<a href="https://colab.research.google.com/github/karthik2529/Data-Analysis-/blob/main/IBM_PredictingEmployeeAttrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Specify the path to your dataset
file_path = '/content/drive/My Drive/IBM_HR_EmployeeAttrition.csv'

In [7]:
# Load the dataset
df = pd.read_csv(file_path)

In [None]:
# Display the first few rows of the dataset
print(df.head())

In [None]:

# Check for missing values
print(df.isnull().sum())

In [None]:
# Summary statistics of numerical features
print(df.describe())

In [None]:
# Check unique values of other categorical variables similarly
print(df['Department'].unique())
print(df['EducationField'].unique())
# Continue checking other categorical variables in the dataset similarly


In [13]:
# Handling missing values
# In this example, we'll simply drop rows with missing values, but you can choose other strategies based on your dataset
df.dropna(inplace=True)

In [27]:
# Encoding categorical variables
label_encoder = LabelEncoder()
df['Attrition'] = label_encoder.fit_transform(df['Attrition'])
df['BusinessTravel'] = label_encoder.fit_transform(df['BusinessTravel'])
# Encode other categorical variables similarly


In [26]:
# Encoding categorical variables
# Before encoding, let's identify categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:", categorical_cols)


Categorical columns: ['Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']


In [28]:
# Using one-hot encoding for categorical variables
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [29]:
# Scaling numerical features
scaler = StandardScaler()
num_features = ['Age', 'MonthlyIncome', 'YearsAtCompany'] # Example numerical features
df[num_features] = scaler.fit_transform(df[num_features])


In [30]:
# Save the cleaned dataset
df.to_csv("cleanedHR_dataset.csv", index=False)

In [31]:
# Split data into features (X) and target (y)
X = df.drop('Attrition', axis=1)
y = df['Attrition']

In [32]:
# Step 5: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [20]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Initialize and train the model (e.g., Logistic Regression)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Step 6: Initialize and train the model (e.g., Logistic Regression)
model = LogisticRegression()
model.fit(X_train, y_train)


In [34]:
# Step 7: Predict on the test set
y_pred = model.predict(X_test)

In [None]:
# Step 8: Evaluate the model
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Example using Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))


In [37]:
#TASK2
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [38]:
# Load the cleaned dataset
df = pd.read_csv("cleanedHR_dataset.csv")


In [None]:
# Statistical summary of the dataset
print(df.describe())

In [None]:
# Correlation matrix
corr_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

In [None]:

# Distribution of Attrition
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Attrition')
plt.title("Distribution of Attrition")
plt.show()

In [None]:

# Pairplot of selected features
sns.pairplot(df[['Attrition', 'Age', 'MonthlyIncome', 'YearsAtCompany', 'JobSatisfaction']], hue='Attrition')
plt.title("Pairplot of Selected Features")
plt.show()


In [None]:
""""Step 4: Insights and Feature Selection
Insights:
Distribution of Attrition:
We observe from the count plot that the dataset is imbalanced, with fewer instances of attrition compared to non-attrition.
Correlation between Features and Attrition:
From the correlation matrix, we can identify features that have relatively high positive or negative correlations with attrition. These features may include 'JobLevel', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany', etc.
Relationship between Selected Features and Attrition:
The pairplot allows us to visualize the relationship between selected features and attrition. For example, we can observe how 'Age', 'MonthlyIncome', 'YearsAtCompany', and 'JobSatisfaction' vary concerning attrition.
Feature Selection:
Based on the insights gathered from the EDA, we can prioritize and select the following features for model development:

'Age': Age of the employee.
'MonthlyIncome': Monthly income of the employee.
'YearsAtCompany': Number of years the employee has been with the company.
'JobSatisfaction': Level of job satisfaction reported by the employee.
These features show significant correlations with attrition and are likely to be important predictors for the predictive model."""

In [None]:
"""Conclusion:
Performing EDA allows us to gain valuable insights into the dataset and identify important features for model development.
 By selecting the most relevant features, we can build a predictive model that effectively predicts employee attrition.
  Make sure to validate the selected features further and iterate on the model development process as needed."""

In [47]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Specify the path to save the file in your Google Drive
file_path = '/content/drive/My Drive/cleanedHR_dataset.csv'

# Save the cleaned dataset to your Google Drive
df.to_csv(file_path, index=False)

# Print confirmation message
print("Dataset saved successfully to Google Drive!")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset saved successfully to Google Drive!


In [48]:
#Task3 : Employee Attrition Prediction Model and Recommendations
# Step 1: Data Splitting
from sklearn.model_selection import train_test_split


In [49]:

# Splitting the dataset into features (X) and target variable (y)
X = df.drop('Attrition', axis=1)
y = df['Attrition']

In [50]:

# Splitting data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [51]:

# Step 2: Model Selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [53]:
# Initialize models
logistic_model = LogisticRegression()
random_forest_model = RandomForestClassifier()

In [None]:
# Step 3: Model Training
logistic_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

In [55]:
# Step 4: Model Evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [56]:
# Function to evaluate model performance
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    return accuracy, precision, recall, f1, confusion_mat

In [58]:
# Evaluate Logistic Regression model
accuracy_lr, precision_lr, recall_lr, f1_lr, confusion_mat_lr = evaluate_model(logistic_model, X_test, y_test)

  _warn_prf(average, modifier, msg_start, len(result))


In [59]:
# Evaluate Random Forest model
accuracy_rf, precision_rf, recall_rf, f1_rf, confusion_mat_rf = evaluate_model(random_forest_model, X_test, y_test)

In [60]:
# Step 5: Hyperparameter Tuning (optional)
# If necessary, optimize hyperparameters for the best-performing model(s)

# Step 6: Recommendations and Actionable Insights
# Based on the model analysis, provide recommendations and actionable insights for HR teams to improve employee retention and job satisfaction.


In [62]:
# Step 5: Hyperparameter Tuning (optional)
# If necessary, optimize hyperparameters for the best-performing model(s)
from sklearn.model_selection import GridSearchCV

In [63]:
# Define hyperparameters for Random Forest model
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [66]:
# Initialize GridSearchCV for Random Forest model
grid_search_rf = GridSearchCV(random_forest_model, param_grid_rf, cv=5, scoring='accuracy')

In [68]:
# Fit the grid search to the data
grid_search_rf.fit(X_train, y_train)

In [67]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters for Random Forest model
param_grid_rf = {
    'n_estimators': [50, 100],  # Reduced number of estimators
    'max_depth': [None, 10],     # Reduced number of max depth values
    'min_samples_split': [2, 5], # Reduced number of min samples split values
    'min_samples_leaf': [1, 2]   # Reduced number of min samples leaf values
}

# Initialize GridSearchCV for Random Forest model
grid_search_rf = GridSearchCV(random_forest_model, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search_rf.fit(X_train, y_train)


In [70]:
# Get the best parameters from grid search
best_params_rf = grid_search_rf.best_params_

# Update Random Forest model with best parameters
best_rf_model = RandomForestClassifier(**best_params_rf)
best_rf_model.fit(X_train, y_train)


In [None]:
# Step 6: Recommendations and Actionable Insights
# Based on the model analysis, provide recommendations and actionable insights for HR teams to improve employee retention and job satisfaction.
# Example recommendations:
# 1. Implement regular employee engagement surveys to identify areas of improvement.
# 2. Provide opportunities for professional development and career advancement.
# 3. Improve communication channels between management and employees.
# 4. Offer competitive compensation and benefits packages.
# 5. Foster a positive work culture and promote work-life balance.

# Feel free to customize the recommendations based on the insights gained from the model analysis.

In [None]:
# Step 6: Recommendations and Actionable Insights

# Example recommendations:
recommendations = [
    "Implement regular employee engagement surveys to identify areas of improvement.",
    "Provide opportunities for professional development and career advancement.",
    "Improve communication channels between management and employees.",
    "Offer competitive compensation and benefits packages.",
    "Foster a positive work culture and promote work-life balance."
]

# Print recommendations
print("Recommendations for HR Teams:")
for i, recommendation in enumerate(recommendations, start=1):
    print(f"{i}. {recommendation}")
