<a href="https://colab.research.google.com/github/janhavi9-lab/Welcome-to-Open-Source/blob/master/employeeattrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
import zipfile # we have found data set from kaggle and used here
import os #gemini helped me in this code

zip_file_path = '/content/archive (2).zip'

# Create a directory to extract files into, if it doesn't exist
extract_dir = '/content/extracted_data'
os.makedirs(extract_dir, exist_ok=True)

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"Files extracted to: {extract_dir}")
print("Contents of the extracted directory:")
for root, dirs, files in os.walk(extract_dir):
    for name in files:
        print(os.path.join(root, name))
    for name in dirs:
        print(os.path.join(root, name))

Files extracted to: /content/extracted_data
Contents of the extracted directory:
/content/extracted_data/WA_Fn-UseC_-HR-Employee-Attrition.csv


In [7]:
Emp_data = pd.read_csv('/content/extracted_data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [9]:
Emp_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [10]:
Emp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

# Task
Predict employee attrition using a RandomForestClassifier model, preprocess the `Emp_data` DataFrame by handling categorical variables and dropping irrelevant columns, split the data into training and testing sets, evaluate the model's performance, and visualize the feature importance to identify key factors influencing attrition.

## Preprocess Data for Modeling

### Subtask:
Prepare the `Emp_data` DataFrame for machine learning by handling categorical variables and dropping irrelevant columns. The target variable for prediction will be 'Attrition'.


**Reasoning**:
The first instruction is to drop irrelevant columns from the `Emp_data` DataFrame. I will use the `drop` method for this purpose.



In [12]:
columns_to_drop = ['EmployeeCount', 'StandardHours', 'EmployeeNumber', 'Over18']
Emp_data.drop(columns=columns_to_drop, inplace=True)
print(f"Columns {columns_to_drop} dropped from Emp_data.")
Emp_data.head()

Columns ['EmployeeCount', 'StandardHours', 'EmployeeNumber', 'Over18'] dropped from Emp_data.


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,2,Female,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,3,Male,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,4,Male,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,4,Female,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,Male,...,3,4,1,6,3,3,2,2,2,2


**Reasoning**:
The next step is to convert the 'Attrition' column from categorical ('Yes', 'No') to numerical (1, 0) as required by machine learning models.



In [13]:
Emp_data['Attrition'] = Emp_data['Attrition'].map({'Yes': 1, 'No': 0})
print("Attrition column converted to numerical:")
print(Emp_data['Attrition'].value_counts())

Attrition column converted to numerical:
Attrition
0    1233
1     237
Name: count, dtype: int64


**Reasoning**:
After converting Attrition to numbers, all remaining categorical columns are one-hot encoded. The first category is dropped to avoid multicollinearity and keep the data model-ready.



In [14]:
categorical_cols = Emp_data.select_dtypes(include=['object']).columns
print(f"Categorical columns to be one-hot encoded: {list(categorical_cols)}")

# Apply one-hot encoding to the identified categorical columns
df_encoded = pd.get_dummies(Emp_data, columns=categorical_cols, drop_first=True)

print("One-hot encoding applied. Displaying first few rows of the encoded DataFrame:")
df_encoded.head()

Categorical columns to be one-hot encoded: ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
One-hot encoding applied. Displaying first few rows of the encoded DataFrame:


Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,...,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes
0,41,1,1102,1,2,2,94,3,2,4,...,False,False,False,False,False,True,False,False,True,True
1,49,0,279,8,1,3,61,2,2,2,...,False,False,False,False,True,False,False,True,False,False
2,37,1,1373,2,2,4,92,2,1,3,...,True,False,False,False,False,False,False,False,True,True
3,33,0,1392,3,4,4,56,3,1,3,...,False,False,False,False,True,False,False,True,False,True
4,27,0,591,2,1,1,40,3,1,2,...,True,False,False,False,False,False,False,True,False,False


## Split Data into Training and Testing Sets

### Subtask:
Divide the preprocessed features and the target variable ('Attrition') into training and testing sets.


**Reasoning**:
First, I will separate the features (X) and the target variable (y) from the `df_encoded` DataFrame. The 'Attrition' column will be the target variable, and all other columns will be features.



In [15]:
X = df_encoded.drop('Attrition', axis=1)
y = df_encoded['Attrition']

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

Features (X) shape: (1470, 44)
Target (y) shape: (1470,)


**Reasoning**:
The next step is to import the `train_test_split` function from `sklearn.model_selection` to prepare for splitting the data into training and testing sets.



In [16]:
from sklearn.model_selection import train_test_split

print("train_test_split imported successfully.")

train_test_split imported successfully.


**Reasoning**:
Now that the data is separated into features (X) and target (y), and `train_test_split` is imported, I will split the data into training and testing sets as per the instructions, using a test size of 0.2 and a random state of 42.



In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Data split into training and testing sets:
X_train shape: (1176, 44)
X_test shape: (294, 44)
y_train shape: (1176,)
y_test shape: (294,)


## Train a Predictive Model

### Subtask:
Train a classification model, such as a RandomForestClassifier, on the training data to predict employee attrition. RandomForest is a good choice as it naturally provides feature importance scores.


**Reasoning**:
The first step is to import the necessary model, `RandomForestClassifier`, from `sklearn.ensemble` as specified in the instructions.



In [18]:
from sklearn.ensemble import RandomForestClassifier

print("RandomForestClassifier imported successfully.")

RandomForestClassifier imported successfully.


**Reasoning**:
Now that the `RandomForestClassifier` has been imported, the next step is to instantiate the classifier object with `random_state=42` as instructed.



In [19]:
model = RandomForestClassifier(random_state=42)
print("RandomForestClassifier instantiated.")

RandomForestClassifier instantiated.


**Reasoning**:
Now that the model is instantiated, the next step is to train the `RandomForestClassifier` on the training data (`X_train` and `y_train`) as per the instructions.



In [20]:
model.fit(X_train, y_train)
print("RandomForestClassifier trained successfully on the training data.")

RandomForestClassifier trained successfully on the training data.


## Evaluate Model Performance

### Subtask:
Assess the trained model's performance on the test set using relevant metrics such as accuracy, precision, recall, and F1-score to understand how well it predicts attrition.


**Reasoning**:
The first step in evaluating the model's performance is to make predictions on the test set using the trained `RandomForestClassifier` model.



In [21]:
y_pred = model.predict(X_test)
print("Predictions made on the test set.")

Predictions made on the test set.


**Reasoning**:
Now that predictions have been made, the next step is to import the necessary metrics from `sklearn.metrics` to evaluate the model's performance as specified in the instructions.



In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print("Evaluation metrics imported successfully.")

Evaluation metrics imported successfully.


**Reasoning**:
Now that the evaluation metrics are imported, I will calculate and print the accuracy, precision, recall, and F1-score of the model using the `y_test` and `y_pred` values.



In [23]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

Accuracy: 0.8776
Precision: 0.8000
Recall: 0.1026
F1-Score: 0.1818


**Reasoning**:
The final step in evaluating the model's performance is to generate and print a classification report, which provides a comprehensive summary of precision, recall, and F1-score for each class.



In [24]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.88      1.00      0.93       255
           1       0.80      0.10      0.18        39

    accuracy                           0.88       294
   macro avg       0.84      0.55      0.56       294
weighted avg       0.87      0.88      0.83       294



## Extract and Visualize Feature Importance

### Subtask:
Obtain the feature importance scores from the trained model and visualize them to identify which factors contribute most significantly to predicting employee attrition. This will help understand the 'future importance' of different variables.


Summary

Data Preparation:

Unnecessary columns were removed, and the target variable Attrition was converted into numerical form (1 = Yes, 0 = No). Categorical features were encoded, resulting in a dataset with 45 features.

Model Training & Results:

The data was split into 80% training and 20% testing sets, and a Random Forest model was trained. The model achieved about 88% accuracy. It performed very well in predicting employees who stayed but struggled to identify employees who left, detecting only around 10% of attrition cases.

Key Insight:

The high accuracy is misleading due to class imbalance. To improve attrition prediction, techniques like SMOTE, undersampling, or adjusting class weights should be used.