**Welcome to your First Graded Assignment on Machine Learning!**

## Employee Churn Dataset

About the Data

This dataset contains information about employees in a company, including their educational backgrounds, work history, demographics, and employment-related factors. It has been anonymized to protect privacy while still providing valuable insights into the workforce.

Goal:- The classification goal is to predict if the Employee will Churn / Attrition (1 / 0) in the company.

Features of train data are listed below


1. **Education:** The educational qualifications of employees, including degree, institution, and field of study.

2. **Joining Year:** The year each employee joined the company, indicating their length of service.

3. **City:** The location or city where each employee is based or works.

4. **Payment Tier:** Categorization of employees into different salary tiers.

5. **Age:** The age of each employee, providing demographic insights.

6. **Gender:** Gender identity of employees, promoting diversity analysis.

7. **Ever Benched:** Indicates if an employee has ever been temporarily without assigned work.

8. **Experience in Current Domain:** The number of years of experience employees have in their current field.

Target Column

9. **Leave or Not:** Whether employee left us or Not (1 = Left)

---

#### Use the Updated Script to serve a Streamlit App Predictor for the Attached Dataset.

1. Load Dataset in your Jupyter notebook
2. Generate Schema for Numerical & Categorical Values and Export Schema
3. Create a ML Fitting Pipeline on the Dataset and Export Pipeline
4. Create Streamlit App and Load Schema and Model
5. Add your Name under Header as Sub-header to identify Created by <User Name>
6. Put Interactions in Streamlit for User
7. Use Interaction Values for Prediction
8. Show Model Prediction on Frontend on Submit Button

### **Importing Data**

In [1]:
## Importing Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings('ignore')

In [2]:
## Reading the dataset
df = pd.read_csv('Employee.csv')
df.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


### **Data Cleaning**

In [3]:
df.shape

(4653, 9)

In [4]:
df.isnull().sum()

Education                    0
JoiningYear                  0
City                         0
PaymentTier                  0
Age                          0
Gender                       0
EverBenched                  0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


### Standardize Text Data

In [6]:
import string

def standardize_text(column):
    # Convert to lowercase
    column = column.str.lower()
    # Strip leading and trailing spaces
    column = column.str.strip()
    # Remove punctuation
    column = column.str.replace(f'[{string.punctuation}]', '', regex=True)
    return column

# Iterate through DataFrame columns and apply standardization
for column in df.select_dtypes(include='object').columns:
    df[column] = standardize_text(df[column])

# Check the standardized DataFrame
print(df)

      Education  JoiningYear       City  PaymentTier  Age  Gender EverBenched  \
0     bachelors         2017  bangalore            3   34    male          no   
1     bachelors         2013       pune            1   28  female          no   
2     bachelors         2014  new delhi            3   38  female          no   
3       masters         2016  bangalore            3   27    male          no   
4       masters         2017       pune            3   24    male         yes   
...         ...          ...        ...          ...  ...     ...         ...   
4648  bachelors         2013  bangalore            3   26  female          no   
4649    masters         2013       pune            2   37    male          no   
4650    masters         2018  new delhi            3   27    male          no   
4651  bachelors         2012  bangalore            3   30    male         yes   
4652  bachelors         2015  bangalore            3   33    male         yes   

      ExperienceInCurrentDo

### Checking and dropping duplicates

In [7]:
duplicate_rows = df.duplicated().sum()
duplicate_rows

np.int64(1889)

In [8]:
df_cleaned = df.drop_duplicates(keep='first')
df_cleaned.shape

(2764, 9)

In [9]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2764 entries, 0 to 4651
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  2764 non-null   object
 1   JoiningYear                2764 non-null   int64 
 2   City                       2764 non-null   object
 3   PaymentTier                2764 non-null   int64 
 4   Age                        2764 non-null   int64 
 5   Gender                     2764 non-null   object
 6   EverBenched                2764 non-null   object
 7   ExperienceInCurrentDomain  2764 non-null   int64 
 8   LeaveOrNot                 2764 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 215.9+ KB


### **Feature Engineering**

In [10]:
df_cleaned['LeaveOrNot'] = df_cleaned['LeaveOrNot'].map({1 : 'Yes' , 0 : 'No'})

In [11]:
# Splitting features and target variable
X = df_cleaned.drop(columns=['LeaveOrNot'])
y = df_cleaned['LeaveOrNot']
X.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
0,bachelors,2017,bangalore,3,34,male,no,0
1,bachelors,2013,pune,1,28,female,no,3
2,bachelors,2014,new delhi,3,38,female,no,2
3,masters,2016,bangalore,3,27,male,no,5
4,masters,2017,pune,3,24,male,yes,2


#### SELECTING CATEGORICAL AND NUMERICAL FOR PROCESSING

In [12]:
X.select_dtypes(include=[object]).head()

Unnamed: 0,Education,City,Gender,EverBenched
0,bachelors,bangalore,male,no
1,bachelors,pune,female,no
2,bachelors,new delhi,female,no
3,masters,bangalore,male,no
4,masters,pune,male,yes


In [14]:
categorical_features = X.select_dtypes(include=[object]).columns

categorical_features= list(categorical_features.difference(['LeaveOrNot']))

print('\n','Categorical Features','\n', categorical_features,'\n')


 Categorical Features 
 ['City', 'Education', 'EverBenched', 'Gender'] 



In [15]:
X.select_dtypes(include=[np.float64,np.int64]).head()

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain
0,2017,3,34,0
1,2013,1,28,3
2,2014,3,38,2
3,2016,3,27,5
4,2017,3,24,2


In [17]:
numerical_features = list(X.select_dtypes(include=[np.float64,np.int64]))

print('\n','Numerical Features','\n', numerical_features,'\n')


 Numerical Features 
 ['JoiningYear', 'PaymentTier', 'Age', 'ExperienceInCurrentDomain'] 



#### SPLITTING DATA FOR TRAIN / TEST

In [18]:
from sklearn.model_selection import train_test_split

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
print('Train Data','\n',y_train.value_counts(normalize=True),'\n','\n','Test Data','\n', y_test.value_counts(normalize=True))

Train Data 
 LeaveOrNot
No     0.607417
Yes    0.392583
Name: proportion, dtype: float64 
 
 Test Data 
 LeaveOrNot
No     0.60217
Yes    0.39783
Name: proportion, dtype: float64


#### EXPORTING FEATURE INPUT METADATA

In [20]:
def summarize_cat(data,categorical_features):
  results=[]

  for column in data[categorical_features]:
      # Get the unique members of the column
      members = data[column].unique().tolist()
      # Append the column name and its unique members to the results list
      results.append([column, members])

  return pd.DataFrame(results, columns=['Column Name', 'Members'])

# Create a DataFrame from the results list
summarize_cat(X_train,categorical_features)

Unnamed: 0,Column Name,Members
0,City,"[bangalore, new delhi, pune]"
1,Education,"[bachelors, masters, phd]"
2,EverBenched,"[no, yes]"
3,Gender,"[male, female]"


In [21]:
summarize_cat(df_cleaned,categorical_features).to_dict()

{'Column Name': {0: 'City', 1: 'Education', 2: 'EverBenched', 3: 'Gender'},
 'Members': {0: ['bangalore', 'pune', 'new delhi'],
  1: ['bachelors', 'masters', 'phd'],
  2: ['no', 'yes'],
  3: ['male', 'female']}}

In [22]:
# EXPORTING FOR DE

my_feature_dict = {'CATEGORICAL' : summarize_cat(df_cleaned,categorical_features).to_dict(), 'NUMERICAL' : {'Column Name': numerical_features}}

my_feature_dict.get('NUMERICAL')

{'Column Name': ['JoiningYear',
  'PaymentTier',
  'Age',
  'ExperienceInCurrentDomain']}

In [23]:
my_feature_dict

{'CATEGORICAL': {'Column Name': {0: 'City',
   1: 'Education',
   2: 'EverBenched',
   3: 'Gender'},
  'Members': {0: ['bangalore', 'pune', 'new delhi'],
   1: ['bachelors', 'masters', 'phd'],
   2: ['no', 'yes'],
   3: ['male', 'female']}},
 'NUMERICAL': {'Column Name': ['JoiningYear',
   'PaymentTier',
   'Age',
   'ExperienceInCurrentDomain']}}

In [24]:
import pickle

# save dictionary to person_data.pkl file
with open('my_feature_dict.pkl', 'wb') as fp:
    pickle.dump(my_feature_dict, fp)
    print('dictionary saved successfully to file')

dictionary saved successfully to file


#### CREATING THE PIPELINE

In [25]:
from sklearn.pipeline import Pipeline

# PREPROCESSING TRANSFORMATIONS ARE DONE ON EXAMPLE BASIS
# REAL WORLD SELECTION OF PREPROCSSING TRANSFORMATIONS MUST BE LOGICAL

#transform_senior_citizen = lambda x: x.assign(SENIORCITIZEN=x['SENIORCITIZEN'].map({1: 'Yes', 0: 'No'}))

from sklearn.preprocessing import FunctionTransformer

#preprocessor_stage_1 = Pipeline(steps=[
 #   ('transform_sc', FunctionTransformer(transform_senior_citizen)),
#])

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipeline_num = Pipeline(steps=[
    ('scale_data', StandardScaler()),
    ('simple_imputer1', SimpleImputer(strategy='constant',fill_value=0)),
])

from sklearn.preprocessing import OneHotEncoder

pipeline_cat = Pipeline(steps=[
    ('OneHotEncode', OneHotEncoder(handle_unknown="ignore"))
])

from sklearn.compose import ColumnTransformer

preprocessor_stage_1 = ColumnTransformer(
    transformers=[
        ('cat', pipeline_cat, categorical_features),  # Categorical columns
        ('num', pipeline_num, numerical_features),     # Numerical columns
    ],remainder='drop')

preprocessor_stack = Pipeline(steps=[
    ('preprocessor_stage_1', preprocessor_stage_1),
    #('preprocessor_stage_2', preprocessor_stage_2)
])

# BECAUSE WE DIDN'T SPECIFY CUSTOMERID IN ANY OF CATEGORICAL OR NUMERICAL FEATURES (REMAINDER='drop') REMOVE IT OUT OF PIPELINE

In [26]:
preprocessor_stack

#### FITTING THE PIPELINE

In [27]:
preprocessor_stack.fit(X_train)

In [28]:
pd.DataFrame(preprocessor_stack.transform(X_train),columns=preprocessor_stack[-1].get_feature_names_out())

Unnamed: 0,cat__City_bangalore,cat__City_new delhi,cat__City_pune,cat__Education_bachelors,cat__Education_masters,cat__Education_phd,cat__EverBenched_no,cat__EverBenched_yes,cat__Gender_female,cat__Gender_male,num__JoiningYear,num__PaymentTier,num__Age,num__ExperienceInCurrentDomain
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.016242,-1.00359,0.773695,-1.639104
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,-0.586786,0.58933,0.382737,-1.017268
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,-1.655472,0.58933,-0.594661,1.470075
3,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.016242,-1.00359,-0.594661,-1.017268
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,-0.586786,0.58933,-0.985620,0.848239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2206,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.016242,0.58933,0.187257,-1.017268
2207,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.481899,0.58933,-0.594661,-1.017268
2208,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,-1.655472,0.58933,-0.203702,-0.395432
2209,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,-1.121129,-1.00359,-0.203702,-1.017268


In [29]:
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_stack),
    ('classifier', RandomForestClassifier())
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [30]:
# Checking Training Accuracy
y_train_pred = pipeline.predict(X_train)

# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_train,y_train_pred))
print("\nClassification Report:\n", classification_report(y_train,y_train_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_train,y_train_pred))

Accuracy: 0.9353233830845771

Classification Report:
               precision    recall  f1-score   support

          No       0.93      0.97      0.95      1343
         Yes       0.95      0.89      0.91       868

    accuracy                           0.94      2211
   macro avg       0.94      0.93      0.93      2211
weighted avg       0.94      0.94      0.93      2211


Confusion Matrix:
 [[1299   44]
 [  99  769]]


#### BREAKING PIPEPLINE INTO EXPLAINABLE PARTS ON TEST

In [32]:
# CREATING A TEST

my_pred_array=X_test.iloc[21:22:]

my_pred_array

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
670,masters,2018,new delhi,3,27,female,no,5


In [33]:
pd.DataFrame(preprocessor_stack.transform(my_pred_array),columns=preprocessor_stack[0].get_feature_names_out())

Unnamed: 0,cat__City_bangalore,cat__City_new delhi,cat__City_pune,cat__Education_bachelors,cat__Education_masters,cat__Education_phd,cat__EverBenched_no,cat__EverBenched_yes,cat__Gender_female,cat__Gender_male,num__JoiningYear,num__PaymentTier,num__Age,num__ExperienceInCurrentDomain
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.550585,0.58933,-0.79014,1.470075


In [34]:
# USING PIPELINE TO DO ALL TOGHETHER (PREPROCESSING FOLLOWED BY MODEL PREDICT)

# SINGLE PREDICTION

y_pred = pipeline.predict(my_pred_array)

y_pred


array(['Yes'], dtype=object)

In [35]:
# USING PIPELINE TO DO ALL TOGHETHER (PREPROCESSING FOLLOWED BY MODEL PREDICT)

# MULTIPLE PREDICTION
y_test_pred = pipeline.predict(X_test)

# EVALUATE MODEL FOR TEST ACCURACY SINCE WE HAVE TEST SET
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report:\n", classification_report(y_test, y_test_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

Accuracy: 0.7305605786618445

Classification Report:
               precision    recall  f1-score   support

          No       0.76      0.81      0.78       333
         Yes       0.68      0.60      0.64       220

    accuracy                           0.73       553
   macro avg       0.72      0.71      0.71       553
weighted avg       0.73      0.73      0.73       553


Confusion Matrix:
 [[271  62]
 [ 87 133]]


### EXPORTING THE PIPELINE

In [36]:
!pip install dill




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [37]:
import dill

# save trained pipeline file

with open('pipeline.pkl', 'wb') as file:
    dill.dump(pipeline, file)

print('pipeline saved successfully to file')

pipeline saved successfully to file
