Context:
This dataset contains information about employees in a company, including their educational backgrounds, work history, demographics, and employment-related factors. It has been anonymized to protect privacy while still providing valuable insights into the workforce.

We need to predict is employee Leave the company or Not

Columns:
- Education: The educational qualifications of employees, including degree, institution, and field of study.
- Joining Year: The year each employee joined the company, indicating their length of service.
- City: The location or city where each employee is based or works.
- Payment Tier: Categorization of employees into different salary tiers.
- Age: The age of each employee, providing demographic insights.
- Gender: Gender identity of employees, promoting diversity analysis.
- Ever Benched: Indicates if an employee has ever been temporarily without assigned work.
- Experience in Current Domain: The number of years of experience employees have in their current field.
- Leave or Not: a target column

In [1]:
# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler

In [3]:
dataset_df = pd.read_csv('Employee.csv')

In [6]:
dataset_df.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


In [9]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


In [10]:
dataset_df.describe()

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot
count,4653.0,4653.0,4653.0,4653.0,4653.0
mean,2015.06297,2.698259,29.393295,2.905652,0.343864
std,1.863377,0.561435,4.826087,1.55824,0.475047
min,2012.0,1.0,22.0,0.0,0.0
25%,2013.0,3.0,26.0,2.0,0.0
50%,2015.0,3.0,28.0,3.0,0.0
75%,2017.0,3.0,32.0,4.0,1.0
max,2018.0,3.0,41.0,7.0,1.0


Explore details about categorical variables

In [27]:
print(dataset_df['Education'].value_counts())
print(dataset_df['Education'].count())
print('------------------')
print(dataset_df['City'].value_counts())
print(dataset_df['City'].count())
print('------------------')
print(dataset_df['Gender'].value_counts())
print(dataset_df['Gender'].count())
print('------------------')
print(dataset_df['EverBenched'].value_counts())
print(dataset_df['EverBenched'].count())

print('---------------------------------------------')
print('---------------------------------------------')
print(dataset_df.isna().sum())

Education
Bachelors    3601
Masters       873
PHD           179
Name: count, dtype: int64
4653
------------------
City
Bangalore    2228
Pune         1268
New Delhi    1157
Name: count, dtype: int64
4653
------------------
Gender
Male      2778
Female    1875
Name: count, dtype: int64
4653
------------------
EverBenched
No     4175
Yes     478
Name: count, dtype: int64
4653
---------------------------------------------
---------------------------------------------
Education                    0
JoiningYear                  0
City                         0
PaymentTier                  0
Age                          0
Gender                       0
EverBenched                  0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
dtype: int64


Based on above details, there are no missing values and seems to be having accptable values.

Now, convert categorical values into numerical using one-hot encoding

In [32]:
encoded_dataset_df = pd.get_dummies(dataset_df, columns=['Education', 'City', 'Gender', 'EverBenched'], drop_first=True, dtype=int)
encoded_dataset_df.head()

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot,Education_Masters,Education_PHD,City_New Delhi,City_Pune,Gender_Male,EverBenched_Yes
0,2017,3,34,0,0,0,0,0,0,1,0
1,2013,1,28,3,1,0,0,0,1,0,0
2,2014,3,38,2,0,0,0,1,0,0,0
3,2016,3,27,5,1,1,0,0,0,1,0
4,2017,3,24,2,1,1,0,0,1,1,1


Prepare the X and Y datasets

In [40]:
X = encoded_dataset_df.drop(columns=['LeaveOrNot'], axis=1)
y = encoded_dataset_df['LeaveOrNot']

display(X.head())
display(y.head())

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,Education_Masters,Education_PHD,City_New Delhi,City_Pune,Gender_Male,EverBenched_Yes
0,2017,3,34,0,0,0,0,0,1,0
1,2013,1,28,3,0,0,0,1,0,0
2,2014,3,38,2,0,0,1,0,0,0
3,2016,3,27,5,1,0,0,0,1,0
4,2017,3,24,2,1,0,0,1,1,1


0    0
1    1
2    0
3    1
4    1
Name: LeaveOrNot, dtype: int64

Prepare the train and test datasets

In [43]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=21)

Scale the dataset

In [51]:
# Standardizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

Train and Test the dataset using Logistic Regression

In [53]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)


log_train = round(logreg.score(X_train, y_train) * 100, 2)
log_accuracy = round(accuracy_score(Y_pred, y_test) * 100, 2)

f1_score = f1_score(y_test, Y_pred)
cnf_matrix = confusion_matrix(y_test, Y_pred)

print("Training Accuracy    :", log_train)
print("Model Accuracy Score :", log_accuracy)
print("Model F1 Score :", f1_score)
print(cnf_matrix)

TypeError: 'numpy.float64' object is not callable