### Libraries

In [2]:
import sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


### Upload dataset and Explory Analysis

In [3]:
df = pd.read_csv('dataset.csv')

In [4]:
df.shape

(23058, 30)

In [5]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Employee Source,AgeStartedWorking
0,41,Voluntary Resignation,Travel_Rarely,Sales,1,2,Life Sciences,2,Female,3,...,0,8,0,1,6,4,0,5,Referral,33
1,37,Voluntary Resignation,Travel_Rarely,Human Resources,6,4,Human Resources,1,Female,3,...,0,8,0,1,6,4,0,5,Referral,29
2,41,Voluntary Resignation,Travel_Rarely,Sales,1,2,Life Sciences,2,Female,3,...,0,8,0,1,6,4,0,5,Referral,33
3,37,Voluntary Resignation,Travel_Rarely,Human Resources,6,4,Marketing,1,Female,3,...,0,8,0,1,6,4,0,5,Referral,29
4,37,Voluntary Resignation,Travel_Rarely,Human Resources,6,4,Human Resources,1,Female,3,...,0,8,0,1,6,4,0,5,Referral,29


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23058 entries, 0 to 23057
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       23058 non-null  int64 
 1   Attrition                 23058 non-null  object
 2   BusinessTravel            23058 non-null  object
 3   Department                23058 non-null  object
 4   DistanceFromHome          23058 non-null  int64 
 5   Education                 23058 non-null  int64 
 6   EducationField            23058 non-null  object
 7   EnvironmentSatisfaction   23058 non-null  int64 
 8   Gender                    23058 non-null  object
 9   JobInvolvement            23058 non-null  int64 
 10  JobLevel                  23058 non-null  int64 
 11  JobRole                   23058 non-null  object
 12  JobSatisfaction           23058 non-null  int64 
 13  MaritalStatus             23058 non-null  object
 14  MonthlyIncome         

### Selecting target variable

In [7]:
df['Attrition'].value_counts()

Attrition
Current employee         19370
Voluntary Resignation     3601
Termination                 87
Name: count, dtype: int64

### Enconding target Variable and preparing data

Class 0 = Non-occurence of event (didn't quit)
Class 1 = Occurrence of event (quit)

Let's analyze the result based on class 1 and understand the factores that influence employee satisfaction, that is, lead employee to quit.

In [8]:
# filter dataset to keep only 'Current employee' and 'Voluntary Resignation'
df = df[df['Attrition'].isin(['Current employee', 'Voluntary Resignation'])]

In [9]:
# Check unique values in 'Attrition' column after filter
df['Attrition'].value_counts()

Attrition
Current employee         19370
Voluntary Resignation     3601
Name: count, dtype: int64

In [10]:
# Encode target variable
df['Attrition'] = df['Attrition'].apply(lambda x: 1 if x == 'Voluntary Resignation' else 0)

In [11]:
df['Attrition'].value_counts()

Attrition
0    19370
1     3601
Name: count, dtype: int64

In [12]:
# split variables
x = df.drop('Attrition', axis = 1)
y = df['Attrition']

In [13]:
# split train and test subset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 , random_state = 42)

In [14]:
# Splittin categorical and numeric variables
cat_features = x.select_dtypes(include = ['object']).columns.tolist()
num_features = x.select_dtypes(include = ['int64', 'float64']).columns.tolist()

### Numeric Variables Preprocessing Pipeline

The goal of this pipeline is to ensure that all numeric features in the dataset are treated consistently and appropriately before being fed into the Machine Learning model.

- Missing Value Handling: Replaces missing values ​​with the median to prevent the model from being affected by missing data.
- Normalization: Standardizes the data so that all numeric features have the same scale, improving model performance and ensuring that no single feature dominates the others due to scale.

In [15]:
### Create pipeline
numeric_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())])

### Categorical Variables Preprocessing Pipeline

The goal of this pipeline is to ensure that all categorical features in the dataset are treated consistently and appropriately before being fed into the Machine Learning model.

- Missing Value Treatment: Replaces missing values ​​with 'missing', creating a special category for missing values.
- One-Hot Encoding: Transforms categorical features into a form that can be used by the Machine Learning model, converting each category into a binary column.


In [22]:
# Create pipeline
categorical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'constant', fill_value = 'missing')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

### Modelling Pipeline
The ColumnTransformer class allows you to apply different transformations to different subsets of features. This is useful when you need to preprocess numeric and categorical columns differently.

In [25]:
preprocessor = ColumnTransformer(
    transformers = [
        ('num', numeric_transformer, num_features),
        ('cat', categorical_transformer, cat_features)])

The goal of the modeling pipeline is to combine all preprocessing and modeling steps into a single workflow that can be applied consistently to both training and test data. This ensures that all necessary transformations are applied correctly and in the right order, making the process easier to replicate and maintain.

In [26]:
model_df = Pipeline(steps = [('preprocessor', preprocessor),
                             ('classifier', LogisticRegression(max_iter = 1000))])

In [27]:
#Trainning model
model_df.fit(x_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Age', 'DistanceFromHome',
                                                   'Education',
                                                   'EnvironmentSatisfaction',
                                                   'JobInvolvement', 'JobLevel',
                                                   'JobSatisfaction',
                                                   'MonthlyIncome',
                                                   'NumCompaniesWorked',
                                                   'P

In [29]:
# Predict with testdataset
y_pred = model_df.predict(x_test)

In [31]:
#Evaluate model
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)

0.8554951033732318


#### Checking  model coeficients

In [32]:
coefficients = model_df.named_steps['classifier'].coef_[0]

In [34]:
features_names = num_features + list(model_df.named_steps['preprocessor'] \
                                     .transformers_[1][1].named_steps['onehot'].get_feature_names_out(cat_features))

In [37]:
features_names

['Age',
 'DistanceFromHome',
 'Education',
 'EnvironmentSatisfaction',
 'JobInvolvement',
 'JobLevel',
 'JobSatisfaction',
 'MonthlyIncome',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager',
 'AgeStartedWorking ',
 'BusinessTravel_Non-Travel',
 'BusinessTravel_Travel_Frequently',
 'BusinessTravel_Travel_Rarely',
 'Department_Human Resources',
 'Department_Research & Development',
 'Department_Sales',
 'EducationField_Human Resources',
 'EducationField_Life Sciences',
 'EducationField_Marketing',
 'EducationField_Medical',
 'EducationField_Other',
 'EducationField_Technical Degree',
 'Gender_Female',
 'Gender_Male',
 'JobRole_Healthcare Representative',
 'JobRole_Human Resources',
 'JobRole_Laboratory Technician',
 'JobRole_Manager',
 'JobRole_Manufacturing Dir

In [41]:
# Dataframe
coeff_df = pd.DataFrame({'Attributes': features_names, 'Coefficient': coefficients}).sort_values(by = 'Coefficient', ascending = False)

In [42]:
coeff_df.head(10)

Unnamed: 0,Attributes,Coefficient
22,BusinessTravel_Travel_Frequently,0.659193
48,OverTime_Yes,0.429008
46,MaritalStatus_Single,0.375998
32,EducationField_Technical Degree,0.353505
56,Employee Source_Referral,0.283704
37,JobRole_Laboratory Technician,0.277759
43,JobRole_Sales Representative,0.200103
26,Department_Sales,0.162521
1,DistanceFromHome,0.160598
18,YearsSinceLastPromotion,0.156987


### Analyzing Results

**BusinessTravel_Travel_Frequently (0.494839)**

Employees who travel frequently for business are more likely to voluntarily resign. This coefficient is quite significant, suggesting that the frequency of travel may be a factor of stress or dissatisfaction.

**EducationField_Technical Degree (0.275768)**

Employees with a technical degree are more likely to voluntarily resign compared to those with other educational backgrounds. This may indicate that these employees have more opportunities in the job market or that their expectations are not being met.

**Employee Source_Referral (0.257281)**

Employees who were hired through referrals are more likely to voluntarily resign. This may suggest that despite being a referral, they may not be as aligned with the company as other employees.

**MaritalStatus_Single (0.213240)**

Single employees are more likely to voluntarily resign compared to married employees or employees in other marital statuses. This may be due to greater flexibility and fewer personal responsibilities.

**JobRole_Laboratory Technician (0.213197)**

Employees who work as laboratory technicians are more likely to voluntarily quit. This may indicate dissatisfaction with the specific role or work environment.

**OverTime_Yes (0.183383)**

Employees who work overtime are more likely to voluntarily quit. This suggests that overwork can lead to burnout and dissatisfaction.

**DistanceFromHome (0.160642)**

A longer distance from home to work is associated with a higher likelihood of voluntarily quitting. Long commutes can cause burnout and dissatisfaction.

**YearsSinceLastPromotion (0.157113)**

Employees who have spent more years since their last promotion are more likely to voluntarily quit. This may indicate dissatisfaction with growth opportunities within the company.

**JobRole_Sales Representative (0.126106)**

Employees who work as sales representatives are more likely to voluntarily resign. This role may have high performance pressure or lack of adequate support.

**YearsAtCompany (0.105078)**

The more years an employee has been with the company, the more likely they are to voluntarily resign. This may indicate that after a certain period of time, employees may feel stagnant or seek new opportunities.

### Conclusion

Positive coefficients indicate that these factors increase the likelihood of voluntary resignation. Understanding these factors can help a company take preventative measures, such as improving working conditions, providing opportunities for growth, and minimizing the need for overtime, to reduce the voluntary turnover rate.