# Logistic Regression using Emplyee Churn Dataset

### Definintion
Logistic regression, despite its name, is a classification model rather than regression model. Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary) i.e whether an event will occurr or not. Like all regression analyses, logistic regression is a predictive analysis. It is a classification model, which is very easy to realize and achieves very good performance with linearly separable classes. 

#### What is the difference between Linear and Logistic Regression?

While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the **most probable class** for that data point. For this, we use **Logistic Regression**.


Logistic Regression is a variation of Linear Regression, used when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability with the following function, which is called the sigmoid function 𝜎:

$$
ℎ\_\theta(𝑥) = \sigma({\theta^TX}) =  \frac {e^{(\theta\_0 + \theta\_1  x\_1 + \theta\_2  x\_2 +...)}}{1 + e^{(\theta\_0 + \theta\_1  x\_1 + \theta\_2  x\_2 +\cdots)}}
$$
Or:
$$
ProbabilityOfaClass\_1 =  P(Y=1|X) = \sigma({\theta^TX}) = \frac{e^{\theta^TX}}{1+e^{\theta^TX}}
$$

In this equation, ${\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\sigma(\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01), also called logistic curve. It is a common "S" shape (sigmoid curve).

So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:

<img
src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/images/mod_ID_24_final.png" width="400" align="center">

The objective of the **Logistic Regression** algorithm, is to find the best parameters θ, for $ℎ\_\theta(𝑥)$ = $\sigma({\theta^TX})$, in such a way that the model best predicts the class of each case.

source: IBM Data Science

## Objective

* Implement Logistic Regression using Scikit Learn on the Employee churn data
* Use Multicollinearity function to to determine which independent features should be used in the process
* Create a model which will be trained and tested
* Evaluate the Model to know its efficiency in making predictions

### Initial Understanding of data

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Education
1 'Below College'
2 'College'
3 'Bachelor'
4 'Master'
5 'Doctor'

EnvironmentSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'

JobInvolvement
1 'Low'
2 'Medium'
3 'High'
4 'Very High'

JobSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'

PerformanceRating
1 'Low'
2 'Good'
3 'Excellent'
4 'Outstanding'

RelationshipSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'

WorkLifeBalance
1 'Bad'
2 'Good'
3 'Better'
4 'Best'


## Libraries Import

In [44]:
import pandas as pd
import pylab as pl
import numpy as np

from plotly.offline import iplot, init_notebook_mode
import plotly.express as px

## Data Import

In [45]:
df = pd.read_csv('employee-attrition.csv')

In [46]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Data Exploration

In [47]:
df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [49]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [50]:
categorical_counts = df['Attrition'].value_counts()
categorical_counts

No     1233
Yes     237
Name: Attrition, dtype: int64

In [51]:
fig = px.pie(df, "Attrition", color='Attrition', hole=.3)
fig.show()

### Converting numerical values of our dependent feature to numerical

In [52]:
df['Attrition'].replace(['Yes', 'No',],
                        [1, 2], inplace=True)

In [53]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,2,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,2,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,2,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


### Checking for which independent features have strong correlation on the dependent feature

In [54]:
df.corr()["Attrition"]

Age                         0.159205
Attrition                   1.000000
DailyRate                   0.056652
DistanceFromHome           -0.077924
Education                   0.031373
EmployeeCount                    NaN
EmployeeNumber              0.010577
EnvironmentSatisfaction     0.103369
HourlyRate                  0.006846
JobInvolvement              0.130016
JobLevel                    0.169105
JobSatisfaction             0.103481
MonthlyIncome               0.159840
MonthlyRate                -0.015170
NumCompaniesWorked         -0.043494
PercentSalaryHike           0.013478
PerformanceRating          -0.002889
RelationshipSatisfaction    0.045872
StandardHours                    NaN
StockOptionLevel            0.137145
TotalWorkingYears           0.171063
TrainingTimesLastYear       0.059478
WorkLifeBalance             0.063939
YearsAtCompany              0.134392
YearsInCurrentRole          0.160545
YearsSinceLastPromotion     0.033019
YearsWithCurrManager        0.156199
N

In [55]:
# from the above we will be using the ff. for our model
# Age, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobSatisfaction, MonthlyIncome, StockOptionLevel, TotalWorkingYears
# YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager 

In [56]:
import statsmodels.api as sm

## Feature Selection

After running the OLS a number of times we eliminated a number of featues and decided to use the following feature for our model

In [57]:
#  'DailyRate',, 'MonthlyIncome' were removed because initial model run showed its coefficient was a bit off
X =df[['Age', 'TotalWorkingYears', 'YearsInCurrentRole', 'JobInvolvement', 'EnvironmentSatisfaction', 'JobLevel',
       'YearsWithCurrManager', 'YearsAtCompany', 'StockOptionLevel', 'JobSatisfaction']]
y =df[['Attrition']]

In [58]:
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

In [59]:
model.summary()

0,1,2,3
Dep. Variable:,Attrition,R-squared:,0.104
Model:,OLS,Adj. R-squared:,0.098
Method:,Least Squares,F-statistic:,16.98
Date:,"Thu, 05 Oct 2023",Prob (F-statistic):,2.0000000000000002e-29
Time:,16:28:00,Log-Likelihood:,-534.38
No. Observations:,1470,AIC:,1091.0
Df Residuals:,1459,BIC:,1149.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.1925,0.065,18.484,0.000,1.066,1.319
Age,0.0033,0.001,2.357,0.019,0.001,0.006
TotalWorkingYears,0.0009,0.002,0.361,0.718,-0.004,0.006
YearsInCurrentRole,0.0092,0.004,2.252,0.024,0.001,0.017
JobInvolvement,0.0639,0.013,4.968,0.000,0.039,0.089
EnvironmentSatisfaction,0.0346,0.008,4.149,0.000,0.018,0.051
JobLevel,0.0305,0.013,2.295,0.022,0.004,0.057
YearsWithCurrManager,0.0112,0.004,2.642,0.008,0.003,0.019
YearsAtCompany,-0.0062,0.003,-2.078,0.038,-0.012,-0.000

0,1,2,3
Omnibus:,372.838,Durbin-Watson:,1.93
Prob(Omnibus):,0.0,Jarque-Bera (JB):,707.32
Skew:,-1.59,Prob(JB):,2.56e-154
Kurtosis:,4.201,Cond. No.,296.0


## Data Visualisation

In [60]:
fig = px.histogram(df, x="Age", color='Attrition', barmode='group')
fig.show()

In [61]:
fig = px.histogram(df, x="TotalWorkingYears", color='Attrition', barmode='group')
fig.show()

In [62]:
fig = px.histogram(df, x="YearsInCurrentRole", color='Attrition', barmode='group')
fig.show()

In [63]:
fig = px.histogram(df, x="YearsAtCompany", color='Attrition', barmode='group')
fig.show()

In [64]:
fig = px.histogram(df, x="JobSatisfaction", color='Attrition', barmode='group')
fig.show()

## Train Test Split

In [65]:
from sklearn.model_selection import train_test_split

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.30)

In [67]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [68]:
# Define a dictionary to store the results of each model
results = {}

## Model Building

In [69]:
#Fitting The LogisticRegression Classifier model to the Traning Set
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression()
LogReg.fit(X_train,y_train)
LogReg_y_pred = LogReg.predict(X_test)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



## Model Evaluation

In [70]:
from sklearn import metrics
# Evaluate the LogisticRegression Classifier model
LogReg_accuracy = metrics.accuracy_score(y_test, LogReg_y_pred)
# Store the results of LogisticRegression model in the dictionary
results["LogReg"] = {"accuracy": LogReg_accuracy}

In [71]:
results

{'LogReg': {'accuracy': 0.854875283446712}}

### Remarks

* Our model scored 86% which is high enough for use in making predictions on whether an employee will leave or stay
* The model can be experimented by converting more categorical to numerical features