<center><img src="https://github.com/kings-shah/GCD_Capstone_HR/blob/319fac03a6383942d87e531d08c9e8ac5d7c715e/companylogo.png?raw=true" width="60%" height="150" /></center>

# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
Your client for this project is the HR Department at a software company.

- They want to try a new initiative to retain employees.
- The idea is to use data to predict whether an employee is likely to leave.
- Once these employees are identified, HR can be more proactive in reaching out to them before it's too late.
- They only want to deal with the data that is related to permanent employees.

** Current Practice **
- Once an employee leaves, he or she is taken an interview with the name “exit interview” and shares reasons for leaving. The HR Department then tries and learns insights from the interview and makes changes accordingly.

This suffers from the following problems:

- This approach is that it's too haphazard. The quality of insight gained from an interview depends heavily on the skill of the interviewer.
- The second problem is these insights can't be aggregated and interlaced across all employees who have left.
- The third is that it is too late by the time the proposed policy changes take effect.

The HR department has hired you as data science consultants. They want to supplement their exit interviews with a more proactive approach.

---
<a name = Section2></a>
# **2. Consulting Goals**
---
<b> Your Role </b>
- You are given datasets of past employees and their status (still employed or already left).
- Your task is to build a classification model using the datasets.
Because there was no machine learning model for this problem in the company, you don’t have quantifiable win condition. You need to build the best possible model.

<b> Problem Specifics </b>
- <b>Deliverable</b> : Predict whether an employee will stay or leave.
- <b> Machine learning task</b>: Classification
- <b>Target variable</b>: Status (Employed/Left)
- <b>Win condition</b>: N/A (best possible model)

<center><img src="https://github.com/kings-shah/GCD_Capstone_HR/blob/319fac03a6383942d87e531d08c9e8ac5d7c715e/hr.png?raw=true"></center>

---
<a id = Section3></a>
# **3. Installing & Importing Libraries**
---

- This section is emphasised on installing and importing the necessary libraries that will be required.

In [4]:
pip install mysql-connector-python


Collecting mysql-connector-python
  Downloading mysql_connector_python-8.0.29-cp37-cp37m-manylinux1_x86_64.whl (25.2 MB)
[K     |████████████████████████████████| 25.2 MB 1.6 MB/s 
Installing collected packages: mysql-connector-python
Successfully installed mysql-connector-python-8.0.29


In [5]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
#from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import mysql.connector
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------

---
<a id = Section4></a>
# **4. Data Acquisition & Description**
---


**Fetching and data descirption of department_data**

In [6]:
import mysql.connector
import pandas as pd
from mysql.connector import errorcode
try:
 cnx = mysql.connector.connect(user='student', password='student',
                              host='cpanel.insaid.co',
                              database='Capstone2')
 #cursor = cnx.cursor()
 #cursor.execute('select * from gender_age_train')
 #rows = cursor.fetchall()
 department_data = pd.read_sql_query('select * from department_data', con=cnx)

except mysql.connector.Error as err:
  if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
    print("Something is wrong with your user name or password")
  elif err.errno == errorcode.ER_BAD_DB_ERROR:
    print("Database does not exist")
  else:
    print(err)
else:
 cnx.close()


**department_data**
This dataset contains information about each department. The schema of the dataset is as follows:

- **dept_id** – Unique Department Code
- **dept_name** – Name of the Department
- **dept_head** – Name of the Head of the Department

In [7]:
department_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   dept_id    11 non-null     object
 1   dept_name  11 non-null     object
 2   dept_head  11 non-null     object
dtypes: object(3)
memory usage: 392.0+ bytes


In [8]:
department_data.describe(include='all')

Unnamed: 0,dept_id,dept_name,dept_head
count,11,11,11
unique,11,11,11
top,D00-IT,IT,Henry Adey
freq,1,1,1


In [9]:
#employee_data_df categorical fields
s = (department_data.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)
for x in object_cols:
    print(x,":",department_data[x].unique())
    print()

Categorical variables:
['dept_id', 'dept_name', 'dept_head']
dept_id : ['D00-IT' 'D00-SS' 'D00-TP' 'D00-ENG' 'D00-SP' 'D00-FN' 'D00-PR' 'D00-AD'
 'D00-MN' 'D00-MT' 'D00-PD']

dept_name : ['IT' 'Sales' 'Temp' 'Engineering' 'Support' 'Finance' 'Procurement'
 'Admin' 'Management' 'Marketing' 'Product']

dept_head : ['Henry Adey' 'Edward J Bayley' 'Micheal Zachrey' 'Sushant Raghunathan K'
 'Amelia Westray' 'Aanchal J' 'Louie Viles' 'Evelyn Tolson'
 'Ellie Trafton' 'Reuben Swann' 'Darcy Staines']



**Fetching and data descirption of employee_details_data**

**employee_details_data**
This dataset consists of Employee ID, their Age, Gender and Marital Status. The schema of this dataset is as follows:

- **employee_id** – Unique ID Number for each employee
- **age** – Age of the employee
- **gender** – Gender of the employee
- **marital_status** – Marital Status of the employee

In [10]:
cnx1 = mysql.connector.connect(user='student', password='student',
                              host='cpanel.insaid.co',
                              database='Capstone2')

#creating a dataframe phone_brand

emp_details_df = pd.read_sql_query('select * from employee_details_data',con=cnx1)

In [11]:
emp_details_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14245 entries, 0 to 14244
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   employee_id     14245 non-null  int64 
 1   age             14245 non-null  int64 
 2   gender          14245 non-null  object
 3   marital_status  14245 non-null  object
dtypes: int64(2), object(2)
memory usage: 445.3+ KB


In [12]:
emp_details_df.describe(include='all')

Unnamed: 0,employee_id,age,gender,marital_status
count,14245.0,14245.0,14245,14245
unique,,,2,2
top,,,Male,Unmarried
freq,,,9382,7283
mean,112123.050544,32.889926,,
std,8500.457343,9.970834,,
min,100101.0,22.0,,
25%,105775.0,24.0,,
50%,111298.0,29.0,,
75%,116658.0,41.0,,


In [13]:
#employee_data_df categorical fields
s = (emp_details_df.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)
for x in object_cols:
    print(x,":",emp_details_df[x].value_counts())
    print()

Categorical variables:
['gender', 'marital_status']
gender : Male      9382
Female    4863
Name: gender, dtype: int64

marital_status : Unmarried    7283
Married      6962
Name: marital_status, dtype: int64



In [14]:
# Get list of numerical variables
s = (emp_details_df.dtypes == 'int64')
numeric_cols = list(s[s].index)

print("Numeric variables INT:")
print(numeric_cols)
for x in numeric_cols:
    #print(x,":",employee_data_df[x].unique())
    print(x,":",emp_details_df[x].unique())
    print()

Numeric variables INT:
['employee_id', 'age']
employee_id : [113558 112256 112586 ... 128083 118487 118849]

age : [43 24 22 36 38 51 54 49 37 27 47 28 53 39 35 42 40 23 45 25 30 34 26 44
 52 31 32 33 29 41 46 48 57 50 55 56]



**Fetching and data descirption of employee_data**

In [15]:
# creating a dataframe employee_data
cnx1 = mysql.connector.connect(user='student', password='student',
                              host='cpanel.insaid.co',
                              database='Capstone2')

#creating a dataframe phone_brand

employee_data_df = pd.read_sql_query('select * from employee_data',con=cnx1)

**employee_data**
This dataset consists of each employee’s Administrative Information, Workload Information, Mutual Evaluation Information and Status.

<br>**Target variable**

- **status** – Current employment status (Employed / Left)
**Administrative information**

- **department** – Department to which the employees belong(ed) to-- **object**, **707 null** 
- **salary** – Salary level with respect to rest of their department -- **Object(convert to float)**, **no null**
- **tenure** – Number of years at the company -- **float**, **150 null**
- **recently_promoted** – Was the employee promoted in the last 3 years? **float64, 13853 nulls**
- **employee_id** – Unique ID Number for each employee
**Workload information**

- **n_projects** – Number of projects employee has worked on -- **int, 0 null** 
- **avg_monthly_hrs** – Average number of hours worked per month -- **float, 0 null**
**Mutual evaluation information**

- **satisfaction** – Score for employee’s satisfaction with the company (higher is better) -- **float, 150 null**
- **last_evaluation** – Score for most recent evaluation of employee (higher is better)-- **float, 1487 nulls**
- **filed_complaint** – Has the employee filed a formal complaint in the last 3 years? -- **float, 12104 nulls**

In [16]:
#Fetching unseen data
test=pd.read_excel('https://github.com/kings-shah/GCD_Capstone_HR/blob/33a7bfba3de50f9cba9ab0ebbdf158cd1d69d8d4/GCD_Capstone_Project_unseen_data.xlsx?raw=true')

In [17]:
employee_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14150 entries, 0 to 14149
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   avg_monthly_hrs    14150 non-null  float64
 1   department         13443 non-null  object 
 2   filed_complaint    2046 non-null   float64
 3   last_evaluation    12663 non-null  float64
 4   n_projects         14150 non-null  int64  
 5   recently_promoted  297 non-null    float64
 6   salary             14150 non-null  object 
 7   satisfaction       14000 non-null  float64
 8   status             14150 non-null  object 
 9   tenure             14000 non-null  float64
 10  employee_id        14150 non-null  int64  
dtypes: float64(6), int64(2), object(3)
memory usage: 1.2+ MB


In [18]:
employee_data_df.describe(include='all')

Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,status,tenure,employee_id
count,14150.0,13443,2046.0,12663.0,14150.0,297.0,14150,14000.0,14150,14000.0,14150.0
unique,,12,,,,,3,,2,,
top,,D00-SS,,,,,low,,Employed,,
freq,,3905,,,,,6906,,10784,,
mean,199.994346,,1.0,0.718399,3.778304,1.0,,0.621212,,3.499357,112080.750247
std,50.833697,,0.0,0.173108,1.250162,0.0,,0.250482,,1.462584,8748.202856
min,49.0,,1.0,0.316175,1.0,1.0,,0.040058,,2.0,0.0
25%,155.0,,1.0,0.563711,3.0,1.0,,0.450356,,3.0,105772.5
50%,199.0,,1.0,0.724731,4.0,1.0,,0.652394,,3.0,111291.5
75%,245.0,,1.0,0.871409,5.0,1.0,,0.824925,,4.0,116650.75


In [19]:
employee_data_df.isnull().sum()

avg_monthly_hrs          0
department             707
filed_complaint      12104
last_evaluation       1487
n_projects               0
recently_promoted    13853
salary                   0
satisfaction           150
status                   0
tenure                 150
employee_id              0
dtype: int64

In [20]:
#employee_data_df categorical fields
s = (employee_data_df.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)
for x in object_cols:
    #print(x,":",employee_data_df[x].unique())
    print(x,":",employee_data_df[x].value_counts())
    print()

Categorical variables:
['department', 'salary', 'status']
department : D00-SS     3905
D00-ENG    2575
D00-SP     2113
D00-IT     1157
D00-PD      855
D00-MT      815
D00-FN      725
D00-MN      593
-IT         207
D00-AD      175
D00-PR      173
D00-TP      150
Name: department, dtype: int64

salary : low       6906
medium    6101
high      1143
Name: salary, dtype: int64

status : Employed    10784
Left         3366
Name: status, dtype: int64



In [21]:
# Get list of numerical variables
s = (employee_data_df.dtypes == 'int64')
numeric_cols = list(s[s].index)

print("Numeric variables INT:")
print(numeric_cols)
for x in numeric_cols:
    #print(x,":",employee_data_df[x].unique())
    print(x,":",employee_data_df[x].unique())
    print()

Numeric variables INT:
['n_projects', 'employee_id']
n_projects : [6 2 7 5 4 3 1]

employee_id : [124467 112210 126150 ... 106064 113083 104996]



In [22]:
# Get list of numerical variables
s = (employee_data_df.dtypes == 'float64')
float_cols = list(s[s].index)

print("Float variables Float:")
print(float_cols)
for x in float_cols:
    #print(x,":",employee_data_df[x].unique())
    print(x,":",employee_data_df[x].unique())
    print(x," Mean:",employee_data_df[x].mean())
    print(x," Median:",employee_data_df[x].median())
    print()

Float variables Float:
['avg_monthly_hrs', 'filed_complaint', 'last_evaluation', 'recently_promoted', 'satisfaction', 'tenure']
avg_monthly_hrs : [246. 134. 156. 256. 146. 135. 270. 244. 289. 281. 269. 267. 257. 155.
 128. 274. 151. 127. 132. 309. 130. 233. 245. 149. 232. 284. 249. 164.
 159. 154. 239. 260. 125. 308. 306. 141. 143. 261. 301. 296. 271. 129.
 290. 225. 253. 255. 268. 153. 294. 293. 235. 158. 273. 277. 198. 160.
 131. 150. 254. 152. 236. 145. 279. 259. 297. 258. 140. 223. 147. 148.
 310. 137. 303. 202. 136. 287. 218. 172. 305. 291. 243. 228. 283. 242.
 192. 298. 285. 247. 216. 280. 265. 263. 276. 139. 142. 299. 278. 282.
 241. 144. 157. 264. 138. 224. 251. 124. 119. 248. 304. 262. 266. 133.
 252. 275. 219. 307. 226. 214. 180. 300. 240. 217. 227. 238. 177. 181.
 165. 288. 286. 272. 250. 126. 292.  65. 222. 229. 302. 237. 161. 295.
 221.  63. 195. 213. 234. 205. 212. 179.  72.  87. 163. 169. 231. 166.
 220.  68. 196. 162. 182. 204. 184.  74.  67. 206. 183. 189. 168. 178.
 2

The df_employee_data dataset  

1.  707 missing values in ‘department’ column 
- - As we don’t know the employee department, so replaced the missing values with mode 


2. -IT value
- - Replace with D00-IT 
3. 12080 missing values in ‘filed_complaint’ 
- - Replace the missing values ‘0.0’ 
4. 13824  in ‘recently_promoted’ column
- - Replace the missing values ‘0.0’ 
5. 1487 missing values in ‘last_evaluation
- - replace with median 0.72
6. 150 missing values in ‘satisfaction’
- - replace with median  0 as its for dept - TP only 
7. 150 in tenure column
- - replace with Median: 0 as its for dept - TP only  



<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

In [25]:

employee_data_df.filed_complaint=employee_data_df.filed_complaint.replace(np.NaN,0.0)
employee_data_df.recently_promoted=employee_data_df.recently_promoted.replace(np.NaN,0.0)
employee_data_df.department=employee_data_df.department.replace('-IT','D00-IT')
test.filed_complaint=test.filed_complaint.replace(np.NaN,0.0)
test.recently_promoted=test.recently_promoted.replace(np.NaN,0.0)
test.department=test.department.replace('-IT','D00-IT')

In [26]:
employee_data_df.tenure=employee_data_df.tenure.replace(np.NaN,0.0)
employee_data_df.satisfaction=employee_data_df.satisfaction.replace(np.NaN,0.0)
employee_data_df.last_evaluation=employee_data_df.last_evaluation.replace(np.NaN,employee_data_df.last_evaluation.median())

In [28]:
test.tenure=test.tenure.replace(np.NaN,0.0)
test.satisfaction=test.satisfaction.replace(np.NaN,0.0)
test.last_evaluation=test.last_evaluation.replace(np.NaN,test.last_evaluation.median())
employee_data_df.filed_complaint.isnull().sum()

0

In [30]:
test.shape

(100, 10)

In [31]:
employee_data_df.shape

(14150, 11)

In [32]:
employee_data_df = employee_data_df.drop_duplicates()
employee_data_df.drop(employee_data_df[employee_data_df.employee_id == 0].index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [33]:
employee_data_df.department=employee_data_df.department.replace(np.NaN,employee_data_df.department.mode()[0])

In [34]:
test.department=test.department.replace(np.NaN,test.department.mode()[0])

<a name = Section7></a>

---
# **7. Data Post-Processing**
---


<a name = Section71></a>
### **7.1 Data Encoding**

- In this section, we will encode our categorical features as necessary and manipulate any column as necessary

In [35]:
employee_data_df.isnull().sum()

avg_monthly_hrs      0
department           0
filed_complaint      0
last_evaluation      0
n_projects           0
recently_promoted    0
salary               0
satisfaction         0
status               0
tenure               0
employee_id          0
dtype: int64

In [36]:
dict_status = {'Left':0, 'Employed':1}
dict_salary = {'low':1, 'medium':2, 'high':3}
employee_data_df.replace({'salary': dict_salary},inplace=True)
employee_data_df.replace({'status': dict_status},inplace=True)

test.replace({'salary': dict_salary},inplace=True)
#test.replace({'status': dict_status},inplace=True)

def OHE(original_dataframe, col):    
    dummies = pd.get_dummies(original_dataframe[[col]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([col], axis=1)
    return(res) 

employee_data_df = OHE(employee_data_df,'department')
test = OHE(test,'department')


<a name = Section72></a>
### **7.2 Data Preparation**

- Now we will **split** our **data** into **dependent** and **independent** variables for further development using holdout validation technique.

In [37]:
X=employee_data_df.loc[:,employee_data_df.columns!='status']
y = employee_data_df.pop('status')

In [38]:
# Splitting data into training and testing sets with using Validation Test Data as 25%
from sklearn.model_selection import train_test_split                # To split the data in training and testing part     
from sklearn.linear_model import LogisticRegression  
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import accuracy_score                          # For calculating the accuracy for the model
from sklearn.metrics import precision_score                         # For calculating the Precision of the model
from sklearn.metrics import recall_score                            # For calculating the recall of the model
from sklearn.metrics import precision_recall_curve                  # For precision and recall metric estimation
from sklearn.metrics import confusion_matrix                        # For verifying model performance using confusion matrix
from sklearn.metrics import f1_score                                # For Checking the F1-Score of our model  
from sklearn.metrics import roc_curve                               # For Roc-Auc metric estimation
from sklearn.metrics import plot_roc_curve
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Display the shape of training and testing data
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
X_train.info()
#X_train.columns
X_train.head()

X_train shape:  (10587, 20)
y_train shape:  (10587,)
X_test shape:  (3529, 20)
y_test shape:  (3529,)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10587 entries, 9435 to 7273
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   avg_monthly_hrs     10587 non-null  float64
 1   filed_complaint     10587 non-null  float64
 2   last_evaluation     10587 non-null  float64
 3   n_projects          10587 non-null  int64  
 4   recently_promoted   10587 non-null  float64
 5   salary              10587 non-null  int64  
 6   satisfaction        10587 non-null  float64
 7   tenure              10587 non-null  float64
 8   employee_id         10587 non-null  int64  
 9   department_D00-AD   10587 non-null  uint8  
 10  department_D00-ENG  10587 non-null  uint8  
 11  department_D00-FN   10587 non-null  uint8  
 12  department_D00-IT   10587 non-null  uint8  
 13  department_D00-MN   10587 non-null  uint8  
 14

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,employee_id,department_D00-AD,department_D00-ENG,department_D00-FN,department_D00-IT,department_D00-MN,department_D00-MT,department_D00-PD,department_D00-PR,department_D00-SP,department_D00-SS,department_D00-TP
9435,237.0,0.0,0.835774,5,0.0,1,0.754157,4.0,107035,0,0,0,0,0,0,1,0,0,0,0
7487,130.0,0.0,0.506343,2,0.0,2,0.450085,3.0,117847,0,0,0,0,0,0,0,0,1,0,0
5120,250.0,0.0,0.785817,6,0.0,1,0.335011,3.0,110091,0,0,0,0,0,0,0,0,1,0,0
12005,191.0,0.0,0.523009,4,0.0,2,0.656776,4.0,104013,0,0,1,0,0,0,0,0,0,0,0
6557,154.0,0.0,0.571853,3,0.0,2,0.945445,3.0,110702,0,0,0,0,0,0,0,0,0,1,0


<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will **develop a Logistic Regression model**

- Then we will **analyze the results** obtained and **make our observations**.

- For **evaluation purpose** we will **focus** on **F1 Score** score as required by this project.

<a name = Section81></a>
### **8.1 Baseline Model Development & Evaluation**

- Here we will develop Logistic Regression classification model using default setting.

In [39]:
logreg = LogisticRegression(C=100).fit(X_train,y_train)
#logreg()

# Predicting training and testing labels
y_train_pred_count = logreg.predict(X_train)
y_test_pred_count = logreg.predict(X_test)
print('Accuracy score for test validation data is:', accuracy_score(y_test,y_test_pred_count))

Accuracy score for test validation data is: 0.7631056956644942


In [40]:
from sklearn.preprocessing import RobustScaler
scaler_rbs = RobustScaler()
X_train_rbs = scaler_rbs.fit_transform(X_train)
X_test_rbs = scaler_rbs.transform(X_test)
# Instantiate a Logistic Regression
logreg = LogisticRegression(C=100).fit(X_train_rbs,y_train)
#logreg()

# Predicting training and testing labels
y_train_pred_count = logreg.predict(X_train_rbs)
y_test_pred_count = logreg.predict(X_test_rbs)
print('Accuracy score for test validation data is:', accuracy_score(y_test,y_test_pred_count))

Accuracy score for test validation data is: 0.8016435250779258


** Random Forest **

In [41]:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
#With Pipeline
from sklearn.pipeline import Pipeline
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=10)
#kfold = model_selection.KFold(n_splits=10)
pipeRF = Pipeline(( 
("RF",RandomForestClassifier(random_state = 42, max_depth = 16)) ,    
))
pipeRF.fit(X_train,y_train)
print("Testing Accuracy")
print(pipeRF.score(X_test,y_test))
print("Training Accuracy")
print(pipeRF.score(X_train,y_train))
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeRF, X_train, y_train, cv=10,scoring='accuracy')
print()
print("Accuracy")
print(np.mean(scores))

Testing Accuracy
0.9815811844715217
Training Accuracy
0.9961273259658071

Accuracy
0.977992042284068


** XGB **

In [42]:
from sklearn import model_selection
from xgboost import XGBClassifier
#With Pipeline
from sklearn.pipeline import Pipeline
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=10)
kfold = model_selection.KFold(n_splits=10)
pipeXGB = Pipeline(( 
("XGB",XGBClassifier(random_state = 42, max_depth = 15)) ,    
))
pipeXGB.fit(X_train,y_train)
print("Testing Accuracy")
print(pipeXGB.score(X_test,y_test))
print("Training Accuracy")
print(pipeXGB.score(X_train,y_train))
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeXGB, X_train, y_train, cv=10,scoring='accuracy')
print()
print("Accuracy")
print(np.mean(scores))

Testing Accuracy
0.9827146500425049
Training Accuracy
0.9989609898932653

Accuracy
0.9791250082558178


In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

RF = RandomForestClassifier(max_depth=13)
LR = LogisticRegression(solver='liblinear')
SV = SVC(kernel="linear")
DT = DecisionTreeClassifier()
XB=XGBClassifier(max_depth=15)

In [45]:
rf = RF
rf.fit(X_train, y_train.values.ravel())
print('\n Random Forest Classifier : {:.3f}'.format(accuracy_score(y_test, rf.predict(X_test))))


 Random Forest Classifier : 0.981


In [46]:
xgb = XB
xgb.fit(X_train, y_train.values.ravel())
print('\n XGBoost Classifier : {:.3f}'.format(accuracy_score(y_test, xgb.predict(X_test))))


 XGBoost Classifier : 0.983


## for Test data

In [47]:
y_pred_final=xgb.predict(test)

In [48]:
df_test= pd.DataFrame({"employee_id":test.employee_id,"status":y_pred_final})

In [49]:
df_test.loc[df_test.status==0,'status']='Left'
df_test.loc[df_test.status==1,'status']='Employed'

In [50]:
df_test.to_csv('GCD_output.csv',index=False, header=False)

In [51]:
df_test.shape

(100, 2)