# Shaastra 2022

# Machine Learning - A project based approach 

## Prediction of attrition - if a particular employee leave the company or not 

### 14 January 2022


#### Session 2: Classification projects
#### Trainers:
Abhijit Rathod and Sai Shashank 

### Install required modules here 

####  To install catboost use pip install catboost

#### To install xgboost: If you are using anaconda use conda install -c anaconda py-xgboost OR pip install xgboost

#### To install pandas Profiling 
##### 1. pip install pandas-profiling
##### 2. pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
##### 3. conda install -c conda-forge pandas-profiling


In [None]:
# Block reserved for any installations 




## Import Python Modules, and Required Libraries 
##### Some modules and libraries are not used in this project but are given here for reader's reference (for use in other projects)

In [123]:
#Essential modules 
import pandas as pd
import numpy as np

#Visualisation libraries 
import seaborn as sns
import matplotlib.pyplot as plt

#Encoding 
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce 

#Normalization
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from scipy import stats

#Model Building

from sklearn import model_selection
from sklearn.model_selection import train_test_split,StratifiedKFold

##Classifiers 

# Logistic Regression
from sklearn.linear_model import LogisticRegression

#Decision tree
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

#Random Forest
from sklearn.ensemble import RandomForestClassifier

#Boosting 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

#Bagging 
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification #for bootstrapping

#KFold 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate

#Evaluation libraries 
from sklearn.metrics import recall_score
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score, roc_curve

# surpress warnings
import warnings
warnings.filterwarnings('ignore')

# Timer
import timeit

## Helper Functions 
#### These helper functions are not needed as such. They are given just to show off :P. Other build in functions are also available as an alternative. 

In [2]:
def printColumnTypes(df):
    non_num_df = df.select_dtypes(include=['object'])
    num_df = df.select_dtypes(exclude=['object'])
    '''separates non-numeric and numeric columns'''
    print("Non-Numeric columns:")
    for col in non_num_df:
        print(f"{col}")
    print("")
    print("Numeric columns:")
    for col in num_df:
        print(f"{col}")

def missing_cols(df):
    '''prints out columns with its amount of missing values with its %'''
    total = 0
    for col in df.columns:
        missing_vals = df[col].isnull().sum()
        pct = df[col].isna().mean() * 100
        total += missing_vals
        if missing_vals != 0:
          print('{} => {} [{}%]'.format(col, df[col].isnull().sum(), round(pct, 2)))
    
    if total == 0:
        print("no missing values")

# PHASE 1: EXPLORING TRAINING DATA

## Upload Data

In [None]:
df = pd.read_csv(r"F:\Shaastra Workshop\Data Sprint 56 HR Analytics\Attrition_training_data.csv")
df

# df = df.sample(frac = 0.1)    ## Remove comment if you want to use this line 
                                ## If dataset is huge (more than 1,00,000 rows, take sample data for validation purpose. 
                                ## frac = 0.1 means 10 % data is used)

## Data Exploration Before Data Cleaning and Preprocessing 

##### Caution: It is not recommended to run this block if your laptop has low memory

In [None]:
## Magic code

## This gives entire profiling of data with visual representation in the form of an html page.

## Note: It is not recommended to run this block, if the laptop is occupied with some other tasks 
## as it requires a huge amount of memory. 

from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof

In [None]:
prof

## Dimension of the dataset

In [None]:
print('Dimension of dataframe is', df.shape)

## Data types of each column

In [None]:
## method 1

df.info()

In [None]:
## Method 2

printColumnTypes(df)

##### Observation: From the information table above, we can see that
##### 1. There are 28 features in the dataset 
##### 2. Non-Numeric columns:
BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, Attrition

## Checking number of unique values in each column 

In [None]:
df.nunique()

## Data Cleaning and Preprocessing 

### Check for missing data

In [None]:
## Method 1

df.isnull().sum()

In [None]:
## Method 2 

missing_cols(df)      # Gives percentage of missing values

##### Observation: Columns NumCompaniesWorked, TotalWorkingYears, EnvironmentSatisfaction, JobSatisfaction, WorkLifeBalance have missing values

## Dealing with Missing Values 

### Alternative 1: If percentage of missing values is more than 30%, it is better to drop that column (Not a thumb rule)

### Alternative 2: Replace missing values with 'zero'

### Alternative 3: Replace missing values with 'mean' of the column 

### Alternative 4: Replace some missing values with 'mean', some with zeroes

### Alternative 5: Use advanced imputers like KNN imputer (Not in the scope of the workshop)

##### Note: axis=1 (or axis='columns') is vertical axis. To take it further, if you use pandas method drop, to remove columns or rows, if you specify axis=1 you will be removing columns. If you specify axis=0 you will be removing rows from dataset.

In [None]:
## Alternative 1:

df = df.drop(['DataFrame Column 1','DataFrame Column 2'], axis = 1)
df

In [None]:
## Alternative 2:
#df = df.fillna(0)

df = df.fillna(0)

In [None]:
## Alternative 3: 

#df['DataFrame Column'] = df['DataFrame Column'].fillna(0)
#df['DataFrame Column'] = df['DataFrame Column'].fillna(mean())

df[''].fillna(int(df['NumCompaniesWorked'].mean()), inplace=True)     # When inplace = True , the data is modified in place, 
                                                                                        # which means it will return nothing 
                                                                                        # and the dataframe is now updated.
df

df['NumCompaniesWorked'] = df['NumCompaniesWorked'].fillna(0)
df

In [None]:
## Alternative 4:
#df = df.fillna(mean())

df = df.fillna(df.mean())
df

## Check if there are missing values 

In [None]:
missing_cols(df)

## Encoding 

### Get the list of categorical Columns

In [7]:
objList = df.select_dtypes(include = "object").columns
print (objList)

Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')


In [8]:
## Method 1: Label Encoder
#Label Encoding for object to numeric conversion

le = LabelEncoder()

for feat in objList:
    df[feat] = le.fit_transform(df[feat].astype(str))

print (df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176 entries, 0 to 1175
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Age                       1176 non-null   int64
 1   BusinessTravel            1176 non-null   int32
 2   DailyRate                 1176 non-null   int64
 3   Department                1176 non-null   int32
 4   DistanceFromHome          1176 non-null   int64
 5   Education                 1176 non-null   int64
 6   EducationField            1176 non-null   int32
 7   EmployeeCount             1176 non-null   int64
 8   EmployeeNumber            1176 non-null   int64
 9   EnvironmentSatisfaction   1176 non-null   int64
 10  Gender                    1176 non-null   int32
 11  HourlyRate                1176 non-null   int64
 12  JobInvolvement            1176 non-null   int64
 13  JobLevel                  1176 non-null   int64
 14  JobRole                   1176 non-null 

In [None]:
## Method 2: Find and replace 

replacement = {"Column 1":     {"element 1": 1, "element 2": 2},
                "Column 2": {"element 1": 1, "element 2": 2, "element": 3, "element 4": 4 }}

df = df.replace(replacement)
df.head()      ## head() shows first 5 rows 

##### To know more about categorical encoding visit https://pbpython.com/categorical-encoding.html

## Check if all the columns are numeric 

In [None]:
df.info()

## Outliers Detection and Removal (Optional)

In [None]:
## Z-score method is used to detect and remove outliers 

z = np.abs(stats.zscore(df))
print(z)

In [None]:
## It gives indices of the rows with z-score greater than 3 i.e. rows with outliers 

threshold = 3
print(np.where(z > 3))

In [None]:
clean_df = df[(z < 3).all(axis=1)]
clean_df

In [None]:
print('Dimension of dataframe after cleaning is', clean_df.shape)

## Exploratory Data Analysis After Cleaning and Preprocessing - Visualization 

### Data Description 

In [None]:
df.describe()  ## Gives count, mean, std, min, max, and quartile values for each variable 

## Class Distribution

In [None]:
df['Attrition'].value_counts()

In [None]:
sns.countplot(x = 'Attrition', data = df, palette="Set1");

###### Observation: Data is imbalanced as proportion of employees staying with the company is extremely higher than employees planning to leave the company. 

### Histograms 

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

sns.histplot(df, x = 'BusinessTravel', discrete = True, ax = axs[0]);
sns.histplot(df, x = 'Gender', discrete = True, ax = axs[1]);

In [None]:
df['BusinessTravel'].value_counts()

##### Frequently: 
##### Rarely: 
##### : 

##### Observation:  

In [None]:
df['Gender'].value_counts()

##### Male employees: 
##### Female employees: 

##### Observation: Proportion of male employees is significantly higher than female employees. 

In [None]:
fig, axs = plt.subplots(1,2, figsize=(10, 5))

sns.histplot(x = 'Age', data = df, discrete = True, ax = axs[0]);
sns.histplot(x = 'YearsAtCompany', data = df, discrete = True, ax = axs[1]);

In [None]:
fig, axs = plt.subplots(1,2, figsize=(10, 5))

sns.histplot(x = 'Education', data = df, discrete = True, ax = axs[0]);
sns.histplot(x = 'EducationField', data = df, discrete = True, ax = axs[1]);

## Would you like to try?

In [None]:
fig, axs = plt.subplots(1,2, figsize=(10, 5))

sns.histplot(x = '_______', data = df, discrete = True, ax = axs[0]);
sns.histplot(x = '_______', data = df, discrete = True, ax = axs[1]);

In [None]:
# Consider EnvironmentSatisfaction, JobSatisfaction, WorkLifeBalance,JobInvolvement

fig, axs = plt.subplots(2,2, figsize=(16, 10))

#sns.histplot(x = 'Education', data = df, discrete = True, ax = axs[0][0]);
#sns.histplot(x = 'EducationField', data = df, discrete = True, ax = axs[0][1]);
#sns.histplot(x = 'EducationField', data = df, discrete = True, ax = axs[1][0]);
#sns.histplot(x = 'EducationField', data = df, discrete = True, ax = axs[1][1]);

## Box Plots    
#### To identify outliers 

In [None]:
sns.set_theme(style="whitegrid")
ax = sns.boxplot(x=df["Age"])

## Feature Selection 

### Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.
### Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

### To know more visit: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e


### Method 1: Correlation Matrix 

In [None]:
plt.figure(figsize = (12,12))
corrmat = df.corr()
ax = sns.heatmap(corrmat, vmax=1, square=True);

In [120]:
numeric_feature_cols = [x for x in df.columns if x not in ['Attrition']]
#cat_feature_cols = []
target_col = ['Attrition']

In [None]:
import scipy.stats 

corrs = []
for col in numeric_feature_cols:
    corr = scipy.stats.spearmanr(df['Attrition'], df[col])
    corrs.append({
        'feature': col,
        'correlation': corr[0],
        'correlation_p_value': corr[1]
    })
    
pd.DataFrame(corrs).sort_values('correlation')

##### Features are arranged in the ascending order of correlation with the Attrition. Monthly income has the highest correlation with 'age_group'. Higher the correlation (positive or negative) with target variable, more the importance of the feature. This method is not very reliable as there is a possibity of interaction between variables. 

## Data preparation for Modelling 

### X and Y division

#### Drop irrelevant columns here. We generally drop columns 
##### 1. With unique id; In this case, EmployeeID. 
##### 2. With more than 30% missing values 
##### 3. More than 70-80% unique values 
##### 4. Only 1 unique value. Example: StandardHours
##### 5. Very low correlation with target variable 

In [10]:
X = df.drop(['Attrition'], axis=1)     

y = df['Attrition']

In [None]:
X

In [None]:
y

### Data Normalization (Optional)

In [None]:
column_maxes = X.max()
X_max = column_maxes.max()
column_mins = X.min()
X_min = column_mins.min()
normalized_X = (X - X_min) / (X_max - X_min)

normalized_X

##### This data normalization is just a demonstration of how data is normalised.
##### Normalization doesn't have much impact on decision tree algorithms. Hence normalized data is not used in prediction models. 

## Data Scaling 

In [None]:
st_x= StandardScaler()    
X = st_x.fit_transform(X)    

# PHASE 2: ALGORTIHMS ON TRAINING DATA

# Train and Validation Division

#### test_size = 0.25 means entire dataset is divided into 2 sets - Train data = 75% and Test/Validation data = 25%

In [None]:
X_train, X_val, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)
X_train.shape, X_test.shape

# Different Algortihms 

### 1. Decision Tree Classifier

In [None]:
# Build and fit the model 

clf_gini = DecisionTreeClassifier(criterion='gini', random_state=0)
clf_gini.fit(X_train, y_train)

In [None]:
# Predict 

y_pred_gini = clf_gini.predict(X_val)

In [None]:
# Plot confusion matrix 

cm = confusion_matrix(y_test, y_pred_gini)
plot_confusion_matrix(clf_gini, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_gini, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_gini)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_gini)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold

scores = cross_validate(clf_gini, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 2. Random Forest 

### 2a. Random Forest with default Parameters 

In [114]:
## Random Forest Trees with default parameters (i.e. 100 Decision Trees)

rfc = RandomForestClassifier(random_state=0, class_weight='balanced')
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(rfc, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred)
print("Accuracy of the model:",result2)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred, average='weighted')

In [None]:
# Cross Validation using KFold 

scores = cross_validate(rfc, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

### 2b. Random Forest with Parameter Tuning

#### Grid Search Cross Validation to find best parameters 

In [92]:
## Might take a while

start = timeit.default_timer()

rfc_best = RandomForestClassifier()
parameters = {
    "n_estimators":[250, 500],
    "max_depth":[8, 16, None],
    "max_features": ['auto']
    }

from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(rfc_best,parameters,cv=5, scoring='f1')      # scoring = 'roc_auc', scoring = 'accuracy'
cv.fit(X_train,y_train.values.ravel())

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

stop = timeit.default_timer()

print('Time taken to perform Gridsearch: ', stop - start)        

display(cv)

Time taken to perform Gridsearch:  26.66947970000001
Best parameters are: {'max_depth': 16, 'max_features': 'auto', 'n_estimators': 250}


0.211 + or -0.039 for the {'max_depth': 8, 'max_features': 'auto', 'n_estimators': 250}
0.178 + or -0.033 for the {'max_depth': 8, 'max_features': 'auto', 'n_estimators': 500}
0.259 + or -0.05 for the {'max_depth': 16, 'max_features': 'auto', 'n_estimators': 250}
0.217 + or -0.051 for the {'max_depth': 16, 'max_features': 'auto', 'n_estimators': 500}
0.238 + or -0.036 for the {'max_depth': None, 'max_features': 'auto', 'n_estimators': 250}
0.218 + or -0.055 for the {'max_depth': None, 'max_features': 'auto', 'n_estimators': 500}


##### GridSearch Algorithm gives best parameters using the combinations of the given set of parameters. It also cross validates the accuracy and avarage accuracy for each combination is dispalyed. 

##### In this case best parameters are: 
##### n_estimators = ___
##### max_depth = ____
##### max_features = _____

In [None]:
## Random Forest Trees with Best Parameters

rfc_best = RandomForestClassifier(n_estimators = 250, max_depth = 8, max_features = 'auto', random_state=0)

start = timeit.default_timer()

rfc_best.fit(X_train, y_train)

stop = timeit.default_timer()

print('Time taken to perform Random Forest Classifier: ', stop - start) 

y_pred_best = rfc_best.predict(X_val)

In [None]:
## Confusion Matrix 

cm = confusion_matrix(y_test, y_pred_best)
plot_confusion_matrix(rfc_best, X_test, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_best, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_best)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_best)
print("Accuracy of the model:",result2)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred_best, average='weighted')

In [None]:
# Cross Validation using KFold 

scores = cross_validate(rfc_best, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## Method 2 of Feature Selection -  Feature Scores



In [None]:
feature_scores = pd.Series(rfc_best.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores

In [None]:
## Plot Important Features 

feature_imp = pd.DataFrame(sorted(zip(rfc_best.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data = feature_imp.sort_values(by="Value", ascending=False)[:25], color = 'blue');

### 2c. Random Forest using SMOTE (For Imbalanced target variables) (Optional)

#### Sample code for SMOTE demonstration

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

smote = SMOTE(sampling_strategy='not majority')

In [None]:
rfc_smote = RandomForestClassifier(random_state=0, max_depth = 16, n_estimators = 500)

In [None]:
clf_smote = Pipeline(steps = [('sampling', smote),('classifier',rfc_smote)])

In [None]:
clf_smote.fit(X_train, y_train)

In [None]:
y_pred_smote = clf_smote.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_smote)
plot_confusion_matrix(clf_smote, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_smote, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_smote)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_smote)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_smote, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## Boosting Algorithms 

## 3. AdaBoostClassifier

In [None]:
clf_adaboost = AdaBoostClassifier(random_state=0)
clf_adaboost.fit(X_train, y_train)

In [None]:
y_pred_adaboost = clf_adaboost.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_adaboost)
plot_confusion_matrix(clf_adaboost, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_adaboost, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_adaboost)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_adaboost)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_adaboost, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 4. Gradient Boosting 

In [None]:
clf_grboost = GradientBoostingClassifier(random_state=0)
clf_grboost.fit(X_train, y_train)

In [None]:
y_pred_grboost = clf_grboost.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_grboost)
plot_confusion_matrix(clf_grboost, X_val, y_test)
print('Confusion matrix\n\n', cm)

In [None]:
result = pd.crosstab(y_test, y_pred_grboost, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_grboost)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_grboost)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_grboost, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

### Gradient Boosting with Parameter Tuning

In [None]:
clf_grboost_best = GradientBoostingClassifier(random_state=0)
parameters = {
    "n_estimators":[100, 250, 500],
    "max_depth":[8, 16, None],
    "max_features": [0.1, 0.25, 0.5, 1.0]
      }

from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(clf_grboost_best, parameters,cv=5)
cv.fit(X_train,y_train.values.ravel())

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
        
display(cv)

#### Gradient Boosting with best parameters 

In [None]:
clf_grboost_best = GradientBoostingClassifier(random_state=0)
clf_grboost_best.fit(X_train, y_train)

In [None]:
y_pred_grboost_best = clf_grboost_best.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_grboost_best)
plot_confusion_matrix(clf_grboost_best, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_grboost_best, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_grboost_best)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_grboost_best)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_grboost_best, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 5. XGBoost 

In [None]:
clf_xgb = XGBClassifier()
clf_xgb.fit(X_train, y_train)

In [None]:
y_pred_xgb = clf_xgb.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_xgb)
plot_confusion_matrix(clf_xgb,X_val,y_test)
print('Confusion matrix\n\n', cm)

In [None]:
result = pd.crosstab(y_test, y_pred_xgb, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_xgb)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_xgb)
print("Accuracy of the model:",result2)

In [None]:
scores = cross_validate(clf_xgb, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

### XGBoost with parameter tuning 

In [94]:
clf_xgboost_best = XGBClassifier(random_state=0)
parameters = {
    "n_estimators":[100, 250, 500],
    "max_depth":[4, 5, 6],
    "max_features": [0.1, 0.25, 0.5, 1.0],
    "learning_rate": [0.1, 0.3, 0.5],
    "gamma": [0, 0.5, 1, 5]
      }

from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(clf_xgboost_best, parameters,cv=5, scoring='f1')
cv.fit(X_train,y_train.values.ravel())


def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
        
display(cv)

Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings bu

In [95]:
clf_xgboost_best = XGBClassifier(random_state=0,
                                learning_rate = 0.3,
                                max_depth = 4,
                                max_features = 0.1,
                                n_estimators = 500)
clf_xgboost_best.fit(X_train, y_train)

Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=4,
              max_features=0.1, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=500, n_jobs=8,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [96]:
y_pred_xgboost_best = clf_xgboost_best.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_xgboost_best)
plot_confusion_matrix(clf_xgboost_best, X_val, y_test)

In [None]:
result = pd.crosstab(y_test, y_pred_xgboost_best, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_xgboost_best)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_xgboost_best)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_xgboost_best, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 6. LightGBM 

In [None]:
clf_lgbm = LGBMClassifier()
clf_lgbm.fit(X_train, y_train)

In [None]:
y_pred_lgbm= clf_lgbm.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_lgbm)
plot_confusion_matrix(clf_lgbm,X_val,y_test)
print('Confusion matrix\n\n', cm)

In [None]:
result = pd.crosstab(y_test, y_pred_lgbm, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_lgbm)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_lgbm)
print("Accuracy of the model:",result2)

In [None]:
scores = cross_validate(clf_lgbm, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 7. Catboost

In [None]:
clf_catboost = CatBoostClassifier(random_state=0, class_weights = [0.15 , 0.85])      
clf_catboost.fit(X_train, y_train)



## For imbalanced data use class_weights = [ a , b ] where a aand b are class_weights - genereally between 0 to 1. 
                                                     # Give more weight to the class with fewer rows 

In [104]:
y_pred_catboost = clf_catboost.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_catboost)
plot_confusion_matrix(clf_catboost,X_val,y_test)
print('Confusion matrix\n\n', cm)

In [None]:
result1 = classification_report(y_test, y_pred_catboost)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_catboost)
print("Accuracy of the model:",result2)

In [None]:
scores = cross_validate(clf_catboost, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 8. Bagging

In [None]:
clf_bagging = BaggingClassifier(random_state=0)
clf_bagging.fit(X_train, y_train)

In [None]:
y_pred_bagging = clf_bagging.predict(X_val)

In [None]:
cm = confusion_matrix(y_test, y_pred_bagging)
plot_confusion_matrix(clf_bagging,X_val,y_test)
print('Confusion matrix\n\n', cm)

In [None]:
result = pd.crosstab(y_test, y_pred_bagging, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, y_pred_bagging)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, y_pred_bagging)
print("Accuracy of the model:",result2)

In [None]:
# Cross Validation using KFold 

scores = cross_validate(clf_bagging, X_val, y_test,cv=KFold(n_splits=5))
k=scores['test_score']
print("K fold scores are:", k)
print("Average of k fold scores is:", np.mean(k, axis = None))

## 9. Logistic Regression

In [None]:
start = timeit.default_timer()

lr = LogisticRegression(class_weight = 'balanced')
lr.fit(X_train, y_train)

stop = timeit.default_timer()

print('Time: ', stop - start) 

preds = lr.predict(X_val)

In [None]:
pd.DataFrame(confusion_matrix(y_test, preds), columns=['Predicted 0', "Predicted 1"], index=['Actual 0', 'Actual 0'])

In [None]:
result = pd.crosstab(y_test, preds, rownames=['Actual Result'], colnames=['Predicted Result'])
print("Confusion Matrix:\n")
print(result)
result1 = classification_report(y_test, preds)
print("\nClassification Report:\n",)
print (result1)
result2 = accuracy_score(y_test, preds)
print("Accuracy of the model:",result2)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

In [128]:
probas = lr.predict_proba(X_val)[:, 1]

In [None]:
probas

In [130]:
def get_preds(threshold, probabilities):
    return [1 if prob > threshold else 0 for prob in probabilities]

In [131]:
roc_values = []
for thresh in np.linspace(0, 1, 100):
    preds = get_preds(thresh, probas)
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    tpr = tp/(tp+fn)
    fpr = fp/(fp+tn)
    roc_values.append([tpr, fpr])
tpr_values, fpr_values = zip(*roc_values)

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
ax.plot(fpr_values, tpr_values)
ax.plot(np.linspace(0, 1, 100),
         np.linspace(0, 1, 100),
         label='baseline',
         linestyle='--')
plt.title('Receiver Operating Characteristic Curve', fontsize=18)
plt.ylabel('TPR', fontsize=16)
plt.xlabel('FPR', fontsize=16)
plt.legend(fontsize=12);

In [None]:
roc_auc_score(y_test, lr.predict_proba(X_val)[:, 1])

# PHASE 3: ALGORITHMS ON TESTING DATA

In [None]:
df2 = pd.read_csv(r"F:\Shaastra Workshop\Data Sprint 56 HR Analytics\Attrition_testing_data.csv")
df2

## Preprocessing of Testing Data (Exactly same as Training Data)


### 1. Check column drops 
### 2. Check missing values
### 3. Check encoding 
### 4. Check scaling 
### 5. Check anything else

In [None]:
objList = df2.select_dtypes(include = "object").columns
print (objList)

In [None]:
## Method 1: Label Encoder
#Label Encoding for object to numeric conversion

le = LabelEncoder()

for feat in objList:
    df2[feat] = le.fit_transform(df2[feat].astype(str))

print (df2.info())

# Final Data for Testing model

In [None]:
X_train = X
y_train = y
print (X_train.shape, y_train.shape)
print (df2.shape)                      ## X_test is replaced by df2


## Check if number of columns in X_train and df2 are equal 

# Use your best algorithm 

### 1. Copy paste your algorithm in following block
### 2. Replace X_test to df2
### 3. Change the prediction file name to test_prediction 


### Note: Final line should look like test_prediction = model_name.predict(df2)

In [107]:
## Random Forest Trees with default parameters (i.e. 100 Decision Trees)

rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)


test_prediction = rfc.predict(df2)

# Preparing Prediction File for downloading 

In [None]:
from pandas import DataFrame
df3 = DataFrame(test_prediction, columns=['Attrition'])      ## Change column name as per the requirement of the submission file
df3

In [None]:
test_prediction

In [None]:
y_id = df2['EmployeeNumber'].to_numpy()      ## Converting into an array 
y_id

In [None]:
Id = DataFrame(y_id, columns = ['Id'], index = None)       ## Converting into a dataframe
Id

In [112]:
result = Id.join(df3)     ### Joining test unique id with predictions 

In [None]:
result

# Downloading submission file 

In [79]:
### Save the file 
### Remember to change the file name 

result.to_csv(r'F:\Shaastra Workshop\Data Sprint 56 HR Analytics\prediction_2.csv', index = False)   

## 'index = False' removes first column with auto generated index ids

## Compare your Results 

### The best model with respect to accuracy: ___________ with accuracy of ___________ %

### Write the accuracy of the model:

### Model:                                              Accuracy  

### 1. XGBoost :                                    
### 2. Gradient Boosting :                   
### 3. Light GBM :                                  
### 4. CatBoost :                                 
### 5. Random Forest :                          
### 6. Bagging :                                      
### 7. Decision Trees :                           
### 8. Adaboost :   
### 9. Logistic Regression:

## End of code