# Big Data Final Project

#### _Team Members: Pooja Sastry (819907953), Sindhuri Punyamurthula (820923656)_

### Goal: 
To predict employee attrition in a company using their real-time dataset. This is a binary classification problem of predicting if an employee will leave a company or not, based on several attributes. 
### Dataset:
The human resource data from the Kaggle repository has been used for learning the models: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data. This dataset has 34 features and a target label 'Attrition'. Some of the important features are: Employee_Age, Years_Of_Service, Gender,Distance_From_Home, Job_Level, Current_Salary, Performance_Rating,BusinessTravel,etc. We renamed the dataset to _AttritionData.csv_ and it is located in the same folder as our project notebook.
### Kernel:
This notebook runs on _Apache Toree - Scala_ kernel.

## _Code:_

### Import Libraries

In [1]:
import java.io._
import scala.io.Source
import scala.collection.immutable
import org.apache.spark.sql.types.{StructField, StructType, StringType, DoubleType, IntegerType}
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

## 1. Load the Data

### 1.1. Define the schema 

In [2]:
val emp_schema = new StructType(Array(
    new StructField("Age", IntegerType, true),
    new StructField("Attrition", StringType, true),
    new StructField("BusinessTravel", StringType, true),
    new StructField("DailyRate", IntegerType, true),
    new StructField("Department", StringType, true),
    new StructField("DistanceFromHome", IntegerType, true),
    new StructField("Education", IntegerType, true),
    new StructField("EducationField", StringType, true),
    new StructField("EmployeeCount", IntegerType, true),
    new StructField("EmployeeNumber", IntegerType, true),
    new StructField("EnvironmentSatisfaction", IntegerType, true),
    new StructField("Gender", StringType, true),
    new StructField("HourlyRate", IntegerType, true),
    new StructField("JobInvolvement", IntegerType, true),
    new StructField("JobLevel", IntegerType, true),
    new StructField("JobRole", StringType, true),
    new StructField("JobSatisfaction", IntegerType, true),
    new StructField("MaritalStatus", StringType, true),
    new StructField("MonthlyIncome", IntegerType, true),
    new StructField("MonthlyRate", IntegerType, true),
    new StructField("NumCompaniesWorked", IntegerType, true),
    new StructField("Over18", StringType, true),
    new StructField("OverTime", StringType, true),
    new StructField("PercentSalaryHike", IntegerType, true),    
    new StructField("PerformanceRating", IntegerType, true),
    new StructField("RelationshipSatisfaction", IntegerType, true),
    new StructField("StandardHours", IntegerType, true),
    new StructField("StockOptionLevel", IntegerType, true),
    new StructField("TotalWorkingYears", IntegerType, true),
    new StructField("TrainingTimesLastYear", IntegerType, true),
    new StructField("WorkLifeBalance", IntegerType, true),
    new StructField("YearsAtCompany", IntegerType, true),
    new StructField("YearsInCurrentRole", IntegerType, true),
    new StructField("YearsSinceLastPromotion", IntegerType, true),
    new StructField("YearsWithCurrManager", IntegerType, true)))

In [3]:
val attritionDf = spark.read.format("csv").option("header", true).schema(emp_schema).load("./AttritionData.csv")

In [4]:
attritionDf.na.drop()
attritionDf.show

+---+---------+-----------------+---------+--------------------+----------------+---------+--------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|Over18|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StandardHours|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBalanc

In [5]:
attritionDf.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)
 |-- Education: integer (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: integer (nullable = true)
 |-- EmployeeNumber: integer (nullable = true)
 |-- EnvironmentSatisfaction: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: integer (nullable = true)
 |-- JobInvolvement: integer (nullable = true)
 |-- JobLevel: integer (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: integer (nullable = true)
 |-- MonthlyRate: integer (nullable = true)
 |-- NumCompaniesWorked: integer (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string 

### 1.2. Extract, Transform and Select features using StringIndexer and VectorAssembler

In [6]:
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler

In [7]:
val Attrition_LabelIndexer = new StringIndexer().setInputCol("Attrition").setOutputCol("label")
val BusinessTravel_LabelIndexer = new StringIndexer().setInputCol("BusinessTravel").setOutputCol("BusinessTravelIndexed")
val Department_LabelIndexer = new StringIndexer().setInputCol("Department").setOutputCol("DepartmentIndexed")
val EducationField_LabelIndexer = new StringIndexer().setInputCol("EducationField").setOutputCol("EducationFieldIndexed")
val Gender_LabelIndexer = new StringIndexer().setInputCol("Gender").setOutputCol("GenderIndexed")
val JobRole_LabelIndexer = new StringIndexer().setInputCol("JobRole").setOutputCol("JobRoleIndexed")
val MaritalStatus_LabelIndexer = new StringIndexer().setInputCol("MaritalStatus").setOutputCol("MaritalStatusIndexed")
val Over18_LabelIndexer = new StringIndexer().setInputCol("Over18").setOutputCol("Over18Indexed")
val OverTime_LabelIndexer = new StringIndexer().setInputCol("OverTime").setOutputCol("OverTimeIndexed")

In [8]:
val Attrition_IndexedDF = Attrition_LabelIndexer.fit(attritionDf).transform(attritionDf)
val BusinessTravel_IndexedDF = BusinessTravel_LabelIndexer.fit(Attrition_IndexedDF).transform(Attrition_IndexedDF)
val Department_IndexedDF = Department_LabelIndexer.fit(BusinessTravel_IndexedDF).transform(BusinessTravel_IndexedDF)
val EducationField_IndexedDF = EducationField_LabelIndexer.fit(Department_IndexedDF).transform(Department_IndexedDF)
val Gender_IndexedDF = Gender_LabelIndexer.fit(EducationField_IndexedDF).transform(EducationField_IndexedDF)
val JobRole_IndexedDF = JobRole_LabelIndexer.fit(Gender_IndexedDF).transform(Gender_IndexedDF)
val MaritalStatus_IndexedDF = MaritalStatus_LabelIndexer.fit(JobRole_IndexedDF).transform(JobRole_IndexedDF)
val Over18_IndexedDF = Over18_LabelIndexer.fit(MaritalStatus_IndexedDF).transform(MaritalStatus_IndexedDF)
val OverTime_IndexedDF = OverTime_LabelIndexer.fit(Over18_IndexedDF).transform(Over18_IndexedDF)

In [9]:
val assembler = new VectorAssembler().setInputCols(Array("BusinessTravelIndexed", "DepartmentIndexed", "EducationFieldIndexed", 
"GenderIndexed", "JobRoleIndexed", "MaritalStatusIndexed", "Over18Indexed", "OverTimeIndexed", "Age", 
"DailyRate", "DistanceFromHome", "Education","EmployeeCount", "EmployeeNumber","EnvironmentSatisfaction","HourlyRate",
"JobInvolvement","JobLevel","JobSatisfaction","MonthlyIncome","MonthlyRate","NumCompaniesWorked","PercentSalaryHike",
"PerformanceRating","RelationshipSatisfaction","StandardHours","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear",
"WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager")).setOutputCol("features")
val result = assembler.transform(OverTime_IndexedDF)
result.show

+---+---------+-----------------+---------+--------------------+----------------+---------+--------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompanies

In [10]:
result.select("label","features","BusinessTravelIndexed","DepartmentIndexed","EducationFieldIndexed").show(2)

+-----+--------------------+---------------------+-----------------+---------------------+
|label|            features|BusinessTravelIndexed|DepartmentIndexed|EducationFieldIndexed|
+-----+--------------------+---------------------+-----------------+---------------------+
|  1.0|[0.0,1.0,0.0,1.0,...|                  0.0|              1.0|                  0.0|
|  0.0|[1.0,0.0,0.0,0.0,...|                  1.0|              0.0|                  0.0|
+-----+--------------------+---------------------+-----------------+---------------------+
only showing top 2 rows



### 1.3. Feature Correlation using Pearson's Correlation Matrix

By plotting a correlation matrix, we can have an overview of how the features are related to one another. 

In [11]:
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val coeff1 = Correlation.corr(result, "features").head
println("Pearson's correlation matrix:\n" + coeff1.toString)

Pearson's correlation matrix:
[1.0                    -9.345554473742944E-4  ... (34 total)
-9.345554473742944E-4  1.0                    ...
-0.021512719737780385  0.253736538418516      ...
-0.03298095630330798   4.883422630625519E-4   ...
-0.028663665240755     0.08477914511425376    ...
0.06573146408981918    -0.03081840512586054   ...
NaN                    NaN                    ...
-0.016543041959464554  3.3961808998089277E-4  ...
-0.024751434227472353  -0.007652016028022201  ...
0.00408603392695391    -0.021958839348349083  ...
0.024469441632343172   0.0021963708361808073  ...
-7.5693310007276E-4    0.019636446644669635   ...
NaN                    NaN                    ...
0.015577822975475329   0.05766297309290677    ...
-0.004174404869855     -0.026110442307181205  ...
-0.026528186185597415  -0.021527658389011904  ...
-0.039061500556181315  -0.017692783090383083  ...
-0.019311282061786422  0.08801799142179016    ...
0.033962151335930515   -0.006231043718765055  ...
-0.03431

## 2. Data Analysis

We found a few interesting facts when exploring the data.

### 2.1. Number of employees who left vs. stayed

In [12]:
var leftVsStayed=result.groupBy("Attrition").agg(count("Attrition"))
leftVsStayed.show

+---------+----------------+
|Attrition|count(Attrition)|
+---------+----------------+
|       No|            1233|
|      Yes|             237|
+---------+----------------+



From the above results, we can say that approximately 16% of the employees left the company.

### 2.2. Average employee satisfaction of employees who left vs. stayed

In [13]:
var satisfaction=result.groupBy("Attrition").agg(mean("JobSatisfaction"))
satisfaction.show

+---------+--------------------+
|Attrition|avg(JobSatisfaction)|
+---------+--------------------+
|       No|   2.778588807785888|
|      Yes|  2.4683544303797467|
+---------+--------------------+



### 2.3. Average employee satisfaction based on job role

In [14]:
var ProjectsSatisfaction=result.groupBy("JobRole").agg(mean("JobSatisfaction"))
ProjectsSatisfaction.show

+--------------------+--------------------+
|             JobRole|avg(JobSatisfaction)|
+--------------------+--------------------+
|     Sales Executive|   2.754601226993865|
|Manufacturing Dir...|   2.682758620689655|
|Laboratory Techni...|  2.6911196911196913|
|Sales Representative|  2.7349397590361444|
|Healthcare Repres...|   2.786259541984733|
|  Research Scientist|  2.7739726027397262|
|             Manager|  2.7058823529411766|
|   Research Director|                 2.7|
|     Human Resources|  2.5576923076923075|
+--------------------+--------------------+



Human Resources Employees have the least job satisfaction rate while Health Care Representatives have the  most.

## 3. Machine Learning Models

The prediction of employee attrition requires a Binary Classification Model which decides the Attrition variable as 'Yes/No'. We used 6 different models : Naive Bayes, Decision Tree, LogisticRegression, Gradient Boosting, Random Forest and Linear Support Vector Classifier.
We evaluated these models and compared the accuracy and time taken by each of them. Confusion Matrix was computed for each of the models.

### Randomly Split the data for Training and Testing

In [15]:
val Array(train, test) = result.randomSplit(Array(0.7, 0.3))

## Model 1 : Naive Bayes

In [16]:
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes().setLabelCol("label").setFeaturesCol("features")

val startTime_nb = System.nanoTime()  

val model_nb = nb.fit(train)

val time_nb = (System.nanoTime() - startTime_nb) / 1e9

println("Time elapsed for Naive Bayes:")
println(time_nb)

Time elapsed for Naive Bayes:
1.94149036


In [17]:
val predictions_nb = model_nb.transform(test)
predictions_nb.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfac

In [18]:
predictions_nb.select("label","features","prediction","BusinessTravelIndexed").show(5)

+-----+--------------------+----------+---------------------+
|label|            features|prediction|BusinessTravelIndexed|
+-----+--------------------+----------+---------------------+
|  0.0|[2.0,0.0,0.0,1.0,...|       1.0|                  2.0|
|  1.0|[1.0,1.0,2.0,0.0,...|       1.0|                  1.0|
|  0.0|[0.0,0.0,0.0,1.0,...|       1.0|                  0.0|
|  1.0|[1.0,1.0,3.0,1.0,...|       1.0|                  1.0|
|  1.0|[0.0,1.0,4.0,0.0,...|       1.0|                  0.0|
+-----+--------------------+----------+---------------------+
only showing top 5 rows



In [19]:
import org.apache.spark.ml.feature.Binarizer
val binarizer: Binarizer = new Binarizer().setInputCol("prediction").setOutputCol("binarized_prediction").setThreshold(0.5)
val predictionBinary_nb = binarizer.transform(predictions_nb) 

In [20]:
val wrongPredictions_nb = predictionBinary_nb.where(expr("label != prediction"))
val countErrors_nb = wrongPredictions_nb.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_nb.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|   192|
|  1.0|    26|
+-----+------+



In [21]:
val correctPredictions_nb = predictionBinary_nb.where(expr("label == prediction"))
val countCorrectPredictions_nb = correctPredictions_nb.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_nb.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    198|
|  1.0|     45|
+-----+-------+



### Confusion Matrix and Accuracy (Naive Bayes)

In [22]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val selectMetrics_nb = predictions_nb.select("label","prediction")
val rdd_nb = selectMetrics_nb.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_nb = new MulticlassMetrics(rdd_nb)
println("Confusion matrix for Naive Bayes:")
val confusionMatrix_nb=metrics_nb.confusionMatrix
println(confusionMatrix_nb)
println("Accuracy of Naive Bayes:")
val accuracy_nb=metrics_nb.accuracy
println(accuracy_nb)

Confusion matrix for Naive Bayes:
198.0  192.0  
26.0   45.0   
Accuracy of Naive Bayes:
0.527114967462039


## Model 2 : Decision Tree

In [23]:
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor

val dt = new DecisionTreeRegressor().setMaxBins(100).setLabelCol("label").setFeaturesCol("features")

val startTime_dt = System.nanoTime()

val model_dt = dt.fit(train)

val time_dt = (System.nanoTime() - startTime_dt) / 1e9

println("Time elapsed for Decision tree:")
println(time_dt)

Time elapsed for Decision tree:
2.662944262


In [24]:
val predictions_dt = model_dt.transform(test)
predictions_dt.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+-------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|

In [25]:
val predictionBinary_dt = binarizer.transform(predictions_dt) 

In [26]:
val wrongPredictions_dt = predictionBinary_dt.where(expr("label != prediction"))
val countErrors_dt = wrongPredictions_dt.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_dt.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|   386|
|  1.0|    67|
+-----+------+



In [27]:
val correctPredictions_dt = predictionBinary_dt.where(expr("label == prediction"))
val countCorrectPredictions_dt = correctPredictions_dt.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_dt.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|      4|
|  1.0|      4|
+-----+-------+



### Confusion Matrix and Accuracy (Decision Tree)

In [28]:
val selectMetrics_dt = predictions_dt.select("label","prediction")
val rdd_dt = selectMetrics_dt.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_dt = new MulticlassMetrics(rdd_dt)
println("Confusion matrix for Decision Tree:")
val confusionMatrix_dt=metrics_dt.confusionMatrix
println(confusionMatrix_dt)
println("Accuracy of Decision Tree:")
val accuracy_dt=metrics_dt.accuracy
println(accuracy_dt)

Confusion matrix for Decision Tree:
4.0  2.0  
2.0  4.0  
Accuracy of Decision Tree:
0.01735357917570499


## Model 3 : Logistic Regression

In [29]:
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")

val startTime_lr = System.nanoTime()  

val model_lr = lr.fit(train)

val time_lr = (System.nanoTime() - startTime_lr) / 1e9

println("Time elapsed for Logistic Regression:")
println(time_lr)

println(s"Coefficients: ${model_lr.coefficients} Intercept: ${model_lr.intercept}")

Time elapsed for Logistic Regression:
4.851917167
Coefficients: [-0.21410064512914723,0.5188448826291124,0.06407087472622376,-0.2966350864732555,0.003492472344570198,0.009485474047862567,0.0,1.6996626163602908,-0.026956384625340387,-4.6787964656225E-4,0.04354410688164103,0.0743567547665852,0.0,-6.154104911826443E-5,-0.4794414116244865,8.355537226788188E-5,-0.5878689533641921,-0.30229601514450877,-0.3643527029542467,-4.090608640565197E-5,1.6552747463417513E-7,0.11221342254993368,-0.017812551534071,0.13303511119294464,-0.15386620604854412,0.0,-0.5046918307510031,-0.057262107862185364,-0.1245343845503649,-0.2893295936603231,0.07817516073603716,-0.1392369852351677,0.1647641276196548,-0.09977153045052696] Intercept: 4.898769992117542


In [30]:
val predictions_lr=model_lr.transform(test)
predictions_lr.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfac

In [31]:
val predictionBinary_lr = binarizer.transform(predictions_lr) 

In [32]:
val wrongPredictions_lr = predictionBinary_lr.where(expr("label != prediction"))
val countErrors_lr = wrongPredictions_lr.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_lr.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|    12|
|  1.0|    47|
+-----+------+



In [33]:
val correctPredictions_lr = predictionBinary_lr.where(expr("label == prediction"))
val countCorrectPredictions_lr = correctPredictions_lr.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_lr.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    378|
|  1.0|     24|
+-----+-------+



### Confusion Matrix and Accuracy (Logistic Regression)

In [34]:
val selectMetrics_lr = predictions_lr.select("label","prediction")
val rdd_lr = selectMetrics_lr.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_lr = new MulticlassMetrics(rdd_lr)
println("Confusion matrix for Logistic Regression:")
val confusionMatrix_lr=metrics_lr.confusionMatrix
println(confusionMatrix_lr)
println("Accuracy of Logistic Regression:")
val accuracy_lr=metrics_lr.accuracy
println(accuracy_lr)

Confusion matrix for Logistic Regression:
378.0  12.0  
47.0   24.0  
Accuracy of Logistic Regression:
0.8720173535791758


## Model 4 : Gradient Boosting Classifier

In [35]:
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
val gbt = new GBTClassifier().setMaxIter(10).setLabelCol("label").setFeaturesCol("features")

val startTime_gbt = System.nanoTime()  

val model_gbt = gbt.fit(train)

val time_gbt = (System.nanoTime() - startTime_gbt) / 1e9

println("Time elapsed for Gradient Boost:")
println(time_gbt)

Time elapsed for Gradient Boost:
6.941712915


In [36]:
val predictions_gbt=model_gbt.transform(test)
predictions_gbt.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfac

In [37]:
val predictionBinary_gbt = binarizer.transform(predictions_gbt) 

In [38]:
val wrongPredictions_gbt = predictionBinary_gbt.where(expr("label != prediction"))
val countErrors_gbt = wrongPredictions_gbt.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_gbt.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|    26|
|  1.0|    43|
+-----+------+



In [39]:
val correctPredictions_gbt = predictionBinary_gbt.where(expr("label == prediction"))
val countCorrectPredictions_gbt = correctPredictions_gbt.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_gbt.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    364|
|  1.0|     28|
+-----+-------+



### Confusion Matrix and Accuracy (Gradient Boosting Classifier)

In [40]:
val selectMetrics_gbt = predictions_gbt.select("label","prediction")
val rdd_gbt = selectMetrics_gbt.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_gbt = new MulticlassMetrics(rdd_gbt)
println("Confusion matrix for Graident Boost:")
val confusionMatrix_gbt=metrics_gbt.confusionMatrix
println(confusionMatrix_gbt)
println("Accuracy of Gradient Boost:")
val accuracy_gbt=metrics_gbt.accuracy
println(accuracy_gbt)

Confusion matrix for Graident Boost:
364.0  26.0  
43.0   28.0  
Accuracy of Gradient Boost:
0.8503253796095445


### Feature importance using Gradient Boosting Classifier

The attribute _featureimportances_ tells us which features within the dataset has been given most importance through the Gradient Boost algorithm.

In [41]:
val importance_gbt=model_gbt.featureImportances
val featuresArray=Array("BusinessTravelIndexed","Age","DailyRate","DepartmentIndexed","DistanceFromHome","Education","EducationFieldIndexed","EmployeeCount",
"EmployeeNumber","EnvironmentSatisfaction","GenderIndexed","HourlyRate","JobInvolvement","JobLevel","JobRoleIndexed","JobSatisfaction","MaritalStatusIndexed",
"MonthlyIncome","MonthlyRate","NumCompaniesWorked","Over18Indexed","OverTimeIndexed","PercentSalaryHike","PerformanceRating","RelationshipSatisfaction",
"StandardHours","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion",
"YearsWithCurrManager")
val features_importance_gbt = featuresArray.zip(importance_gbt.toArray).sortBy(-_._2)
for (item <- features_importance_gbt)
{
    println(item)
}

(DistanceFromHome,0.0842059217229255)
(EmployeeNumber,0.06708985482086756)
(GenderIndexed,0.06667782068935119)
(EnvironmentSatisfaction,0.06598774736027943)
(JobRoleIndexed,0.06045894805229842)
(StockOptionLevel,0.04907723497687035)
(DailyRate,0.048325912129020786)
(YearsAtCompany,0.04799554051669127)
(Over18Indexed,0.043271798934119435)
(JobLevel,0.04093618065006249)
(JobSatisfaction,0.04068730744797036)
(EmployeeCount,0.0391714950078628)
(MaritalStatusIndexed,0.0340120703070706)
(PercentSalaryHike,0.03336586028807294)
(TotalWorkingYears,0.027956587415729678)
(MonthlyRate,0.02791972844667373)
(BusinessTravelIndexed,0.02772339715111606)
(YearsSinceLastPromotion,0.027046397191247226)
(NumCompaniesWorked,0.026351250143904006)
(RelationshipSatisfaction,0.024998623214240793)
(WorkLifeBalance,0.02071676647347053)
(TrainingTimesLastYear,0.018494634368742366)
(MonthlyIncome,0.014489594767120867)
(YearsInCurrentRole,0.012546053691839067)
(OverTimeIndexed,0.012141929335073732)
(DepartmentIndexe

<img src = "./Importance_GBT.png">

We can see that "Distance From Home" was surprisingly considered as the most important feature by the Gradient Boosting algorithm while a seemingly more important feature like "Performance Rating" was given a ranking score of 0.

_Note: The important features in the graphs are representative of one of our code runs._

## Model 5 : Random Forest

In [42]:
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
val rf = new RandomForestClassifier().setNumTrees(3).setLabelCol("label").setFeaturesCol("features")

val startTime_rf = System.nanoTime()  

val model_rf = rf.fit(train)

val time_rf = (System.nanoTime() - startTime_rf) / 1e9

println("Time elapsed for Random Forest:")
println(time_rf)

Time elapsed for Random Forest:
2.094999318


In [43]:
val predictions_rf=model_rf.transform(test)
predictions_rf.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfac

In [44]:
val predictionBinary_rf = binarizer.transform(predictions_rf) 

In [45]:
val wrongPredictions_rf = predictionBinary_rf.where(expr("label != prediction"))
val countErrors_rf = wrongPredictions_rf.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_rf.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|    12|
|  1.0|    56|
+-----+------+



In [46]:
val correctPredictions_rf = predictionBinary_rf.where(expr("label == prediction"))
val countCorrectPredictions_rf = correctPredictions_rf.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_rf.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    378|
|  1.0|     15|
+-----+-------+



### Confusion Matrix and Accuracy (Random Forest)

In [47]:
val selectMetrics_rf = predictions_rf.select("label","prediction")
val rdd_rf = selectMetrics_rf.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_rf = new MulticlassMetrics(rdd_rf)
println("Confusion matrix for Random Forest:")
val confusionMatrix_rf=metrics_rf.confusionMatrix
println(confusionMatrix_rf)
println("Accuracy of Random Forest:")
val accuracy_rf=metrics_rf.accuracy
println(accuracy_rf)

Confusion matrix for Random Forest:
378.0  12.0  
56.0   15.0  
Accuracy of Random Forest:
0.8524945770065075


### Feature importance Using Random Forests

The attribute _featureimportances_ tells us which features within the dataset have been given the most importance through the Random Forest algorithm. 

In [48]:
val importance_rf=model_rf.featureImportances
val featuresArray=Array("AttritionIndexed","BusinessTravelIndexed","DailyRate","DailyRate","DepartmentIndexed","DistanceFromHome","Education","EducationFieldIndexed","EmployeeCount","EmployeeNumber","EnvironmentSatisfaction","GenderIndexed","HourlyRate","JobInvolvement","JobLevel","JobRoleIndexed","JobSatisfaction","MaritalStatusIndexed","MonthlyIncome","MonthlyRate","NumCompaniesWorked","Over18Indexed","OverTimeIndexed","PercentSalaryHike","PerformanceRating","RelationshipSatisfaction","StandardHours","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion","YearsWithCurrManager")
val features_importance_rf = featuresArray.zip(importance_rf.toArray).sortBy(-_._2)
for (item <- features_importance_rf)
{
    println(item)
}

(WorkLifeBalance,0.12190322369356148)
(EmployeeCount,0.11339566327260553)
(EducationFieldIndexed,0.09385288663005897)
(MonthlyRate,0.0666084617118382)
(JobInvolvement,0.06116503590601826)
(JobLevel,0.04840638252144266)
(EmployeeNumber,0.046129429495306275)
(DistanceFromHome,0.04606877617980757)
(Over18Indexed,0.04282709636269108)
(YearsSinceLastPromotion,0.040916720611687905)
(StockOptionLevel,0.04024648006507451)
(JobRoleIndexed,0.037998829287455745)
(BusinessTravelIndexed,0.030537865272743447)
(MaritalStatusIndexed,0.030183706132486304)
(DepartmentIndexed,0.02507923721703621)
(TotalWorkingYears,0.02109009868902441)
(StandardHours,0.020782919679253182)
(GenderIndexed,0.018114620667342138)
(JobSatisfaction,0.017321501068724907)
(NumCompaniesWorked,0.01728424055545186)
(OverTimeIndexed,0.0161307986155934)
(EnvironmentSatisfaction,0.014796713285478734)
(PerformanceRating,0.009974999032846323)
(YearsInCurrentRole,0.008163579455391113)
(AttritionIndexed,0.00467692262345907)
(DailyRate,0.00

<img src = "./Importance_RF.png">

Here, 'Education Field' of the employee was given the most importance followed by the feature 'Work Life Balance'.

_Note: The important features in the graphs are representative of one of our code runs._

## Model 6 : Linear Support Vector Machine

In [49]:
import org.apache.spark.ml.classification.LinearSVC

val svm = new LinearSVC().setMaxIter(10).setRegParam(0.1).setLabelCol("label").setFeaturesCol("features")

val startTime_svm = System.nanoTime()  

val model_svm = lr.fit(train)

val time_svm = (System.nanoTime() - startTime_svm) / 1e9

println("Time elapsed for Linear SVM:")
println(time_svm)


Time elapsed for Linear SVM:
4.640156044


In [50]:
val predictions_svm=model_svm.transform(test)
predictions_svm.show

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+-----+---------------------+-----------------+---------------------+-------------+--------------+--------------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfac

In [51]:
val predictionBinary_svm = binarizer.transform(predictions_svm) 

In [52]:
val wrongPredictions_svm = predictionBinary_svm.where(expr("label != prediction"))
val countErrors_svm = wrongPredictions_svm.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_svm.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|    12|
|  1.0|    47|
+-----+------+



In [53]:
val correctPredictions_svm = predictionBinary_svm.where(expr("label == prediction"))
val countCorrectPredictions_svm = correctPredictions_svm.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_svm.show

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    378|
|  1.0|     24|
+-----+-------+



### Confusion Matrix and Accuracy (Linear Support Vector Machine)

In [54]:
val selectMetrics_svm = predictions_svm.select("label","prediction")
val rdd_svm = selectMetrics_svm.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_svm = new MulticlassMetrics(rdd_svm)
println("Confusion matrix for Linnear SVM:")
val confusionMatrix_svm=metrics_svm.confusionMatrix
println(confusionMatrix_svm)
println("Accuracy of Linear SVM:")
val accuracy_svm=metrics_svm.accuracy
println(accuracy_svm)

Confusion matrix for Linnear SVM:
378.0  12.0  
47.0   24.0  
Accuracy of Linear SVM:
0.8720173535791758


##  4. Parameter Tuning with Cross-Validation Pipelining over a Parameter Grid

We also performed cross-validation over a grid of parameters for a few models and evaluated them. We saw an improvement in the accuracy for two of the models, Random Forest and Gradient Boosting (shown below). However, we observed that the accuracy of other models did not improve when cross-validation pipeline model was applied.<br>

To perform cross-validation pipelining, firstly we determined the most useful features using feature importance as shown in the above methods. The top features (with feature importances value greater than 0.0) were used as inputs to the Vector Slicer which picks out these most important features. We then used a StandardScaler so that all features are on a similar scale. We set the number of folds to 10 for the cross validation.<br>
We selected from a grid of hyperparameters for fine-tuning the model and computed the time elapsed, accuracy and confusion matrices for the two models.

### 4.1. Random Forest Model with Pipeline

In [55]:
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StandardScaler

val slicer = new VectorSlicer().setInputCol("features").setOutputCol("slicedfeatures").setNames(Array("EducationFieldIndexed","WorkLifeBalance","JobLevel","EmployeeCount","YearsSinceLastPromotion","JobSatisfaction","MonthlyRate","DailyRate",
"TotalWorkingYears","JobInvolvement","EmployeeNumber","StandardHours","StockOptionLevel","PerformanceRating","Over18Indexed","EnvironmentSatisfaction",
"MonthlyIncome","JobRoleIndexed","BusinessTravelIndexed","OverTimeIndexed","DepartmentIndexed","NumCompaniesWorked","DistanceFromHome",
"GenderIndexed","YearsAtCompany","MaritalStatusIndexed"))       

In [56]:
val scaler = new StandardScaler().setInputCol("slicedfeatures").setOutputCol("scaledfeatures").setWithStd(true).setWithMean(true)             

val rf_ParamGrid = new RandomForestClassifier().setLabelCol("label").setFeaturesCol("scaledfeatures")

val rfPipeline = new Pipeline().setStages(Array(slicer, scaler, rf_ParamGrid))

In [57]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row

val paramGrid_rf = new ParamGridBuilder().addGrid(rf_ParamGrid.maxBins, Array(25, 28, 31)).addGrid(rf_ParamGrid.maxDepth, Array(4, 6, 8)).addGrid(rf_ParamGrid.impurity, Array("entropy", "gini")).build()               

val evaluator_rf = new BinaryClassificationEvaluator().setLabelCol("label").setMetricName("areaUnderPR")

val cv_rf = new CrossValidator().setEstimator(rfPipeline).setEvaluator(evaluator_rf).setEstimatorParamMaps(paramGrid_rf).setNumFolds(10)

val startTime_rf = System.nanoTime()  

val crossValidatorModel_rf = cv_rf.fit(train)

val rfPipeTime = (System.nanoTime() - startTime_rf) / 1e9

println("Time elapsed using Random Forest with Pipelining:")
println(rfPipeTime)

Time elapsed using Random Forest with Pipelining:
202.294446795


In [58]:
val predictions_rf_ParamGrid = crossValidatorModel_rf.transform(test)
crossValidatorModel_rf.explainParams()

estimator: estimator for selection (current: pipeline_69abe024ae18)
estimatorParamMaps: param maps for the estimator (current: [Lorg.apache.spark.ml.param.ParamMap;@68a89377)
evaluator: evaluator used to select hyper-parameters that maximize the validated metric (current: binEval_f205501f4b09)
numFolds: number of folds for cross validation (>= 2) (default: 3, current: 10)
seed: random seed (default: -1191137437)

In [59]:
val predictionBinary_rf_ParamGrid = binarizer.transform(predictions_rf_ParamGrid) 

val wrongPredictions_rf_ParamGrid = predictionBinary_rf_ParamGrid.where(expr("label != prediction"))
val countErrors_rf_ParamGrid = wrongPredictions_rf_ParamGrid.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_rf_ParamGrid.show

val correctPredictions_rf_ParamGrid = predictionBinary_rf_ParamGrid.where(expr("label == prediction"))
val countCorrectPredictions_rf_ParamGrid = correctPredictions_rf_ParamGrid.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_rf_ParamGrid.show

+-----+------+
|label|Errors|
+-----+------+
|  0.0|     6|
|  1.0|    57|
+-----+------+

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    384|
|  1.0|     14|
+-----+-------+



### Confusion Matrix and Accuracy (Random Forest Pipeline)

In [60]:
val selectMetrics_rf_ParamGrid = predictions_rf_ParamGrid.select("label","prediction")
val rdd_rf_ParamGrid = selectMetrics_rf_ParamGrid.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_rf_ParamGrid = new MulticlassMetrics(rdd_rf_ParamGrid)
println("Confusion matrix for Random Forests using Pipelining:")
val confusionMatrix_rf_ParamGrid=metrics_rf_ParamGrid.confusionMatrix
println(confusionMatrix_rf_ParamGrid)
println("Accuracy of Random Forests using Pipelining:")
val accuracy_rf_ParamGrid=metrics_rf_ParamGrid.accuracy
println(accuracy_rf_ParamGrid)

Confusion matrix for Random Forests using Pipelining:
384.0  6.0   
57.0   14.0  
Accuracy of Random Forests using Pipelining:
0.8633405639913232


### 4.2. Gradient Boosting Classifier Model with pipeline

In [61]:
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StandardScaler

val slicer = new VectorSlicer().setInputCol("features").setOutputCol("slicedfeatures").setNames(Array("BusinessTravelIndexed","Age","DailyRate","DepartmentIndexed","DistanceFromHome","Education","EmployeeCount",
"EmployeeNumber","EnvironmentSatisfaction","GenderIndexed","HourlyRate","JobLevel","JobRoleIndexed","JobSatisfaction","MaritalStatusIndexed",
"MonthlyIncome","MonthlyRate","NumCompaniesWorked","Over18Indexed","OverTimeIndexed","PercentSalaryHike","RelationshipSatisfaction",
"StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","YearsAtCompany","YearsInCurrentRole","YearsSinceLastPromotion",
"YearsWithCurrManager"))


In [62]:
val scaler = new StandardScaler().setInputCol("slicedfeatures").setOutputCol("scaledfeatures").setWithStd(true).setWithMean(true)

val gbt_ParamGrid = new GBTClassifier().setLabelCol("label").setFeaturesCol("scaledfeatures")

val gbtPipeline = new Pipeline().setStages(Array(slicer, scaler, gbt_ParamGrid))

In [63]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row

val paramGrid_gbt = new ParamGridBuilder().build()               

val evaluator_gbt = new BinaryClassificationEvaluator().setLabelCol("label").setMetricName("areaUnderPR")

val cv_gbt = new CrossValidator().setEstimator(gbtPipeline).setEvaluator(evaluator_gbt).setEstimatorParamMaps(paramGrid_gbt).setNumFolds(10)

val startTime_gbt = System.nanoTime()  

val crossValidatorModel_gbt = cv_gbt.fit(train)

val gbtPipeTime = (System.nanoTime() - startTime_gbt) / 1e9

println("Time elapsed for Gradient Boosting with pipelining:")
println(gbtPipeTime)


Time elapsed for Gradient Boosting with pipelining:
127.333577878


In [64]:
val predictions_gbt_ParamGrid = crossValidatorModel_gbt.transform(test)
crossValidatorModel_gbt.explainParams()

estimator: estimator for selection (current: pipeline_9ceecef1562a)
estimatorParamMaps: param maps for the estimator (current: [Lorg.apache.spark.ml.param.ParamMap;@36c431a2)
evaluator: evaluator used to select hyper-parameters that maximize the validated metric (current: binEval_716fafcd4a70)
numFolds: number of folds for cross validation (>= 2) (default: 3, current: 10)
seed: random seed (default: -1191137437)

In [65]:
val predictionBinary_gbt_ParamGrid = binarizer.transform(predictions_gbt_ParamGrid) 

val wrongPredictions_gbt_ParamGrid = predictionBinary_gbt_ParamGrid.where(expr("label != prediction"))
val countErrors_gbt_ParamGrid = wrongPredictions_gbt_ParamGrid.groupBy("label").agg(count("prediction").alias("Errors"))
countErrors_gbt_ParamGrid.show

val correctPredictions_gbt_ParamGrid = predictionBinary_gbt_ParamGrid.where(expr("label == prediction"))
val countCorrectPredictions_gbt_ParamGrid = correctPredictions_gbt_ParamGrid.groupBy("label").agg(count("prediction").alias("Correct"))
countCorrectPredictions_gbt_ParamGrid.show


+-----+------+
|label|Errors|
+-----+------+
|  0.0|    25|
|  1.0|    46|
+-----+------+

+-----+-------+
|label|Correct|
+-----+-------+
|  0.0|    365|
|  1.0|     25|
+-----+-------+



### Confusion Matrix and Accuracy (Gradient Boosting Pipeline)

In [66]:
val selectMetrics_gbt_ParamGrid = predictions_gbt_ParamGrid.select("label","prediction")
val rdd_gbt_ParamGrid = selectMetrics_gbt_ParamGrid.rdd.map(row => {
      val label = row.getDouble(0)
      val prediction = row.getDouble(1)
      (prediction, label)
    })
val metrics_gbt_ParamGrid = new MulticlassMetrics(rdd_gbt_ParamGrid)
println("Confusion matrix for Gradient Boosting with pipelining:")
val confusionMatrix_gbt_ParamGrid=metrics_gbt_ParamGrid.confusionMatrix
println(confusionMatrix_gbt_ParamGrid)
println("Accuracy of Gradient Boosting with pipelining:")
val accuracy_gbt_ParamGrid=metrics_gbt_ParamGrid.accuracy
println(accuracy_gbt_ParamGrid)

Confusion matrix for Gradient Boosting with pipelining:
365.0  25.0  
46.0   25.0  
Accuracy of Gradient Boosting with pipelining:
0.8459869848156182


## 5. Evaluation of Models

### 5.1. Comparison of Models by Classification Accuracy

In [67]:
import scala.collection.immutable.ListMap
var allModelsAccuracies = scala.collection.mutable.Map[String, Double]()
allModelsAccuracies+=("DecisionTree"->accuracy_dt)
allModelsAccuracies+=("NaiveBayes"->accuracy_nb)
allModelsAccuracies+=("GradientBoost"->accuracy_gbt_ParamGrid)
allModelsAccuracies+=("LogisticRegression"->accuracy_lr)
allModelsAccuracies+=("LinearSupportVector"->accuracy_svm)
allModelsAccuracies+=("RandomForest"->accuracy_rf_ParamGrid)

for (item<-allModelsAccuracies)
    {
        println(item)
    }

(NaiveBayes,0.527114967462039)
(GradientBoost,0.8459869848156182)
(LogisticRegression,0.8720173535791758)
(DecisionTree,0.01735357917570499)
(RandomForest,0.8633405639913232)
(LinearSupportVector,0.8720173535791758)


<img src = "./Accuracy.png">

__Analysis: Above graph shows the performance of all classifiers used, based on their accuracies. Logistic Regression and Linear Support Vector models are most accurate in predicting which employee is likely to leave the company while Decision Tree model turns out to be least accurate. <br><br>
One of the Python codes on Kaggle (https://www.kaggle.com/arthurtok/employee-attrition-via-rf-gbm), which used the same dataset, used the models Gradient Boosting Classifier and Random Forests. The performance of these two models in our case is on similar scales as theirs.__<br><br>
_Note: The scores in the graphs are representative of one of our code runs._

### 5.2. Comparison of Models by Execution Time 

In [68]:
import scala.collection.immutable.ListMap
var allModelsTimes = scala.collection.mutable.Map[String, Double]()
allModelsTimes+=("DecisionTree"->time_dt)
allModelsTimes+=("NaiveBayes"->time_nb)
allModelsTimes+=("GradientBoost"->time_gbt)
allModelsTimes+=("LogisticRegression"->time_lr)
allModelsTimes+=("LinearSupportVector"->time_svm)
allModelsTimes+=("RandomForest"->time_rf)

for (item<-allModelsTimes)
    {
        println(item)
    }

(NaiveBayes,1.94149036)
(GradientBoost,6.941712915)
(LogisticRegression,4.851917167)
(DecisionTree,2.662944262)
(RandomForest,2.094999318)
(LinearSupportVector,4.640156044)


<img src = "./Time.png">

__Analysis: The graph compares the execution (training + testing) times of each of the models. Random Forest model is the quickest to make predictions while Gradient Boosting Classifier takes the maximum time. Logistic Regression and Linear Support Vector models which had the highest accuracies, take medium to high execution times.__<br><br>
_Note: The times in the graphs are representative of one of our code runs._

__Result: Based on the above comparisons we can say that the trade-off between accuracy score and execution time for Random Forest model is the best. Hence, we can say that Random Forest is the most suitable model fore predicting whether an employee is likely to leave the company or not.__