<strong> IBM Attrition Case </strong> <br>
by: Sophie Briques <br>
2020-05-29

This notebook is complementary to the IBM attrition project that can be found here: https://sbriques.github.io/IBM-Attrition/

This was created using Dataiku DSS, and will need adjustments in the data importing to function locally.

Set Up:

In [None]:
%pylab inline

# Importing dataframe essential packages
import dataiku
from   dataiku import pandasutils as pdu
import pandas as pd

# Importing PySpark Packages
import dataiku.spark as dkuspark
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, CountVectorizer
from pyspark.sql import Row
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.evaluation import *
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Load PySpark
sc = pyspark.SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

Since we’re using the dataset cleaned within Dataiku, we run the following lines to read the dataset with dataiku then as Pandas and Spark:

In [None]:
# Read Dataiku dataset from memory
mydataset = dataiku.Dataset("IBM_clean_Hadoop")

# Read the dataset as a Pandas Dataframe
df_pd = mydataset.get_dataframe()

# Read the dataset as a Spark dataframe
df_sprk = dkuspark.get_dataframe(sqlContext, mydataset)

We’ll want to check if our dataframe is cleaned. Since we have removed some observations, our dataset should have 1447 rows, and additional features we engineered with SQL.

In [None]:
# Get the count of records in the dataframe
print(df_sprk.count())

# Get a view of the first 5 records in the dataframe
df_pd.head()

After checking the dataset, we can start building our base model:

In [None]:
# creating vectors with variable names
x_var_vec = VectorAssembler(inputCols = ['boolean_overtime',
                                         'boolean_businesstravel',
                                         'Age',
                                         'YearsInCurrentRole',
                                         'MonthlyIncome',
                                         'StockOptionLevel',
                                         'JobSatisfaction',
                                         'NumCompaniesWorked',
                                         'JobInvolvement'], outputCol = "features")

# Adding x var vector back into dataframe
vec_to_df = x_var_vec.transform(df_sprk)

# Defining target variable
df_logit = vec_to_df.select(['features', 'boolean_attrition']) 

# Renaming Target Column
df_logit = df_logit.withColumnRenamed("boolean_attrition", "label")

# Splitting the dataset
splits = df_logit.randomSplit([0.7,0.3])
train_df = splits[0]
test_df = splits[1]

# Creating an object with Logistic Regression Model
lr       = LogisticRegression(maxIter = 20)
pipeline = Pipeline(stages = [lr])

# fitting the model 
model = lr.fit(train_df)

# Calculating Evaluation Metrics
result    = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(rawPredictionCol = "rawPrediction")
AUC_ROC   = evaluator.evaluate(result,{evaluator.metricName: "areaUnderROC"})
coefs     = model.coefficients
intercept = model.intercept

We’ll also run a Parameter GridSearch and CrossValidation. These will help improve our model’s predictive capabilities.

In [None]:
## Setting up Parameter GridSearch and CrossValidation
# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validation
cvModel = cv.fit(train_df)

# Calculating Evaluation Metrics
result_cv    = cvModel.transform(test_df)
evaluator_cv = BinaryClassificationEvaluator(rawPredictionCol = "rawPrediction")
AUC_ROC_cv   = evaluator_cv.evaluate(result_cv,{evaluator_cv.metricName: "areaUnderROC"})
coefs_cv     = cvModel.bestModel.coefficients
intercept_cv = cvModel.bestModel.intercept

print('LOGISTIC REGRESSION: After CV')
print('AUC ROC:' + str(AUC_ROC_cv))
print('Coefficients:' +  str(coefs_cv))
print('Intercept:' + str(intercept_cv))

The most important number in these results is the AUC ROC score (area under the curve). This number represents our model’s predictve capabilities. A score of 0.82 is not bad at all! The coefficients in a logistic regression need to be treated differently than with linear regressions. We first need to take the exponential of a coefficient, which represents the change in odds ratio. For example, for every additional companie worked at in the past, an employees odds of leaving IBM increase by exp(0.015)-1)*100 = 1.56 %.