## Tree Based Models

In this notebook, we have implemented tree based models using [ML package](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#) in PySpark like,
- Decision Tree, 
- Random Forest,
- Gradient-boosted Tree classifier

**Import dependencies** 

In [1]:
from pyspark.sql.functions import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

In [2]:
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Load Data

In [4]:
df = spark.read.csv('../data/Telco-Customer-Churn.csv', header = True, inferSchema = True)

In [5]:
df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



**Replace/ Drop Missing Values**

In [6]:
#Replacing spaces with null values in total charges column
dfWithEmptyReplaced = df.withColumn('TotalCharges', when(col('TotalCharges') == ' ', None).otherwise(col('TotalCharges')).cast("float"))
dfWithEmptyReplaced = dfWithEmptyReplaced.na.drop()

In [7]:
#Replacing 'No internet service' to No for the following columns
replace_cols = [ 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                'TechSupport','StreamingTV', 'StreamingMovies']

In [8]:
#replace values
for col_name in replace_cols:
    dfwithNo = dfWithEmptyReplaced.withColumn(col_name, when(col(col_name)== "No internet service","No").otherwise(col(col_name)))

In [9]:
dfwithNo.createOrReplaceTempView("datawrangling")

In [10]:
# Using Spark SQL to create categories 
df_wrangling = spark.sql("""
select distinct 
         customerID
        ,gender
        ,SeniorCitizen
        ,Partner
        ,Dependents
        ,tenure
        ,case when (tenure<=12) then "Tenure_0-12"
              when (tenure>12 and tenure <=24) then "Tenure_12-24"
              when (tenure>24 and tenure <=48) then "Tenure_24-48"
              when (tenure>48 and tenure <=60) then "Tenure_48-60"
              when (tenure>60) then "Tenure_gt_60"
        end as tenure_group
        ,PhoneService
        ,MultipleLines
        ,InternetService
        ,OnlineSecurity
        ,OnlineBackup
        ,DeviceProtection
        ,TechSupport
        ,StreamingTV
        ,StreamingMovies
        ,Contract
        ,PaperlessBilling
        ,PaymentMethod
        ,MonthlyCharges
        ,TotalCharges
        ,Churn
    from datawrangling
""")


In [11]:
# select on categorical Columns from dataset
categoricalColumns = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract','PaperlessBilling','PaymentMethod']
stages = [] # stages in our Pipeline

In [12]:
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

In [13]:
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="Churn", outputCol="label")
stages += [label_stringIdx]

**Transforming all features into a vector using VectorAssembler**

In [14]:
# Transform all features into a vector using VectorAssembler
numericCols = ['MonthlyCharges', 'TotalCharges']#'TotalRmbRCN1', 
assemblerInputs = numericCols + [c + "classVec" for c in categoricalColumns]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
IDcols = ['customerID']

**Create a pipeline to transform dataset**

In [15]:
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(df_wrangling)
dataset = pipelineModel.transform(df_wrangling)
# Keep relevant columns
selectedcols= ["label", "features"] + IDcols
dataset = dataset.select(selectedcols)

In [16]:
dataset.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- customerID: string (nullable = true)



In [17]:
dataset.show(5)

+-----+--------------------+----------+
|label|            features|customerID|
+-----+--------------------+----------+
|  0.0|(28,[0,1,3,6,7,10...|6497-TILVL|
|  1.0|(28,[0,1,3,5,6,7,...|0691-JVSYA|
|  0.0|(28,[0,1,3,4,5,6,...|8544-GOQSH|
|  0.0|(28,[0,1,2,3,4,5,...|5172-MIGPM|
|  0.0|(28,[0,1,2,3,5,6,...|4312-KFRXN|
+-----+--------------------+----------+
only showing top 5 rows



### Create Training and Test Set

In [18]:
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=200)
trainingData.createOrReplaceTempView("train")
print('Train Data',trainingData.count())
testData.createOrReplaceTempView("test")
print('Test Data',testData.count())

Train Data 4956
Test Data 2076


#### Decision Tree

In [19]:
# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

# Make predictions on test data
predictions = dtModel.transform(testData)

# Evaluate the model
evaluator = BinaryClassificationEvaluator()
accuracy = evaluator.evaluate(predictions)

In [20]:
print('Accuracy on test data:', accuracy)

Accuracy on test data: 0.775358211593196


In [21]:
print("numNodes = ", dtModel.numNodes)
print("depth = ", dtModel.depth)

numNodes =  7
depth =  3


** Create and Plot Confusion Matrix**

In [22]:
# View Best model's predictions and probabilities of each prediction class
selecteddt = predictions.select("label", "prediction", "probability")
selecteddt.createOrReplaceTempView("selecteddt")

In [23]:
confusion_matrixdt = spark.sql (""" 
select count(*), label, prediction
from selecteddt
group by label, prediction 
""")

confusion_matrixdt.show()

+--------+-----+----------+
|count(1)|label|prediction|
+--------+-----+----------+
|     194|  1.0|       1.0|
|      95|  0.0|       1.0|
|     337|  1.0|       0.0|
|    1450|  0.0|       0.0|
+--------+-----+----------+



### Random Forest

In [24]:
# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)
rfModel.featureImportances

# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(testData)

# Evaluate the model
evaluator = BinaryClassificationEvaluator()
accuracy = evaluator.evaluate(predictions)

In [25]:
# Accuracy of the model
print('Accuracy on test data:', accuracy)

Accuracy on test data: 0.8507408016869923


In [26]:
# View Best model's predictions and probabilities of each prediction class
selectedrf = predictions.select("label", "prediction", "probability")
selectedrf.createOrReplaceTempView("selectedrf")

** Plot and Create Confusion Matrix**

In [27]:
confusion_matrixrf = spark.sql (""" 
select count(*), label, prediction
from selectedrf
group by label, prediction 
""")

confusion_matrixdt.show()

+--------+-----+----------+
|count(1)|label|prediction|
+--------+-----+----------+
|     194|  1.0|       1.0|
|      95|  0.0|       1.0|
|     337|  1.0|       0.0|
|    1450|  0.0|       0.0|
+--------+-----+----------+



### Gradient-boosted tree classifier

In [28]:
# Train a GBT model.
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)

In [29]:
# Train model. 
model = gbt.fit(trainingData)

In [30]:
# Make predictions.
predictions = model.transform(testData)

In [31]:
# Select (prediction, true label) and compute test error
evaluator = BinaryClassificationEvaluator()
accuracy = evaluator.evaluate(predictions)
print("Accuracy on test data: %g" % (accuracy))

Accuracy on test data: 0.856838


In [32]:
# View Best model's predictions and probabilities of each prediction class
selectedgb = predictions.select("label", "prediction", "probability")
selectedgb.createOrReplaceTempView("selectedgb")

** Plot and Create Confusion Matrix**

In [33]:
confusion_matrixgb = spark.sql (""" 
select count(*), label, prediction
from selectedrf
group by label, prediction 
""")

confusion_matrixgb.show()

+--------+-----+----------+
|count(1)|label|prediction|
+--------+-----+----------+
|     229|  1.0|       1.0|
|     112|  0.0|       1.0|
|     302|  1.0|       0.0|
|    1433|  0.0|       0.0|
+--------+-----+----------+

