# Create a Decision Tree model for Churn Analysis
Spark comes with multiple popular machine learning algorithms. In this tutorial, we are looking at decision trees to determine the churn of telco customers. Decision trees are easier to understand in concept than many other machine learning algorithms since many people have been exposed to decisions such as: if this, then that, else another thing. We can easily understand how decisions are made this way through divide and conquer.

Decision trees are more than evaluating an attribute value and decide what to do next. This algorithm looks at the input data and decides how significant each attribute is, how it defines grouping between multiple records. Once this analysis is done it can decide which attribute nd which value range can lead to a decision. 

We start by getting a Spark session and reading the data into a DataFrame. The `data_df.show(3)` forces the instantiation of the data and provides a formatted view of it. Other methods could have been used, as seen in lab 1, such as the `take` method.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
# Add asset from file system
data_df = spark.read.csv('../datasets/customer_churn.csv', header='true', inferSchema = 'true')
data_df.show(3)

+---+-----+------+------+--------+----------+---------+---------+------------+-------------+------+-------+---------+-------------+--------------------+------+--------+
| ID|CHURN|Gender|Status|Children|Est Income|Car Owner|      Age|LongDistance|International| Local|Dropped|Paymethod|LocalBilltype|LongDistanceBilltype| Usage|RatePlan|
+---+-----+------+------+--------+----------+---------+---------+------------+-------------+------+-------+---------+-------------+--------------------+------+--------+
|  1|    T|     F|     S|     1.0|   38000.0|        N|24.393333|       23.56|          0.0|206.08|    0.0|       CC|       Budget|      Intnl_discount|229.64|     3.0|
|  6|    F|     M|     M|     2.0|   29616.0|        N|49.426667|       29.78|          0.0|  45.5|    0.0|       CH|    FreeLocal|            Standard| 75.29|     2.0|
|  8|    F|     M|     M|     0.0|   19732.8|        N|50.673333|       24.81|          0.0| 22.44|    0.0|       CC|    FreeLocal|            Standard| 47

In [2]:
# Take a look at the schema. See that data types were inferred
data_df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- CHURN: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Children: double (nullable = true)
 |-- Est Income: double (nullable = true)
 |-- Car Owner: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- LongDistance: double (nullable = true)
 |-- International: double (nullable = true)
 |-- Local: double (nullable = true)
 |-- Dropped: double (nullable = true)
 |-- Paymethod: string (nullable = true)
 |-- LocalBilltype: string (nullable = true)
 |-- LongDistanceBilltype: string (nullable = true)
 |-- Usage: double (nullable = true)
 |-- RatePlan: double (nullable = true)



## Prepare the data before creating the model
Some columns have discrete string values: Gender, Status, Car Owner, and so on. <br/>
We use a __`StringIndexer`__ to convert the values to numbers.

We also convert the 17 columns into a vector so all "features" are in one column.

### Create the indexers
Converting the discrete values to index values

In [3]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml import Pipeline, Model
from pyspark.ml.classification import DecisionTreeClassifier

churn_indexer = StringIndexer(inputCol="CHURN", outputCol="label").fit(data_df)
gender_indexer = StringIndexer(inputCol="Gender", outputCol="IXGender")
status_indexer = StringIndexer(inputCol="Status", outputCol="IXStatus")
car_indexer = StringIndexer(inputCol="Car Owner", outputCol="IXCarOwner")
pay_indexer = StringIndexer(inputCol="Paymethod", outputCol="IXPaymethod")
localbill_indexer = StringIndexer(inputCol="LocalBilltype", outputCol="IXLocalBilltype")
long_indexer = StringIndexer(inputCol="LongDistanceBilltype", outputCol="IXLongDistanceBilltype")

### Create the conversion of columns to vector
Note the following statement:<br/>
`dt = DecisionTreeClassifier(maxDepth=4, labelCol="label")`

In this statement we limit the depth of the tree to 4. This is an arbitrary value that could be changed. It limits the granularity of the decision and can help avlid what is called **overfitting**. This is an important concept that you may want to investigate.

In [4]:
vectorAssembler_features = VectorAssembler(inputCols=["ID", "IXGender", "IXStatus", "Children", "Est Income", "IXCarOwner", "Age", 
               "LongDistance", "International", "Local", "Dropped", "IXPaymethod", "IXLocalBilltype", 
               "IXLongDistanceBilltype", "Usage", "RatePlan"],
    outputCol="features")

dt = DecisionTreeClassifier(maxDepth=4, labelCol="label")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=churn_indexer.labels)

### Create the pipeline that converts the data
A pipeline is the set of steps that were defined earlier that are put together in a series of processing steps. We then apply the pipeline to data to create a model.

In [5]:
pipeline_rf = Pipeline(stages=[churn_indexer, gender_indexer, status_indexer, car_indexer, pay_indexer, 
                               localbill_indexer, long_indexer, vectorAssembler_features, dt, labelConverter])

### Create the model
Note that we split the input data into training data to create the model and testing data to evaluate its accuracy. In many cases, it is split into three groups with a validation group that can be used to see if the mode is degrading over time.

In [6]:
# Randomly select records and get to 80% of the data in training_df and 20% in testing_df
(training_df, testing_df) = data_df.randomSplit([0.80, 0.20], 123)
model = pipeline_rf.fit(training_df)

### Test the model accuracy
The model fits the training data. We can tests the accuracy of the model on data that was not part of the model creation.

In [7]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.transform(testing_df)
evaluatorRF = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Accuracy = 0.85901
Test Error = 0.14099


# Save and Load Model
A model can be crerated in one Notebook or through the Watson Studio "Model" creation, and reused in another notebook. It is even possible to publish it and use it through a REST API.

In [8]:
from dsx_ml.ml import save

Using TensorFlow backend.


## Example of saving a model
Assumes the model we created above.

In [9]:
model_name = "Telco Churn Prediction Model 02"
saved_model_output = save(name = model_name, model = model, algorithm_type = 'Classification', test_data = testing_df)

In [10]:
import json
import os

with open('/user-home/{}/DSX_Projects/{}/models/Telco Churn Prediction Model 02/metadata.json'.format(os.environ['DSX_USER_ID'],os.environ['DSX_PROJECT_NAME'])) as infile:
    metadata_dict = json.load(infile)

In [11]:
print("Model Type: {}".format(metadata_dict['algorithm']))

print("Feature(s):")
for feature in metadata_dict['features']:
    print('    '+feature['name'])

print("Latest Model Version: {}".format(metadata_dict['latestModelVersion']))
print("Label(s):")
for label in metadata_dict['labelColumns']:
    print('    '+label['name'])

Model Type: PipelineModel
Feature(s):
    Status
    CHURN
    Paymethod
    Gender
    Age
    RatePlan
    Car Owner
    Children
    Usage
    LongDistance
    Dropped
    LongDistanceBilltype
    International
    Est Income
    Local
    ID
    LocalBilltype
Latest Model Version: 3
Label(s):
    CHURN


## Score the model
In Watson Studio, a test API endpoint for scoring is created upon saving the model.

Now, you can send (POST) new scoring records (new data) for which you would like to get predictions. To do that, execute the following sample code:

In [12]:
import os
import requests

header_online = {'Content-Type': 'application/json', 'Authorization': os.environ['DSX_TOKEN']}

print(saved_model_output['scoring_endpoint'])

https://dsxl-api/v3/project/score/Python27/spark-2.0/Project%201/Telco%20Churn%20Prediction%20Model%2002/3


In [13]:
payload_scoring = [{"ID":23, "Gender":"M", "Status":"S", "Children":1, "Est Income":50000, "Car Owner":"N", "Age":30, "LongDistance":23.45, "International":0, "Local":200, "Dropped":0, "Paymethod":"CC", "LocalBilltype":"Budget", "LongDistanceBilltype":"Standard", "Usage":200, "RatePlan":2}]
response_scoring = requests.post(saved_model_output['scoring_endpoint'], json=payload_scoring, headers=header_online)

response_dict = json.loads(response_scoring.content)



In [14]:
n = 1
for response in response_dict['object']['output']['predictions']:
    print("{}. {}".format(n,response))
    n+=1

1. F
