# Create a Decision Tree model for Churn Analysis
Spark comes with multiple popular machine learning algorithms. In this tutorial, we are looking at decision trees to determine the churn of telco customers. Decision trees are easier to understand in concept than many other machine learning algorithms since many people have been exposed to decisions such as: if this, then that, else another thing. We can easily understand how decisions are made this way through divide and conquer.

Decision trees are more than evaluating an attribute value and decide what to do next. This algorithm looks at the input data and decides how significant each attribute is, how it defines grouping between multiple records. Once this analysis is done it can decide which attribute nd which value range can lead to a decision. 

We start by getting a Spark session and reading the data into a DataFrame. The `data_df.show(3)` forces the instantiation of the data and provides a formatted view of it. Other methods could have been used, as seen in lab 1, such as the `take` method.

In [None]:
import urllib.request

url = 'https://github.com/jacquesroy/byte-size-data-science/raw/master/data/customer_churn.csv'

filename = url.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkFiles

spark = SparkSession.builder.getOrCreate()
# Add asset from file system
data_df = spark.read.csv(filename, header='true', inferSchema = 'true')
data_df.show(3)


In [None]:
# Take a look at the schema. See that data types were inferred
data_df.printSchema()

## Prepare the data before creating the model
Some columns have discrete string values: Gender, Status, Car Owner, and so on. <br/>
We use a __`StringIndexer`__ to convert the values to numbers.

We also convert the 17 columns into a vector so all "features" are in one column.

### Create the indexers
Converting the discrete values to index values

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml import Pipeline, Model
from pyspark.ml.classification import DecisionTreeClassifier

churn_indexer = StringIndexer(inputCol="CHURN", outputCol="label").fit(data_df)
gender_indexer = StringIndexer(inputCol="Gender", outputCol="IXGender")
status_indexer = StringIndexer(inputCol="Status", outputCol="IXStatus")
car_indexer = StringIndexer(inputCol="Car Owner", outputCol="IXCarOwner")
pay_indexer = StringIndexer(inputCol="Paymethod", outputCol="IXPaymethod")
localbill_indexer = StringIndexer(inputCol="LocalBilltype", outputCol="IXLocalBilltype")
long_indexer = StringIndexer(inputCol="LongDistanceBilltype", outputCol="IXLongDistanceBilltype")

### Create the conversion of columns to vector
Note the following statement:<br/>
`dt = DecisionTreeClassifier(maxDepth=4, labelCol="label")`

In this statement we limit the depth of the tree to 4. This is an arbitrary value that could be changed. It limits the granularity of the decision and can help avlid what is called **overfitting**. This is an important concept that you may want to investigate.

In [None]:
vectorAssembler_features = VectorAssembler(inputCols=["ID", "IXGender", "IXStatus", "Children", "Est Income", "IXCarOwner", "Age", 
               "LongDistance", "International", "Local", "Dropped", "IXPaymethod", "IXLocalBilltype", 
               "IXLongDistanceBilltype", "Usage", "RatePlan"],
    outputCol="features")

dt = DecisionTreeClassifier(maxDepth=4, labelCol="label")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=churn_indexer.labels)

### Create the pipeline that converts the data
A pipeline is the set of steps that were defined earlier that are put together in a series of processing steps. We then apply the pipeline to data to create a model.

In [None]:
pipeline_rf = Pipeline(stages=[churn_indexer, gender_indexer, status_indexer, car_indexer, pay_indexer, 
                               localbill_indexer, long_indexer, vectorAssembler_features, dt, labelConverter])

### Create the model
Note that we split the input data into training data to create the model and testing data to evaluate its accuracy. In many cases, it is split into three groups with a validation group that can be used to see if the mode is degrading over time.

In [None]:
# Randomly select records and get to 80% of the data in training_df and 20% in testing_df
(training_df, testing_df) = data_df.randomSplit([0.80, 0.20], 123)
model = pipeline_rf.fit(training_df)

### Test the model accuracy
The model fits the training data. We can tests the accuracy of the model on data that was not part of the model creation.

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.transform(testing_df)
evaluatorRF = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

# Saving the model
We can save the model to the local filesystem. The model is saved as a directory structure.


In [None]:
model.save('DecisionTreeChurnModel')
%ls -R DecisionTreeChurnModel

In [None]:
from pyspark.ml import PipelineModel

model2 = PipelineModel.load('DecisionTreeChurnModel')

### Try it to see that we get the same results

In [None]:
predictions = model2.transform(testing_df)
evaluatorRF = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

# Use Repository Services to Save and Load Model
A model can be created in one Notebook or through the Watson Studio "Model" creation, and reused in another notebook. It is even possible to publish it and use it through a REST API.

A model can be created in one Notebook or through the Watson Studio "Model" creation, and re-used in another notebook.

For this, we need to save the model and create a deployment

See the documentation at: `http://watson-ml-api.mybluemix.net/`


In [None]:
!pip install watson-machine-learning-client

In [None]:
wml_credentials = {
  "apikey": "sEF2bGvVooTpOhJSjyLJ2VBM01itYiD8tDpJG_4Ba5l5",
  "iam_apikey_description": "Auto generated apikey during resource-key operation for Instance - crn:v1:bluemix:public:pm-20:us-south:a/e46675b7f1bf89b09b5badfb3bd4a7b5:0c4bc8b4-ec84-4a68-b9ca-59597b23af4b::",
  "iam_apikey_name": "auto-generated-apikey-be7374c9-f851-4c1d-ba62-5b8ff10fc11b",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
  "iam_serviceid_crn": "crn:v1:bluemix:public:iam-identity::a/e46675b7f1bf89b09b5badfb3bd4a7b5::serviceid:ServiceId-c0bc47e6-6a85-46f7-905a-efe757ee1ae1",
  "instance_id": "0c4bc8b4-ec84-4a68-b9ca-59597b23af4b",
  "password": "b6e74b34-1ce7-440a-9a48-5c7c6d6865b7",
  "url": "https://us-south.ml.cloud.ibm.com",
  "username": "be7374c9-f851-4c1d-ba62-5b8ff10fc11b"
}


In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

client = WatsonMachineLearningAPIClient(wml_credentials)
print(client.service_instance.get_url())


In [None]:
# List models already in the repository
client.repository.list_models()

In [None]:
saved_model = client.repository.store_model(model=model, meta_props={'name':'Telco Churn Prediction Model'}, 
                                            training_data=training_df, pipeline=pipeline_rf)

In [None]:
# List models in the repository
client.repository.list_models()

In [None]:
# Get the model UID
model_uid = client.repository.get_model_uid(saved_model)

## Publish the model

In [None]:
deployment_details = client.deployments.create(model_uid, "Deployment of Telco churn model", deployment_type='online')


## Accessing a Saved Model
We can get a list of all the models available using. It is then possible to iterate on each one and find the name we are looking for.


In [None]:
client.deployments.list()

In [None]:
allModels = client.deployments.get_details()['resources']
modelUid = ""
deploymentDetail = ""
for model in allModels :
    print ("Model name: " + model['entity']['deployable_asset']['name'])
    print ("Model uid : " + model['entity']['deployable_asset']['guid'])
    if (model['entity']['deployable_asset']['name'] == "Telco Churn Prediction Model"):
        modelUid = model['entity']['deployable_asset']['guid']
        deploymentDetail = model
        print("Deployed model: " + model['entity']['name'])

print ("\nmodelUid: " + modelUid)

### Getting the model artifact
If we already have the model uid, we can get it using the get command as shown below.<br/>
Since we did not get the `ModelArtifact` in the previous cell, we still need to execute the following one. 

In [None]:
scoring_url = client.deployments.get_scoring_url(deploymentDetail)
print(scoring_url)

In [None]:
# Show the attributes used in the model
modelDetail = client.repository.get_model_details(model_uid)
# print(md)
vals=[]
for attr in modelDetail['entity']['input_data_schema']['fields'] :
    vals.append(attr['name'])
print(*vals, sep=', ')

In [None]:
# Execute the model
scoring_payload = {'fields': ['ID','Gender','Status','Children','Est Income','Car Owner',
                              'Age','LongDistance','International','Local','Dropped',
                              'Paymethod','LocalBilltype','LongDistanceBilltype',
                              'Usage','RatePlan'], 
                   'values': [[1,'F','S',1.0,38000.0,'N',24.393333,23.56,0.0,206.08,0.0,'CC','Budget','Intnl_discount',229.64,3.0],                      
                              [6,'M','M',2.0,29616.0,'N',49.426667,29.78,0.0,45.5,0.0,  'CH','FreeLocal','Standard',75.29,2.0]
                             ]}
predictions = client.deployments.score(scoring_url, scoring_payload)

In [None]:
print(predictions)

In [None]:
for prediction in predictions['values'] :
    print("ID: " + str(prediction[0]) + ", probability: [" + 
          str(prediction[26][0]) + ',' +  str(prediction[26][1]) + 
          "], prediction: " + str(prediction[27]) + ", predicted label: " + str(prediction[28])
         )

## Removing a saved model
We can remove a model from the repository using the __`remove`__ method.<br/>
In the example below, we show that since we only had one model in the repository, the looping over the models does not show anything once the model has been removed.

In [None]:
# Remove the deployment
deployment_uid = client.deployments.get_uid(deploymentDetail)
client.deployments.delete(deployment_uid)

In [None]:
client.deployments.list()

In [None]:
client.repository.list()

In [None]:
# Remove the 'Telco Churn Prediction Model' model(s)
for mldef in client.repository.get_details()['models']['resources'] :
    if (mldef['entity']['name'] == 'Telco Churn Prediction Model') :
        ml_uid = client.repository.get_model_uid(mldef)
        client.repository.delete(ml_uid)

In [None]:
# Remove the 'Telco Churn Prediction Model' definition(s)
for mldef in client.repository.get_details()['definitions']['resources'] :
    if (mldef['entity']['name'] == 'Telco Churn Prediction Model') :
        ml_uid = client.repository.get_definition_uid(mldef)
        client.repository.delete(ml_uid)