**Check Python version. This notebook is implemented for Python 3.5.x. Not all cells may work in other versions of Python.**

In [8]:
import platform
print(platform.python_version())

3.6.8


### Predicting Customer Churn

### Step 1: Load data 

#### 1.1: Download the data files

In [None]:
#Run once to install the wget package
!pip install wget

In [9]:
# download data from GitHub repository
import wget
url_churn='https://raw.githubusercontent.com/SidneyPhoon/Data/master/churn.csv'

url_customer='https://raw.githubusercontent.com/SidneyPhoon/Data/master/customer-profile.csv'

#remove existing files before downloading
!rm -f churn.csv
!rm -f customer-profile.csv

churnFilename=wget.download(url_churn)
customerFilename=wget.download(url_customer)

#list existing files
!ls -l churn.csv
!ls -l customer-profile.csv

-rw-r--r-- 1 spark spark 8546 Jul 22 20:11 churn.csv
-rw-r--r-- 1 spark spark 77821 Jul 22 20:11 customer-profile.csv


#### 1.1 Read data into Spark DataFrames
Note: You want to reference the Spark DataFrame API to learn more about the supported operations, https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

customer_churn= spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option("inferSchema", "true")\
  .load("churn.csv")

customer = spark.read\
    .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("customer-profile.csv")

In [11]:
customer.take(5)

[Row(ID=11, Gender='M', Status='S', Children=2.0, Est Income=96.33, Car Owner='N', Age=56.473333, AvgMonthlySpend=32.88, CustomerSupportCalls=1.0, Paymethod='CC', MembershipPlan=1.0),
 Row(ID=14, Gender='F', Status='M', Children=2.0, Est Income=52004.8, Car Owner='N', Age=25.14, AvgMonthlySpend=23.11, CustomerSupportCalls=0.0, Paymethod='CH', MembershipPlan=1.0),
 Row(ID=22, Gender='M', Status='S', Children=1.0, Est Income=57626.9, Car Owner='Y', Age=43.906667, AvgMonthlySpend=38.96, CustomerSupportCalls=0.0, Paymethod='CC', MembershipPlan=2.0),
 Row(ID=23, Gender='M', Status='M', Children=2.0, Est Income=20078.0, Car Owner='N', Age=32.846667, AvgMonthlySpend=6.33, CustomerSupportCalls=0.0, Paymethod='CC', MembershipPlan=4.0),
 Row(ID=35, Gender='F', Status='S', Children=0.0, Est Income=78851.3, Car Owner='N', Age=48.373333, AvgMonthlySpend=28.66, CustomerSupportCalls=0.0, Paymethod='CC', MembershipPlan=4.0)]

In [12]:
customer_churn.take(5)

[Row(ID=6, CHURN='F'),
 Row(ID=11, CHURN='F'),
 Row(ID=22, CHURN='F'),
 Row(ID=23, CHURN='F'),
 Row(ID=35, CHURN='T')]

### Step 2: Merge Files

In [13]:
data=customer.join(customer_churn,customer['ID']==customer_churn['ID']).select(customer['*'],customer_churn['CHURN'])

### Step 3: Rename some columns
This step is to remove spaces from columns names, it's an example of data preparation that you may have to do before creating a model. 

In [14]:
data = data.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
data.toPandas().head()

Unnamed: 0,ID,Gender,Status,Children,EstIncome,CarOwner,Age,AvgMonthlySpend,CustomerSupportCalls,Paymethod,MembershipPlan,CHURN
0,11,M,S,2.0,96.33,N,56.473333,32.88,1.0,CC,1.0,F
1,22,M,S,1.0,57626.9,Y,43.906667,38.96,0.0,CC,2.0,F
2,23,M,M,2.0,20078.0,N,32.846667,6.33,0.0,CC,4.0,F
3,35,F,S,0.0,78851.3,N,48.373333,28.66,0.0,CC,4.0,T
4,36,F,S,1.0,17540.7,Y,62.786667,13.45,0.0,Auto,4.0,T


In [15]:
data.toPandas().shape

(785, 12)

In [16]:
data.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Children: double (nullable = true)
 |-- EstIncome: double (nullable = true)
 |-- CarOwner: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- AvgMonthlySpend: double (nullable = true)
 |-- CustomerSupportCalls: double (nullable = true)
 |-- Paymethod: string (nullable = true)
 |-- MembershipPlan: double (nullable = true)
 |-- CHURN: string (nullable = true)



### Step 4: Data understanding

In [17]:
df = data.toPandas()

In [None]:
!pip install pandas_profiling

In [19]:
import pandas_profiling

pandas_profiling.ProfileReport(df)





### Step 5: Build the Spark pipeline and the Random Forest model
"Pipeline" is an API in SparkML that's used for building models.
Additional information on SparkML: https://spark.apache.org/docs/2.1.0/ml-guide.html

In [20]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Prepare string variables so that they can be used by the decision tree algorithm
# StringIndexer encodes a string column of labels to a column of label indices
SI1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
SI2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
SI3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
SI4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
SI5 = StringIndexer(inputCol='MembershipPlan',outputCol='MembershipPlanEncoded')

labelIndexer = StringIndexer(inputCol='CHURN', outputCol='label').fit(data)

# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", "MembershipPlanEncoded", \
                                       "Children", "EstIncome", "Age", "AvgMonthlySpend", "CustomerSupportCalls"], outputCol="features")

In [21]:
# instantiate the algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,labelIndexer, assembler, rf, labelConverter])

In [22]:
# Split data into train and test datasets
train, test = data.randomSplit([0.7,0.3], seed=3)
train.cache()
test.cache()

DataFrame[ID: int, Gender: string, Status: string, Children: double, EstIncome: double, CarOwner: string, Age: double, AvgMonthlySpend: double, CustomerSupportCalls: double, Paymethod: string, MembershipPlan: double, CHURN: string]

In [23]:
# Build model
model = pipeline.fit(train)

In [24]:
model.transform(test)

DataFrame[ID: int, Gender: string, Status: string, Children: double, EstIncome: double, CarOwner: string, Age: double, AvgMonthlySpend: double, CustomerSupportCalls: double, Paymethod: string, MembershipPlan: double, CHURN: string, GenderEncoded: double, StatusEncoded: double, CarOwnerEncoded: double, PaymethodEncoded: double, MembershipPlanEncoded: double, label: double, features: vector, rawPrediction: vector, probability: vector, prediction: double, predictedLabel: string]

### Step 6: Score the test data set

In [25]:
results = model.transform(test)
results=results.select(results["ID"],results["CHURN"],results["label"],results["predictedLabel"],results["prediction"],results["probability"])
results.toPandas().head(6)

Unnamed: 0,ID,CHURN,label,predictedLabel,prediction,probability
0,36,T,1.0,T,1.0,"[0.42638306091087824, 0.5736169390891218]"
1,61,T,1.0,F,0.0,"[0.5148638773455525, 0.4851361226544476]"
2,80,F,0.0,F,0.0,"[0.7535130285045396, 0.24648697149546045]"
3,87,T,1.0,T,1.0,"[0.48023729977370966, 0.5197627002262903]"
4,120,F,0.0,F,0.0,"[0.9079596558224557, 0.09204034417754423]"
5,121,T,1.0,T,1.0,"[0.3096257661058385, 0.6903742338941614]"


### Step 7: Model Evaluation 

In [26]:
print('Precision model1 = {:.2f}.'.format(results.filter(results.label == results.prediction).count() / float(results.count())))

Precision model1 = 0.78.


In [27]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print('Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(results)))

Area under ROC curve = 0.77.


We have finished building and testing a predictive model. The next step is to deploy it for real time scoring. 

###  Step 8:  Tune the model to find the best model

#### 8.1 Build a Parameter Grid specifying the parameters to be evaluated to determine the best combination

In [28]:
# set different levels for the maxDepth
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder().addGrid(rf.maxDepth,[3,5,6]).build())

#### 8.2 Create a cross validator to tune the pipeline with the generated parameter grid
Cross-validation attempts to fit the underlying estimator with user-specified combinations of parameters, cross-evaluate the fitted models, and output the best one.

In [29]:
# perform 3 fold cross validation
cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

In [30]:
# train the model
cvModel = cv.fit(train)

# pick the best model
best_rfModel = cvModel.bestModel

In [31]:
# score the test data set with the best model
cvresult=best_rfModel.transform(test)
cvresults_show=cvresult.select(cvresult["ID"],cvresult["CHURN"],cvresult["Label"],cvresult["predictedLabel"],cvresult["prediction"],cvresult["probability"])
cvresults_show.toPandas().head()

Unnamed: 0,ID,CHURN,Label,predictedLabel,prediction,probability
0,36,T,1.0,T,1.0,"[0.3674695140636778, 0.6325304859363222]"
1,61,T,1.0,F,0.0,"[0.5896943338739005, 0.4103056661260996]"
2,80,F,0.0,F,0.0,"[0.7326145463784617, 0.2673854536215382]"
3,87,T,1.0,T,1.0,"[0.378067400281718, 0.6219325997182821]"
4,120,F,0.0,F,0.0,"[0.9548245981912311, 0.04517540180876895]"


In [32]:
print('Model Accuracy of the best fitted model = {:.2f}.'.format(cvresult.filter(cvresult.label == cvresult.prediction).count()/ float(cvresult.count())))
print('Model Accuracy of the default model = {:.2f}.'.format(results.filter(results.label == results.prediction).count() / float(results.count())))
print('   ')
print('Area under the ROC curve of best fitted model = {:.2f}.'.format(evaluator.evaluate(cvresult)))
print('Area under the ROC curve of the default model = {:.2f}.'.format(evaluator.evaluate(results)))

Model Accuracy of the best fitted model = 0.78.
Model Accuracy of the default model = 0.78.
   
Area under the ROC curve of best fitted model = 0.77.
Area under the ROC curve of the default model = 0.77.


### Step 9: Save Model in WML repository

In this section you will store your model in the Watson Machine Learning (WML) repository by using Python client libraries.
* <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-overview.html?context=analytics">WML Documentation</a>
* <a href="http://watson-ml-api.mybluemix.net/">WML REST API</a> 
* <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-deploy-notebook.html?audience=wdp&context=analytics&linkInPage=true">Deploy a model from a notebook</a>
* <a href="https://wml-api-pyclient.mybluemix.net/">WML Repository API</a>
<br/>

First, you must import client libraries from pypi.

In [None]:
!pip install watson-machine-learning-client --upgrade

Import installed client by running below code.

In [2]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

#### <span style="color:red">9.1 Action Required</span>

If you do not already have an instance of the Machine Learning service in IBM Cloud, go to <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">IBM Cloud</a>, click **Catalog** on the top right of the menu, search for "Machine Learning", and create an instance.

If you have an existing instance of the Machine Learning service in <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">IBM Cloud</a>, click into the service.


#### <span style="color:red">9.2 Action Required</span>

* Click **Service credentials** on the left navigation bar
* Click **New credentials** and then the **Add** button to create new credentials
* Under **ACTIONS** click **View credentials**
* Click the **copy** icon to copy the credentials
* Paste the credentials into the code cell below

Paste credentials in the code cell below

In [33]:
# @hidden_cell
wml_credentials={
  "apikey": "DO-fHcuxZ7QGUNPXzDoNva85kGyuGCKbMU0dup_L9KJd",
  "iam_apikey_description": "Auto-generated for key cb719d05-707c-419f-b9d3-4b8dfdbcd2db",
  "iam_apikey_name": "Service credentials-1",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
  "iam_serviceid_crn": "crn:v1:bluemix:public:iam-identity::a/f99cae423540e5951676b0767ae74ae2::serviceid:ServiceId-9c745a3a-7959-4077-b2bd-fef9fdd63268",
  "instance_id": "7d184276-32ca-433c-8662-0e0660e53f65",
  "password": "XXXXX",
  "url": "https://us-south.ml.cloud.ibm.com",
  "username": "xxxxxxxxx"
}

#### Create API client by running below code.

In [34]:
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

#### 10.3 Publish model in Watson Machine Learning repository on Cloud.

In [37]:
model_props = {wml_client.repository.ModelMetaNames.AUTHOR_NAME: "Sidney Phoon",  
               wml_client.repository.ModelMetaNames.NAME: "Predict Customer Churn"}

published_model = wml_client.repository.store_model(model=best_rfModel, pipeline=pipeline, meta_props=model_props, training_data=train)

#### Get model details

In [38]:
import json
published_model_uid = wml_client.repository.get_model_uid(published_model)
model_details = wml_client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

{
  "metadata": {
    "guid": "fc251c16-2446-4b9b-b4a5-bbf9599ff8d6",
    "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/7d184276-32ca-433c-8662-0e0660e53f65/published_models/fc251c16-2446-4b9b-b4a5-bbf9599ff8d6",
    "created_at": "2019-07-22T20:17:48.988Z",
    "modified_at": "2019-07-22T20:17:49.092Z"
  },
  "entity": {
    "runtime_environment": "spark-2.3",
    "learning_configuration_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/7d184276-32ca-433c-8662-0e0660e53f65/published_models/fc251c16-2446-4b9b-b4a5-bbf9599ff8d6/learning_configuration",
    "author": {
      "name": "Sidney Phoon"
    },
    "name": "Predict Customer Churn",
    "label_col": "CHURN",
    "learning_iterations_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/7d184276-32ca-433c-8662-0e0660e53f65/published_models/fc251c16-2446-4b9b-b4a5-bbf9599ff8d6/learning_iterations",
    "training_data_schema": {
      "fields": [
        {
          "metadata": {},
          "name": "ID",
 

In [39]:
#List all models
wml_client.repository.list_models()

------------------------------------  ----------------------  ------------------------  -----------------
GUID                                  NAME                    CREATED                   FRAMEWORK
fc251c16-2446-4b9b-b4a5-bbf9599ff8d6  Predict Customer Churn  2019-07-22T20:17:48.988Z  mllib-2.3
05d2aa38-3910-4743-ad29-a35036e693ac  Churn-Prediction-Flow   2019-07-02T20:47:35.254Z  spss-modeler-18.1
------------------------------------  ----------------------  ------------------------  -----------------


#### 9.4 Load model
In this subsection you will learn how to load back saved model from specified instance of Watson Machine Learning.

In [40]:
loaded_model = wml_client.repository.load(published_model_uid)

You can pass test data to loaded model transform() method to make sure that model has been loaded correctly.

In [41]:
test_predictions = loaded_model.transform(test)

In [42]:
test_predictions.select('probability', 'predictedLabel').show(n=3, truncate=False)

+---------------------------------------+--------------+
|probability                            |predictedLabel|
+---------------------------------------+--------------+
|[0.3674695140636778,0.6325304859363222]|T             |
|[0.5896943338739005,0.4103056661260996]|F             |
|[0.7326145463784617,0.2673854536215382]|F             |
+---------------------------------------+--------------+
only showing top 3 rows



As you can see the loaded model works.

### Step 10: Deploy and score in a Cloud
In this section you will learn how to create online scoring and to score a new data record by using the Watson Machine Learning Client.

#### 10.1 Create online deployment

In [43]:
created_deployment = wml_client.deployments.create(published_model_uid, name="Predict Customer Churn Py36 Online Deployment")



#######################################################################################

Synchronous deployment creation for uid: 'fc251c16-2446-4b9b-b4a5-bbf9599ff8d6' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='dd6eeba8-1307-4b5d-93c0-bf3fb3945e7e'
------------------------------------------------------------------------------------------------




Print the online scoring endpoint

In [44]:
scoring_endpoint = wml_client.deployments.get_scoring_url(created_deployment)

print(scoring_endpoint)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/7d184276-32ca-433c-8662-0e0660e53f65/deployments/dd6eeba8-1307-4b5d-93c0-bf3fb3945e7e/online


#### 11.2 Get deployments

In [45]:
wml_client.deployments.list()

------------------------------------  ---------------------------------------------  ------  --------------  ------------------------  -----------------  -------------
GUID                                  NAME                                           TYPE    STATE           CREATED                   FRAMEWORK          ARTIFACT TYPE
dd6eeba8-1307-4b5d-93c0-bf3fb3945e7e  Predict Customer Churn Py36 Online Deployment  online  DEPLOY_SUCCESS  2019-07-22T20:19:12.400Z  mllib-2.3          model
0833d6ee-eaf8-49d9-8433-f8b1f9ed0cff  churn-prediction-flow-deploy                   online  DEPLOY_SUCCESS  2019-07-02T20:48:58.552Z  spss-modeler-18.1  model
------------------------------------  ---------------------------------------------  ------  --------------  ------------------------  -----------------  -------------


#### 11.3 Score
You can use below method to test scoring request against deployed model.


In [46]:
scoring_payload = {
    "fields": [
    "Gender",
    "Status",
    "Children",
    "EstIncome",
    "CarOwner",
    "Age",
    "AvgMonthlySpend",
    "CustomerSupportCalls",
    "Paymethod",
    "MembershipPlan"
    ],
    "values": [ ["F","S",2.0,25000,"Y",33,10,1,"CC",2] ]
} 

In [47]:
predictions = wml_client.deployments.score(scoring_endpoint, scoring_payload)

In [48]:
print ('Prediction = {}'.format(predictions))

Prediction = {'fields': ['Gender', 'Status', 'Children', 'EstIncome', 'CarOwner', 'Age', 'AvgMonthlySpend', 'CustomerSupportCalls', 'Paymethod', 'MembershipPlan', 'CHURN', 'GenderEncoded', 'StatusEncoded', 'CarOwnerEncoded', 'PaymethodEncoded', 'MembershipPlanEncoded', 'label', 'features', 'rawPrediction', 'probability', 'prediction', 'predictedLabel'], 'values': [['F', 'S', 2.0, 25000.0, 'Y', 33.0, 10.0, 1.0, 'CC', 2.0, 'F', 0.0, 1.0, 1.0, 0.0, 2.0, 0.0, [0.0, 1.0, 1.0, 0.0, 2.0, 2.0, 25000.0, 33.0, 10.0, 1.0], [4.614414145104681, 15.385585854895318], [0.23072070725523405, 0.7692792927447659], 1.0, 'T']]}


You have come to the end of this notebook

**Author: Sidney Phoon**<br/>
Last updated: July, 2019