# Predicting Customer Churn in Telco
This notebook is forked from the original notebook (https://github.com/elenalowery/DSX-Local-Telco-Churn/blob/master/Notebooks/Telco%20Churn%20ML_Local.ipynb) which was created for DSX Local (Data Science Experience running locally) and ported to work with Watson Studio on the IBM Cloud.

In [None]:
# Verify Python 3.5
import platform
print(platform.python_version())

In this notebook you will learn how to build a predictive model with [Apache Spark](https://spark.apache.org/) machine learning API (SparkML) and deploy it for scoring using [Machine Learning](https://console.bluemix.net/catalog/services/machine-learning) (ML) service.

This notebook walks you through these steps:
- Load and Visualize data set.
- Build a predictive model with SparkML API
- Save the model in the ML repository
- Create a Deployment in ML (via UI)
- Test the model (via UI)
- Test the model (via REST API)

### Step 1: Review Use Case

The analytics use case implemented in this notebook is telco churn. While it's a simple use case, it implements all steps from the CRISP-DM methodolody, which is the recommended best practice for implementing predictive analytics. 
![CRISP-DM](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/crisp_dm.png)

The analytics process starts with defining the business problem and identifying the data that can be used to solve the problem. For Telco churn, we use demographic and historical transaction data. We also know which customers have churned, which is the critical information for building predictive models. In the next step, we use visual APIs for data understanding and complete some data preparation tasks. In a typical analytics project data preparation will include more steps (for example, formatting data or deriving new variables). 

Once the data is ready, we can build a predictive model. In our example we are using the SparkML Random Forrest classification model. Classification is a statistical technique which assigns a "class" to each customer record (for our use case "churn" or "no churn"). Classification models use historical data to come up with the logic to predict "class", this process is called model training. After the model is created, it's usually evaluated using another data set. 

Finally, if the model's accuracy meets the expectations, it can be deployed for scoring. Scoring is the process of applying the model to a new set of data. For example, when we receive new transactional data, we can score the customer for the risk of churn.  

We also developed a sample Python Flask application to illustrate deployment: http://predictcustomerchurn.mybluemix.net/. This application implements the REST client call to the model.

### Working with Notebooks

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) and code. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

### Step 2: Load data 

For this notebook, we will leverage two csv files:
- customer.csv: this file provides information about the customer such as gender, age, marital status, number of children, income, calling plans, and usage.
- churn.csv: this provides a historical record of which customers have churned.

You will need to download these files from the following github repository and upload to your Cloud Object Storage instance associated with your project.
1. Download churn.csv and customer.csv files from the following repository to your local machine: https://github.com/elenalowery/DSX-Local-Telco-Churn/blob/master/data/
2. Upload to your Cloud Object Storage instance by following these steps:
    - Click on the **Data** icon (top right)
    - Selet the **Files** tab
    - Either drop the churn.csv and customer.csv file or click browse and select those files from your local machine.
    
At this point, you should have both of these csv files in your Cloud Object storage.

Next, we'll need to load the data from the csv files into Spark data frames. To do so, click the **Insert to code** under the **customer.csv** file and click **Insert SparkSession DataFrame**.

This will insert code in a notebook cell that would load the csv file into a Spark data frame. 

In [None]:
# The code was removed by DSX for sharing.

Repeat the same process to load **churn.csv** file into another data frame.

In [None]:
# The code was removed by DSX for sharing.

In [None]:
# Copy the dataframes into generic dataframes with more meaningful names
# Note that your data frames may have slightly different names (maybde df_data_3 and df_data_4)
df_customer = df_data_1
df_customer_churn = df_data_2

If the first step ran successfully (you saw the output), then continue reviewing the notebook and running each code cell step by step. Note that not every cell has a visual output. The cell is still running if you see a * in the brackets next to the cell. 

If the first step didn't finish successfully, check with the instructor. 

### Step 3: Merge Files
Join the two tables (customer information and churn information) based on the ID field.

In [None]:
data=df_customer.join(df_customer_churn,df_customer['ID']==df_customer_churn['ID']).select(df_customer['*'],df_customer_churn['CHURN'])

In [None]:
# Print how many records the data includes
data.count()

In [None]:
# Print top 5 records
data.head(5)

In [None]:
# Print the data schema
data.printSchema()

### Step 4: Rename some columns
This step is to remove spaces from columns names, it's an example of data preparation that you may have to do before creating a model. 

In [None]:
data = data.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
# If you need to change a column type from String to double
# from pyspark.sql.functions import col
#data = data.withColumn("EstIncome",col("Est Income").cast("double"))
data.toPandas().head()

### Step 5: Data understanding

Data preparation and data understanding are the most time-consuming tasks in the data mining process. The data scientist needs to review and evaluate the quality of data before modeling.

Visualization is one of the ways to reivew data.

The Brunel Visualization Language is a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and business users. 
More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki

Try Brunel visualization here: http://brunel.mybluemix.net/gallery_app/renderer

In [None]:
import brunel
df = data.toPandas()
%brunel data('df') bar x(CHURN) y(EstIncome) mean(EstIncome) color(LocalBilltype) stack tooltip(EstIncome) | x(LongDistance) y(Usage) point color(Paymethod) tooltip(LongDistance, Usage) :: width=1100, height=400 

**PixieDust** is a Python Helper library for Spark IPython Notebooks. One of it's main features are visualizations. You'll notice that unlike other APIs which produce just output, PixieDust creates an **interactive UI** in which you can explore data.

More information about PixieDust: https://github.com/ibm-cds-labs/pixiedust?cm_mc_uid=78151411419314871783930&cm_mc_sid_50200000=1487962969

In [None]:
from pixiedust.display import *
display(data)

### Step 6: Build the Spark pipeline and the Random Forest model
"Pipeline" is an API in SparkML that's used for building models.
Additional information on SparkML: https://spark.apache.org/docs/2.0.2/ml-guide.html

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Prepare string variables so that they can be used by the decision tree algorithm
# StringIndexer encodes a string column of labels to a column of label indices
SI1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
SI2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
SI3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
SI4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
SI5 = StringIndexer(inputCol='LocalBilltype',outputCol='LocalBilltypeEncoded')
SI6 = StringIndexer(inputCol='LongDistanceBilltype',outputCol='LongDistanceBilltypeEncoded')
labelIndexer = StringIndexer(inputCol='CHURN', outputCol='label').fit(data)

# Pipelines API requires that input variables are passed in  a vector
#assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", "LocalBilltypeEncoded", \
#                                       "LongDistanceBilltypeEncoded", "Children", "EstIncome", "Age", "LongDistance", "International", "Local",\
#                                      "Dropped","Usage","RatePlan"], outputCol="features")

assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", "LocalBilltypeEncoded", \
                                       "LongDistanceBilltypeEncoded"], outputCol="features")

In [None]:
# instantiate the algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,SI6,labelIndexer, assembler, rf, labelConverter])

In [None]:
# Split data into train and test datasets
train, test = data.randomSplit([0.8,0.2], seed=6)
train.cache()
test.cache()

In [None]:
# Build models
model = pipeline.fit(train)

### Step 7: Score the test data set

In [None]:
results = model.transform(test)
#results=results.select(results["ID"],results["CHURN"],results["label"],results["predictedLabel"],results["prediction"],results["probability"])

results.toPandas().head(6)

### Step 8: Model Evaluation 

In [None]:
precision = results.filter(results.label == results.prediction).count() / float(results.count())
print('Precision model1 = {0:.2f}'.format(precision))

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print('Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(results)))

We have finished building and testing a predictive model. The next step is to deploy it for real time scoring. 

### Step 9: Save Model in ML repository
After creating the predictive ML model, save it to your Machine Learning service. 

In [None]:
# Load required libraries
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

# You need to create a Watson Machine Learning instance in IBM Cloud
# Read WML credentials.

In the next cell, insert the credentials for your Machine Learning service that you created on IBM Cloud.

wml_credentials = {

   "url": "",
   
   "access_key": "",
   
   "username": "",
   
   "password": "",
   
   "instance_id": ""
   
}

In [None]:
#Copy your WML credentials here
wml_credentials={
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "",
  "username": "",
  "password": "",
  "instance_id": ""
}

In [None]:
# The code was removed by DSX for sharing.

In [None]:
# Create ML repo client and authenticate using WML credentials
ml_repository_client = MLRepositoryClient(wml_credentials['url'])
ml_repository_client.authorize(wml_credentials['username'], wml_credentials['password'])


In [None]:
# Create model in WML; baseClustersFit is the trained model and training_data is the data used for training the model
model_artifact = MLRepositoryArtifact(model, training_data=train, name="churn1")

In [None]:
# Create model in WML; model_rf is the trained model and train_data is the data used for training the model
saved_model = ml_repository_client.models.save(model_artifact)

## Step 10: Deploy model using UI and Test model using REST API or UI
1. Save the notebook and switch to the **Models** tab of the project (**hint**: right click the project name link at the top, and open with another tab in your browser). 
2. Under **Models**, find and click into your saved model (note the model name you used above, in our example it is "churn")
3. The **Overview** tab provides general information about trained model.
4. The **Evaluation** tab shows information about evaluation results of trained model. Initially it is emtpy since it is not evaluated yet.
5. Click the **Deployments** tab and select **Add Deployment**. This provides 3 options:
   - Web Service: This provides a REST API endpoint to run the trained model against on new data.
   - Batch Prediction: Run the trained model against a dataset in batch. Dataset could be read from Object Storage or DB2.
   - Real Time Streaming: Run the trained model against data from Message Hub.
  Select the Web Service option, provide a name for the web service and click Save.
6. Once deployed, you click the model and that loads the page with information on deployed model. Specifically:
   - **Overview**: Shows general information on deployed model.
   - **Implementation**: Provides the endpoints of the deployed model as well as code snippets in multiple languages to actually make a REST call against the trained model. This also includes the required authentication when making the call.
   - **Test**: Test the trained model via UI using the following values:
`ID=99, Gender=M, Status=S, Children=0, Est Income=60000, Car Owner=Y, Age=34, LongDistance=68, International=50, Local=100, Dropped=0, Paymethod=CC, LocalBilltype=Budget, LongDistanceBilltype=Intnl_discount, Usage=334, RatePlan=3`


## Test model using REST API



In [None]:
# Go to Models, select the model you just trained (churn), then go to Implementation and get 
# the code snippet in Python

import urllib3, requests, json

# retrieve your wml_service_credentials_username, wml_service_credentials_password, and wml_service_credentials_url from the
# Service credentials associated with your IBM Cloud Watson Machine Learning Service instance



headers = urllib3.util.make_headers(basic_auth='{username}:{password}'.format(username=wml_credentials['username'], password=wml_credentials['password']))
url = '{}/v3/identity/token'.format(wml_credentials['url'])
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')


In [None]:
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}

#payload_scoring = {"fields": ["ID", "Gender", "Status", "Children", "EstIncome", "CarOwner", "Age", "LongDistance", "International", "Local", "Dropped", "Paymethod", "LocalBilltype", "LongDistanceBilltype", "Usage", "RatePlan"], "values": [array_of_values_to_be_scored, another_array_of_values_to_be_scored]}

payload_scoring = {"fields": ["ID", "Gender", "Status", "Children", "EstIncome", "CarOwner", "Age", "LongDistance", "International", "Local", "Dropped", "Paymethod", "LocalBilltype", "LongDistanceBilltype", "Usage", "RatePlan"], "values": [["99", "M", "S", "0", "60000", "Y", "34", "68", "50", "100", "0", "CC", "Budget", "Intnl_discount", "334", "3"]]}

scoring_endpoint = 'https://ibm-watson-ml.mybluemix.net/v3/wml_instances/13cc486f-01c7-4393-8032-f6d6fd0b599b/published_models/9cc07790-c8af-4ff7-8bfa-1ccb120944c7/deployments/ffee1a62-97c4-4f78-9cb1-416344ef8b04/online'
response_scoring = requests.post(scoring_endpoint, json=payload_scoring, headers=header)

print("Scoring response")
print(json.loads(response_scoring.text))

### (Optional) Step 11: Test Saved Model with Test UI
Once you've saved and deployed the machine learning model you created, there are multiple ways to test that model, one being using the REST api as shown in the cells above. Another approach is to use the Test UI of the deployed model as follows:
- Click on the Test tab of the deployed model.
- Enter some values for the various features (ID, Gender, Status, Children, EstIncome, CarOwner, Age, LongDistance, International, Local, Dropped, Paymethod, LocalBillType, LongDistanceBillType, Usage, Rate Plan)
- Click **Predict** ==> this should return the predicted likelihoood whether that customer will churn or not

### Summary

You have finished working on this hands-on lab. In this notebook you created a model using SparkML API, deployed it in  Machine Learning service for online (real time) scoring and tested it using a test client. 


Created by **Sidney Phoon** and **Elena Lowery**
<br/>
yfphoon@us.ibm.com
elowery@us.ibm.com
<br/>
Jan 2, 2018

Ported to Watson Studio on IBM Cloud by Joe Kozhaya
<br/>kozhaya@us.ibm.com<br/>
March 26, 2018