# Predict ride preference using IBM Watson Machine Learning

This notebook introduces commands for getting data and for basic data cleaning and exploration, pipeline creation, model training, model persistance to Watson Machine Learning repository, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 2.0 and Apache® Spark 2.0.


## Learning goals

The learning goals of this notebook are:

-  Load a CSV file into an Apache® Spark DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create an Apache® Spark machine learning pipeline.
-  Train and evaluate a model.
-  Persist a pipeline and model in Watson Machine Learning repository.
-  Deploy a model for online scoring using Wastson Machine Learning API.
-  Score sample scoring data using the Watson Machine Learning API.



## Contents

This notebook contains the following parts:

1.	[Setup](#setup)
2.	[Load and explore data](#load)
3.	[Create spark ml model](#model)
4.	[Persist model](#save)
5.	[Predict locally and visualize](#predict)
6.	[Deploy and score in a Cloud](#deploy)


<a id="setup"></a>
## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

-  Create a [Watson Machine Learning Service](https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/) instance (a free plan is offered). 
-  Upload **ride_demo-1.csv** data as a data asset in IBM Data Science Experience.
-  Make sure that you are using a Spark 2.0 kernel.


<a id="load"></a>
## 2.  Load and explore data

IBM Data Science Experience (DSX) makes it easy to load your files with a few clicks!

In [1]:

import ibmos2spark

# @hidden_cell
credentials = {
    'auth_url': 'https://identity.open.softlayer.com',
    'project_id': 'c103edd6ab074e8f967770017c08c779',
    'region': 'dallas',
    'user_id': '70b92ab4ed014fe0b3564f31a53b6522',
    'username': 'member_2c8b4ad8f76fe19de7823563460e482b899f88c0',
    'password': 'ApX1Y]C*#tvNn95j'
}

configuration_name = 'os_549fa3a24c174b679ba88ab0445f1516_configs'
bmos = ibmos2spark.bluemix(sc, credentials, configuration_name)

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

**Action**: Import the data

and add .option('inferSchema','true)

In [2]:

df_data_1 = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema','true')\
  .load(bmos.url('Analytics', 'ride_demo-1.csv'))
df_data_1.take(5)


[Row(age=23, gender=u'M', party_size=1, heat=1, rain=0, Attraction=u'A Pirates Adventure   Treasures of the Seven Seas'),
 Row(age=28, gender=u'M', party_size=1, heat=1, rain=1, Attraction=u'A Pirates Adventure  Treasures of the Seven Seas'),
 Row(age=33, gender=u'M', party_size=1, heat=0, rain=1, Attraction=u'A Pirates Adventure  Treasures of the Seven Seas'),
 Row(age=18, gender=u'M', party_size=2, heat=0, rain=0, Attraction=u'Astro Orbiter'),
 Row(age=25, gender=u'M', party_size=2, heat=0, rain=0, Attraction=u'Astro Orbiter')]

Explore the loaded data by using the following Apache® Spark DataFrame methods:
-  print schema
-  count all the records
-  print top five records

In [3]:
df = df_data_1

df.printSchema()
print "# of records: " + str(df.count())

root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- party_size: integer (nullable = true)
 |-- heat: integer (nullable = true)
 |-- rain: integer (nullable = true)
 |-- Attraction: string (nullable = true)

# of records: 75


We can see that there are 75 rows and we have 5 fields we will use to predict the Attraction (label)/predicted ride preference.

In [4]:
df.show(5)

+---+------+----------+----+----+--------------------+
|age|gender|party_size|heat|rain|          Attraction|
+---+------+----------+----+----+--------------------+
| 23|     M|         1|   1|   0|A Pirates Adventu...|
| 28|     M|         1|   1|   1|A Pirates Adventu...|
| 33|     M|         1|   0|   1|A Pirates Adventu...|
| 18|     M|         2|   0|   0|       Astro Orbiter|
| 25|     M|         2|   0|   0|       Astro Orbiter|
+---+------+----------+----+----+--------------------+
only showing top 5 rows



Top 5 rows

<a id="model"></a>
## 3. Create an Apache Spark machine learning model

In this section we will prepare data, create an Apache Spark machine learning pipeline, and train a model.


### 3.1:  Prepare data

In this subsection we will split our data into: training, test, and predict datasets.

In [5]:
split_data = df.randomSplit([0.7, 0.2, 0.1], 24)

training_data = split_data[0]
test_data = split_data[1]
predict_data = split_data[2]

print "Training records: " + str(training_data.count())
print "Test records: " + str(test_data.count())
print "Prediction records: " + str(predict_data.count())

Training records: 46
Test records: 25
Prediction records: 4


As you can see our data has been successfully split into three datasets: 

-  The training dataset, which is the largest group, is used for training.
-  The test dataset will be used for model evaluation and is used to test the assumptions of the model.
-  The predict dataset will be used for prediction.

### 3.2:  Create pipeline and train a model

In this section we create an Apache Spark machine learning pipeline and then train the model.

First we need to import several packages that will be used in the next few steps.

In [6]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

First we need to convert all the string fields to numeric values.

In [7]:
stringIndexer_labels = StringIndexer(inputCol="Attraction", outputCol="label").fit(df)
stringIndexer_gender = StringIndexer(inputCol="gender", outputCol="GENDER_IX").fit(df)


Create a feature vector by combining all features together.

In [8]:
vectorAssembler_features = VectorAssembler(inputCols=["age","GENDER_IX","party_size","heat","rain"], outputCol="features")

Next we define a Random Forest estimator.

In [9]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

Next we convert the indexed labels back to the original label.

In [10]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_labels.labels)

Now we will put all the steps into a pipeline. 

In [11]:
pipeline_rf = Pipeline(stages=[stringIndexer_labels,stringIndexer_gender, vectorAssembler_features, rf, labelConverter])

Now we will create a model using our pipeline and the training_data dataset.

In [12]:
model_rf = pipeline_rf.fit(training_data)

Now we will check our model accuracy using our test_data dataset.

In [13]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Accuracy = 0.36
Test Error = 0.64


At this point we would tune the model for desired accuracy, for this example we will move on.

<a id="save"></a>
## 4. Persist model in IBM Watson Machine Learning

In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using python client libraries.

First, you must import client libraries.

**Note**: Apache Spark 2.0 or higher is required.

In [14]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

Authenticate to Watson Machine Learning service on Bluemix.

**Action**: Use your Watson Machine Learning service instance credentials below.

In [15]:
username = '526f830a-db17-4758-a13c-da450f5c49ad'
password = '95033f37-5b87-481c-8ad9-a0ac6d4c6d2a'
service_path = 'https://ibm-watson-ml.mybluemix.net'
instance_id = '40cda31c-7686-4b23-946c-bd2d5bf7fab3'

**Tip**: service_path, username and password can be found on **Service Credentials** tab of service instance created in Bluemix. If you cannot see **instance_id** field in **Service Credentials** generate new credentials by pressing **New credential (+)** button. 

In [16]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Create model artifact (abstraction layer)

In [17]:
model_artifact = MLRepositoryArtifact(model_rf, training_data=training_data, name="Ride Prediction with Python")

**Tip**: The MLRepositoryArtifact method expects a trained model object, training data, and a model name. (It is this model name that is displayed by the Watson Machine Learning service).

### 4.1: Save pipeline and model

In [18]:
saved_model = ml_repository_client.models.save(model_artifact)

Get saved model metadata from Watson Machine Learning using the meta.available_props() method.

In [19]:
saved_model.meta.available_props()

['inputDataSchema',
 'evaluationMetrics',
 'pipelineVersionHref',
 'modelVersionHref',
 'trainingDataRef',
 'pipelineType',
 'creationTime',
 'lastUpdated',
 'label',
 'authorEmail',
 'trainingDataSchema',
 'authorName',
 'version',
 'modelType',
 'runtime',
 'evaluationMethod']

**Tip**:  **modelVersionHref** is our model unique id in Watson Machine Learning.

In [20]:
print saved_model.meta.prop("modelVersionHref")

https://ibm-watson-ml.mybluemix.net/v2/artifacts/models/db518e7e-6202-4e6a-a122-38a88b18d2de/versions/78418b0a-9402-4710-be89-f11ae658ff47


### 4.2: Load model

Now that we saved the model we will load it and verify the name.

In [21]:
loadedModelArtifact = ml_repository_client.models.get(saved_model.uid)

In [22]:
print str(loadedModelArtifact.name)

Ride Prediction with Python


<a id="predict"></a>
## 5. Predict locally and visualize

In this section we will score test data using the loaded model.

### 5.1: Make local prediction using loaded model and predict data

In [23]:
predictions = loadedModelArtifact.model_instance().transform(predict_data)

In [24]:
predictions.show(3)

+---+------+----------+----+----+--------------------+-----+---------+--------------------+--------------------+--------------------+----------+--------------------+
|age|gender|party_size|heat|rain|          Attraction|label|GENDER_IX|            features|       rawPrediction|         probability|prediction|      predictedLabel|
+---+------+----------+----+----+--------------------+-----+---------+--------------------+--------------------+--------------------+----------+--------------------+
|  9|     F|         2|   1|   1|Enchanted Tales w...|  7.0|      0.0|[9.0,0.0,2.0,1.0,...|[7.375,3.125,0.0,...|[0.36875,0.15625,...|       0.0|Mickeys PhilharMagic|
| 12|     F|         2|   0|   1|Enchanted Tales w...|  7.0|      0.0|[12.0,0.0,2.0,0.0...|[13.7954545454545...|[0.68977272727272...|       0.0|Mickeys PhilharMagic|
| 15|     M|         1|   1|   1|The Magic Carpets...| 14.0|      1.0|[15.0,1.0,1.0,1.0...|[0.0,1.1666666666...|[0.0,0.0583333333...|      14.0|The Magic Carpets...|
+---

In [25]:
predictions.select("predictedLabel").groupBy("predictedLabel").count().show()

+--------------------+-----+
|      predictedLabel|count|
+--------------------+-----+
|Walt Disneys Caro...|    1|
|Mickeys PhilharMagic|    2|
|The Magic Carpets...|    1|
+--------------------+-----+



<a id="deploy"></a>
## 6. Deploy and create online scoring endpoint

In this section you will learn how to create online scoring and to score a new data record by using the Watson Machine Learning REST API. 
For more information about REST APIs, see the [Swagger Documentation](http://watson-ml-api.mybluemix.net/).

To work with the Watson Machine Leraning REST API you must generate an access token. To do that you can use the following sample code:

In [26]:
import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(username, password))
url = '{}/v3/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')

Now that we have the token we can create an online scoring endpoint.

First we will check the model for existing deployments and get the deployments url, then we will create the online deployment.

In [27]:
published_model_details = service_path + "/v3/wml_instances/" + instance_id + "/published_models/"\
+ loadedModelArtifact.uid 
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}

response_get_model_details = requests.get(published_model_details, headers=header)

print 'Existing deployment count: ' + str(json.loads(response_get_model_details.text).get('entity').get('deployments').get('count'))
deployments_endpoint = json.loads(response_get_model_details.text).get('entity').get('deployments').get('url')
print deployments_endpoint

Existing deployment count: 0
https://ibm-watson-ml.mybluemix.net/v3/wml_instances/40cda31c-7686-4b23-946c-bd2d5bf7fab3/published_models/db518e7e-6202-4e6a-a122-38a88b18d2de/deployments


In [28]:
payload_online_endpoint = {"name": "Movie Prediction Deployment", "description": "Movie prediction endpoint\
for suggesting movies to customers.", "type": "online"}
response_online = requests.post(deployments_endpoint, json=payload_online_endpoint, headers=header)

scoring_endpoint = json.loads(response_online.text).get('entity').get('scoring_url')
print scoring_endpoint

https://ibm-watson-ml.mybluemix.net/v3/wml_instances/40cda31c-7686-4b23-946c-bd2d5bf7fab3/published_models/db518e7e-6202-4e6a-a122-38a88b18d2de/deployments/f0749ba5-a3e2-45f6-a339-edcf1d456310/online


Now we can send (POST) a new scoring request to our deployed model to get a movie prediction.

In [29]:
payload_scoring = {"fields": ["age","gender","party_size","heat","rain"],"values": [[21,"M",2,0,1]]}
response_scoring = requests.post(scoring_endpoint, json=payload_scoring, headers=header)

print json.loads(response_scoring.text)["values"][0]\
[len(json.loads(response_scoring.text)["values"][0])-1]


Peter Pans Flight


**Now we have a working online endpoint to use in our kiosk applications throughout the park**