# Using Watson Studio Machine Learning Service for Model Training and Making Predictions

This notebook shows how to use machine learning libraries and services from Watson Studio to train, save, deploy and evaluate a model and make a prediction for new data. 

## Table of contents
- [Prepare the environment](#prepare_environment)
- [Load data](#load_data)
- [Access and manipulate data](#access_manipulate_data)
- [Evaluate the model](#evaluate_model)
- [Save the model](#save_model)
- [Make an online scoring prediction](#make_prediction)
- [Summary](#summary)

<a id="prepare_environment"></a>
## Prepare the environment

Import machine learning libraries.

In [1]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorIndexer, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

<a id="load_data"></a>
## Load data 
The 1983 Data Exposition dataset was collected by Ernesto Ramos and David Donoho and dealt with automobiles. Data on mpg, cylinders, displacement, was provided for 406 different cars, each identified by name. The dataset is freely available on the Watson Studio home page.


Perform the following steps to upload this dataset:
1. Go to the <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/c81e9be8daf6941023b9dc86f303053b" target="_blank">Car performance data</a> card on the Watson Studio home page.
1. Click the download button.
1. Click the **Create new** icon on the notebook action bar, and use **Add data set** button to add the downloaded cars.csv file as a `Local File`. 

The data file is listed on the **Local Data** pane in the notebook.



<a id="access_manipulate_data"></a>
## Access and manipulate data

To add the code to access the data file, click the next code cell and select **Insert Spark DataFrame in Python** in the **Insert To Code** drop-down list below the data file in the `Local Data` pane in the notebook.

+---+---------+------+----------+------+------------+----+--------+--------------------+
|mpg|cylinders|engine|horsepower|weight|acceleration|year|  origin|                name|
+---+---------+------+----------+------+------------+----+--------+--------------------+
| 18|        8| 307.0|       130|  3504|        12.0|  70|American|chevrolet chevell...|
| 15|        8| 350.0|       165|  3693|        11.5|  70|American|   buick skylark 320|
| 18|        8| 318.0|       150|  3436|        11.0|  70|American|  plymouth satellite|
| 16|        8| 304.0|       150|  3433|        12.0|  70|American|       amc rebel sst|
| 17|        8| 302.0|       140|  3449|        10.5|  70|American|         ford torino|
+---+---------+------+----------+------+------------+----+--------+--------------------+
only showing top 5 rows



<div class="alert alert-block alert-info"> Note: Make sure the df variable in the following cell is the same as the generated code from insertToCode.</div> 

Due to missing data in `mpg` and `horsepower` columns, they will be excluded from the dataset for model training.

In [3]:
carsDataRaw = df_data_0
carsModData = carsDataRaw.drop("mpg").drop("horsepower")
carsModData.show(5)

+---------+------+------+------------+----+--------+--------------------+
|cylinders|engine|weight|acceleration|year|  origin|                name|
+---------+------+------+------------+----+--------+--------------------+
|        8| 307.0|  3504|        12.0|  70|American|chevrolet chevell...|
|        8| 350.0|  3693|        11.5|  70|American|   buick skylark 320|
|        8| 318.0|  3436|        11.0|  70|American|  plymouth satellite|
|        8| 304.0|  3433|        12.0|  70|American|       amc rebel sst|
|        8| 302.0|  3449|        10.5|  70|American|         ford torino|
+---------+------+------+------------+----+--------+--------------------+
only showing top 5 rows



In the model training process, the original dataset will be split into training dataset and testing dataset. 

In [4]:
splitted_data = carsModData.randomSplit([0.85, 0.15], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training dataset: {}".format(train_data.count()))
print("Number of testing dataset: {}".format(test_data.count()))

Number of training dataset: 348
Number of testing dataset: 58


The following task is to set the input columns for model training, and use the corresponding algorithms to train the model. In this example, Linear Regression method is used to evaluate `weight` in the dataset.

In [5]:
originIndexer = StringIndexer().setInputCol("origin").setOutputCol("origin_code")

vectorAssembler_features = VectorAssembler().setInputCols(["cylinders",
                                                                 "engine",
                                                                 "acceleration",
                                                                 "year",
                                                                 "origin_code"]).setOutputCol("features")

In [6]:
rf = LinearRegression().setLabelCol("weight").setFeaturesCol("features")
pipeline = Pipeline().setStages([originIndexer,vectorAssembler_features,rf])
model = pipeline.fit(train_data)

<a id="evaluate_model"></a>
## Evaluate the model
The model performance can be evaluated using the R Square for test data and the evaluation result can be saved on Cloudant.

In [7]:
testData = model.transform(test_data).drop("prediction")
metric = model.stages[2].evaluate(testData)
print("R Square of Test Data: {}".format(metric.r2))

R Square of Test Data: 0.863976844308


<a id="save_model"></a>
## Save the model
After the model is successfully trained, repository service is used to save the model. The model name and author information can be customized.

In [8]:
from dsx_ml.ml import save
saved_model_output = save(name='CarsModelPython', model=model, test_data=test_data,algorithm_type='Regression')

Using TensorFlow backend.


<div class="alert alert-block alert-info"> Note: The warnings in the cell above are expected.</div> 

<a id="make_prediction"></a>
## Make an online scoring prediction

Upon saving a model, an internal online scoring endpoint is automatically created.

In [9]:
import os
import requests

header_online = {'Content-Type': 'application/json', 'Authorization': os.environ['DSX_TOKEN']}

print(saved_model_output['scoring_endpoint'])

https://dsxl-api/v3/project/score/Python27/spark-2.0/dsx-samples/CarsModelPython/1


New data is provided in the following cell.

In [10]:
new_data = {"cylinders" : 6, "engine" : 289, "acceleration" : 11.1, "year" : 79, "origin" : "American" }
print(new_data)

{'engine': 289, 'acceleration': 11.1, 'cylinders': 6, 'origin': 'American', 'year': 79}


The model evaluates new data and give an estimate scoring.

In [11]:
payload = [new_data]
scoring_response = requests.post(saved_model_output['scoring_endpoint'], json=payload, headers=header_online, verify=False)

print(scoring_response.content)

{"success":true,"description":"Success","object":{"error":"","output":{"classes":[],"predictions":[3553],"probabilities":[]},"returnCode":"0"}}


<a id="summary"></a>
## Summary
In this sample, you learned how to use Watson Studio machine learning services and libraries. You also learned how to split data for model training, how to customize, save and deploy the model, and how to use model endpoint for new data evaluation and scoring.

<div class="alert alert-block alert-info"> Note: To save resources and get the best performance please use the code below to stop the kernel before exiting your notebook.</div>

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<hr>
Copyright &copy; IBM Corp. 2017. Released as licensed Sample Materials.