# Using Watson Studio Machine Learning Service for Model Training and Making Predictions

This notebook shows how to use machine learning libraries and services from Watson Studio to train, save, deploy and evaluate a model and make a prediction for new data. 

## Table of contents
- [Prepare the environment](#prepare_environment)
- [Load data](#load_data)
- [Access and manipulate data](#access_manipulate_data)
- [Save the model](#save_model)
- [Evaluate the model](#evaluate_model)
- [Make a prediction](#make_prediction)
- [Summary](#summary)

<a id="prepare_environment"></a>
## Prepare the environment

Import machine learning libraries.

In [1]:
//import libraries
import org.apache.spark.{SparkConf, SparkContext, SparkFiles}
import org.apache.spark.sql.{SQLContext, SparkSession, Row}
import org.apache.spark.SparkFiles

import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer, VectorAssembler}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.classification.{LogisticRegression, DecisionTreeClassifier}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.ibm.transformers.RenameColumn

import com.ibm.analytics.ngp.dsxML._
import com.ibm.analytics.ngp.ingest.Sampling
import com.ibm.analytics.ngp.util._
import com.ibm.analytics.ngp.pipeline.evaluate.{Evaluator,MLProblemType}

<a id="load_data"></a>
## Load data 
The 1983 Data Exposition dataset was collected by Ernesto Ramos and David Donoho and dealt with automobiles. Data on mpg, cylinders, displacement, was provided for 406 different cars, each identified by name. The dataset is freely available on the Watson Studio home page.


Perform the following steps to upload this dataset:
1. Go to the <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/c81e9be8daf6941023b9dc86f303053b" target="_blank">Car performance data</a> card on the Watson Studio home page.
1. Click the download button.
1. Click the **Create new** icon on the notebook action bar, and use **Add data set** button to add the downloaded cars.csv file as a `Local File`. 

The data file is listed on the **Local Data** pane in the notebook.



<a id="access_manipulate_data"></a>
## Access and manipulate data

To add the code to access the data file, click the next code cell and select **Insert Spark DataFrame in Scala** in the **Insert To Code** drop-down list below the data file in the `Local Data` pane in the notebook.

+----+---------+------+----------+------+------------+----+--------+--------------------+
| mpg|cylinders|engine|horsepower|weight|acceleration|year|  origin|                name|
+----+---------+------+----------+------+------------+----+--------+--------------------+
|18.0|        8| 307.0|       130|  3504|        12.0|  70|American|chevrolet chevell...|
|15.0|        8| 350.0|       165|  3693|        11.5|  70|American|   buick skylark 320|
|18.0|        8| 318.0|       150|  3436|        11.0|  70|American|  plymouth satellite|
|16.0|        8| 304.0|       150|  3433|        12.0|  70|American|       amc rebel sst|
|17.0|        8| 302.0|       140|  3449|        10.5|  70|American|         ford torino|
+----+---------+------+----------+------+------------+----+--------+--------------------+
only showing top 5 rows



<div class="alert alert-block alert-info"> Note: Make sure the df variable in the following cell is the same as the generated code from insertToCode.</div> 

Due to missing data in `mpg` and `horsepower` columns, they will be excluded from the dataset for model training.

In [3]:
val carsDataRaw = df0
val carsModData = carsDataRaw.drop("mpg", "horsepower")
carsModData.show(5)

+---------+------+------+------------+----+--------+--------------------+
|cylinders|engine|weight|acceleration|year|  origin|                name|
+---------+------+------+------------+----+--------+--------------------+
|        8| 307.0|  3504|        12.0|  70|American|chevrolet chevell...|
|        8| 350.0|  3693|        11.5|  70|American|   buick skylark 320|
|        8| 318.0|  3436|        11.0|  70|American|  plymouth satellite|
|        8| 304.0|  3433|        12.0|  70|American|       amc rebel sst|
|        8| 302.0|  3449|        10.5|  70|American|         ford torino|
+---------+------+------+------------+----+--------+--------------------+
only showing top 5 rows



In the model training process, the original dataset will be split into training dataset and testing dataset. 

In [4]:
val splitted_data = carsModData.randomSplit(Array(0.85, 0.15), 24)
val train_data = splitted_data(0)
val test_data = splitted_data(1)

println("Number of training dataset: " + train_data.count())
println("Number of testing dataset: " + test_data.count())

Number of training dataset: 335
Number of testing dataset: 57


The following task is to set the input columns for model training, and use the corresponding algorithms to train the model. In this example, Linear Regression method is used to evaluate `weight` in the dataset.

In [5]:
val originIndexer = new StringIndexer().setInputCol("origin").setOutputCol("origin_code")

val vectorAssembler_features = new VectorAssembler().setInputCols(Array("cylinders",
                                                                 "engine",
                                                                 "acceleration",
                                                                 "year",
                                                                 "origin_code")).setOutputCol("features")

In [6]:
val rf = new LinearRegression().setLabelCol("weight").setFeaturesCol("features")
val pipeline = new Pipeline().setStages(Array(originIndexer,vectorAssembler_features,rf))
val model = pipeline.fit(train_data)

<a id="save_model"></a>
## Save the model
After the model is successfully trained, save the model. The model name can be customized.

In [7]:
val ml_client = ML()
val modelName = "CarsModelScala"
val fileName = "Train+and+predict+with+Scala+machine+learning.ipynb"
val saveResult = ml_client.save(model, train_data, test_data, None, modelName, "", 
                                fileName, 
                                "Regression", 
                                com.ibm.analytics.ngp.dsxML.MetaNames.LABEL_FIELD -> "weight")
print(saveResult.get)

{"path":"/user-home/999/DSX_Projects/dsx-samples/models/CarsModelScala/1","scoring_endpoint":"https://dsxl-api/v3/project/score/spark-2.0/spark-2.0/dsx-samples/CarsModelScala/1"}

<a id="evaluate_model"></a>
## Evaluate the model
The model performance can be evaluated using the R Square for test data.

In [8]:
import org.apache.spark.ml.regression.{LinearRegressionSummary, LinearRegressionModel}

val testData = model.transform(test_data).drop("prediction")
val metric = model.stages(2).asInstanceOf[LinearRegressionModel].evaluate(testData).asInstanceOf[LinearRegressionSummary]
println(s"R Square of Test Data: ${metric.r2}")

R Square of Test Data: 0.8679219324963019


<a id="make_prediction"></a>
## Make a prediction

After deployment, the endpoint of model can be used to give prediction for new data using the online scoring service.

In [9]:
import play.api.libs.json._
import spray.json.DefaultJsonProtocol._
import spray.json._
import scalaj.http.{Http, HttpOptions}

val projectName = sys.env("DSX_PROJECT_NAME")

val scoringURL = saveResult.get.fields("scoring_endpoint").convertTo[String]

print(scoringURL)

https://dsxl-api/v3/project/score/spark-2.0/spark-2.0/dsx-samples/CarsModelScala/1

New data is provided in the following cell.

In [10]:
val json_map = Json.toJson(List(Json.toJson(Map("cylinders" -> Json.toJson(6), 
                                                "engine" -> Json.toJson(289), 
                                                "acceleration" -> Json.toJson(11.1), 
                                                "year" -> Json.toJson(79), 
                                                "origin" -> Json.toJson("American")))))
val payload_scoring = Json.stringify(json_map)

print (payload_scoring)

[{"acceleration":11.1,"cylinders":6,"year":79,"origin":"American","engine":289}]

The model evaluates new data and give an estimate scoring.

In [11]:
val authToken = sys.env("DSX_TOKEN")
val response_scoring = Http(scoringURL).postData(payload_scoring).header("Content-Type", "application/json").header("Authorization", authToken).option(HttpOptions.connTimeout(10000)).option(HttpOptions.readTimeout(50000)).option(HttpOptions.allowUnsafeSSL).asString

print (response_scoring)

HttpResponse({"success":true,"description":"Success","object":{"error":"","output":{"classes":[],"predictions":[3576],"probabilities":[]},"returnCode":"0"}},200,Map(Connection -> Vector(keep-alive), Content-Encoding -> Vector(gzip), Content-Type -> Vector(application/json), Date -> Vector(Tue, 03 Apr 2018 22:38:05 GMT), Server -> Vector(openresty), Status -> Vector(HTTP/1.1 200 OK), Transfer-Encoding -> Vector(chunked), Vary -> Vector(Accept-Encoding), X-Powered-By -> Vector(Express)))

<a id="summary"></a>
## Summary
In this sample, you learned how to use Watson Studio machine learning services and libraries. You also learned how to split data for model training, how to customize, save and deploy the model, and how to use model endpoint for new data evaluation and scoring.

<div class="alert alert-block alert-info"> Note: To save resources and get the best performance please use the code below to stop the kernel before exiting your notebook.</div>

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<hr>
Copyright &copy; IBM Corp. 2017. Released as licensed Sample Materials.