# Using Watson Studio Machine Learning Service for Model Training and Making Predictions on Hadoop Data

This notebook shows you how to use python machine learning libraries and services from Watson Studio to train, evaluate, and save a model on a remote Hadoop cluster.

Our input data will reside in HDFS for a registered Hadoop Integration system. To avoid having to copy the data from Hadoop into Watson Studio, we will use a remote Livy session to build the model _within Hadoop itself_. Then we will "pull" the model into Watson Studio and save it to your Watson Studio filesystem, making it available for use with other Watson Studio model management features.

<div class="alert alert-block alert-info">Note: In this exercise we will be using Spark from a <i>remote</i> Hadoop session to build the model, and then we will use Spark in the <i>local</i> notebook to load the model.  This means that your remote Hadoop cluster **must** be running with a version of Spark that is compatible with the version of Spark that you're using for this notebook. So if your remote Spark version is 2.0, you will need to run this notebook with Python 2.7; if your remote Spark version is 2.1 or 2.2, you'll need to run this notebook with Python 3.5.</div>

## Table of contents
- [Prerequisites (Admin)](#prerequisites)
- [Create a Remote Livy Session](#create_livy_session)
- [Load Data](#load_data)
- [Access and Manipulate Data](#access_manipulate_data)
- [Evaluate the Model](#evaluate_model)
- [Copy the Model to Watson Studio Local](#model_copy_to_wsl)
- [Save the Model](#save_model)
- [Cleanup the Remote Livy Session](#cleanup_livy_session)
- [Summary](#summary)

<a id='prerequisites'></a>
## Prerequisites (Admin)

In order to run Livy sessions on a remote Hadoop cluster, your Watson Studio admin must first register a Hadoop Integration system with Watson Studio.

Ask your Watson Studio admin to use the **Admin Console => Hadoop Integration** option to register a Hadoop Integration system. ** NOTE: Installation and configuration of IBM's Hadoop Integration (`HI`) service on a Hadoop cluster must be done by a Hadoop admin _before_ that system can be registered with your Watson Studio account. **

When your admin indicates that a Hadoop Integration system has been registered, you can proceed with this sample notebook.

In [1]:
# Imports needed for the cells which run locally on Watson Studio.
import dsx_core_utils
import pandas as pd
from sklearn.externals import joblib

<a id='create_livy_session'></a>
## Create a Remote Livy Session

First, let's get a list of registered Hadoop Integration systems. For this example, we're running in the remote Spark and we do not require any special python libraries, so we do **not** need to look for any particular image.

In [None]:
DSXHI_SYSTEMS = dsx_core_utils.get_dsxhi_info(showSummary=True)

Configure the Spark session that we will run on the selected registered HI system. In this case we want the session to start with 1G memory and two Spark executors. **NOTE**: `myConfig` here is optional; if you prefer to use default configs you can omit this cell and remove the `addlConfig` argument in the next cell.

In [3]:
myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 2
};

In [4]:
# Set up sparkmagic to connect to the selected registered HI
# system with the specified configs. **NOTE** This notebook
# requires Spark 2, so you should set 'livy' to 'livyspark2'.
HI_CONFIG = dsx_core_utils.setup_livy_sparkmagic(
  system="P-Body", 
  livy="livyspark2",
  addlConfig=myConfig)

# (Re-)load sparkmagic to apply the new configs.
%reload_ext sparkmagic.magics

sparkmagic has been configured to use https://pbody-edge-1.fyre.ibm.com:8443/gateway/dsx-loc-chell-375-master-1/livy2/v1 
success configuring sparkmagic livy.


Now, let's capture some state about the configured Hadoop Integraton system, to be used later in this notebook. Then start up a new, remote Livy session to connect to that HI system. **NOTE**: Depending on a) the resources available in the remote Hadoop system and b) the speed of your cluster, attempts to start the session might report errors due to timeout or due to a session coming up `dead`.  In such cases you should run **`%spark cleanup`** as a separate cell, then re-run this cell again.  If session creation continues to fail, contact the Hadop admin of the target Hadoop cluster to see if everything is configured as expected.

In [5]:
session_name = 'mlsess1'
livy_endpoint = HI_CONFIG['LIVY']
webhdfs_endpoint = HI_CONFIG['WEBHDFS']
%spark add -s $session_name -l python -k -u $livy_endpoint

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
271,application_1536856451431_0224,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


For reference / debugging: Print out the name of the Hadoop node to which the remote session has been assigned. When "local" files are created within the remote session, they will be written to this node. All of the Yarn container artifacts (workspace and temp files) will exist on this node, as well.

In [6]:
%%spark -s $session_name
import socket
print("Remote livy session driver: {}".format(socket.gethostname()))

Remote livy session driver: hdp-264-pbody-1.fyre.ibm.com

The following cell, and all subsequent cells which have **`%%spark`** as their first line, will run *remotely*, i.e. within a Yarn container that exists on the registered Hadoop Integration system.

In [7]:
%%spark -s $session_name

# Declare imports needed for all of the cells that will run remotely.
import getpass, time, os, shutil
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline

# Load IBM Hadoop Integration utilities to facilitate remote functionality.
# This line assumes that HI version >= X.Y has been installed on the registered
# Hadoop Integration system.
hi_utils_lib = os.getenv("HI_UTILS_PATH", "")
sc.addPyFile("hdfs://{}".format(hi_utils_lib))
import hi_core_utils

# Declare a target HDFS directory path that will be used for our data.
hdfs_dataset_dir = "/user/{}/datasets".format(getpass.getuser())
input_ds = "{}/{}".format(hdfs_dataset_dir, "cars.csv")

# Create target hdfs directory, if it does not already exist.
hi_core_utils.run_command("hdfs dfs -mkdir -p {}".format(hdfs_dataset_dir))

<a id="load_data"></a>
## Load Data 
The 1983 Data Exposition dataset was collected by Ernesto Ramos and David Donoho and dealt with automobiles. Data on mpg, cylinders, displacement, was provided for 406 different cars, each identified by name. The dataset is freely available on the Watson Studio home page.

Perform the following steps to upload this dataset:
1. Go to the <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/c81e9be8daf6941023b9dc86f303053b" target="_blank">Car performance data</a> card on the Watson Studio home page.
1. Click the download button.
1. Click the **Create new** icon on the notebook action bar, and use the **Add data set** button to add the downloaded cars.csv file as a `Local File`. 

The data file is listed on the **Local Data** pane in the notebook.

Now, let's load our test data into HDFS. For the purposes of this sample, our data is small and comes from the local `cars.csv` file created above. We do not _need_ to put it into HDFS for this example--but we choose to do so for demonstration purposes. In a real scenario the desired data should already be loaded into HDFS.

In [8]:
# Redeclare the dataset dir locally--the earlier declaration was in the _remote_
# session so it is not available here.

# ** NOTE ** Replace {your-username-here} with your actual user name.
hdfs_dataset_dir = "/user/{your-username-here}/datasets"

# Upload the saved csv file from Local to the remote HDFS.
input_csv = os.environ["DSX_PROJECT_DIR"] + "/datasets/cars.csv"
dsx_core_utils.upload_hdfs_file(webhdfs_endpoint, input_csv, "{}/cars.csv".format(hdfs_dataset_dir))

upload success


<a id="access_manipulate_data"></a>
## Access and Manipulate Data

Now use Spark to read the data, as a **Spark dataframe**, from HDFS.

In [9]:
%%spark -s $session_name

df_data_0 = spark.read.format(
    "org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option(
    "header", "true").option("inferSchema", "true").load(input_ds)

df_data_0.show(5)

+---+---------+------+----------+------+------------+----+--------+--------------------+
|mpg|cylinders|engine|horsepower|weight|acceleration|year|  origin|                name|
+---+---------+------+----------+------+------------+----+--------+--------------------+
| 18|        8| 307.0|       130|  3504|        12.0|  70|American|chevrolet chevell...|
| 15|        8| 350.0|       165|  3693|        11.5|  70|American|   buick skylark 320|
| 18|        8| 318.0|       150|  3436|        11.0|  70|American|  plymouth satellite|
| 16|        8| 304.0|       150|  3433|        12.0|  70|American|       amc rebel sst|
| 17|        8| 302.0|       140|  3449|        10.5|  70|American|         ford torino|
+---+---------+------+----------+------+------------+----+--------+--------------------+
only showing top 5 rows

Due to missing data in the `mpg` and `horsepower` columns, they will be excluded from the dataset for model training.

In [10]:
%%spark -s $session_name
carsDataRaw = df_data_0
carsModData = carsDataRaw.drop("mpg").drop("horsepower")
carsModData.show(5)

+---------+------+------+------------+----+--------+--------------------+
|cylinders|engine|weight|acceleration|year|  origin|                name|
+---------+------+------+------------+----+--------+--------------------+
|        8| 307.0|  3504|        12.0|  70|American|chevrolet chevell...|
|        8| 350.0|  3693|        11.5|  70|American|   buick skylark 320|
|        8| 318.0|  3436|        11.0|  70|American|  plymouth satellite|
|        8| 304.0|  3433|        12.0|  70|American|       amc rebel sst|
|        8| 302.0|  3449|        10.5|  70|American|         ford torino|
+---------+------+------+------------+----+--------+--------------------+
only showing top 5 rows

In the model training process, the original dataset will be split into a training dataset and a testing dataset.

In [11]:
%%spark -s $session_name

splitted_data = carsModData.randomSplit([0.85, 0.15], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training dataset: {}".format(train_data.count()))
print("Number of testing dataset: {}".format(test_data.count()))

Number of training dataset: 348
Number of testing dataset: 58

Now set the input columns for model training, and use the corresponding algorithms to train the model. In this example, the Linear Regression method is used to evaluate `weight` in the dataset.

In [12]:
%%spark -s $session_name
originIndexer = StringIndexer().setInputCol("origin").setOutputCol("origin_code")
vectorAssembler_features = VectorAssembler().setInputCols(
    ["cylinders", "engine", "acceleration", "year", "origin_code"]).setOutputCol("features")

In [13]:
%%spark -s $session_name
rf = LinearRegression().setLabelCol("weight").setFeaturesCol("features")
pipeline = Pipeline().setStages([originIndexer,vectorAssembler_features,rf])
model = pipeline.fit(train_data)

<a id="evaluate_model"></a>
## Evaluate the Model
The model performance can be evaluated using the R Square for test data.

In [14]:
%%spark -s $session_name
testData = model.transform(test_data).drop("prediction")
metric = model.stages[2].evaluate(testData)
print("R Square of Test Data: {}".format(metric.r2))

R Square of Test Data: 0.863976844308

<a id='model_copy_to_wsl'></a>
## Copy the Model to Watson Studio Local

The model now exists within the memory of the remote livy session. In order to use it in Watson Studio model management, we need to copy it to the local Watson Studio environment.  This is done in two parts.

### Write the model to HDFS
First, in the _remote_ session, we use a Hadoop Integration utility method to write the model to HDFS.

In [15]:
%%spark -s $session_name
hi_core_utils.write_model_to_hdfs(model=model, model_name="ml_cars_model")

{'path': 'hdfs:///user/user1/.dsxhi/models/ml_cars_model/3/model', 'version': 3, 'name': 'ml_cars_model', 'latest_version': 3}

### Load the model from HDFS into Watson Studio
Then, on the Watson Studio _local_ side, use a Watson Studio utility method to load the model from HDFS into memory. Note that the model name we use here should match the one we used in the previous cell, when we wrote the model to HDFS.

Note also that this cell **does not** begin with the **`%%spark`** line, which means it is running locally in your Watson Studio.

In [16]:
ml_cars = dsx_core_utils.load_model_from_hdfs(webhdfs_endpoint, model_name="ml_cars_model")

Model loaded from hdfs:///user/user1/.dsxhi/models/ml_cars_model/3/model.tar.gz


<a id='save_model'></a>
## Save the Model
We can now save the Spark model to the Watson Studio filesystem for publishing, scoring, deployment, and evaluations.

When invoking the `save` function we want to pass a pandas dataframe for **`test_data`** as an argument. By doing so we allow the `save` function to a) determine the schema of the test data automatically, and b) find an example row that can be used elsewhere in the WSL model management UI (ex. for real-time scoring).

At this point the desired dataframe exists within the _remote_ Livy session, which means it is not directly accessible from the local notebook session. However, we can use `sparkmagic` to pull a **single** row ("`-n 1`") from the remote dataframe.  This allows us to get the minimum necessary information we need from the test data **without** having to read the full datasets from HDFS.

Here we load one row of data from the remote dataframe into a local dataframe named **`cars_test_data`**.

In [17]:
%%spark -s $session_name -n 1 -o cars_test_data
cars_test_data = test_data

The above cell will load the data as a pandas dataframe, **`cars_test_data`**, but the `save` call below needs it to be a Spark dataframe since we're dealing with a Spark model. So we have to convert it into a Spark dataframe.

In [18]:
from pyspark.sql import SQLContext
test_data = SQLContext(sc).createDataFrame(cars_test_data)

Now that we have our **`test_data`** dataframe, let's import the `save` function from the `dsx_ml.ml` library and save the model.

**NOTE**: Since we're using a dataframe with a **single** row, i.e. partial data, we choose to skip calculation of performance metrics for the saved model ("`skip_metrics = True`") since metrics based on a single row are not useful.

In [19]:
from dsx_ml.ml import save
save(name='Cars ML via Hadoop', model=ml_cars, test_data=test_data, algorithm_type='Regression', skip_metrics = True)

Using TensorFlow backend.


{'path': '/user-home/1001/DSX_Projects/Models on Hadoop/models/Cars ML via Hadoop/22',
 'scoring_endpoint': 'https://dsxl-api/v3/project/score/Python35/spark-2.2/Models%20on%20Hadoop/Cars%20ML%20via%20Hadoop/22'}

<a id='cleanup_livy_session'></a>
## Cleanup the Remote Livy Session
We're done with our models and we have successfully saved them to Watson Studio. Let's clean up our remote Livy session. 
This will terminate the session and release resources back to the remote Hadoop Integration system.

In [20]:
%spark cleanup

<a id='summary'></a>
## Summary
In this notebook you learned how to create a Spark model using machine learning libraries _on a registered Hadoop Integration system_, allowing you to create the model where the data resides, instead of having to copy your data into the Watson Studio environment.  Once the model was created you were able to save it in the Watson Studio environment, where it can now be used as input for other Watson Studio model management features.

<div class="alert alert-block alert-info">Note: To save resources and get the best performance please use the code below to stop the kernel before exiting your notebook.</div>

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<hr>
Copyright &copy; IBM Corp. 2018. Released as licensed Sample Materials.