# Machine Learning with Spark ML

### In this notebook, we will explore machine learning using Spark ML. We will exploit Spark ML's high-level APIs built on top of DataFrames to create and tune machine learning pipelines. Spark ML Pipelines enable combining multiple algorithms into a single pipeline or workflow. We will utilize Spark ML's feature transformers to convert, modify and scale the features that will be used to develop the machine learning model. Finally, we will evaluate and cross validate our model to demonstrate the process of determining a best fit model and load the results in the database.

### We are using machine learning to try to predict records that a human has not seen or vetted before. We will use these predictions to sort the highest priority records for a human to look at. We will use as a training set for the algorithm fake data that has been vetted by an analyst as high, medium or low.¶

### We will use generated travel data that has been examined for patterns of Human Trafficking from a DB2 table to do the machine learning.  We loaded this data in Lab 1.


## Table of contents

1. [Create Version](#version)
1. [Import Libraries](#install)
1. [Read from Object Storage](#object)
1. [Transform the data](#transform)
1. [Feature Engineering](#engineering)
1. [Model the data](#model)
1. [Setup the Pipeline](#pipeline)
1. [Train the model](#train)
1. [Evaluate results](#evaluate)
1. [Hyperparameter Tuning](#tuning)
1. [Score the records](#score)
1. [Insert Credentials](#credentials)
1. [Write Results](#write)
1. [Create New Version](#version2)
1. [Schedule Job](#schedule)
1. [Revert to Version](#revert)
1. [Even More Help](#help)


<a id="version"></a>
## Create Version 

Save a version of the notebook by selecting <b>File</b> > <b>Save Version</b> 
<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/FileOptions.PNG" > or by selecting the <b>Versions</b> icon and selecting <b>Save Version</b>. <img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/versions-button.png" ><br>
You can have up to ten (10) versions of a notebook.   Notebook versions are saved in a FIFO manner.

## Verify Spark version and existence of Spark

In [1]:
print('The spark version is {}.'.format(spark.version))

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190606010358-0000
KERNEL_ID = 3cb2ddf6-6f96-4f40-ad4d-7af38e94e1dc
The spark version is 2.3.3.


<a id="install"></a>
##  Import the required libraries

In [2]:
#Imports for Spark
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml.feature import Bucketizer
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.classification import NaiveBayes, DecisionTreeClassifier
from pyspark.sql.functions import year
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Imports for pixiedust
from pixiedust.display import *

Pixiedust database opened successfully
Table VERSION_TRACKER created successfully
Table METRICS_TRACKER created successfully

Share anonymous install statistics? (opt-out instructions)

PixieDust will record metadata on its environment the next time the package is installed or updated. The data is anonymized and aggregated to help plan for future releases, and records only the following values:

{
   "data_sent": currentDate,
   "runtime": "python",
   "application_version": currentPixiedustVersion,
   "space_id": nonIdentifyingUniqueId,
   "config": {
       "repository_id": "https://github.com/ibm-watson-data-lab/pixiedust",
       "target_runtimes": ["Data Science Experience"],
       "event_id": "web",
       "event_organizer": "dev-journeys"
   }
}
You can opt out by calling pixiedust.optOut() in a new cell.


[31mPixiedust runtime updated. Please restart kernel[0m
Table SPARK_PACKAGES created successfully
Table USER_PREFERENCES created successfully
Table service_connections created successfully


<a id="object"></a>
## Read Data Asset - female_human_trafficking
Make sure the cursor is in the cell below (after the comments) before you follow these steps. <br><br>
Select the <b>Find and Add Data</b> icon <br>
<img alt="IBM Bluemix.Get started now" align="left" src="https://raw.githubusercontent.com/bleonardb3/WatsonStudio/master/Lab-1/images/onezeroicon.png" > <br><br><br><br>

Select the <b>Files</b> view then click on <b>Insert to code</b> underneath female_human_trafficking, then click on <b>Insert SparkSession DataFrame </b>

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/bleonardb3/DS_POT_02-07/master/Lab-2/images/InsertFemaleHumanTraffickingCode.png" ><br><br>

Rename the result to <b>trafficking_df</b> to ensure compliance with the following cells.

In [3]:
# Insert SparkSession DataFrame code in this cell after the comments. 
# make CERTAIN to rename the default dataframe name (df_data_1 or df_data_2 or df_data_3, etc) to trafficking_df
# Put cursor on the next line to Insesrt to code. 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# @hidden_cell
# The following code is used to access your data and contains your credentials.
# You might want to remove those credentials before you share your notebook.

properties_88b30a37b5ad43b3a04957e342f43c15 = {
    'jdbcurl': 'jdbc:db2://dashdb-entry-yp-dal09-08.services.dal.bluemix.net:50000/BLUDB',
    'user': 'dash100302',
    'password': 'SWwY3_9sc_Kx'
}

trafficking_df = spark.read.jdbc(properties_88b30a37b5ad43b3a04957e342f43c15['jdbcurl'], table='DASH100302.FEMALE_HUMAN_TRAFFICKING', properties=properties_88b30a37b5ad43b3a04957e342f43c15)
trafficking_df.head()


Row(INTERNAL_ID=277, VETTING_LEVEL=100, DESCRIPTION='NA', NAME='Kaitlin Young', GENDER='F', BIRTH_DATE=datetime.date(2001, 8, 13), BIRTH_COUNTRY='Ghana', BIRTH_COUNTRY_CODE='GH', OCCUPATION='Press photographer', ADDRESS='905 Herrera Park Ste 485, Yorktown Hieghts, New York 10598', SSN='474-81-5023', PASSPORT_NUMBER=889016907, PASSPORT_COUNTRY='Ghana', PASSPORT_COUNTRY_CODE='GH', COUNTRIES_VISITED='JO,OM', COUNTRIES_VISITED_COUNT=2, ARRIVAL_AIRPORT_COUNTRY_CODE='US', ARRIVAL_AIRPORT_IATA='DOV', ARRIVAL_AIRPORT_MUNICIPALITY='Dover', ARRIVAL_AIRPORT_REGION='US-DE', DEPARTURE_AIRPORT_COUNTRY_CODE='JO', DEPARTURE_AIRPORT_IATA='AMM', DEPARTURE_AIRPORT_MUNICIPALITY='Amman', DEPARTURE_AIRPORT_REGION='JO-AM', UUID='a4c9a36e-413e-46fc-add5-d02f8a7c3983', AGE=15)

## Read Data Asset - Occupations
The occupations listed in the female human trafficking file are too numerous to use as input to a machine learning model. We will categorize these occupations into 15 categories by joining with two other files. The Occupation.csv file contains a mapping of the occupations in the female human trafficking table to a category code. The Categories.csv file contains each code followed by the category name. This information needs to be joined to the female human trafficking table.  

Follow the same procedure as above to insert a SparkDataFrame for Occupations 

In [4]:
# Insert SparkSession DataFrame code in this cell after the comments
# make CERTAIN to rename the default dataframe name (df_data_1 or df_data_2 or df_data_3,etc) to occupations
#Put cursor on the next line to Insert to code
import ibmos2spark
# @hidden_cell
credentials = {
    'endpoint': 'https://s3.us-south.objectstorage.service.networklayer.com',
    'service_id': 'f32f1376-17a7-4060-a111-aa7920f8b611',
    'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token',
    'api_key': 'YiEjcvXfkmqK5opMPNuloS95DeFfwb7_PzdwKh4UOOcl'
}

configuration_name = 'os_fe425257ac4f4a1d98451f2fb157d912_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
occupations = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('data_asset/Occupation_BJ-oz4VA4.csv', 'watsonstudioprojectd-datacatalog-bj0twxea4'))
occupations.take(5)


[Row(OCCUPATION='Chiropodist', Code='6'),
 Row(OCCUPATION='Engineer, structural', Code='2'),
 Row(OCCUPATION='Engineer, land', Code='2'),
 Row(OCCUPATION='Careers adviser', Code='15'),
 Row(OCCUPATION='Geneticist, molecular', Code='7')]

## Read Data Asset - Categories
Follow the same procedure as above to insert a SparkDataFrame for Categories

In [5]:
# Insert SparkSession DataFrame code in this cell after the comments
# make CERTAIN to rename the default dataframe name (df_data_1 or df_data_2 or df_data_3,etc) to categories
#Put cursor on the next line to Insert to code
categories = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('data_asset/Categories_rkh3QNNR4.csv', 'watsonstudioprojectd-datacatalog-bj0twxea4'))
categories.take(5)


[Row(Code='1', Category='Sports/Travel'),
 Row(Code='2', Category='Engineering'),
 Row(Code='3', Category='Information Technology'),
 Row(Code='4', Category='Journalism'),
 Row(Code='5', Category='Government')]

## Join the Occupation mapping file content with the Category file content. 
The "Code" column serves as the matching key. Note that we can join the two flat files as if they were database tables using the DataFrame API.  In a similar fashion we will join the resulting DataFrame with the trafficking_df Dataframe. 

In [6]:
occupations_df = occupations.join(categories,'Code','inner').drop('Code')
occupations_df.show()

+--------------------+------------+
|          OCCUPATION|    Category|
+--------------------+------------+
|         Chiropodist|     Medical|
|Engineer, structural| Engineering|
|      Engineer, land| Engineering|
|     Careers adviser|       Other|
|Geneticist, molec...|     Science|
|Electronics engineer| Engineering|
|Editor, commissio...|  Journalism|
|              Writer|  Journalism|
|Recruitment consu...|       Other|
|     Theatre manager|        Arts|
|        Statistician| Engineering|
|Merchant navy off...|       Other|
|Advertising accou...| Advertising|
|Administrator, lo...|  Government|
| Corporate treasurer|     Finance|
|           Paramedic|     Medical|
|Special effects a...|        Arts|
|           Architect|Construction|
|               Actor|        Arts|
|      Science writer|  Journalism|
+--------------------+------------+
only showing top 20 rows



In [7]:
trafficking_df = trafficking_df.join(occupations_df,'OCCUPATION','inner')
trafficking_df.show()

+--------------------+-----------+-------------+-----------+--------------------+------+----------+-------------+------------------+--------------------+-----------+---------------+----------------+---------------------+--------------------+-----------------------+----------------------------+--------------------+----------------------------+----------------------+------------------------------+----------------------+------------------------------+------------------------+--------------------+---+--------------------+
|          OCCUPATION|INTERNAL_ID|VETTING_LEVEL|DESCRIPTION|                NAME|GENDER|BIRTH_DATE|BIRTH_COUNTRY|BIRTH_COUNTRY_CODE|             ADDRESS|        SSN|PASSPORT_NUMBER|PASSPORT_COUNTRY|PASSPORT_COUNTRY_CODE|   COUNTRIES_VISITED|COUNTRIES_VISITED_COUNT|ARRIVAL_AIRPORT_COUNTRY_CODE|ARRIVAL_AIRPORT_IATA|ARRIVAL_AIRPORT_MUNICIPALITY|ARRIVAL_AIRPORT_REGION|DEPARTURE_AIRPORT_COUNTRY_CODE|DEPARTURE_AIRPORT_IATA|DEPARTURE_AIRPORT_MUNICIPALITY|DEPARTURE_AIRPORT_REGION

<a id="transform"></a>
## Identify our labels and transform 

We will use the 'VETTING_LEVEL' column as a label for training the machine learning model.  This is where our analyst has marked the data as vetted.  

Spark ML requires that that the labels are data type Double, so we will cast the  column as Double.

withColumn() is a Spark SQL way to manipulate a dataframe.  Since an RDD is immutable, we create a new RDD each time we transform.  This code creates a new column VettingTemp and sets it to the values in "VETTING_LEVEL" cast to a Double.    It then drops column VETTING_LEVEL and renames column VettingTemp to VETTING_LEVEL.

We will also do a similar manipulation for the COUNTRIES_VISITED_COUNT

In [8]:
DataWithLabels = (trafficking_df.withColumn("VettingTemp", trafficking_df["VETTING_LEVEL"]
    .cast("Double")).drop("VETTING_LEVEL").withColumnRenamed("VettingTemp", "VETTING_LEVEL"))

DataWithLabels = (trafficking_df.withColumn("CountriesVisitedCountTemp", trafficking_df["COUNTRIES_VISITED_COUNT"]
    .cast("Double")).drop("COUNTRIES_VISITED_COUNT").withColumnRenamed("CountriesVisitedCountTemp", "COUNTRIES_VISITED_COUNT"))

We want to use year of birth intead of date of birth in our learning.  

Another way to transform an rdd in Spark is using SQL Syntax.  Here, we will be adding a new field, BIRTH_YEAR to our vetting set.  We will also just select the fields we need.

In [9]:
DataWithLabels.createOrReplaceTempView("VettingData")
AllVettingData = sqlContext.sql ("SELECT UUID, VETTING_LEVEL, NAME, Category, COUNTRIES_VISITED_COUNT, PASSPORT_COUNTRY_CODE, year(BIRTH_DATE) as BIRTH_YEAR, 1 as Counter FROM VettingData")
FilteredVettingData = AllVettingData.filter("VETTING_LEVEL==100")
FilteredVettingData.count()

816

Use pixiedust to visually explore the data.

In [None]:
display(AllVettingData)

UUID,VETTING_LEVEL,NAME,Category,COUNTRIES_VISITED_COUNT,PASSPORT_COUNTRY_CODE,BIRTH_YEAR,Counter
5fcbbf15-8268-4306-ac26-406c2faae251,30,Stacey Courtney Gonzalez,Other,1.0,GH,1984,1
6d1fe59a-6828-4bc5-9ded-2194aa238461,30,Erika Patie Fowler,Science,2.0,GH,1998,1
e1ed476e-3c06-4572-9873-f8645a46615b,100,Amanda Dinnie Young,Medical,4.0,GH,1975,1
e37e6245-e542-4377-b1ef-2e3a2bd95cdb,100,Shari Adams,Construction,8.0,GH,1981,1
d06dc97a-7bbe-4151-840b-c5a65328dcd0,10,Natasha Wright,Education,10.0,GH,1998,1
0f1e790d-e94f-4b16-92c2-6eda38ea6ec0,10,Nancy Stephanie Taylor,Other,6.0,GH,1988,1
652f9d66-c58f-49e6-9fcf-dc29299789fe,100,Sarah Kimme Miller,Medical,5.0,GH,2000,1
e24eb501-5425-4e92-b1e3-4d2d18d2f3a0,100,Diane Welch,Advertising,3.0,GH,1988,1
13db12e6-18c1-45d7-aa7a-630642070730,20,Sandra Walker,Other,3.0,GH,1993,1
7a0fdbf3-2dc0-42e8-901c-2f0a9ec54df5,100,Amanda May,Medical,3.0,GH,1971,1


Now, let's look at the data we have:

VETTING_LEVEL is in four different statuses:

    10 - HIGH
    
    20 - MEDIUM
    
    30 - LOW
    
    100 - Unlabeled


Print the total number of vetting statuses 

In [11]:
print('The number of rows labeled high is {}.'.format(AllVettingData.filter(AllVettingData['VETTING_LEVEL'] == 10).count()))
print('The number of rows labeled medium is {}.'.format(AllVettingData.filter(AllVettingData['VETTING_LEVEL'] == 20).count()))
print('The number of rows labeled low is {}.'.format(AllVettingData.filter(AllVettingData['VETTING_LEVEL'] == 30).count()))
print('The number of unlabeled rows is {}.'.format(AllVettingData.filter(AllVettingData['VETTING_LEVEL'] == 100).count()))

The number of rows labeled high is 69.
The number of rows labeled medium is 95.
The number of rows labeled low is 105.
The number of unlabeled rows is 816.


The majority of the data has not been labeled (VETTING_LABEL=100 means unvetted).  We can not use it for our training data, so filter it out.
Print the total number of rows.

In [12]:
LabeledVettingData=AllVettingData.filter("VETTING_LEVEL != 100")
LabeledVettingData.count()

269

In [13]:
display(LabeledVettingData)

UUID,VETTING_LEVEL,NAME,Category,COUNTRIES_VISITED_COUNT,PASSPORT_COUNTRY_CODE,BIRTH_YEAR,Counter
b4d817b3-2154-4351-b0e3-bcd736a76bf5,30,Kelly Robbins,Engineering,3.0,PK,1971,1
5fcbbf15-8268-4306-ac26-406c2faae251,30,Stacey Courtney Gonzalez,Other,1.0,GH,1984,1
6d1fe59a-6828-4bc5-9ded-2194aa238461,30,Erika Patie Fowler,Science,2.0,GH,1998,1
fb3808ae-b5f8-482b-bac7-4cebb8761723,30,Shell Chapman,Journalism,8.0,GH,1997,1
240ca290-bd2a-48a0-95d0-1d9a0d9c5075,10,Linda Miller,Arts,3.0,GH,1997,1
48b1459b-e863-4e2c-ab3b-847ee296e78b,20,Kelly Smith,Retail,4.0,GH,2000,1
98185636-b26c-4d9a-8fe8-eaa778889c24,10,Carla Shelly Montoya,Retail,3.0,GH,1997,1
76f3a61b-d5ea-4c58-8531-939cce8ea416,10,Lys Young,Construction,3.0,GH,2002,1
9e281720-bc43-47a0-9691-78414c897e43,10,Alison Mary Hall,Medical,12.0,GH,2000,1
0f1e790d-e94f-4b16-92c2-6eda38ea6ec0,10,Nancy Stephanie Taylor,Other,6.0,GH,1988,1


<a id="engineering"></a>
## Feature Engineering.
### A feature is the elements of the data that we are using in our learning.  We need to transform each one of our features into a format that SparkML can use it.
More about the choices for feature engineering can be found here:
http://spark.apache.org/docs/2.0.0/ml-features.html#stringindexer


The first thing we will do is transform our labels (VETTING_LEVEL) into a format that we can use in the algorithm, and then get back to 'human readable' from in the end. The ML models require that the labels are in a column called 'label'.    The converter helps us transform these back in the end.



In [14]:
labelIndexer = StringIndexer(inputCol="VETTING_LEVEL", outputCol="label", handleInvalid="error")
labelModel = labelIndexer.fit(LabeledVettingData)
converter = IndexToString(inputCol="prediction", outputCol="predCategory", labels=labelModel.labels)

Next, we will process all of the features we will use. While there are a variety of choices for transforming elements, we will treat each as a String using the StringIndexer.

StringIndexer is a transformer that encodes a string column to a column of indices. The indices are ordered by value frequencies, so the most frequent value gets index 0. If the input column is numeric, it is cast to string first.

For our vetting dataset, we are interested in all string based features so we will use the StringIndexer for them.  We need to use 'handleInvalid="skip"' because not all values have been validated in our vetting set.  That means the algorithms will skip these records.

In [15]:
categoryIndexer = StringIndexer(inputCol="Category", outputCol="categoryIndex", handleInvalid="skip")
countryIndexer = StringIndexer(inputCol="PASSPORT_COUNTRY_CODE", outputCol="countryIndex", handleInvalid="skip")
yearOfBirthIndexer = StringIndexer(inputCol="BIRTH_YEAR", outputCol="birthYearIndex", handleInvalid="skip")

Now, put all of our features into a simple array using a VectorAssembler.

Note that COUNTRIIES_VISITED_COUNT is already a numeric, so we can just put that in the array as is.


In [16]:
vecAssembler = VectorAssembler(inputCols=["categoryIndex","countryIndex","birthYearIndex", "COUNTRIES_VISITED_COUNT"], outputCol="features")

Normalizer will help us normalize the features into a standard frmat.  It can help us improve the behavior of the learning algorithms.


In [17]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)

<a id="model"></a>
## Declare the model that we want to use

The model here is Naive Bayes.  It will output each prediction into a 'prediction' column.  Naive Bayes  is a probabistic model that learns based on previous decisions.  We will take a best guess at the paramater 'smoothing'- SparkML will help us tune it later!


In [18]:
nb = NaiveBayes(smoothing=1.0, featuresCol="normFeatures",modelType="multinomial", labelCol="label", predictionCol="prediction")

<a id="pipeline"></a>
## Setup the Pipeline

The pipeline is the guts of the algorithm that strings all the work we've done together.

The stages are run in order and the input DataFrame is transformed as it passes through each stage.   First, comes the feature transformations, then the assembler to put them togather into one DF.  We pass that into the model. 

In machine learning, it is common to run a sequence of algorithms to process and learn from data, so this can get as complex as we want to make it!

In [19]:
pipeline = Pipeline(stages=[labelIndexer,categoryIndexer,countryIndexer, yearOfBirthIndexer, vecAssembler, normalizer, nb, converter])

<a id="train"></a>
## Train the model

We will split it into training data which is marked and test data which will be used to test the efficiency of the algorithms.

It is common to split the split up the data randomly into 70% for training and 30% for testing.  If we were to use a bigger training set, we might use an 80% / 20% split.

In [20]:
train, test = LabeledVettingData.randomSplit([70.0,30.0], seed=152345)
train.cache()
test.cache()
print('The number of records in the training data set is {}.'.format(train.count()))
print('The number of rows labeled high is {}.'.format(train.filter(train['VETTING_LEVEL'] == 10).count()))
print('The number of rows labeled medium is {}.'.format(train.filter(train['VETTING_LEVEL'] == 20).count()))
print('The number of rows labeled low is {}.'.format(train.filter(train['VETTING_LEVEL'] == 30).count()))
print('')

print('The number of records in the test data set is {}.'.format(test.count()))
print('The number of rows labeled high is {}.'.format(test.filter(test['VETTING_LEVEL'] == 10).count()))
print('The number of rows labeled medium is {}.'.format(test.filter(test['VETTING_LEVEL'] == 20).count()))
print('The number of rows labeled low is {}.'.format(test.filter(test['VETTING_LEVEL'] == 30).count()))

The number of records in the training data set is 194.
The number of rows labeled high is 54.
The number of rows labeled medium is 65.
The number of rows labeled low is 75.

The number of records in the test data set is 75.
The number of rows labeled high is 15.
The number of rows labeled medium is 30.
The number of rows labeled low is 30.


Fit the pipeline to the training data.  This will run the data through the algorithm to train it based on our labled data.

In [21]:
# Fit the pipeline to the training data assigning the result to a variable called 'model'.
model = pipeline.fit(train)

Make predictions on records in the Test data set.  This will test the model based on the 30% data we have left in reserve.  Keep in mind that the model has not seen the data in the test data set.

In [22]:
# Make predictions on the test data assigning the result to a variable called 'predictions'.
predictions = model.transform(test)

<a id="evaluate"></a>
## Show and Evaluate Results

Note that we only got a small sample of the results back because we have a very small amount of training data. 

In [23]:
predictions.count()

74

SparkML has automated ways to look at result quality called Evaluators.  More information can be found here:
http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html

For simplicity here, we will use a a common evaluation method called Reciever Operator Characteristic.  This genenerally is used for binary classifiers, but we will use it because we only have 3 levels of prediction.

The curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The ROC curve is thus the sensitivity as a function of fall-out. The area under the ROC curve is useful for comparing and selecting the best machine learning model for a given data set. A model with an area under the ROC curve score near 1 has very good performance. A model with a score near 0.5 is about as good as flipping a coin.

In [24]:
evaluator = BinaryClassificationEvaluator().setLabelCol("label").setMetricName("areaUnderROC")
print('Area under the ROC curve = {}.'.format(evaluator.evaluate(predictions)))

Area under the ROC curve = 0.7601532567049809.


<a id="tuning"></a>
## Automatic Algorithm Tuning - Also Called  Hyperparameter Tuning


Spark ML algorithms provide many hyperparameters for tuning models. These hyperparameters are distinct from the model parameters being optimized by Spark ML itself.  Hyperparameter tuning is accomplished by choosing the best set of parameters based on model performance on test data that the model was not trained with. All combinations of hyperparameters specified will be tried in order to find the one that leads to the model with the best evaluation result.

First we will build a paramater grid to tell SparkML what to change in its testing.  Note that we are changing all the paramaters we setup in our pipeline before - the 'smoothing' in our model, and the normalizer parameter.

In [25]:
paramGrid = (ParamGridBuilder().addGrid(nb.smoothing, [0.50,0.75,1.0])
                 .addGrid(normalizer.p, [1.0, 1.25, 1.5]).build())

Now, create a cross validator to tune the pipeline with the generated parameter grid.  Cross-validation attempts to fit the underlying estimator with user-specified combinations of parameters, cross-evaluate the fitted models, and output the best one.  

In [26]:
cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(10)

Next, we will run the models through the grid we set above.  It runs a cross evaluation of the ML Pipeline to find the best model.  Note that since runs the model several times, it takes a few minutes to run.

In [27]:
cvModel = cv.fit(train)
print('Area under the ROC curve for best fitted model = {}.'.format(evaluator.evaluate(cvModel.transform(test))))

Area under the ROC curve for best fitted model = 0.7547892720306514.


Let's see what improvement we achieve by tuning the hyperparameters using cross-evaluation 

In [28]:
print('Area under the ROC curve for non-tuned model = {}.'.format(evaluator.evaluate(predictions)))
print('Area under the ROC curve for best fitted model = {}.'.format(evaluator.evaluate(cvModel.transform(test))))
print('Improvement = {0:0.2f}%'.format((evaluator.evaluate(cvModel.transform(test)) - evaluator.evaluate(predictions)) *100 / evaluator.evaluate(predictions)))

Area under the ROC curve for non-tuned model = 0.7601532567049809.
Area under the ROC curve for best fitted model = 0.7547892720306514.
Improvement = -0.71%


<a id="score"></a>
## Score the remaining records that were unscored, and load them into a new file in the object store. 

First, we want to only get the unvetted records.

In [29]:
NewVettingData=AllVettingData.filter("VETTING_LEVEL == 100")

Next, transform the new model with the new vetting records

In [30]:
newPreds = cvModel.transform(NewVettingData)

 Show the data we have predicted and some of the fields in the data.  

In [31]:
newPreds.select("UUID", "prediction", "predCategory", "probability", "NAME", "COUNTRIES_VISITED_COUNT", "PASSPORT_COUNTRY_CODE" ).show()

+--------------------+----------+------------+--------------------+--------------------+-----------------------+---------------------+
|                UUID|prediction|predCategory|         probability|                NAME|COUNTRIES_VISITED_COUNT|PASSPORT_COUNTRY_CODE|
+--------------------+----------+------------+--------------------+--------------------+-----------------------+---------------------+
|81d28245-a6d4-40e...|       1.0|          20|[0.32082790450264...|     Kathleen Bailey|                    4.0|                   GH|
|a52a6477-1f5e-4b0...|       0.0|          30|[0.51082392551921...|       Linda Stewart|                    3.0|                   GH|
|b5c7e0a7-56b6-4f3...|       1.0|          20|[0.35204402935090...|       Ashlee Fisher|                    3.0|                   GH|
|077fc1af-1802-40a...|       0.0|          30|[0.42104226712915...|    Jacqueline Clark|                    7.0|                   GH|
|d4252e23-2f33-42b...|       1.0|          20|[0.330318

Remember that VETTING_LEVEL is in three different statuses:


10- HIGH

20- MEDIUM

30 - LOW


Let's print the total number of vetting statuses that we predicted.  The actual predicted data is low because we only have a few vetted records.  Remember that we had to 'skip' and features that were not in our trained data, so if we didn't have someone who was born in a certain year in our training data, we won't be able to predict a result.

In [32]:
print('The number of records in the unvetted data set is {}.'.format(newPreds.count()))
print('The number of rows labeled high is {}.'.format(newPreds.filter(newPreds['predCategory'] == 10).count()))
print('The number of rows labeled medium is {}.'.format(newPreds.filter(newPreds['predCategory'] == 20).count()))
print('The number of rows labeled low is {}.'.format(newPreds.filter(newPreds['predCategory'] == 30).count()))

The number of records in the unvetted data set is 775.
The number of rows labeled high is 0.
The number of rows labeled medium is 402.
The number of rows labeled low is 373.


<a id="credentials"></a>
## Insert the database credentials

Click on the cell below, then on the notebook toolbar, click the box of 1's and 0's, which is the <b>Find and Add Data</b> icon.

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/bleonardb3/WatsonStudio/master/Lab-1/images/onezeroicon.png" >

Under the female_human_trafficking.csv file, click <b>Insert to code</b> then select the <b>Insert Credentials</b> link to have a credentials dictionary added to the notebook.  

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/bleonardb3/DS_POT_02-07/master/Lab-2/images/InsertCredentials.png" >
   
The @hidden_cell directive tells DSX not to export credentials when sharing.

In [34]:
# Insert database connection credentials below
# Make sure the name that is used is credentials. If credentials_1 is shown, please change to credentials. 
# @hidden_cell
# The following code contains the credentials for a connection in your Project.
# You might want to remove those credentials before you share your notebook.
credentials = {
    'username': 'dash100302',
    'password': """SWwY3_9sc_Kx""",
    'sg_service_url': 'https://sgmanager.ng.bluemix.net',
    'database': 'BLUDB',
    'host': 'dashdb-entry-yp-dal09-08.services.dal.bluemix.net',
    'port': '50000',
    'connectionPath': '/DASH100302/FEMALE_HUMAN_TRAFFICKING',
    'url': 'https://undefined'
}


<a id="write"></a>
## Write the results to the DB2 Warehouse. Please replace BLB in the Table_Name with your initials.

In [35]:
valuesToWrite= newPreds.select("UUID",  "predCategory","Category")
#Please replace BLB below with your initials. 
table_Name = "FEMALE_HUMAN_TRAFFICKING_BLB_ML_RESULTS"

from ingest.Connectors import Connectors

dashdbsaveoption = {
                     Connectors.DASHDB.HOST              : credentials["host"],
                     Connectors.DASHDB.DATABASE          : credentials["database"],
                     Connectors.DASHDB.USERNAME          : credentials["username"],
                     Connectors.DASHDB.PASSWORD          : credentials["password"],
                     Connectors.DASHDB.TARGET_TABLE_NAME : table_Name,
                     Connectors.DASHDB.TARGET_WRITE_MODE : 'insert' }

NewdashDBDF = valuesToWrite.write.format("com.ibm.spark.discover").options(**dashdbsaveoption).save()

***************************************************************************************************************************
## You have now completed the required portion of the lab in order to proceed to lab 3.  The following lab steps are optional. They show different features that are available with the notebook. 


**********************************************************************************************************************************

## Save Model - This section below demonstrates the Watson Machine Language API which allows you to programmatically save the model that was trained to a model repository.  The model can then be deployed for use in a production application.  


### You will leverage the Machine Learning service that was created in the Lab prerequisites.  
### Here is a [link](https://console.bluemix.net/dashboard/apps/pm-20) to access the machine learning service, then follow the steps below. Note that the name of the Machine Learning service in your Dashboard will be whatever you used in the prerequisites.  

1. On the Resource List panel, click on Services. <img alt="Services" src="https://raw.githubusercontent.com/bleonardb3/DS_POT_05-09/master/Lab-5/images/ClickonServices.png">  <br>
1. Click on the Machine Learning service. <img alt="Machine Learning Service" src="https://raw.githubusercontent.com/bleonardb3/DS_POT_05-09/master/Lab-5/images/ClickonMachineLearning.png">  <br>
1. Click on Service Credentials in the left panel. <img alt="Service Credentials" src="https://raw.githubusercontent.com/bleonardb3/DSX/master/Lab-2/images/MLServiceCredentials.png"> <br>
1. Click on New Credential. <img alt="New Credentials" src="https://raw.githubusercontent.com/bleonardb3/DSX/master/Lab-2/images/MLNewCredential.png"> <br>
1. On the Add new credential popup, click on Add. <img alt="Add Credential" src="https://raw.githubusercontent.com/bleonardb3/DSX/master/Lab-2/images/MLAddCredential.png"> <br>
1. In the Credentials-1 row, click on the down arrow next to View Credentials. <img alt="View Credentials" src="https://raw.githubusercontent.com/bleonardb3/DSX/master/Lab-2/images/MLViewCredentials.png"> <br>
1. Copy the credentials shown into the appropriate places in the notebook cell below. 



In [36]:
# @hidden_cell
# The following code contains the credentials the Machine Learning service. 
# You might want to remove those credentials before you share your notebook.

wml_credentials={ 
  "username": "b5b542e0-128c-4a49-9b94-3b8013757fea",
  "password": "e78a2214-3bdb-4b2b-b3a4-162013bae32a",
  "instance_id": "a12a08e4-288b-41b6-89a9-03017e5186e7",
  "url": "https://us-south.ml.cloud.ibm.com"
}




In [37]:
from repository_v3.mlrepositoryclient import MLRepositoryClient
from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
from repository_v3.mlrepository import MetaProps, MetaNames
import json

In [38]:
ml_repository_client = MLRepositoryClient(wml_credentials['url'])
ml_repository_client.authorize(wml_credentials['username'], wml_credentials['password'])

In [39]:
meta_props=MetaProps({
    MetaNames.EVALUATION_METHOD: "binary",
    MetaNames.EVALUATION_METRICS: json.dumps([{
        "name": "areaUnderROC",
        "value":evaluator.evaluate(predictions),
        "threshold":0.8
    }])
})

## After the cell below is executed, a model entry "Female Human Trafficking" will appear in the Project Model section. 

In [40]:
pipeline_artifact = MLRepositoryArtifact(pipeline, name="pipeline")
model_artifact = MLRepositoryArtifact(model, training_data=train, pipeline_artifact=pipeline_artifact, meta_props=meta_props,name="Female Human Trafficking")
saved_model = ml_repository_client.models.save(model_artifact)


<a id="schedule"></a>
## Schedule Job
You can schedule a notebook version to run at specified intervals.   If a notebook version does not yet exist, one will be created for you.  If the notebook kernel was stopped when scheduled to run, it will be started.

To schedule a notebook, select the <b>Schedule</b> icon.

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/schedule-button.png" >

Give a name to the job and pick the time period to run it.   All time periods are for the timezone of the <b>browser</b> NOT the timezone of the server where the notebook is running.

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/Schedule-Window.PNG" >

## Download notebook

Notebooks can be downloaded in notebook (.ipynb), Python (.py), HTML (.html), markdown (.md) or reST (.rst) format.  Use <b>File</b> > <b>Download as</b> to download the notebook in any of the formats.

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/FileOptions.PNG" >

<a id="revert"></a>
## Revert to version 
Revert to the version you saved at the beginning of this lab.   There are two ways to do this.   First, select <b>File</b> > <b>Revert to Version</b> and choose the version you created at the beginning of the lab (versions are timestamped).
<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/FileOptions.PNG" >

The second way is to select the <b>Versions</b> icon 
<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/versions-button.png" ><br>
and then select the version you wish to revert to.   You can also delete versions from here.
<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/Versions.PNG" >

<a id="help"></a>
## Even more help

Select the <b>Find Resources in the Community</b> link to display a search bar, documentation hotlinks, and a link to Stack Overflow's Data Science Experience section.

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/community-button.png" >

<img alt="IBM Bluemix.Get started now" src="https://raw.githubusercontent.com/jpatter/LMCO/master/Lab-1/images/Community-Resources.PNG" >