# Lab: Introduction to Notebooks for SPSS Professionals - Part 1

## Introduction

The purpose of this lab is to help professionals who currently use SPSS Modeler understand how a data scientist can use notebooks for implementing various analytics use cases. 

When we compare environments, it's done for educational purposes only. SPSS Modeler and notebooks were built in different decades using different technologies; they are also typically used by different types of data scientists. While the Notebook interface may seem "difficult to understand" for somebody who is used to visual environment, it provides great improvement in usability and productivity for data scientists who are used to working in shell environments. With a notebook a data scientist can provide explanation for what happens in the code and add visualizations. 

We chose a simple analytics use case for this lab - predicting mortgage default. We are using different predictive models (C5 classification model in SPSS and Spark ML Decision Tree Classifier in the notebook) because we want to focus on explaining functionality, and not comparing performance or accuracy.  

If you modify this notebook for customer demos, please don't use it in the context of "it's easier to build analytics in a visual environment". Showing the same analytics use case in two environments is meant to show flexibility, not to position environments against each other. 


## Notebooks

This notebook IDE is built on top of open source Jupyter notebooks. A notebook is a file type that contains 
1. Code (Python, R or Scala)
2. Markdown (comments like the ones you are reading now)
3. A connection to a kernel (runtime environment). The kernel in this environment is provided by Spark. 

The content of a notebook is organized in "cells". The two main types of cells are "code" and "markdown". As you review the rest of the notebook, you may see some text that's marked with "out". The "out" tag is for output of code directly above it. If you don't see the out tag, then the output may have been cleared or the code in the cell didn't produce any printable output. 

While notebook IDE looks different from SPSS Modeler, these tools have several similarities on the technical level. When we run a Modeler stream, the visual nodes are converted to code which is executed in Modeler server. When we run notebooks, the code that we provide runs in the specific kernel for each programming language (the notebook IDE automatically starts a kernel for the programming language in which the notebook is implemented). 

You will notice that the notebook IDE has a menu that's dedicated to kernels with actions like start, restart, reconnect, etc. This is similar to connections to Modeler Server. Notebook IDE offers more flexibility for connections to the kernel and the ability to switch kernels because several versions of programming languages are supported. 

When we work with notebooks, we have an option to run the entire notebook (all cells in the notebook) or individual cells. This is similar to running the entire Modeler stream or a selected branch. If you are running individual cells, it's important that the cells above it had been run. Again, this is similar to Modeler (we don't start execution in the middle of the stream, and if we do, we turn on caching). 

If you want to run a cell, position the cursor at the end of the last line, and select menu Cell -> Run Cells (or click the Run icon). When the cell is running, you will see an an asterisks next to the cell [*]. Don't run any subsequent cells until execution is done.  


Since Notebooks are based on an open source technlogy, you can find many tutorials and sample Notebooks. 
Here are some notebooks that show "functional/technical" features of Python and R: https://github.com/IBMDataScience/sample-notebooks
You can find additional notebooks on the Community page of DSX.


## Sample Modeler Stream

Start with reviewing the sample Modeler stream that was provided by your instructor (MortgageDefault.str). In this lab we'll implement the same use case using Python and SparkML. 

## Working with a Notebook
Until you click the Edit icon (pencil in the top right corner), you are looking at the "static" version of the notebook (i.e. it' simply displaying the content of the file, and it's not connected to a kernel). 

After you click the Edit icon, notice the "Kernel starting" message in the top right corner. You should also see a menu bar becuase we have opened the notebook IDE. 

Explore the menu options and let the instructor know if you have any questions. We'll use some of the menu options in this lab. 

As you are working on this lab, read the information in the markdown cells. To run code in each cell, position cursor at the end of the last line of code cell and click the Run icon. As mentioned earlier, the cell is still running if you see [*] next to the code cell, and not every cell will have printed output.  


## Mortgage Default Use Case Implementation

### Step 1: Connect to Object Storage

We start with connecting to Object Storage. Object Storage is the Bluemix environment for storing flat files. If you go back to the Project dashboard and click on Data Assets, you'll see 4 files - Default.csv, Customer.csv, Property.csv, and TestData1.csv. 

The following code has been generated by DSX and it's connecting to one of the instructors' Object Storage. If you would like to load data files to your own Object Storage, please check with the lab instructor. 

We use the SQLContext API because it makes it easier to work with files. 


In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# Begin generated code

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_1002714be8c646fd80887f973e8f09df(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', '1d64182068a14b259334892f004f9847')
    hconf.set(prefix + '.username', '23e888652f75406eb4927ccdc78651e4')
    hconf.set(prefix + '.password', 'LTJ}E73L2nB(DX{w')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_1002714be8c646fd80887f973e8f09df(name)

# End generated code

### Step 2: Load files

In [2]:
# The loan default information
default = sqlContext.read.format('com.databricks.spark.csv')\
  .options(header='true', inferschema='true')\
  .load("swift://ConvertSPSSModelToNotebook." + name + "/Default.csv")

# The property information
property = sqlContext.read.format('com.databricks.spark.csv')\
  .options(header='true', inferschema='true')\
  .load("swift://ConvertSPSSModelToNotebook." + name + "/Property.csv")

# Customer information
customer = sqlContext.read.format('com.databricks.spark.csv')\
  .options(header='true', inferschema='true')\
  .load("swift://ConvertSPSSModelToNotebook." + name + "/Customer.csv")

# Data set for testing - it's already been merged
testingData = sqlContext.read.format('com.databricks.spark.csv')\
  .options(header='true', inferschema='true')\
  .load("swift://ConvertSPSSModelToNotebook." + name + "/TestData1.csv")

# Prevew test data set
testingData.take(5)

[Row(ID=100272, Income=43593, AppliedOnline=u'YES', Residence=u'Owner Occupier', YearCurrentAddress=13, YearsCurrentEmployer=0, NumberOfCards=1, CCDebt=2315, Loans=0, LoanAmount=12820, SalePrice=180000, Location=130, historicalLabel=0.0),
 Row(ID=100273, Income=45706, AppliedOnline=u'YES', Residence=u'Owner Occupier', YearCurrentAddress=17, YearsCurrentEmployer=16, NumberOfCards=2, CCDebt=373, Loans=1, LoanAmount=7275, SalePrice=145000, Location=100, historicalLabel=1.0),
 Row(ID=100279, Income=44756, AppliedOnline=u'YES', Residence=u'Owner Occupier', YearCurrentAddress=19, YearsCurrentEmployer=6, NumberOfCards=1, CCDebt=2117, Loans=1, LoanAmount=10760, SalePrice=145000, Location=110, historicalLabel=0.0),
 Row(ID=100280, Income=44202, AppliedOnline=u'YES', Residence=u'Owner Occupier', YearCurrentAddress=8, YearsCurrentEmployer=0, NumberOfCards=2, CCDebt=748, Loans=0, LoanAmount=10455, SalePrice=170000, Location=100, historicalLabel=0.0),
 Row(ID=100282, Income=45715, AppliedOnline=u'Y

### Step 3: Merge Files
This step is similar to Merge node in Modeler

In [None]:
merged = customer.join(property, customer['ID'] == property['ID'])\
                   .join(default, customer['ID']==default['ID']).select(customer['*'],property['SalePrice'], property['Location'], default['MortgageDefault'])
# Preview  5 rows
merged.take(5)

### Step 4: Data understanding
This capability is similar to graphboard in Modeler.
PixieDust is a Python Helper library for Spark IPython Notebooks. One of it's main features are visualizations. You'll notice that the unlike other APIs which produce just output, PixieDust creates an interactive UI in which you can explore data.

Try creating different graphs. 

More information about PixieDust: https://github.com/ibm-cds-labs/pixiedust?cm_mc_uid=78151411419314871783930&cm_mc_sid_50200000=1487962969

In [None]:
from pixiedust.display import *
display(merged)

### Step 5: Rename some columns
This step is not a requirement, it just makes some columns names simpler to type with no spaces

In [None]:
merged2 = merged.withColumnRenamed("Yrs at Current Address", "YearCurrentAddress").withColumnRenamed("Yrs with Current Employer","YearsCurrentEmployer")\
                .withColumnRenamed("Number of Cards","NumberOfCards").withColumnRenamed("Creditcard Debt","CCDebt").withColumnRenamed("Loan Amount", "LoanAmount")

### Step 6: Build the Spark pipeline and the Decision Tree model
"Pipeline" is an API in SparkML that's used for building models.
Additional information on SparkML: http://spark.apache.org/docs/latest/ml-guide.html

In [None]:
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

# Prepare string variables so that they can be used by the decision tree algorithm
stringIndexer1 = StringIndexer(inputCol='AppliedOnline', outputCol='AppliedOnlineEncoded')
stringIndexer2 = StringIndexer(inputCol='Residence',outputCol='ResidenceEncoded')
stringIndexer3 = StringIndexer(inputCol='MortgageDefault', outputCol='label')

# Instanciate the algorithm
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["Income", "AppliedOnlineEncoded", "ResidenceEncoded", "YearCurrentAddress", "YearsCurrentEmployer", "NumberOfCards", \
                                       "CCDebt", "Loans", "LoanAmount", "SalePrice", "Location"], outputCol="features")

pipeline = Pipeline(stages=[stringIndexer1, stringIndexer2, stringIndexer3, assembler, dt])

# Build model
model = pipeline.fit(merged2)


### Step 7: Score the test data set

In [None]:
results = model.transform(testingData)
# This is a preview of 10 rows
results.take(10)

### Step 8: Model Analysis
Find precision of the model, this is similar to Analysis node in Modeler.  

In [None]:
results.filter(results.historicalLabel == results.prediction).count() / float(results.count())

### Step 9: Model Evaluation
This step is similar to the Evaluation node in Modeler

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="historicalLabel", metricName="areaUnderROC")
evaluator.evaluate(results)

### Step 10: Save Model
Save model in Object Storage. 

A separate notebook has been created for "batch scoring deployment". This deployment notebook retrieves the model from object storage and applies it to a new dataset. The notebook can be scheduled to run via the Notebook scheduler (the clock icon on the menu bar) or through the deployment interface in IBM ML (currently in beta).

In [None]:
# The overwrite API - model.write.overwrite().save("ConvertSPSSModelToNotebook.mortgageDefaultModel") currently doesn't work
#model.write.overwrite().save("ConvertSPSSModelToNotebook.mortgageDefaultModel")

# We can use model.save(), but you have to specify a unique model name (replace uniquename)
# Note: model.save() only works in Spark 2.0
# model.save("ConvertSPSSModelToNotebook.mortgageDefaultModel_uniquename")
print("Saved model in Object Storage")

You have finished the intro to Notebooks lab. 