# Lab: Introduction to Notebooks for SPSS Professionals - Part 2
In part 2 of the lab you will learn how to load the model that was created in Part 1 and use it to score new data. 

### Step 1: Connect to Object Storage
If you wish, follow instructions in Appendix A of Lab Instructions to switch to your own object storage

In [1]:
from pyspark.sql import SparkSession

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_18c4556616c5444581b1cb6d212cf2dc(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', '879160f1a1174d2f912f196ac158ffbf')
    hconf.set(prefix + '.username', 'aa5ea4cb9c48463681897f88b4a9ab08')
    hconf.set(prefix + '.password', 'veCB_4)bYn362UU&')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_18c4556616c5444581b1cb6d212cf2dc(name)

spark = SparkSession.builder.getOrCreate()

### Step 2: Load the new data for scoring
Note: The test data set contains all fields that were used for model training with the exception of the target variable (MortgageDefault)

In [2]:
newData = spark.read.format('csv')\
  .options(header='true', inferschema='true')\
  .load("swift://IntroToNotebooks." + name + "/MortgageDefaultTestData.csv")

newData.toPandas().head()

Unnamed: 0,ID,Income,AppliedOnline,Residence,YearCurrentAddress,YearsCurrentEmployer,NumberOfCards,CCDebt,Loans,LoanAmount,SalePrice,Location
0,100342,43202,YES,Owner Occupier,17,7,1,1412,1,8925,650000,101
1,100344,49745,NO,Private Renting,24,6,1,518,1,5915,170000,120
2,100367,43645,YES,Owner Occupier,5,13,2,502,0,4600,875000,100
3,100373,49678,YES,Owner Occupier,11,8,2,564,0,7195,275000,100
4,100418,59508,YES,Owner Occupier,2,19,1,3671,1,10595,135000,130


### Step 3: Load Saved Model
The model was saved in the model building notebook. If you changed the namespace (in our example ConvertSPSSModelToNotebook) in that notebook, modify it to the changed value here. 

In [3]:
#Load Saved Model
from pyspark.ml import PipelineModel
model = PipelineModel.load("IntroToNotebooks.mortgageDefaultModel")

### Step 4: Score Reloaded Model
Take note of the prediction and probability columns in the result

In [4]:
#Score reloaded model
# Take note of the prediction and probability columns in the result

results = model.transform(newData)
results.toPandas().head(5)

Unnamed: 0,ID,Income,AppliedOnline,Residence,YearCurrentAddress,YearsCurrentEmployer,NumberOfCards,CCDebt,Loans,LoanAmount,SalePrice,Location,AppliedOnlineEncoded,ResidenceEncoded,features,rawPrediction,probability,prediction
0,100342,43202,YES,Owner Occupier,17,7,1,1412,1,8925,650000,101,0,0,"[43202.0, 0.0, 0.0, 17.0, 7.0, 1.0, 1412.0, 1....","[14.1666666667, 5.83333333333]","[0.708333333333, 0.291666666667]",0
1,100344,49745,NO,Private Renting,24,6,1,518,1,5915,170000,120,1,1,"[49745.0, 1.0, 1.0, 24.0, 6.0, 1.0, 518.0, 1.0...","[15.6602564103, 4.33974358974]","[0.783012820513, 0.216987179487]",0
2,100367,43645,YES,Owner Occupier,5,13,2,502,0,4600,875000,100,0,0,"[43645.0, 0.0, 0.0, 5.0, 13.0, 2.0, 502.0, 0.0...","[16.8783673469, 3.12163265306]","[0.843918367347, 0.156081632653]",0
3,100373,49678,YES,Owner Occupier,11,8,2,564,0,7195,275000,100,0,0,"[49678.0, 0.0, 0.0, 11.0, 8.0, 2.0, 564.0, 0.0...","[7.89699627858, 12.1030037214]","[0.394849813929, 0.605150186071]",1
4,100418,59508,YES,Owner Occupier,2,19,1,3671,1,10595,135000,130,0,0,"[59508.0, 0.0, 0.0, 2.0, 19.0, 1.0, 3671.0, 1....","[14.0, 6.0]","[0.7, 0.3]",0


### Step 5: Write results into a csv file

In [5]:
#Select ID, prediction and probability fields from the results dataframe
r1=results.select(results["ID"], results["prediction"],results["probability"])
r1.show(5,False)

+--------+----------+----------------------------------------+
|ID      |prediction|probability                             |
+--------+----------+----------------------------------------+
|100342.0|0.0       |[0.7083333333333333,0.29166666666666663]|
|100344.0|0.0       |[0.7830128205128204,0.21698717948717952]|
|100367.0|0.0       |[0.8439183673469388,0.15608163265306124]|
|100373.0|1.0       |[0.3948498139287613,0.6051501860712387] |
|100418.0|0.0       |[0.7,0.3]                               |
+--------+----------+----------------------------------------+
only showing top 5 rows



#### Decompose the probability column
The probability column contains a vector for each record, and the elements must be extracted

In [6]:
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors

udf_0 = udf(lambda vector: float(vector[0]), DoubleType())
udf_1 = udf(lambda vector: float(vector[1]), DoubleType())

r2 = (r1.select(r1["ID"], r1["prediction"],r1["probability"])
    .withColumn('probability_0', udf_0(r1.probability))
    .withColumn('probability_1', udf_1(r1.probability))
    .drop("probability"))

r2.show(5, False)

+--------+----------+------------------+-------------------+
|ID      |prediction|probability_0     |probability_1      |
+--------+----------+------------------+-------------------+
|100342.0|0.0       |0.7083333333333333|0.29166666666666663|
|100344.0|0.0       |0.7830128205128204|0.21698717948717952|
|100367.0|0.0       |0.8439183673469388|0.15608163265306124|
|100373.0|1.0       |0.3948498139287613|0.6051501860712387 |
|100418.0|0.0       |0.7               |0.3                |
+--------+----------+------------------+-------------------+
only showing top 5 rows



In [7]:
r2.write.csv('swift://IntroToNotebooks.' + name + '/mortgage_default_scores.csv', mode='overwrite')

In [8]:
# Show csv file can be read back
r3= spark.read.csv('swift://IntroToNotebooks.' + name + '/mortgage_default_scores.csv')
r3.select(r3["_c0"].alias("ID"), r3["_c1"].alias("prediction"), r3["_c2"].alias("probability_0"), r3["_c3"].alias("probability_1")).show(5, False)

+--------+----------+------------------+-------------------+
|ID      |prediction|probability_0     |probability_1      |
+--------+----------+------------------+-------------------+
|100342.0|0.0       |0.7083333333333333|0.29166666666666663|
|100344.0|0.0       |0.7830128205128204|0.21698717948717952|
|100367.0|0.0       |0.8439183673469388|0.15608163265306124|
|100373.0|1.0       |0.3948498139287613|0.6051501860712387 |
|100418.0|0.0       |0.7               |0.3                |
+--------+----------+------------------+-------------------+
only showing top 5 rows



You have finished testing of the scoring notebook

### Step 6: Set up a schedule to run a notebook
Use the clock icon in the top right corner to schedule the notebook. 

You have come to the end of the lab

**Authors**
* Elena Lowery
* Mokhtar Kandil
* Rich Tarro
* Sidney Phoon