# Score a PySpark model with IBM Streams



This notebook uses Streams to deploy a [LinearRegression model](https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression) built using the PySpark ML library. 
The model was created [in this notebook](https://github.com/IBM/db2-event-store-iot-analytics/blob/master/notebooks/Event_Store_ML_Model_Deployment.ipynb).


The following code snippet was added to the aforementioned notebook to save the model:
``` python
saved_model_dir = "path/to/model"
model.save(saved_model_dir)
```
You can view the original notebook to learn about the model and/or customize it further. 
Tested using pyspark 2.4.4 and Spark 2.4.4. 


## About the application 
The goal of this notebook is to load and score a Spark model on a stream of data. 
The scenario is that we have a stream of data of temperature readings from sensors. Each tuple or data item on the stream has this format:

``` python
{'device': 3,   'sensor': 18,   'ts': 1541019342143,   'ambient_temp': 26.30057257150144,
  'power': 11.516388340597683,
  'temperature': 44.02880734420622,
  'prediction': 44.940489852384474}

```    
We want to predict the next temperature that the sensor will report using the model previously created.

The result will be a new stream that includes all the data in the input and  a predicted temperature as an additional attribute:
``` python
{'device': 3,   'sensor': 18,   'ts': 1541019342143,   'ambient_temp': 26.30057257150144,
  'power': 11.516388340597683,
  'temperature': 44.02880734420622,
  'prediction': 44.940489852384474}
```


### Download the model

As mentioned before, the model was saved using `model.save`. It has been uploaded to GitHub so download it:

In [7]:
!mkdir -p model_dir
!wget -O lmodel.zip https://github.com/natashadsilva/scratch/blob/master/LinearModel.zip?raw=true
!unzip -o -d model_dir  lmodel.zip

--2019-11-26 20:01:35--  https://github.com/natashadsilva/scratch/blob/master/LinearModel.zip?raw=true
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/natashadsilva/scratch/raw/master/LinearModel.zip [following]
--2019-11-26 20:01:35--  https://github.com/natashadsilva/scratch/raw/master/LinearModel.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/natashadsilva/scratch/master/LinearModel.zip [following]
--2019-11-26 20:01:36--  https://raw.githubusercontent.com/natashadsilva/scratch/master/LinearModel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8

## Setup and install PySpark

Install PySpark from `pip`.



In [12]:
!pip install --user pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 154kB/s  eta 0:00:01
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 46.5MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.4


### Make sure Streams Python API is installed
Make sure streamsx package is installed and at least version 1.13.14

In [1]:
!pip show streamsx
# upgrade if needed
# !pip install --upgrade streamsx


Name: streamsx
Version: 1.13.14
Summary: IBM Streams Python Support
Home-page: https://github.com/IBMStreams/pypi.streamsx
Author: IBM Streams @ github.com
Author-email: debrunne@us.ibm.com
License: Apache License - Version 2.0
Location: /opt/conda/envs/Python36/lib/python3.6/site-packages
Requires: requests, future, dill
Required-by: 


### Create an instance of the Streaming analytics service

The Streams application does not run in this notebook, rather it runs on a Streams instance. If you do not already have one, follow the [steps here](http://ibmstreams.github.io/streamsx.documentation/docs/python/1.6/python-appapi-devguide-2/#streams) to set up the Streaming Analytics service in IBM Cloud.


### Set up a connection to the service

<h4>Submit to the Streaming Analytics service</h4>
        To connect to the Streaming Analytics service in IBM cloud you need to get the service credentials from the Streaming Analytics service dashboard.
        <p>
        To copy your service credentials, open the Streaming Analytics service dashboard click <strong>Service Credentials</strong>, then <strong>View Credentials</strong>, and copy the contents of the cell. Click <strong>Add new credentials</strong> if there are no credentials listed.
        </p>
        <p>See the image below for an example. Click to enlarge.</p>
        <a href="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/11/sa-credentials-only.png">
        <img width="600" height="500" src="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/11/sa-credentials-only.png"></a>
 <br/>
 <h4>Run the cell below and enter the credentials when prompted.</h4>

In [19]:
from streamsx.topology.context import ConfigParams
from streamsx.topology import context
import json
import getpass


service_cfg  = {}

SA_credentials=getpass.getpass('Streaming Analytics credentials:')
service_cfg[ConfigParams.SERVICE_DEFINITION] = json.loads(SA_credentials)

def submit_topology(topo):
    global service_cfg
    service_cfg[context.ConfigParams.SSL_VERIFY] = False

    # This specifies how the application will be deployed

    contextType = context.ContextTypes.STREAMING_ANALYTICS_SERVICE

    return context.submit (contextType, topo, config = service_cfg) 

Streaming Analytics credentials:········


## Define the class used to score the model

The `LoadSparkModel` class is a callable class that will be used to load the model, and then score it on each incoming data item, or tuple.

When the `__call__` method receives a tuple, it uses `model.transform` to get a prediction for the tuple and adds the result as a `prediction` attribute. 

The data is not scored in batches but in real-time, as it arrives.

In [9]:
import streamsx.ec
import time
import random

import warnings
import logging
import csv 
warnings.filterwarnings('ignore')


class LoadSparkModel:
    def __init__(self, model_file):
        # name of the model 
        self.model_path = model_file
    def __exit__(self, exc_type, exc_value, traceback):
        pass
    def __enter__(self):
        # This function is called when the  application is starting on the Streams runtime,

        # Pyspark requires that PYSPARK_PYTHON` should be set to the Python executable at  `PYTHONHOME`.  
        # Since this app is running on a Streams instance in the cloud there's no access  to environment variables,
        # So I worked around it by setting the environment variable programmatically, before importing the pyspark modules.

        os.environ["PYSPARK_PYTHON"]=os.environ["PYTHONHOME"] + "/bin/python"
        # Application is starting on the Streams runtime,
        from pyspark.ml import PipelineModel
        from pyspark.sql import SparkSession
        

        logging.getLogger("SparkModel").info("INFO: Loading Spark Model")
        # load json and create model
        try :
            self.sparkSession = SparkSession.builder.appName("Score Spark Model with Streams").getOrCreate()
            #get path to model at runtime
            path_to_model = streamsx.ec.get_application_directory() + "/etc/" +  self.model_path
            self.sparkModel = PipelineModel.load(path_to_model)
            logging.getLogger("SparkModel").info("INFO: Successfully loaded ,odel")
        except Exception as e:
            logging.getLogger("SparkModel").error("ERROR loading file " +  str(e))
        
    def __call__(self, tpl):
        logging.getLogger("SparkModel").info("INFO: Going to run model on tuple")
        # wrap tuple in a Data Frame 
        tpl_as_DF = self.sparkSession.createDataFrame([tpl])
        # score model
        result = self.sparkModel.transform(tpl_as_DF).collect()[0]
        
        logging.getLogger("SparkModel").debug("INFO: Ran model on tuple")
        # add prediction to input tuple
        
        tpl["prediction"] = result.prediction
        return tpl


### Define data generation function

For simplicity we'll use the `readings` function below to simulate a stream of data. This function generates a new tuple every 0.1 seconds.
See the documentation  to [connect to other data sources](http://ibmstreams.github.io/streamsx.documentation/docs/python/1.6/python-appapi-devguide-4/#adapters).

In [10]:
import random, time

def readings():
    while True:
        time.sleep(0.1)
        time_now = 1541019341*1000
        record_processed = 0
        deviceID = random.randint(1, 3)
        sensorID = random.randint(1, 50)
        ambient_temp = random.gauss(24.5, 2)  # ambient temp 
        time_now += random.randint(10,1500)
        power = random.gauss(10, 3) # power consumption
        noise = random.gauss(0,1.5)
        temp = 1.3 * ambient_temp + 0.5 * power + 5 + noise

        yield dict(device=int(deviceID), 
                   sensor=int(sensorID), ts=time_now, 
                   ambient_temp=float(ambient_temp),
                   power=float(power), 
                   temperature=float(temp))


# Define the Streams application

A Streams application is called directed graph called a `Topology`.

## 1. Create the Topology object 

First, create the Topology and set up the prerequisites:


In [13]:

from streamsx.topology.topology import Topology
import streamsx.topology.context
from pyspark.ml.regression import LinearRegression

model_path = "model_dir" # directory we downloaded the model into

topo = Topology(name="SparkScoring")

# This makes sure the model files are available to the application at runtime

topo.add_file_dependency(model_path + "/", "etc") 


# Pyspark requires the same version of pyspark as used in the notebook on the Streams host. 
# For Streams running in the cloud, work around this by including  pyspark in the compiled application using `topo.add_pip_package`.

pyspark_version = !pip show pyspark
pyspark_version = pyspark_version[1].split(':')[1].strip()
topo.add_pip_package("pyspark==" + pyspark_version)



2019-11-26 20:04:07,836 - __PROJECT_LIB__ - ERROR - ProjectHandle: Project ID neither provided nor found in the environment.


## 2. Create the source stream

Use the `readings` function to create our input `Stream`. The Stream class represents a potentially infinite sequence of data.

In [14]:
# The source_stream will contain the tuples generated by the readings function

source_stream = topo.source(readings)

## 3. Use the `LoadSparkModel` class to score the data from the source stream

This is the core of the application. 

- We want to convert every tuple on the source Stream to a new Stream that contains the predictions.
- We already have the `LoadSparkModel` class that will return a prediction, given a tuple.


Use the `map` transform to put it all together:

In [15]:

model_scorer = LoadSparkModel(model_path)
predictions_stream = source_stream.map(model_scorer)


## 4. Create a view to see output from the notebook
A `View` is a connection to the running application that allows you to see the data on a particular Stream.


Create a view on the `predictions` stream to see the output:

In [17]:

predictions_stream.print()

# create a view to watch the stream data while running


results_view = predictions_stream.view(name="Predictions")


### Submit topology

As mentioned, the Streams application does not run in this notebook, rather it runs on the Streams instance. So this cell submits the application to the instance for execution, using the `submit_topology` function defined earlier.


**Note:** If you see messages like this: `_IntProgress(value=0, bar_style='info', description='Initializing', max=10, style=ProgressStyle(description_wid…_`
This is because ipywidgets are not enabled in your kernel. They're currently not supported in Watson Studio.

In [21]:

# The submission_result object contains information about the running application, or job
print("Submitting Topology to Streams for execution..")
submission_result = submit_topology(topo)

if submission_result.job:
    streams_job = submission_result.job
    print ("JobId: ", streams_job.id , "\nJob name: ", streams_job.name)
else:
    print("Submission failed: "   + str(submssion_result))


Submitting Topology to Streams for execution..


IntProgress(value=0, bar_style='info', description='Initializing', max=10, style=ProgressStyle(description_wid…

JobId:  0 
Job name:  notebook::SparkScoring_0


## Connect to the view to see the data

Once the application is running, run this cell to see some output.

In [32]:
try:
    print("Fetching data from the view")
    
    data_queue = results_view.start_data_fetch()
    tpls = []
    num_to_fetch = 5
    for i in range(num_to_fetch):
        tpls.append(data_queue.get())
        
    print("Fetched " + str(len(tpls)) + " result tuples")
finally:
    results_view.stop_data_fetch()

import pandas as pd
df = pd.DataFrame(tpls)
df.head()


Fetching data from the view
Fetched 5 result tuples


Unnamed: 0,ambient_temp,device,power,prediction,sensor,temperature,ts
0,21.599243,1,6.83905,36.513092,22,37.315819,1541019342371
1,22.755869,2,19.986368,44.562062,10,43.803365,1541019341967
2,25.470143,2,14.585983,45.392438,40,49.973616,1541019342335
3,22.98205,3,12.470996,41.111915,27,41.968606,1541019341880
4,29.672598,1,13.865026,50.483845,50,49.462162,1541019342027


## See the job in the Streams Console
See the running job's logs and other metrics in the Streams Console.
The Streams Python development guide has [instructions to access the Streams console](http://ibmstreams.github.io/streamsx.documentation/docs/python/1.6/python-appapi-devguide-3/#42-see-job-status).

## Cancel the job


In [None]:
submission_result.job.cancel()

## References

- [Map transform](http://ibmstreams.github.io/streamsx.documentation/docs/python/1.6/python-appapi-devguide-4/#map)
- [Streams Python API Doc](https://streamsxtopology.readthedocs.io/en/stable/streamsx.topology.topology.html#stream-processing)