# Predict house prices

**Original source:** https://medium.com/ibm-watson-data-lab/building-your-first-machine-learning-system-b3d9401927b7  
**Modified by:** Jukka Ruponen /IBM, 2018-01-07

## Step 1: Identify what you want to predict and the source of your data

We’ve identified that we want to predict house prices, and the data set we want to use to drive those predictions.
The data set available on GitHub:

https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv

This URL is going to be used to pull this data into our Notebook later on.

## Step 2: Import, clean and analyze the data

The next two cells below updates (optionally) and imports a Python library called PixieDust. PixieDust is an open source helper library that works as an add-on to Jupyter Notebooks that makes it easy to import and visualize data.

In [None]:
#Optionally update the pixiedust to the latest version
!pip install --user --upgrade pixiedust

In [3]:
import pixiedust

Load the sample data from GitHub and create a data frame (df)

In [4]:
df = pixiedust.sampleData("https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv")

Downloading 'https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv' from https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv
Downloaded 92 bytes
Creating pySpark DataFrame for 'https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv'


Next, we'll display the data.
This will generate a Spark DataFrame called “df”. A DataFrame is a data set organized into named columns. You can think of it as a spreadsheet, or a relational database table. The Spark ML API uses DataFrames to train and test ML models.

Note: This will display all of the data. If the data set was large, you should use head(df) instead)

In [5]:
display(df)

SquareFeet,Bedrooms,Color,Price
2100,3,White,100000
2300,4,White,125000
2500,4,Brown,150000


## Step 3: Use Apache Spark ML to build and test a machine learning model
We’re going to build our first ML model in just a handful of cells. To start we need to import the Spark ML libraries that we’ll be using:

In [7]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

Our goal is a regression problem (we’re trying to predict a real number), so we are going to use the Linear Regression algorithm in pyspark.ml.regression. There are other regression algorithms, but those are outside of the scope of this post.

We are going to build our ML model in just four lines of code, all in a single cell in our notebook:

In [9]:
# Defining the ML model (linear regression)

# Our ML algorithm expects a single vector of feature columns.
# So here we use a VectorAssembler to tell our ML pipeline that we want SquareFeet and Bedrooms as our features:
assembler = VectorAssembler(inputCols=['SquareFeet','Bedrooms'],outputCol="features")

# Next, we create an instance of LinearRegression, the ML algorithm we are going to use.
# At a minimum, you must specify the features and the labels.
# There are other parameters you can provide to tweak the algorithm, but they’re not going to do us much good when working with three data points :)
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, labelCol='Price', featuresCol='features')

# Next, we create our pipeline.
# A Pipeline allows us to specify the steps that should be performed when training an ML model.
# In this case, we first want to assemble our two feature columns into a single vector — that’s the assembler.
# Then we want to run it through our LinearRegression algorithm (lr).
pipeline = Pipeline(stages=[assembler, lr])

# Finally, we pass our DataFrame to the fit method on the pipeline to create our ML model.
model = pipeline.fit(df)

**Test the model**  
It’s time to test our model. We are next going to run a single prediction.

In [12]:
# Defining a Python function to get our prediction (creates the DataFrame we’ll pass to our model):
def get_prediction(square_feet, num_bedrooms):
    df_req = spark.createDataFrame([(square_feet, num_bedrooms)],
                                   ['SquareFeet','Bedrooms'])
    df_res = model.transform(df_req)
    return df_res

In [17]:
# Run the test prediction with 2400 sq-feet and 4 bedrooms:
res = get_prediction(2400, 4)
res.show()

+----------+--------+------------+------------------+
|SquareFeet|Bedrooms|    features|        prediction|
+----------+--------+------------+------------------+
|      2400|       4|[2400.0,4.0]|137499.79216713924|
+----------+--------+------------+------------------+



## Step 4: Save & Deploy model to Watson ML service

Let's first import some libraries that we'll need:

In [18]:
import json
import requests
import urllib3

In this Notebook we’ve now trained and tested our machine learning model, but if we want to predict house prices from a web or mobile app it’s not going to do us much good in this notebook. That’s where Watson ML service comes in.

In this notebook, we’re next going to deploy this model to Waston ML and create a “scoring endpoint”, or a REST API for making predictions.

The first thing you’ll need to do next is specify your Watson ML credentials in  this Notebook.
If you do not yet have the Watson ML service created, do the following steps:
1. In another browser window, go to https://console.bluemix.net/catalog/services/machine-learning and create a **Watson Machine Learning** service instance  
**Note**: For your convenience, you should create the Watson ML service in the same space than your Spark service instance
2. Check your Watson ML credentials from bluemix console by opening the just created Watson ML service and clicking "Service Credentials" on the left
3. If you do not yet have any credentials provisioned, press "Create credentials" and "Add". Then check your credentials again (step 2).
4. Copy your unique credential values (username, password and instance_id) to the cell below:

In [None]:
# Watson ML credentials:
service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'YOUR_WML_USER_NAME'
password = 'YOUR_WML_PASSWORD'
instance_id = 'YOUR_WML_INSTANCE_ID'
model_name = 'House Prices Model'
deployment_name = 'House Prices Deployment'

The next two cells initializes some libraries for connecting to Watson ML. These libraries are built into DSX:

In [20]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

In [52]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)
ml_model_name = model_name

Create model artifact (abstraction layer)

In [None]:
pipeline_artifact = MLRepositoryArtifact(pipeline, name="pipeline")

**Saving the model to Watson ML**  
Next, we’ll use these libraries to save our model to Watson ML. We pass the trained model, our data set, and a name for the model — in this case we’re calling it “House Prices Model”:

In [53]:
model_artifact = MLRepositoryArtifact(model, training_data=df, name=ml_model_name, pipeline_artifact=pipeline_artifact)
saved_model = ml_repository_client.models.save(model_artifact)
saved_model

<repository.mlrepositoryclient.model_adapter.ModelArtifact at 0x7f083d353bd0>

To confirm that our model was saved in Watson ML, list all models that comply to our model name:

In [54]:
ml_model_name = 'House Prices Model'
ml_models = ml_repository_client.models.all()
for ml_model in ml_models:
    print '{} - {}'.format(ml_model.name, ml_model.uid)

House Prices Model - 3989a65c-c3bb-4012-aeb4-91667804cbe9


**Request an id, pointing to our saved Watson ML model**  
The call to models.save above returned an **object** that we stored in **saved_model** variable, from which we extracted the unique ID for the model, mode_id. This **model_id** is important as it will be used later to create a deployment for the model.

In [55]:
model_id = saved_model.uid
model_id

'3989a65c-c3bb-4012-aeb4-91667804cbe9'

**Preparing to deploy the model in Watson ML**  
We are now going to create a Deployment for our ML model. In other words, we are going to deploy a running instance of our model. To do this, we'll use the Watson ML REST API. The Watson ML REST API uses token-based authentication, so our first step is to generate a token using our Watson ML credentials:

In [56]:
# Generate access token for Watson ML REST API
headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(username, password))
url = '{}/v3/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
ml_token = 'Bearer ' + json.loads(response.text).get('token')

In [None]:
# Display the access token for Watson ML REST API:
ml_token

In [None]:
# Get our model in Watson ML via REST api
model_url = service_path + "/v3/wml_instances/" + instance_id
model_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
model_response = requests.get(model_url, headers=model_header)
print model_response.text

**Model Deployment**  
Now we can actually create our deployment. Here we make an HTTP POST to the published_models/deployments endpoint — passing in our Watson ML instance_id and the model_id of our newly saved model.

In [None]:
deployment_url = service_path + "/v3/wml_instances/" + instance_id + "/published_models/" + model_id + "/deployments/"
deployment_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
deployment_payload = {"type": "online", "name": deployment_name}
deployment_response = requests.post(deployment_url, json=deployment_payload, headers=deployment_header)
print deployment_response
print deployment_response.text

**OPTIONAL:**  
Run the cell below **ONLY** if you want to delete any previously existing Watson ML deployments!  
This may be useful to clean up things but **BE WARNED, IT IS DESTRUCTIVE!**  
Otherwise, just skip it and DO NOT RUN IT!

In [51]:
## WARNING - THIS CELL WILL DELETE ANY SAVED MODELS AND DEPLOYMENTS THAT ALREADY EXIST!
## DO NOT RUN THIS AND SKIP IT, UNLESS THIS IS EXACTLY WHAT YOU WANT TO DO!

for ml_model in ml_models:
    print '{} - {}'.format(ml_model.name, ml_model.uid)
    deployment_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
    deployment_url = service_path + "/v2/published_models/" + ml_model.uid + "/deployments/"
    deployment_response = requests.get(deployment_url, headers=deployment_header)
    o = json.loads(deployment_response.text)
    if 'resources' in o.keys():
        for resource in o['resources']:
            deployment_url = service_path + "/v2/published_models/" + ml_model.uid + "/deployments/" + resource['metadata']['guid']
            deployment_response = requests.delete(deployment_url, headers=deployment_header)
            print deployment_response.text
        # delete the model
        ml_repository_client.models.remove(ml_model.uid)

## Step 5: Test the model in Watson ML service

**Get the HTTP endpoint URL to access our model**  
The last line below prints the scoring_url parsed from the response received from Watson ML. This is an **HTTP endpoint** that we can use to make predictions. You now have a deployed machine learning model that you can use to predict house prices from anywhere! You can call it from a front-end application, your middleware, or from a notebook — we’ll do just that next :)

In [60]:
scoring_url = json.loads(deployment_response.text).get('entity').get('scoring_url')
print scoring_url

https://ibm-watson-ml.mybluemix.net/v3/wml_instances/ce16a175-2a90-4725-b08a-ded2dd5fbee9/published_models/3989a65c-c3bb-4012-aeb4-91667804cbe9/deployments/31b99366-7ce9-466d-8e0b-72be805f8931/online


**Test the model via HTTP**

In [61]:
# Define the HTTP POST request to scoring_url
def get_prediction_from_watson_ml(square_feet, num_bedrooms):
    scoring_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
    scoring_payload = {'fields': ['SquareFeet','Bedrooms'], 'values': [[square_feet, num_bedrooms]]}
    print scoring_payload
    scoring_response = requests.post(scoring_url, json=scoring_payload, headers=scoring_header)
    return scoring_response.text
    #values = json.loads(scoring_response.text)['values'][0]
    #prediction = values[len(values)-1]
    #return {'prediction': prediction, 'probability': values[len(values)-2][int(prediction)]}

In [62]:
response = get_prediction_from_watson_ml(2400, 4)
print response

{'fields': ['SquareFeet', 'Bedrooms'], 'values': [[2400, 4]]}
{
  "fields": ["SquareFeet", "Bedrooms", "features", "prediction"],
  "values": [[2400, 4, [2400.0, 4.0], 137499.79216713924]]
}


## Next steps

In this post you built an end-to-end machine learning system using the IBM Data Science Experience, Spark ML, and Watson ML.

In just a few lines of code, you imported and visualized a data set.  
Then you built an ML pipeline and trained an ML model.  
Finally you made that model available in Watson ML service to make predictions from software running anywhere.

If you wish, you can now proceed to create a simple application that takes "house size" and "number of bedrooms" as inputs, then calls your Watson ML model and displays predicted price for it.
Check out this the GitHub repo for simple Node-RED example: https://github.com/jruponen/watson_ml_with_wdp
