# LocalCart scenario part 5: Build a product recommendation engine

In this notebook, you'll first load historical shopping data. Then you'll structure and view that data in a table that displays customer information, product categories, and shopping history details. You will use the _k_-means algorithm, which is useful for cluster analysis in data mining, to segment customers into clusters for the purpose of making an in-store purchase recommendation based on shopping history. You’ll deploy the model to the IBM Watson Machine Learning service in IBM Bluemix to create your recommendation application. By the end of the notebook, you’ll understand how to build a model to provide product recommendations for customers based on their purchase history.

<img src="https://raw.githubusercontent.com/wdp-beta/get-started/master/notebooks/images/nb5_flow.png"></img>

This notebook runs on Python 2 with Spark 2.0.

## Table of contents

1. [Setup](#setup)<br>
    1.1. [Import libraries](#libraries)<br>
    1.2. [Load sample data](#load)<br>
    1.3. [View data in a table](#view_table)<br>
2. [Create a KMeans model](#kmeans)<br>
    2.1. [Prepare data](#prepare_data)<br>
    2.2. [Create clusters and define the model](#build_model)<br>
3. [Persist the model](#persist)<br>	
4. [Deploy the model to the cloud](#deploy)<br>
	4.1. [Create deployment for the model](#create_deploy)<br>
	4.2. [Test model deployment](#test_deploy)<br>
5. [Create product recommendations](#create_recomm)<br>
6. [Summary and next steps](#summary)<br>

<a id="setup"></a>
## 1. Setup

You need to import the required library and load the customer shopping data into this notebook.

Import the PixieDust library:

In [None]:
import pixiedust

In [None]:
# If the previous cell caused an error:
# (1) Uncomment the last line in this cell by removing the # sign
# (2) Run this cell
# (3) Restart the kernel
# (4) Re-run the first cell 
#!pip install --user --upgrade pixiedust

<a id="load"></a>
### 1.1. Load sample data

In this section you'll load the data file that contains the customer shopping data:

In [None]:
df = pixiedust.sampleData("https://raw.githubusercontent.com/wdp-beta/get-started/master/data/customers_orders1_opt.csv")

<a id="view_table"></a>
### 1.3. View data in a table by using Pixiedust

To better examine and visualize the data, run the following cell to view it in a table format:

In [None]:
display(df)

<a id="kmeans"></a>
## 2. Create a _k_-means model

In this section of the notebook you'll use the _k_-means implementation to associate every customer to a cluster based on their shopping history.

First, import the Apache® Spark machine learning packages that you'll need in the subsequent steps:


In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors

<a id="prepare_data"></a>
### 2.1. Prepare data

Create a new data set with just the data that you need. Filter the columns that you want, in this case the customer ID column and the product-related columns. Remove the columns that you don't need for aggregating the data and training the model:

In [None]:
# Here are the product cols. In a real world scenario we would query a product table, or similar.
product_cols = ['Baby Food', 'Diapers', 'Formula', 'Lotion', 'Baby wash', 'Wipes', 'Fresh Fruits', 'Fresh Vegetables', 'Beer', 'Wine', 'Club Soda', 'Sports Drink', 'Chips', 'Popcorn', 'Oatmeal', 'Medicines', 'Canned Foods', 'Cigarettes', 'Cheese', 'Cleaning Products', 'Condiments', 'Frozen Foods', 'Kitchen Items', 'Meat', 'Office Supplies', 'Personal Care', 'Pet Supplies', 'Sea Food', 'Spices']
# Here we get the customer ID and the products they purchased
df_filtered = df.select(['CUST_ID'] + product_cols)

Run the `display()` command again, this time to view the filtered information:

In [None]:
display(df_filtered)

Now, aggregate the individual transactions for each customer to get a single score per product, per customer.

In [None]:
df_customer_products = df_filtered.groupby('CUST_ID').sum()  # Use customer IDs to group transactions by customer and sum them up
df_customer_products = df_customer_products.drop('sum(CUST_ID)')
display(df_customer_products)

<a id="build_model"></a>
### 2.2. Create clusters and define the model 

You'll use _k_-means to create 100 clusters based on the number of times a specific customer purchased a product.

First, create a feature vector by combining the product and quantity columns:

In [None]:
assembler = VectorAssembler(inputCols=["sum({})".format(x) for x in product_cols],outputCol="features") # Assemble vectors using product fields

Next, create the _k_-means clusters and the pipeline to define the model:

In [None]:
kmeans = KMeans(maxIter=50, predictionCol="cluster").setK(100).setSeed(1)  # Initialize model
pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df_customer_products)

Finally, calculate the cluster for each customer by running the original dataset against the _k_-means model: 

In [None]:
df_customer_products_cluster = model.transform(df_customer_products)
display(df_customer_products_cluster)

<a id="persist"></a>
## 3. Persist the model 

In this section you will learn how to store your model in the Watson Machine Learning repository by using the Python client libraries.

### 3.1 Import the necessary libraries

First, you must import the client libraries.

In [None]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

### 3.2. Configure IBM Watson Machine Learning credentials
To access your machine learning repository programmatically, you need to copy in your credentials, which you can see in your **IBM Watson Machine Learning** service details in Bluemix.

1. Open your [Bluemix Data Services list](https://apsportal.ibm.com/settings/services?context=analytics) in a new browser window. (You can also navigate to this list by clicking the avatar icon on the upper right hand side and selecting _Settings_ and _Services_.
2. Click on your IBM Watson Machine Learning service.
3. Click **Service Credentials** and then click **View credentials**.
4. Copy your credentials and paste them into the `username` and `password` fields in the next cell.
5. Run the cell.


In [None]:
service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'copy_your_username_here'
password = 'copy_your_password_here'

### 3.3 Authorize the repository and save artifacts

Authorize the repository client by running the following code, which uses the Watson Machine Learning client libraries:

In [None]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)
ml_model_name = 'Shopping History'

Save the model to your Watson Machine Learning instance:

In [None]:
data_training = df_customer_products.withColumnRenamed('CUST_ID', 'label')
model_artifact = MLRepositoryArtifact(model, training_data=data_training, name=ml_model_name)
saved_model = ml_repository_client.models.save(model_artifact)
saved_model

<a id="deploy"></a>
## 4. Deploy model to the cloud

In this section you'll learn how to deploy the model to the cloud by using the Watson Machine Learning REST API. For more information about REST APIs, see the [Watson Machine Learning API Documentation](http://watson-ml-api.mybluemix.net/).

Import the required libraries:

In [None]:
import json
import requests
import urllib3

To work with the Watson Machine Learning REST API, generate an access token:

In [None]:
headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(username, password))
url = '{}/v2/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
ml_token = json.loads(response.text).get('token')

<a id="create_deploy"></a>
### 4.1. Create deployment for the model

Now you can create a deployment for the model in Watson Machine Learning:

In [None]:
deployment_url = service_path + "/v2/published_models/" + saved_model.uid + "/deployments/"
deployment_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
deployment_payload = {"type": "online", "name": "Shopping List Prediction"}
deployment_response = requests.post(deployment_url, json=deployment_payload, headers=deployment_header)

Execute the following code to load the scoring endpoint for the deployment. 

In [None]:
scoring_url = json.loads(deployment_response.text).get('entity').get('scoring_href')
print scoring_url

<a id="test_deploy"></a>
### 4.2. Test deployment of the model

To verify that the model was successfully deployed to the cloud, you'll specify a customer ID, for example customer 12027, to predict this customer's cluster against the Watson Machine Learning deployment, and see if it matches the cluster that was previously associated this customer ID.

In [None]:
customer = df_customer_products_cluster.filter('CUST_ID = 12027').collect()
print("Previously calculated cluster = {}".format(customer[0].cluster))

To determine the customer's cluster using Watson Machine Learning, you need to load the customer's purchase history. This function uses the local data frame to select every product field and the number of times that customer 12027 purchased a product.

In [None]:
from six import iteritems
def get_product_counts_for_customer(cust_id):
    cust = df_customer_products.filter('CUST_ID = 12027').take(1)
    fields = []
    values = []
    for row in customer:
        for product_col in product_cols:
            field = 'sum({})'.format(product_col)
            value = row[field]
            fields.append(field)
            values.append(value)
    return (fields, values)

This function takes the customer's purchase history and calls the scoring endpoint:

In [None]:
def get_cluster_from_watson_ml(fields, values):
    scoring_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
    scoring_payload = {'fields': fields, 'values': [values]}
    scoring_response = requests.post(scoring_url, json=scoring_payload, headers=scoring_header)
    return json.loads(scoring_response.text)['values'][0][len(product_cols)+1]

Finally, call the functions defined above to get the product history, call the scoring endpoint, and get the cluster associated to customer 12027:

In [None]:
product_counts = get_product_counts_for_customer(12027)
fields = product_counts[0]
values = product_counts[1]
print("Cluster calculated by Watson ML = {}".format(get_cluster_from_watson_ml(fields, values)))

### 4.3. Optional: Delete previous models

If needed, you can delete previous deployments and models that you created using this notebook. 

In [None]:
# Uncomment this cell to delete a model that was previously created
# ml_models = ml_repository_client.models.all()
# for ml_model in ml_models:
#     if ml_model.name == ml_model_name:
#         # delete any deployments
#         deployment_header = {'Content-Type': 'application/json', 'Authorization': ml_token}
#         deployment_url = service_path + "/v2/published_models/" + ml_model.uid + "/deployments/"
#         deployment_response = requests.get(deployment_url, headers=deployment_header)
#         o = json.loads(deployment_response.text)
#         if 'resources' in o.keys():
#             for resource in o['resources']:
#                 deployment_url = service_path + "/v2/published_models/" + ml_model.uid + "/deployments/" + resource['metadata']['guid']
#                 deployment_response = requests.delete(deployment_url, headers=deployment_header)
#                 print deployment_response.text
#         # delete the model
#         ml_repository_client.models.remove(ml_model.uid)

<a id="create_recomm"></a>
## 5. Create product recommendations

Now you can create some product recommendations.

First, run this cell to create a function that queries the database and finds the most popular clusters. In this case, the **df_customer_products_cluster** dataframe is the database.

In [None]:
# This function gets the most popular clusters in the cell by grouping by the cluster column
def get_popular_products_in_cluster(cluster):
    df_cluster_products = df_customer_products_cluster.filter('cluster = {}'.format(cluster))
    df_cluster_products_agg = df_cluster_products.groupby('cluster').sum()
    row = df_cluster_products_agg.rdd.collect()[0]
    items = []
    for product_col in product_cols:
        field = 'sum(sum({}))'.format(product_col)
        items.append((product_col, row[field]))
    sortedItems = sorted(items, key=lambda x: x[1], reverse=True) # Sort by score
    popular = [x for x in sortedItems if x[1] > 0]
    return popular

Now, run this cell to create a function that will calculate the recommendations based on a given cluster. This function finds the most popular products in the cluster, filters out products already purchased by the customer or currently in the customer's shopping cart, and finally produces a list of recommended products.

In [None]:
# This function takes a cluster and the quantity of every product already purchased or in the user's cart
from pyspark.sql.functions import desc
def get_recommendations_by_cluster(cluster, purchased_quantities):
    # Existing customer products
    print('PRODUCTS ALREADY PURCHASED/IN CART:')
    customer_products = []
    for i in range(0, len(product_cols)):
        if purchased_quantities[i] > 0:
            customer_products.append((product_cols[i], purchased_quantities[i]))
    df_customer_products = sc.parallelize(customer_products).toDF(["PRODUCT","COUNT"])
    df_customer_products.show()
    # Get popular products in the cluster
    print('POPULAR PRODUCTS IN CLUSTER:')
    cluster_products = get_popular_products_in_cluster(cluster)
    df_cluster_products = sc.parallelize(cluster_products).toDF(["PRODUCT","COUNT"])
    df_cluster_products.show()
    # Filter out products the user has already purchased
    print('RECOMMENDED PRODUCTS:')
    df_recommended_products = df_cluster_products.alias('cl').join(df_customer_products.alias('cu'), df_cluster_products['PRODUCT'] == df_customer_products['PRODUCT'], 'leftouter')
    df_recommended_products = df_recommended_products.filter('cu.PRODUCT IS NULL').select('cl.PRODUCT','cl.COUNT').sort(desc('cl.COUNT'))
    df_recommended_products.show(10)

Next, run this cell to create a function that produces a list of recommended items based on the products and quantities in a user's cart. This function uses Watson Machine Learning to calculate the cluster based on the shopping cart contents and then calls the **get_recommendations_by_cluster** function.

In [None]:
# This function would be used to find recommendations based on the products and quantities in a user's cart
def get_recommendations_for_shopping_cart(products, quantities):
    fields = []
    values = []
    for product_col in product_cols:
        field = 'sum({})'.format(product_col)
        if product_col in products:
            value = quantities[products.index(product_col)]
        else:
            value = 0
        fields.append(field)
        values.append(value)
    return get_recommendations_by_cluster(get_cluster_from_watson_ml(fields, values), values)

Run this cell to create a function that produces a list of recommended items based on the purchase history of a customer. This function uses Watson Machine Learning to calculate the cluster based on the customer's purchase history and then calls the **get_recommendations_by_cluster** function.

In [None]:
# This function is used to find recommendations based on the purchase history of a customer
def get_recommendations_for_customer_purchase_history(customer_id):
    product_counts = get_product_counts_for_customer(customer_id)
    fields = product_counts[0]
    values = product_counts[1]
    return get_recommendations_by_cluster(get_cluster_from_watson_ml(fields, values), values)

Now you can take customer 12027 and produce a recommendation based on that customer's purchase history:

In [None]:
get_recommendations_for_customer_purchase_history(12027)

Now, take a sample shopping cart and produce a recommendation based on the items in the cart:

In [None]:
get_recommendations_for_shopping_cart(['Diapers','Baby wash','Wine'],[1,2,1])

<a id="summary"></a>
## Summary and next steps

You successfully completed this notebook! You learned how to use Watson Machine Learning for model creation and deployment.   

Check out other notebooks in this series: 
 - [Generating a Kafka producer (JSON) into MessageHub](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-1.ipynb).
 - [Building the Streaming Pipeline](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-2.ipynb)
 - [Visualize streaming data in a real-time dashboard](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-4.ipynb) 

### Author
**Mark Watson** is an IBM Developer Advocate whose primary responsibility is to make it easy for you to use the IBM platform and offerings. He is passionate about the three pillars of his work, namely development, advocacy, and community. While coding is still an essential part of his job, he focuses on examples, training materials, explanatory demos, Software Development Kits and tools that can help people excel at using IBM products.

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.