# Case 2: What Products Will Land in a Customer's Basket?

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


## Install Necessary Libraries

In [2]:
!pip install turicreate

Collecting turicreate
[?25l  Downloading https://files.pythonhosted.org/packages/25/9f/a76acc465d873d217f05eac4846bd73d640b9db6d6f4a3c29ad92650fbbe/turicreate-6.4.1-cp37-cp37m-manylinux1_x86_64.whl (92.0MB)
[K     |████████████████████████████████| 92.0MB 41kB/s 
Collecting coremltools==3.3
[?25l  Downloading https://files.pythonhosted.org/packages/1b/1d/b1a99beca7355b6a026ae61fd8d3d36136e5b36f13e92ec5f81aceffc7f1/coremltools-3.3-cp37-none-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 45.4MB/s 
Collecting resampy==0.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
[K     |████████████████████████████████| 327kB 34.3MB/s 
[?25hCollecting tensorflow<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/3c/b3/3eeae9bc44039ceadceac0c7ba1cc8b1482b172810b3d7624a1cad251437/tensorflow-2.0.4-cp37-cp37m-manylinux2010_x86_64.whl (

## Import Necessary Libraries

In [3]:
import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

## Reading Data

In [4]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/online_retail_II.csv')

In [5]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


## Data Wrangling

As any other data science project, the thing that we need to do after reading the data is to understand the data as well as transform them such that it will be useful for our purpose.

As we can see above we have `Description` column which contains the name of the products. Also, we have `StockCode` column which contains the unique identifier for each distinct product. These two columns are going to be very important for this task beside of course, the `Quantity` and `Customer ID` column.

The first thing that we need to do is to assign a new dataframe which contains a unique id of each customer.

In [8]:
df_customer = pd.DataFrame(df['Customer ID'].unique().astype('int64'))
df_customer.columns = ['CustomerID']

In [9]:
df_customer.head()

Unnamed: 0,CustomerID
0,13085
1,13078
2,15362
3,18102
4,12682


Next, we also need to create a new dataframe which will be very important for our analysis, which I will call `df_purchase`. This dataframe consists of customer ID, the product ID, and the total purchase that each customer made to each product.

In [10]:
df_purchase = df.groupby(['Customer ID','StockCode'], as_index=False)['Quantity'].sum()

In [11]:
df_purchase.head()

Unnamed: 0,Customer ID,StockCode,Quantity
0,12346.0,15056BL,1
1,12346.0,15056N,1
2,12346.0,15056P,1
3,12346.0,20679,1
4,12346.0,20682,1


In [12]:
df_purchase = df_purchase.astype({'Customer ID': 'int64'})

After this step, we've already got the necessary data to build our model! What we need to do next is to split the `df_purchase` dataframe above to training and test data. To split the data, we can use `train_test_split` method from sklearn. We will do this in the next section.

In [13]:
def split_data(data):
    
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [14]:
train_data, test_data = split_data(df_purchase)

Now we get the neat table as follow for our training data

In [15]:
train_data[0:5]

Customer ID,StockCode,Quantity
16602,84077,48
14854,22606,2
16360,23351,4
15358,22272,6
15821,23414,3


## Building Recommendation System Model

After we have our training and test data, here comes the fun part: we are going to build our recommendation system. To build this recommendation system model, I am going to use a library called Turicreate. Turicreate has always been my favorite go-to library when it comes to recommendation system as this library is very straightforward to use and the result is very interpretable as well.

To create recommendation with Turicreate, we need the following:
- The dataframe which we have built before (`df_purchase`)
- A list of unique customer ID
- A list of unique products ID
- Number of items to be recommended for each user. Below I call this variable as `n_rec`

In [18]:
customer_id = 'Customer ID'
item_id = 'StockCode'
customer_to_recommend = list(df_customer['CustomerID'])
n_rec = 5 # Number of items to recommend
n_display = 30 # Display the first few rows in an output dataset

For this recommendation system project, we are going to use two different types of recommendations system: Product popularity based recommendation and Collaborative Filtering. Next, we are going to compare their performance before we choose which method that we are going to implement in the end.

In [17]:
def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):

    if name == 'popularity':

        model = tc.popularity_recommender.create(train_data, 
                                                 user_id=user_id, 
                                                 item_id=item_id, 
                                                 target=target)
    elif name == 'cosine':

        model = tc.item_similarity_recommender.create(train_data, 
                                                      user_id=user_id, 
                                                      item_id=item_id, 
                                                      target=target, 
                                                      similarity_type='cosine')
   
        
    recommendation = model.recommend(users=users_to_recommend, k=n_rec)
    recommendation.print_rows(n_display)
    return model

### Product Popularity Based Recommendation System

As the name suggest, product popularity based recommendation system will recommend the customer the most popular products. Most popular products can be defined as the products with the highest number of sales across all customers.

In [19]:
name = 'popularity'
target = 'Quantity'
popularity = model(train_data, name, customer_id, item_id, target, customer_to_recommend, n_rec, n_display)

+-------------+-----------+--------------------+------+
| Customer ID | StockCode |       score        | rank |
+-------------+-----------+--------------------+------+
|    13085    |   16044   |       3096.0       |  1   |
|    13085    |   37410   |       2526.1       |  2   |
|    13085    |   37352   | 1538.3333333333333 |  3   |
|    13085    |   21092   | 1052.9166666666667 |  4   |
|    13085    |   20800   |       992.0        |  5   |
|    13078    |   16044   |       3096.0       |  1   |
|    13078    |   37410   |       2526.1       |  2   |
|    13078    |   37352   | 1538.3333333333333 |  3   |
|    13078    |   21092   | 1052.9166666666667 |  4   |
|    13078    |   20800   |       992.0        |  5   |
|    15362    |   16044   |       3096.0       |  1   |
|    15362    |   37410   |       2526.1       |  2   |
|    15362    |   37352   | 1538.3333333333333 |  3   |
|    15362    |   21092   | 1052.9166666666667 |  4   |
|    15362    |   20800   |       992.0        |

As you can see in the above table, there are 5 product recommendations for each customer, as we expected from the input that we gave the model. However, you might notice that all of the customers got the same product recommendations. This is very much expected because product popularity is calculated by taking the most popular products among all the customers.

Next, let's implement collaborative filtering.

### Collaborative Filtering

Collaborative filtering is a recommendation system method where the products which will be recommended to a customer will depend on the customer's past purchase as well as how similar a customer's purchase behavior compared to other customers.

In order to compute the similarity of purchasing behavior between customer, then cosine similarity algorithm is normally implemented. We can implement this easily with Turicreate.

In [20]:
name = 'cosine'
target = 'Quantity'
cos = model(train_data, name, customer_id, item_id, target, customer_to_recommend, n_rec, n_display)

+-------------+-----------+--------------------+------+
| Customer ID | StockCode |       score        | rank |
+-------------+-----------+--------------------+------+
|    13085    |   23554   | 2.7353433641520413 |  1   |
|    13085    |   23531   | 2.7346357107162476 |  2   |
|    13085    |   48173C  | 2.7342423038049177 |  3   |
|    13085    |   22765   | 2.7335107651623813 |  4   |
|    13085    |   37482B  | 2.731279814785177  |  5   |
|    13078    |   22687   | 7.224561340007626  |  1   |
|    13078    |   21444   | 7.224473624444399  |  2   |
|    13078    |   21545   | 7.222503073391367  |  3   |
|    13078    |   22765   | 7.222096770024691  |  4   |
|    13078    |   23401   | 7.219130055826218  |  5   |
|    15362    |   22595   | 0.8989201283454895 |  1   |
|    15362    |   21124   | 0.8597626924514771 |  2   |
|    15362    |   84327A  | 0.8321827411651611 |  3   |
|    15362    |   22630   | 0.7572551035881042 |  4   |
|    15362    |   21542   | 0.7493278741836548 |

From a quick look at the result of the model, it looks like collaborative filtering method yields to a better result as the products being recommended to each user are different between one to another.

Let's confirm this hypothesis in the next section.

## Recommender System's Evaluation

To evaluate the performance of recommendation system model, there are several metrics that we can consider. In our case, I would like to use 3 types of metrics, which are the root mean square error (RMSE), precision, and recall. The indicator of how good our recommender system with those metrics can be interpreted as follows:

- The lower the RMSE, the better the recommendation.
- The higher the precision, the better the recommendation. Precision basically tells us: out of all the products that have been recommended to customers, how many do they actually liked?
- The higher the recall, the better the recommendation. Recall basically tells us: what percentage of products that a customer buys are actually being recommended to them?

Hence, we can say that all three of the metrics are equally important.

In [21]:
models = [popularity, cos]
names = ['Popularity Model', 'Collaborative Filtering Model']

eval = tc.recommender.util.compare_models(test_data, models, model_names=names)

PROGRESS: Evaluate model Popularity Model



Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    |          0.0           |          0.0           |
|   2    | 0.0002700270027002697  | 1.7907907433702083e-05 |
|   3    | 0.0003600360036003596  | 2.4300350262749785e-05 |
|   4    | 0.0004500450045004502  |  4.22449942662021e-05  |
|   5    | 0.00039603960396039623 | 4.455291736620462e-05  |
|   6    | 0.0003900390039003901  | 7.068171527378201e-05  |
|   7    | 0.00046290343320046294 | 9.777166555138966e-05  |
|   8    | 0.0006300630063006297  | 0.0003583203095913241  |
|   9    | 0.0007200720072007198  | 0.00044859713913186295 |
|   10   | 0.0006480648064806485  | 0.00044859713913186295 |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 167.0074779533474

Per User RMSE (best)
+-------------+--


Precision and recall summary statistics by cutoff
+--------+----------------------+-----------------------+
| cutoff |    mean_precision    |      mean_recall      |
+--------+----------------------+-----------------------+
|   1    | 0.048604860486048604 | 0.0060854622105955495 |
|   2    |  0.0387038703870387  |  0.009572896255925898 |
|   3    | 0.034563456345634555 |  0.013353467396137705 |
|   4    | 0.031098109810981107 |  0.01521631160436448  |
|   5    | 0.028118811881188123 |  0.016577833120700384 |
|   6    | 0.025772577257725787 |  0.017552534166623838 |
|   7    | 0.02394239423942394  |  0.018539164174209095 |
|   8    | 0.02295229522952293  |  0.020158345580975887 |
|   9    | 0.021582158215821726 |  0.020980640313943498 |
|   10   | 0.020774077407740718 |  0.02205168465153129  |
+--------+----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 167.68222979789024

Per User RMSE (best)
+-------------+------+-------+
| Customer ID | rmse | coun

As we can see from the output above, the popularity based recommender system has a slightly better RMSE score, but they have a far worst value in terms of precision and recall compared to collaborative filtering.

Hence, let's build collaborative filtering based recommender system for our final model.

## Building Collaborative Filtering Recommender System

In [22]:
final_model = tc.item_similarity_recommender.create(tc.SFrame(df_purchase), 
                                            user_id=customer_id, 
                                            item_id=item_id, 
                                            target='Quantity', similarity_type='cosine')

recom = final_model.recommend(users=customer_to_recommend, k=n_rec)
recom.print_rows(n_display)

+-------------+-----------+--------------------+------+
| Customer ID | StockCode |       score        | rank |
+-------------+-----------+--------------------+------+
|    13085    |   23280   | 2.4207055451823214 |  1   |
|    13085    |   23292   | 2.4114346504211426 |  2   |
|    13085    |   23290   | 2.4107505620694627 |  3   |
|    13085    |   48173C  | 2.3452104844299018 |  4   |
|    13085    |   22688   | 2.3358877055785237 |  5   |
|    13078    |   22734   | 9.860444524458476  |  1   |
|    13078    |   23174   |  9.82950988605425  |  2   |
|    13078    |   23175   | 9.809334694178073  |  3   |
|    13078    |   22137   |  9.78258188591375  |  4   |
|    13078    |   21428   | 9.684604188064476  |  5   |
|    15362    |   21749   | 1.846084331211291  |  1   |
|    15362    |   21915   | 1.7752558306643837 |  2   |
|    15362    |   21891   | 1.444044552351299  |  3   |
|    15362    |   22086   | 1.3383159825676365 |  4   |
|    15362    |   22332   | 1.2894908064290096 |

## Generating Product Recommendations to a Customer

Now let's do the fun part, which is generating the product recommendation to the customer. In the end, given the Customer ID, the collaborative filtering will generate 5 products to be recommended to that user. Now given that user already buy one product, these 5 products will be recommended next.

First thing that we need to do now is creating a new dataframe to store our output after model prediction, which consists of Customer ID and the 5 products that are being recommended to them.

In [26]:
def create_output(model, users_to_recommend, n_rec):

    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['RecommendedProducts'] = df_rec.groupby([customer_id])[item_id].transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['Customer ID', 'RecommendedProducts']].drop_duplicates().sort_values('Customer ID').set_index('Customer ID')

    return df_output

In [27]:
df_output = create_output(final_model, customer_to_recommend, n_rec)

Next, given a customer ID, we want to look to the actual name of the products instead of the product ID. Hence, we need to create one more function which basically will take a customer ID, and then convert the product ID that are being recommended to them into an actual product name.

In [28]:
def customer_recomendation(customer_id):

    if customer_id not in df_output.index:
        print('Customer not found.')
        return customer_id
    stock_code = df_output.loc[customer_id]
    stock_code = stock_code[0].split('|')
    products = [df['Description'].loc[df['StockCode'] == x].dropna().tolist()[0] for x in stock_code]

    return products

Let's say a customer with ID 12346 buy something from our online store. The next thing that we should do is to run the collaborative filtering to find the top 5 products that they might like.

In [29]:
customer_id = 12346
products_to_recommend = customer_recomendation(customer_id)

In [30]:
for i,v in enumerate (products_to_recommend):

  print(f'Number {i+1} product to be most likely to be in customer {customer_id} basket is {v}')

Number 1 product to be most likely to be in customer 12346 basket is CREAM HEART CARD HOLDER
Number 2 product to be most likely to be in customer 12346 basket is PINK/BROWN DOTS RUFFLED UMBRELLA
Number 3 product to be most likely to be in customer 12346 basket is CUBIC MUG BLUE POLKA DOT
Number 4 product to be most likely to be in customer 12346 basket is BLACK HEART CARD HOLDER
Number 5 product to be most likely to be in customer 12346 basket is ROSE DU SUD COSMETICS BAG


As we can see above, when a customer with ID 12346 buy one item, the recommender system will recommend 5 products to them, for example Cream Heart Card Holder, Black Heart Card Holder, etc.