## Introduction

Recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries). Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from surprise import accuracy, Reader, Dataset
from surprise import KNNBasic, SVD, NMF

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, LinearRegression

## 1. Data Selection

In [2]:
olist = pd.read_csv('../data/processed/olist.csv')

We will split data into new and repeat customers based on EDA (part 1) to treat them differently.

In [3]:
num_of_order_per_customer = olist.groupby('customer_unique_id')['order_id'].nunique()
repeat_customer_unique_id = num_of_order_per_customer[num_of_order_per_customer > 1].index.tolist()
repeat_customers = olist.loc[olist.customer_unique_id.isin(repeat_customer_unique_id)]
num_of_repeat_customers = len(repeat_customer_unique_id)
total_num_of_customers = olist['customer_unique_id'].nunique()
print(f'Number of repeat customers: {num_of_repeat_customers}, new customers: {total_num_of_customers - num_of_repeat_customers}')

Number of repeat customers: 2807, new customers: 90589


For new customers: products will be recommended based on Popularity by Month (and/or by Location) or based on Highly Rated Categories

In [4]:
# Popular products
print('*'*50, '\n', 'Get popular products')
print(olist['product_id'].value_counts().head())

# Top rated products
print('*'*50, '\n', 'Get popular products')
print(olist.groupby('product_id')['review_score'].mean().sort_values(ascending=False).head())

************************************************** 
 Get popular products
product_id
aca2eb7d00ea1a7b8ebd4e68314663af    533
99a4788cb24856965c36a24e339b6058    517
422879e10f46682990de24d770e7f83d    507
389d119b48cf3043d311335e499d9c6b    405
368c6c730842d78016ad823897a372db    395
Name: count, dtype: int64
************************************************** 
 Get popular products
product_id
0021a87d4997a48b6cef1665602be0f5    5.0
001c5d71ac6ad696d22315953758fa04    5.0
00066f42aeeb9f3007548bb9d3f33c38    5.0
001b237c0e9bb435f2e54071129237e9    5.0
fffdb2d0ec8d6a61f0a0a0db3f25b441    5.0
Name: review_score, dtype: float64


We could see that we have many products with the same rating score, so that we might need more filters to get the best products. These filters will be generalized in the following function.

In [5]:
def get_top_rate_products(month=None, location=None):
    if month:
        df = olist.loc[olist.order_purchase_year_month == month]
    elif location:
        df = olist.loc[olist.customer_state == location]
    else:
        df = olist.copy()
    agg_df = df.groupby('product_id').agg({'review_score': ['mean', 'count'], 'product_category_name_english': 'first'})
    agg_df.columns = ['avg_rating_score', 'count', 'product_category']
    output = agg_df.sort_values(by=['avg_rating_score', 'count'], ascending=[False, False]).head()
    
    return output

# Top rated products by month
print('*'*50, '\n', 'Get top rated products in month=201711')
print(get_top_rate_products(month=201711))
# Top rated products by location
print('*'*50, '\n', 'Get top rated products in Sao Paulo')
print(get_top_rate_products(location='SP'))

************************************************** 
 Get top rated products in month=201711
                                  avg_rating_score  count product_category
product_id                                                                
89b190a046022486c635022524a974a8               5.0     15  Furniture Decor
47cd48073d67f91f09cb5ef9496c920b               5.0     10   Bed Bath Table
1166bc797ddf5fb009c376d133f61204               5.0      9       Housewares
846145e9b8d412bd1c9bb478a52ab4a0               5.0      9       Cool Stuff
10b0226d162bdc55d60c0eabf68c7021               5.0      7   Sports Leisure
************************************************** 
 Get top rated products in Sao Paulo
                                  avg_rating_score  count product_category
product_id                                                                
ebf9bc6cd600eadd681384e3116fda85               5.0     42   Bed Bath Table
8d37ee446981d3790967d0268d6cfc81               5.0     26   Bed Bath 

We will build a recommendation system for repeat customers in the following sections.

## 2. Recommendation system approaches

#### 2.1. Data split

In [6]:
# Filter unique customer
unique_customer_ids = repeat_customers['customer_unique_id'].unique()

# Split unique customer ids into train and test
train_customer_ids, test_customer_ids = train_test_split(unique_customer_ids, test_size = 0.3, random_state=42)

# Filter data for train and test sets based on customer ids
train_data = repeat_customers[repeat_customers['customer_unique_id'].isin(train_customer_ids)]
test_data = repeat_customers[repeat_customers['customer_unique_id'].isin(test_customer_ids)]

# Print sizes of the resulting sets
print('Size of train data:', len(train_data), 'Number of customers:', train_data.customer_unique_id.nunique())
print('Size of test data:', len(test_data), 'Number of customers:', test_data.customer_unique_id.nunique())

Size of train data: 5785 Number of customers: 1964
Size of test data: 2419 Number of customers: 843


#### 2.2. Evaluation metrics
- Using RMSE: compare predicted review score with actual `review score` to evaluate performance of current models. The lower RMSE, the better model.
- Using top-K-accuracy to compare overall performance of all approaches. For simplicity, we choose K=5, i.e. when suggested product was bought by customer, we will count it as a True Positive. This metric are explainable and very close to actual business. The higher accuracy, the better model.

In [7]:
overall_performance = {} # create new variable to store model metrics

def evaluate_performance(predicted_review_scores=pd.DataFrame([]), top_products=[], k=5):
    '''
    predicted_review_scores dataframe will contain all predicted score of users for all products
    '''
    # For each user, we will suggest top K products with highest review score. If these scores are equal, we will choose it randomly.
    tp = 0
    for user_id in test_customer_ids:
        if top_products:
            suggested_products = top_products
        else:
            suggested_products = (predicted_review_scores
                                  .loc[predicted_review_scores.customer_unique_id == user_id]
                                  .sort_values(by='review_score', ascending=False)['product_id']
                                  .head(k))
        
        bought_products = test_data.loc[test_data.customer_unique_id == user_id, 'product_id'].values
        if any([i in bought_products for i in suggested_products]):
            tp += 1
    accuracy = round(tp / len(test_customer_ids), 3)
    print(f'Accuracy: {accuracy}')
    
    return accuracy

#### 2.2. Baseline models

a. Rating prediction using mean

In [8]:
np.random.seed(42)
predicted_review_scores = test_data.copy()
predicted_review_scores['product_id'] = np.random.choice(olist.product_id.unique(), size=len(test_data), replace=True)
predicted_review_scores['review_score'] = train_data['review_score'].mean()

overall_performance['Baseline: Mean Prediction'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.0


b. Top k popular products

All users will be suggested by top 5 products as in the section 1 (for new customers)

In [9]:
top_popular_products = olist['product_id'].value_counts().head().index.tolist()

overall_performance['Baseline: Top k popular products'] = evaluate_performance(top_products=top_popular_products)

Accuracy: 0.03


#### 2.3. Content-based Filtering Technique

This technique bases on feature values of **products**. For each user, we try to suggest highest rating products by creating a ML model to learn from the input feature values to the output (`review_score`).

In [10]:
product_features = ['product_name_lenght', 'product_description_lenght', 'product_photos_qty', 'product_weight_g', 'product_length_cm', 
                    'product_height_cm', 'product_width_cm', 'price']

In [11]:
# Normalize the input first
all_products = repeat_customers[['product_id'] + product_features].drop_duplicates() # remove duplicates
scaler = StandardScaler().fit(all_products[product_features])
repeat_customers[product_features] = scaler.transform(repeat_customers[product_features])

In [12]:
# Get product suggestion
user_product_suggestion_list = []
for user_id in test_customer_ids:
    df = test_data.loc[test_data.customer_unique_id == user_id, product_features + ['review_score']].drop_duplicates()
    model = Ridge().fit(df[product_features], df['review_score'])
    all_products['predicted_score'] = model.predict(all_products[product_features])
    top_products = all_products.sort_values(by='predicted_score', ascending=False).head(5)
    top_products.loc[:, 'customer_unique_id'] = [user_id for i in range(5)]
    
    user_product_suggestion_list.append(top_products[['customer_unique_id', 'product_id', 'predicted_score']]
                                        .rename(columns={'predicted_score': 'review_score'}))

predicted_review_scores = pd.concat(user_product_suggestion_list)

In [13]:
overall_performance['Content-based Filtering'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.001


#### 2.4. Collaborative Filtering Technique: Memory-based

* We will create an utility matrix between users and products. This could be any relation between users and products such as click or not, watching time, rating... Our objective is to fill missing items. Note that these items generally very large compared to the available values.
* We might consider to normalize (average subtraction) before combining with similarity matrix (average of k-nearest users). The similarity matrix is created from calculating *"distances"* between users-users or items-items, and this calculation plays an important roles in this approach. We will have several ways to calculate these kind of distances as follows.
    - **User-user**: similarity is calculated between users --> "People like you also like", "You might also like"
    - **Item-item**: similarity is calculated between items --> "Users who bought this might also like this" (similar items). Generally, it will be much faster than user-user, and secondly user profiles changes quickly and the entire system model has to be recomputed, whereas item's average ratings doesn't change that quickly, and this leads to more stable rating distributions in the model, so the model doesn't have to be rebuilt as often.
    - **Similarity options**: cosine similarity (most common), Euclide distance, Pearson correlation or Jaccard.

In [14]:
util_matrix = repeat_customers[['customer_unique_id', 'product_id', 'review_score']]
user_item_matrix = util_matrix.pivot_table(index='customer_unique_id', columns='product_id', values='review_score')

*a. Item-based*

In [15]:
item_matrix = user_item_matrix.apply(lambda col: col.fillna(col.mean()), axis=0)

In [16]:
def ib_get_product_suggestion(item_corr_matrix):
    # For each user, find top k similar products, based on their first purchase
    user_product_suggestion_list = []
    first_purchase = test_data.sort_values(by='order_purchase_timestamp', ascending=False).groupby('customer_unique_id')['product_id'].first()
    
    for user_id in test_customer_ids:
        # Find first purchase product
        first_product_id = first_purchase.loc[user_id]
        
        # Get top 5 similar products
        selected_product_corr = item_corr_matrix[first_product_id].dropna()
        top_products = pd.DataFrame(selected_product_corr.sort_values(ascending=False)[1:6]) # exclude the first one - the first purchase item
        top_products.columns = ['corr']
        
        top_products = top_products.reset_index()
        top_products.loc[:, 'customer_unique_id'] = [user_id for i in range(5)]
    
        user_product_suggestion_list.append(top_products)
    
    predicted_review_scores = (pd.concat(user_product_suggestion_list)
                               .rename(columns={'corr': 'review_score'})
                               [['customer_unique_id', 'product_id', 'review_score']])
    
    return predicted_review_scores 

In [17]:
# Pearson correlation
item_corr_matrix = item_matrix.corr()
predicted_review_scores = ib_get_product_suggestion(item_corr_matrix)
overall_performance['Item-based CF: Pearson correlation'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.079


In [18]:
# Cosine similarity
item_corr_matrix = cosine_similarity(item_matrix.T, item_matrix.T)
item_corr_matrix = pd.DataFrame(item_corr_matrix, index=user_item_matrix.columns, columns=user_item_matrix.columns) # add index/columns

predicted_review_scores = ib_get_product_suggestion(item_corr_matrix)
overall_performance['Item-based CF: Cosine similarity'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.033


In [19]:
# Euclid distance
item_corr_matrix = euclidean_distances(item_matrix.T, item_matrix.T)
item_corr_matrix = pd.DataFrame(item_corr_matrix, index=user_item_matrix.columns, columns=user_item_matrix.columns) # add index/columns

predicted_review_scores = ib_get_product_suggestion(item_corr_matrix)
overall_performance['Item-based CF: Euclid distance'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.001


*b. User-based*

In [20]:
user_matrix = user_item_matrix.apply(lambda x: x.fillna(x.mean()), axis=1).T

In [21]:
def ub_get_product_suggestion(user_similarity):
    user_product_suggestion_list = []
    
    for user_id in test_customer_ids:
        similar_users = user_similarity.loc[user_id].sort_values(ascending=True)[1:10] # average of 5 customers
        
        top_products = user_item_matrix.loc[similar_users.index].mean().sort_values(ascending=False).head(5) # top 5 products with highest rating
        top_products = pd.DataFrame(top_products).rename(columns={0: 'review_score'}).reset_index()
        top_products.loc[:, 'customer_unique_id'] = [user_id for i in range(5)]
        
        user_product_suggestion_list.append(top_products)
    
    predicted_review_scores = pd.concat(user_product_suggestion_list)

    return predicted_review_scores

In [22]:
# Pearson correlation
user_similarity = user_matrix.corr()

predicted_review_scores = ub_get_product_suggestion(user_similarity)
overall_performance['User-based CF: Pearson correlation'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.014


In [23]:
# Cosine similarity
user_similarity = cosine_similarity(user_matrix.T, user_matrix.T)
user_similarity = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index) # add index/columns

predicted_review_scores = ub_get_product_suggestion(user_similarity)
overall_performance['User-based CF: Cosine similarity'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.002


In [24]:
# Euclid distance
user_similarity = euclidean_distances(user_matrix.T, user_matrix.T)
user_similarity = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index) # add index/columns

predicted_review_scores = ub_get_product_suggestion(user_similarity)
overall_performance['User-based CF: Euclid distance'] = evaluate_performance(predicted_review_scores=predicted_review_scores)

Accuracy: 0.024


#### 2.4. Collaborative Filtering Technique: Model-based

Basically, we compress user-item matrix into a low dimension matrix. We use techniques like SVD which is a low-rank factorization method, PCA which is used for dimensionaliry reduction etc.

In [25]:
# Data preparation
reader = Reader(rating_scale=(1,5))

train_dataset = Dataset.load_from_df(train_data[['customer_unique_id', 'product_id', 'review_score']], reader)
test_dataset = Dataset.load_from_df(test_data[['customer_unique_id', 'product_id', 'review_score']], reader)

trainset = train_dataset.build_full_trainset()
testset = test_dataset.build_full_trainset().build_testset()

In [26]:
def mb_get_product_suggestion(model):
    user_product_suggestion_list = []
    
    for user_id in test_customer_ids:
        # Get top k recommendations for each user
        user_predictions = model.test([(user_id, iid, 0) for iid in repeat_customers['product_id'].unique()])
        top_pred = sorted(user_predictions, key=lambda x: x.est, reverse=True)[:5]
        
        top_products = pd.DataFrame(
            {'product_id': [pred.iid for pred in top_pred], 
             'review_score': [pred.est for pred in top_pred]}, 
            index=range(5))
        top_products.loc[:, 'customer_unique_id'] = [user_id for i in range(5)]
        
        user_product_suggestion_list.append(top_products)
    
    predicted_review_scores = pd.concat(user_product_suggestion_list)

    return predicted_review_scores

In [27]:
models = {'Model-based CF - SVD': SVD(), 'Model-based CF - NMF': NMF(), 'Model-based CF - KNNBasic': KNNBasic()}

for model_name, model in models.items():
    print('*' * 50, '\n', model_name)
    model.fit(trainset)
    predictions = model.test(testset)
    
    rmse = round(accuracy.rmse(predictions), 3)

    predicted_review_scores = mb_get_product_suggestion(model=model)
    overall_performance[model_name] = evaluate_performance(predicted_review_scores=predicted_review_scores)

************************************************** 
 Model-based CF - SVD
RMSE: 1.4028
Accuracy: 0.012
************************************************** 
 Model-based CF - NMF
RMSE: 1.4032
Accuracy: 0.0
************************************************** 
 Model-based CF - KNNBasic
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.4032
Accuracy: 0.0


#### 2.5. Summary

Now we summarize all above results into a single one table. 

We could find that **Pearson correlation** provides better results than others despite it is quite computationally intensive. In addition, because the datasets are so small, the memory based filtering method shows significant efficiency compared to other methods, including model based filtering. With small datasets like this, it's clear that the basic method, suggest popular products, is showing a pretty solid baseline at **3%** for top 5 accuracy.


In [28]:
pd.DataFrame(overall_performance, index=[0])

Unnamed: 0,Baseline: Mean Prediction,Baseline: Top k popular products,Content-based Filtering,Item-based CF: Pearson correlation,Item-based CF: Cosine similarity,Item-based CF: Euclid distance,User-based CF: Pearson correlation,User-based CF: Cosine similarity,User-based CF: Euclid distance,Model-based CF - SVD,Model-based CF - NMF,Model-based CF - KNNBasic
0,0.0,0.03,0.001,0.079,0.033,0.001,0.014,0.002,0.024,0.012,0.0,0.0


## 4. Future works
#### 4.1. Deep Learning-based models

Nowadays, recommender systems have been improved with many approaches based on deep learning. 
The concept behind matrix factorization models is that the preferences of a user can be determined by a small number of hidden factors. And these are called as Embeddings. There are two types of architecture:

- One using *dot product Embeddings* - we embedding user and item matrix separately, then use dot product to combine these information to create features
- Another is called *concatenation Embeddings* - we concatenate user and item matrix after using Embedding to create features

* Reference:
    - https://d2l.ai/chapter_recommender-systems/movielens.html
    - https://d2l.ai/chapter_recommender-systems/deepfm.html
    - https://d2l.ai/chapter_recommender-systems/neumf.html
    - https://github.com/recommenders-team/recommenders/blob/main/examples/06_benchmarks/movielens.ipynb