# Olist Product Recommendation System
### Part 4 Modelling - Comparing Techniques
#### Author: Olabisi Sunmon | 10th April 2023

### Problem Statement

How can we create a customised product recommendation system using data analysis and machine learning techniques to help Olist customers discover new products and find relevant items for purchase, to boost revenue and customer purchase rates.



In this notebook, I will be exploring different collaborative filtering techniques to build a product recommendation system. I will compare the effectiveness of various algorithms and evaluate their performance using appropriate metrics.

---------
### Procedure:

First time Customer recommendations will be based on;
- Trending products in the country 
- Trending products in their state

Returning Customer recommendations will be based on;
- KNNWithMeans
- FunkSVD 
- NormalPredictor
- CoClustering

These techniques have been published here; https://surpriselib.com

------
#### IMPORTANT 
Due to limitations in computational power, the modeling for returning customer product recommendation will be conducted on a subset of the dataset, rather than on the entire dataset.To ensure that the insights generated from the analysis are representative of the sales data at different levels of satisfaction, I will attempt to create a well-represented subset by stratifying the sample based on `review_score`. By doing so, the sample will contain a proportional representation of different satisfaction levels, which can help to ensure that the insights are generalized to the entire population.


In [4]:
#Import Package
# data manipulation
import numpy as np
import pandas as pd
import joblib
from sklearn.model_selection import StratifiedShuffleSplit

#Plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Modeling
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder
from surprise import Dataset
from surprise.reader import Reader
from surprise import KNNWithMeans, CoClustering
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
from surprise import accuracy
from surprise.accuracy import rmse
from surprise import NormalPredictor


#Ignore futurewarnings
import warnings
warnings.filterwarnings('ignore')



## Import data

In [5]:
df = joblib.load('/Users/labisi/Desktop/capstone-project-osunmon1/src/data/Processed/df_processed.pkl')
returners = joblib.load('/Users/labisi/Desktop/capstone-project-osunmon1/src/data/Processed/returners_data.pkl')
first_timers = joblib.load('/Users/labisi/Desktop/capstone-project-osunmon1/src/data/Processed/first_timer_data.pkl')

## Creating a Sample

In [6]:
# Creating a 30% subset of the returners dataframe
split = StratifiedShuffleSplit(n_splits=1, test_size=0.12, random_state=42)
for _, test_index in split.split(returners, returners['review_score']):
    returners = returners.iloc[test_index]
    
print("Shape of returners dataset:", returners.shape)
print("Shape of first timer dataset:", first_timers.shape)

Shape of returners dataset: (3476, 19)
Shape of first timer dataset: (81785, 19)


## First time Customer recommendations
Trending products in the country 

In [7]:
def popular_items(df, top): #the function returns n trending items
    top_n_items = df.product_id.value_counts().sort_values(ascending=False)[:top].index
    return list(top_n_items)

Trending products in their state

In [8]:
def popular_state (df, state, top):#the function returns n trending items in the state
    location_df = df[df.customer_state == state]
    top_n_items = location_df.product_id.value_counts().sort_values(ascending=False)[:top].index
    return list(top_n_items)

In [9]:
#the function returns trending items in country and state of the customer

def first_time_recommender(df, uid, top):
    hot_items = popular_items(df, top)
    state = df[df.customer_unique_id==uid].customer_state.max()
    popular_in_state = popular_state(df, state, top)
    
    print(f"Trending items you might like:\n {hot_items}\n")
    print(f"Popular items in your area:\n {popular_in_state}")
    
    recommendation = {'Trending Items': hot_items, 'Area': popular_in_state}
    
    return recommendation

In [10]:
# Example Recommendation
recommendation = first_time_recommender(df, 'c71a196d46a70ec611f3922db5755d1d', 3)

Trending items you might like:
 ['aca2eb7d00ea1a7b8ebd4e68314663af', '422879e10f46682990de24d770e7f83d', '99a4788cb24856965c36a24e339b6058']

Popular items in your area:
 ['aca2eb7d00ea1a7b8ebd4e68314663af', '99a4788cb24856965c36a24e339b6058', '422879e10f46682990de24d770e7f83d']


## Returning Costumers
### Collabrative Filtering 
 

In [11]:
#Creating df
collab_df =returners[['customer_unique_id','product_id','review_score']]
collab_df = collab_df.sort_values(by=['customer_unique_id','product_id'])


In [12]:
collab_df.shape

(3476, 3)

In [13]:
reader = Reader(rating_scale=(1,5))
# Set the dataset
data = Dataset.load_from_df(collab_df, reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7f9d5ee33b50>

Firstly i will be creating a model using FunkSVD

Funk Singular Value Decomposition is a matrix factorization technique used in collaborative filtering recommendation systems to predict user preferences based on the feedback of similar users.

Cross-validation will be set to 3 to avoid overfitting. The measure FCP will be used for the gridsearch as FCP evaluates the performance of a model in predicting user preferences. FCP ranges from -1 to 1, where 1 indicates perfect agreement between the predicted and actual rankings. 

In [14]:
#  create parameter grid
param_grid = {
    'n_factors': [5, 150], 
    'n_epochs': [5, 40],
    'lr_all': [0.0001, 0.1],
    'biased': [False] }

# Set 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(data)
     

In [15]:
# Check the FCP accuracy score 
GS.best_score['fcp']

0.1568627450980392

In [16]:
# Check the best parameters
GS.best_params['fcp']

{'n_factors': 5, 'n_epochs': 5, 'lr_all': 0.0001, 'biased': False}

I will tune funk SVD model using the best parameters 

In [17]:
# Split train and test set
train, test = train_test_split(data, test_size=0.2)
# Set the SVD algorithm
svd = FunkSVD(n_factors=5,n_epochs=5 ,lr_all=0.0001, biased=False, verbose=0)
# Fit train set
svd.fit(train)
# Test the algorithm using test set
pred = svd.test(test)

In [18]:
# Put my_pred result in a dataframe
svd_pred_df = pd.DataFrame(pred, columns=['user_id','product_id','actual','prediction','details'])

# Calculate the difference of actual and prediction into diff column
svd_pred_df['diff'] = abs(svd_pred_df['prediction'] - 
                            svd_pred_df['actual'])


In [19]:
# Build full trainset
train_df = data.build_full_trainset()

# Build the SVD algorithm
my_svd = FunkSVD(n_factors=5, n_epochs=5,lr_all=0.0001,biased=False, verbose=0)

# Fit with full trainset
my_svd.fit(train_df)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9d5ee332e0>

In [20]:
# Define the full test set
test_df = train_df.build_anti_testset(fill=-1)

In [21]:
# set the prediction
pred = my_svd.test(test_df)

In [22]:
# Put into a dataframe
pred_df = pd.DataFrame(pred, columns=['user_id', 'product_id','actual', 'prediction','details'])                                 

I will now create a FunkSVD recommendation function 

In [23]:
def recommender_FunkSVD(uid):
    '''
    Given a user_id, recommend new product using FunkSVD.
    uid: user_id
    '''
    # check user_id prediction in pred_df
    user_df = pred_df[pred_df['user_id'] == uid]
    user_df = user_df.sort_values('prediction', ascending=False)
    # merge with rating_df
    df = user_df.merge(collab_df['product_id'].drop_duplicates(), how='left', 
                    on = 'product_id')
    return df.head(5)

In [24]:
recommender_FunkSVD('66ae4493fc7c710a8db6d9620901f40d')

Unnamed: 0,user_id,product_id,actual,prediction,details
0,66ae4493fc7c710a8db6d9620901f40d,9e572ff4654f7064419d97a891a8b0fc,-1.0,1,{'was_impossible': False}
1,66ae4493fc7c710a8db6d9620901f40d,1427b126f61597524866770b05d4eed2,-1.0,1,{'was_impossible': False}
2,66ae4493fc7c710a8db6d9620901f40d,e731dfad79b4686d049d024b9fc97360,-1.0,1,{'was_impossible': False}
3,66ae4493fc7c710a8db6d9620901f40d,b3793f4676bdf327ca34c40d236ee2b2,-1.0,1,{'was_impossible': False}
4,66ae4493fc7c710a8db6d9620901f40d,77cc62dc80ebe12a0452d1ce0565acdc,-1.0,1,{'was_impossible': False}


The recommendation function has recommend 5 products for the user '66ae4493fc7c710a8db6d9620901f40d'. The actual value is -1 this shows that the user has not bought these items in the past. As the rating scale is from 1 - 5 a prediction rating of 1 indicates that the user is very likely to rate the item low.

A good way to evaluate a FunkSVD model is to look at its fcp score, FCP measures the proportion of pairs of data points for which the predicted values are in the same order as the actual values. This cannot be utilised because not all users have a minimum of two predictions. This situation may differ if we have used the whole dataset. When recommendation systems have more data to train on it can do a better job creating recommendations.

In [None]:
#  FCP
# FCP = accuracy.fcp(pred, verbose=False)
# print(fcp)

In [22]:

# RMSE
RMSE = accuracy.rmse(pred, verbose=False)
print(RMSE)

2.0


In [23]:

# MAE
MAE = accuracy.mae(pred, verbose=False)
print(MAE)

2.0


The RMSE and the MAE score are 2, the lower the score the more accuarate the prediction of the rating. I will investage other recommendation systems againist FunkSVD using rmse score.

### Comparing Different Collaborating techniques

This technique has been explored here; https://surpriselib.com

In [24]:
acc = []
algos = [FunkSVD(), KNNWithMeans(), NormalPredictor(),CoClustering()]
for algo in algos:
    # perform cross validation
    results = cross_validate(algo, data, measures=['rmse'], cv=3, verbose=0)
    # get results
    temp = pd.DataFrame.from_dict(results).mean(axis=0)
    temp = temp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    acc.append(temp)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


In [25]:
acc_results = pd.DataFrame(acc).set_index('Algorithm').sort_values('test_rmse')
acc_results

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNWithMeans,1.516538,0.141338,0.006493
CoClustering,1.525092,0.361644,0.004384
SVD,1.533964,0.197952,0.006193
NormalPredictor,2.01889,0.00289,0.005964


The KNNWithMeans has the best test_rmse score. I will now tune the KNNWithMeans model.

In [26]:
# Set the parameter grid
param_grid = {
    'n_factors': [5, 150], 
    'n_epochs': [5, 40],
    'lr_all': [0.0001, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse'], cv=3)

# Fit the model
GS.fit(data)
     

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [27]:
GS.best_score['rmse']

1.512998683665387

In [28]:


# Check the best parameters
GS.best_params['rmse']

{'n_factors': 5, 'n_epochs': 5, 'lr_all': 0.0001, 'biased': False}

In [29]:
# Split train test set
train, test = train_test_split(data, test_size=0.2)
# Set the algorithm

knn = KNNWithMeans(n_factors=5,
              n_epochs=5 ,
              lr_all=0.0001, 
              biased=False, 
              verbose=0)
# Fit train set
knn.fit(train)
# Test the algorithm using test set
pred = knn.test(test)



In [30]:
# Put my_pred result in a dataframe
knn_pred_df = pd.DataFrame(pred, columns=['user_id',
                                        'product_id',
                                        'actual',
                                        'prediction',
                                        'details'])

# Calculate the difference of actual and prediction into diff column
knn_pred_df['diff'] = abs(knn_pred_df['prediction'] - 
                            knn_pred_df['actual'])


In [31]:
# Build full trainset
train_df = data.build_full_trainset()

# Build the SVD algorithm
my_knn = KNNWithMeans(n_factors=5, 
                 n_epochs=5, 
                 lr_all=0.0001,    
                 biased=False, 
                 verbose=0)

# Fit with full trainset
my_knn.fit(train_df)

<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f9a1b0f3b50>

In [32]:
# Define the full test set
test_df = train_df.build_anti_testset(fill=-1)

In [33]:
# set the prediction
pred = my_knn.test(test_df)

In [34]:
# Put into a dataframe
pred_df = pd.DataFrame(pred, columns=['user_id', 'product_id','actual', 'prediction','details'])   

In [35]:
def recommender_knn(uid):
    '''
    Given a user_id, recommend new product using KNN.
    uid: user_id
    '''
    # check user_id prediction in pred_df
    user_df = pred_df[pred_df['user_id'] == uid]
    user_df = user_df.sort_values('prediction', ascending=False)
    # merge with rating_df
    df = user_df.merge(collab_df['product_id'].drop_duplicates(), how='left', 
                    on = 'product_id')
    return df.head(5)

In [36]:
recommender_knn('66ae4493fc7c710a8db6d9620901f40d')

Unnamed: 0,user_id,product_id,actual,prediction,details
0,66ae4493fc7c710a8db6d9620901f40d,9e572ff4654f7064419d97a891a8b0fc,-1.0,5.0,"{'actual_k': 0, 'was_impossible': False}"
1,66ae4493fc7c710a8db6d9620901f40d,1427b126f61597524866770b05d4eed2,-1.0,5.0,"{'actual_k': 0, 'was_impossible': False}"
2,66ae4493fc7c710a8db6d9620901f40d,e731dfad79b4686d049d024b9fc97360,-1.0,5.0,"{'actual_k': 0, 'was_impossible': False}"
3,66ae4493fc7c710a8db6d9620901f40d,b3793f4676bdf327ca34c40d236ee2b2,-1.0,5.0,"{'actual_k': 0, 'was_impossible': False}"
4,66ae4493fc7c710a8db6d9620901f40d,77cc62dc80ebe12a0452d1ce0565acdc,-1.0,5.0,"{'actual_k': 0, 'was_impossible': False}"


Although the KNN recommendation system and FunkSVD recommendation system both recommended the same 5 products for customer '66ae4493fc7c710a8db6d9620901f40d',  the  predicted ratings for the KKN recommendation is much higher at 5.

In [37]:
# RMSE
RMSE = accuracy.rmse(pred, verbose=False)
RMSE

4.9583506772736206

In [38]:

# MAE
MAE = accuracy.mae(pred, verbose=False)
print(MAE)

4.7006990680466405


### Comparing the tuned FunkSVD model and the tuned KNN model

Although the untuned KNN algorithm had the best test RMSE score when compared to CoClustering, FunkSVD, and NormalPredictor algorithms, the tuned FunkSVD algorithm outperformed the tuned KNN algorithm with an RMSE of 2, while the KNN algorithm had an RMSE of 5.

The reason for the difference in performance between KNN and FunkSVD algorithms may be because they work in very different ways. KNN algorithms struggle with sparse matrices, as they do not assume any underlying data distribution, while FunkSVD algorithm assumes that the data can be represented as a low-rank matrix. Which could of resulted in the FunkSVD model performing better than KNN model on unseen test data.

https://medium.com/analytics-vidhya/k-nearest-neighbors-all-you-need-to-know-1333eb5f0ed0


## Item - Item 
An item-item based collaborative recommendation system was attempted, but due to the wide range of over 32000 products and a low average purchase rate of unique products per costumer, it faced difficulties. The system struggled to identify similar items accordingly because there were very few pairwise arrays available, making the task extremely challenging and therefore making that recommendation system redundant.

# Summary 

In the notebook 3_modelling_comparing_market_basket I have developed a product recommendation system that suggests products to customers based on the items in their basket. As Olist's order history data grows, it would be beneficial for the business to explore product recommendations based on the customer state and/or product categories of items in their basket. This could provide a more personalised recommendation if trends are found.

It's important to note that when creating a collaborative filtering-based product recommendation system, choosing a model with the lowest test_RSME score isn't enough. It's crucial to understand how the model will work with the available order data. I recommend that Olist considers user-based recommendations using the FunkSVD recommendation model.