In [1]:
import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

## Analytics - Part3 (Recommender System)
- Collaborative filtering (memory-based)

*******************************************************************************************************************************************************************
### Popularity based & Collaborative filtering

- Collaborative methods in recommender systems use the past interactions between users and items to generate new recommendations. These interactions are stored in a matrix that shows how users have interacted with different items. The main idea behind collaborative methods is that these past interactions contain enough information to find similar users or items and make predictions based on these similarities.

- The advantage of collaborative approaches is that they don't require any additional information about users or items, making them versatile in various situations. Additionally, as users interact more with items, the accuracy of new recommendations improves. Over time, new interactions provide more information and improve the accuracy of the system.

- In this project, I will build the recommender system on the memory based collaborative filtering method (item-item) for the existing customers. Also, I will build the popularity-based recommender system for those who just signed up or future customers.

### Load data

In [2]:
features = ['customer_unique_id', 'product_id', 'review_score']
df = pd.read_csv("./datasets/olist_master.csv")
df = df[features]
print(df.isna().sum())
print()
print(df.shape)

customer_unique_id    0
product_id            0
review_score          0
dtype: int64

(116581, 3)


In [3]:
df.head()

Unnamed: 0,customer_unique_id,product_id,review_score
0,5ee8fe956c2631afc0a1dcc1920d0e3d,4244733e06e7ecb4970a6e2683c13e61,5
1,8b3f917f4307d3e5cf34c0b43d6e6f50,4244733e06e7ecb4970a6e2683c13e61,5
2,69ba88e17ea574da9c9b8c8834a583d1,4244733e06e7ecb4970a6e2683c13e61,4
3,cbe063493a222cb17024ff0285b4ecb6,4244733e06e7ecb4970a6e2683c13e61,5
4,ffab5330bd7b40979ab6726b2e02292e,4244733e06e7ecb4970a6e2683c13e61,5


### Popularity based recommender system

- A popularity-based recommender system is a type of recommendation system that suggests items to users based on their overall popularity or popularity among a specific group of customers. Instead of personalizing recommendations based on individual customer preferences or behaviors, it focuses on the general popularity or trends in the business.

- Popularity-based recommenders do not take into account the specific preferences or characteristics of individual customers. The recommendations are the same for all users or a specific user group, as they are based solely on overall popularity. So, generally the popularity-based recommender system is not used for the existing customers, but for the customers who just signed up.

In [4]:
# preprocessing
df = df.groupby('product_id', as_index=False).agg(review_count=('customer_unique_id', 'count'), avg_score=('review_score', 'mean'))
df.head()

Unnamed: 0,product_id,review_count,avg_score
0,00066f42aeeb9f3007548bb9d3f33c38,1,5.0
1,00088930e925c41fd95ebfe695fd2655,1,4.0
2,0009406fd7479715e4bef61dd91f2462,1,1.0
3,000b8f95fcb9e0096488278317764d19,2,5.0
4,000d9be29b5207b54e86aa1b1ac54872,1,5.0


- Then, let's calculate the popularity on weighted average review score

In [5]:
threshold = df.review_count.quantile(.95)
avg_score = df.avg_score.mean()

def weighted_score(df=df, threshold = threshold, mean_score = avg_score):
    review_cnt = df.review_count
    score = df.avg_score
    weighted_avg = review_cnt / (review_cnt + threshold) * score + (threshold / (threshold + review_cnt) * mean_score)
    return weighted_avg

weighted_score(df, threshold, avg_score)

0        4.115492
1        4.032159
2        3.782159
3        4.183531
4        4.115492
           ...   
32323    4.106608
32324    4.032159
32325    4.115492
32326    4.336619
32327    4.032159
Length: 32328, dtype: float64

In [6]:
df['weighted_score'] = weighted_score()
df.head()

Unnamed: 0,product_id,review_count,avg_score,weighted_score
0,00066f42aeeb9f3007548bb9d3f33c38,1,5.0,4.115492
1,00088930e925c41fd95ebfe695fd2655,1,4.0,4.032159
2,0009406fd7479715e4bef61dd91f2462,1,1.0,3.782159
3,000b8f95fcb9e0096488278317764d19,2,5.0,4.183531
4,000d9be29b5207b54e86aa1b1ac54872,1,5.0,4.115492


- Let's filter products having reviews more than at least 95% of the products and show top10 recommended products

In [7]:
popular_items = df[df.review_count > threshold].sort_values('weighted_score', ascending=False)
popular_items.head(10)

Unnamed: 0,product_id,review_count,avg_score,weighted_score
29803,ebf9bc6cd600eadd681384e3116fda85,44,5.0,4.807016
627,0554911df28fda9fd668ce5ba5949695,38,5.0,4.783386
14661,73326828aa5efe1ba096223de496f596,56,4.839286,4.707252
2206,11250b0d4b709fee92441c5f34122aed,44,4.863636,4.697926
11942,5ddab10d5e0a23acb99acf56b62b3276,21,5.0,4.668309
7954,3e4176d545618ed02f382a3057de32b4,24,4.958333,4.668169
31313,f7f59e6186e10983a061ac7bdb3494d6,41,4.829268,4.661267
17837,8d37ee446981d3790967d0268d6cfc81,30,4.866667,4.643559
1290,0a4f9f421af66d2ea061fbb8883419f7,25,4.88,4.621831
16932,85b99d83c60cab5b4d8f927ad35212a1,17,5.0,4.620925


- Let's filter top10 recommended products based on the number of reviews at this time

In [8]:
popular_items.sort_values('review_count', ascending=False).head(10)

Unnamed: 0,product_id,review_count,avg_score,weighted_score
21716,aca2eb7d00ea1a7b8ebd4e68314663af,536,4.009328,4.009846
19387,99a4788cb24856965c36a24e339b6058,528,3.876894,3.880122
8453,422879e10f46682990de24d770e7f83d,508,3.923228,3.925599
7228,389d119b48cf3043d311335e499d9c6b,406,4.100985,4.099247
6947,368c6c730842d78016ad823897a372db,398,3.90201,3.905589
10635,53759a2ecddad2bb87a079a1f1519f73,391,3.877238,3.881557
26533,d1c427060a0f73f6b889a5c7c61f2ac4,357,4.078431,4.077136
10662,53b36df67ebb7c41585e8d54d6772e08,327,4.183486,4.178657
2741,154e7e31ebfa092203795c972e5804a6,295,4.318644,4.308451
7902,3dd2a17168ec895c781a9191c1e95ad7,278,4.183453,4.177806


- Like above, the popularity based recommender system is not sensitive to the interests and tastes of a particular customer. So, if you want or need to use such a method, it will only work for customers who just signed up and have no transactional history.

### Collaborative filtering : Item-based Recommendation

In [9]:
# data load
features = ['customer_unique_id', 'product_id', 'review_score']
df = pd.read_csv("./datasets/olist_master.csv")
df = df[features]
df.head()

Unnamed: 0,customer_unique_id,product_id,review_score
0,5ee8fe956c2631afc0a1dcc1920d0e3d,4244733e06e7ecb4970a6e2683c13e61,5
1,8b3f917f4307d3e5cf34c0b43d6e6f50,4244733e06e7ecb4970a6e2683c13e61,5
2,69ba88e17ea574da9c9b8c8834a583d1,4244733e06e7ecb4970a6e2683c13e61,4
3,cbe063493a222cb17024ff0285b4ecb6,4244733e06e7ecb4970a6e2683c13e61,5
4,ffab5330bd7b40979ab6726b2e02292e,4244733e06e7ecb4970a6e2683c13e61,5


In [10]:
# preprocessing
df_score = df.groupby('product_id', as_index=False)[['review_score']].mean()
pid_score = dict(zip(df_score.product_id, df_score.review_score))
df['mean_score'] = df.product_id.map(pid_score)
df.drop_duplicates(inplace=True)
df.head()

Unnamed: 0,customer_unique_id,product_id,review_score,mean_score
0,5ee8fe956c2631afc0a1dcc1920d0e3d,4244733e06e7ecb4970a6e2683c13e61,5,3.818182
1,8b3f917f4307d3e5cf34c0b43d6e6f50,4244733e06e7ecb4970a6e2683c13e61,5,3.818182
2,69ba88e17ea574da9c9b8c8834a583d1,4244733e06e7ecb4970a6e2683c13e61,4,3.818182
3,cbe063493a222cb17024ff0285b4ecb6,4244733e06e7ecb4970a6e2683c13e61,5,3.818182
4,ffab5330bd7b40979ab6726b2e02292e,4244733e06e7ecb4970a6e2683c13e61,5,3.818182


- Let's pivot the above table for calculating similarity between each product

In [11]:
df_pivot = pd.pivot_table(df, index='product_id', columns='customer_unique_id', values='review_score')
df_pivot.fillna(0, inplace=True)
df_pivot.head()

customer_unique_id,0000b849f77a49e4a4ce2b2a4ca5be3f,0000f46a3911fa3c0805444483337064,0004bd2a26a76fe21f786e4fbd80607f,00050ab1314c0e55a6ca13cf7181fecf,0005ef4cd20d2893f0d9fbd94d3c0d97,000949456b182f53c18b68d6babc79c1,000a5ad9c4601d2bbdd9ed765d5213b3,000c8bdb58a29e7115cfc257230fb21b,000de6019bb59f34c099a907c151d855,000ec5bff359e1c0ad76a81a45cb598f,...,ffee94d548cef05b146d825a7648dab4,ffef0ffa736c7b3d9af741611089729b,fff1bdd5c5e37ca79dd74deeb91aa5b6,fff699c184bcc967d62fa2c6171765f7,fff96bc586f78b1f070da28c4977e810,fffa431dd3fcdefea4b1777d114144f2,fffb09418989a0dbff854a28163e47c6,fffbf87b7a1a6fa8b03f081c5f51a201,fffcc512b7dfecaffd80f13614af1d16,fffea47cd6d3cc0a88bd621562a9d061
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00066f42aeeb9f3007548bb9d3f33c38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00088930e925c41fd95ebfe695fd2655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009406fd7479715e4bef61dd91f2462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000b8f95fcb9e0096488278317764d19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000d9be29b5207b54e86aa1b1ac54872,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Now, let's make the cosine similarity matrix using the above pivoted table

In [12]:
df_cosim = pd.DataFrame(cosine_similarity(df_pivot, df_pivot), index=df_pivot.index, columns=df_pivot.index)
df_cosim.head()

product_id,00066f42aeeb9f3007548bb9d3f33c38,00088930e925c41fd95ebfe695fd2655,0009406fd7479715e4bef61dd91f2462,000b8f95fcb9e0096488278317764d19,000d9be29b5207b54e86aa1b1ac54872,0011c512eb256aa0dbbb544d8dffcf6e,00126f27c813603687e6ce486d909d01,001795ec6f1b187d37335e1c4704762e,001b237c0e9bb435f2e54071129237e9,001b72dfd63e9833e8c02742adf472e3,...,ffedbd68fa6f44e788ff6c2db8094715,ffef256879dbadcab7e77950f4f4a195,fff0a542c3c62682f23305214eaeaa24,fff1059cd247279f3726b7696c66e44e,fff515ea94dbf35d54d256b3e39f0fea,fff6177642830a9a94a0f2cba5e476d1,fff81cc3158d2725c0655ab9ba0f712c,fff9553ac224cec9d15d49f5a263411f,fffdb2d0ec8d6a61f0a0a0db3f25b441,fffe9eeff12fcbd74a2f2b007dde0c58
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00066f42aeeb9f3007548bb9d3f33c38,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00088930e925c41fd95ebfe695fd2655,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009406fd7479715e4bef61dd91f2462,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000b8f95fcb9e0096488278317764d19,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000d9be29b5207b54e86aa1b1ac54872,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- Let's define a function that returns the 5 similar product and its similarity scores

In [13]:
# Determine similar items for a given item
def get_similar_items(item_id, similarity_matrix, top_n=5):
    # Retrieve the row of the cosine similarity matrix corresponding to the item
    item_row = similarity_matrix.loc[item_id]
    
    # Sort the items based on similarity scores in descending order
    similar_items = item_row.sort_values(ascending=False)
    
    # Exclude the given item itself from the list of similar items
    similar_items = similar_items.drop(item_id)
    
    # Select the top-N similar items
    similar_items = similar_items.head(top_n)
    
    return similar_items

# Select top5 similar items of a certian product
item_id = '00066f42aeeb9f3007548bb9d3f33c38'
top_similar_items = get_similar_items(item_id, df_cosim, top_n=5)
top_similar_items

product_id
ee59699c00f92af8c00c6c9e470a6ff0    1.000000
c3e43e07f541c49ce73e9778dfecc6ca    1.000000
ffd4bf4306745865e5692f69bd237893    0.408248
ab71466511b5e49de73454353d49a2ba    0.000000
ab6edd00a29d1e29948284a492497854    0.000000
Name: 00066f42aeeb9f3007548bb9d3f33c38, dtype: float64

- Now, let's build the recommender system to provide products based on the highest top_n similarity scores

In [14]:
# Generate recommendations for users
def generate_recommendations(user_id, user_item_matrix, similarity_matrix, top_n=5):
    # Retrieve the items interacted/rated positively by the user
    user_items = user_item_matrix[user_id]
    positive_items = user_items[user_items > 0].index
    
    # Generate recommendations for each positive item
    recommendations = pd.Series()
    for item_id in positive_items:
        similar_items = get_similar_items(item_id, similarity_matrix, top_n)
        recommendations = recommendations.append(similar_items)
    
    # Remove duplicates and items already reviewed by the user
    recommendations = recommendations.groupby(recommendations.index).max()
    recommendations = recommendations[~recommendations.index.isin(positive_items)]
    recommendations = recommendations.sort_values(ascending=False).head(top_n)
    return recommendations

# Provide recommendations
user_id = '0000b849f77a49e4a4ce2b2a4ca5be3f'
recommendations = generate_recommendations(user_id, df_pivot, df_cosim, top_n=5)
recommendations

8aea198b4506fe321e998a858a2e39c1    0.458831
b044bda7bc05cc5cf14d4969fda159cb    0.458831
2b6c38dc34f7d28ba93d469431bb6c88    0.249377
67e3279668ba06a3dd977cb31399a311    0.249377
6c40c98da8f33daca57f0e9b78eae09d    0.249377
dtype: float64

- Let's make a recommender system class and provide top3 products to a random customer

In [39]:
class item_based_cf:
    def __init__(self, df: pd.DataFrame, user_id, top_n):
        self.df = df
        self.item_id = item_id
        self.user_id = user_id
        self.top_n = top_n

    def pivot_table(self):
        # pivot the dataframe such that cosine similarity score between the products can be calculated
        df_pivot = pd.pivot_table(df, index='product_id', columns='customer_unique_id', values='review_score')
        # replace all NAs to 0 (0 similarity score)
        df_pivot.fillna(0, inplace=True)
        return df_pivot
    
    # Calculate cosine similarity matrix between each product
    def get_similarity_mtx(self, pivoted: pd.DataFrame):
        df_cosim = pd.DataFrame(cosine_similarity(pivoted, pivoted), index=pivoted.index, columns=pivoted.index)
        return df_cosim

    # Determine similar items for a given item based on a cosine similarity score calculated above
    def get_similar_items(self, sim_mtx: pd.DataFrame, item_id, top_n):
        # Retrieve the row of the cosine similarity matrix corresponding to the item
        item_row = sim_mtx.loc[item_id]
        # Sort the items based on similarity scores in descending order
        similar_items = item_row.sort_values(ascending=False)
        # Exclude the given item itself from the list of similar items
        similar_items = similar_items.drop(item_id)
        # Select the top_n similar items in descending similarity score
        similar_items = similar_items.head(top_n)
        return similar_items

    def generate_recommendations(self, pivoted: pd.DataFrame, sim_mtx: pd.DataFrame, top_n):
        # Retrieve the items rated positively by the user
        user_items = pivoted[user_id]
        positive_items = user_items[user_items > 0].index

        # Generate recommendations for each positive item
        recommendations = pd.Series()
        for item_id in positive_items:
            similar_items = get_similar_items(item_id, sim_mtx, top_n)
            recommendations = recommendations.append(similar_items)
        
        # Remove duplicates and items already reviewed by the customer
        recommendations = recommendations.groupby(recommendations.index).max()
        recommendations = recommendations[~recommendations.index.isin(positive_items)]
        recommendations = recommendations.sort_values(ascending=False).head(top_n)
        return recommendations

In [40]:
# Provide top3 recommendations to 0000b849f77a49e4a4ce2b2a4ca5be3f
item_cf = item_based_cf(df, '0000b849f77a49e4a4ce2b2a4ca5be3f', 3)
pivoted = item_cf.pivot_table()
sim_mtx = item_cf.get_similarity_mtx(pivoted)
item_cf.generate_recommendations(pivoted, sim_mtx, 3)

8aea198b4506fe321e998a858a2e39c1    0.458831
b044bda7bc05cc5cf14d4969fda159cb    0.458831
67e3279668ba06a3dd977cb31399a311    0.249377
dtype: float64

- The recommender system provides the above 3 products. However, as we can see the similarity scores are below .5 and this is because of data sparsity. In this case, item based collaborative filtering might not be a good choice, so we need to consider the user based collaborative filtering or other recommender systems.

### Downsides of the system and the future direction

While the item-based collaborative filtering recommender system based on cosine similarity has its advantages, it also has certain limitations. Here are some critical downsides of this approach:

1. Cold-start problem: The system struggles with the cold-start problem, meaning it cannot provide accurate recommendations for new items or new users with limited historical data. It heavily relies on user-item interactions to generate recommendations.

2. Sparsity of data: If the user-item interaction data is sparse, where users have rated or interacted with only a small fraction of the available items, the system may face challenges in generating meaningful recommendations. The similarity matrix may not accurately capture item similarities due to limited data.

3. Limited personalization: The item-based collaborative filtering approach does not take into account the specific tastes and preferences of individual users. It relies solely on item-item similarities and assumes that users with similar item preferences will have similar recommendations. This lack of personalization may result in less accurate or diverse recommendations.

4. Scalability: As the number of items and users increases, the computation and storage requirements of the cosine similarity matrix can become substantial. Calculating and storing the full similarity matrix for large datasets can be computationally expensive and memory-intensive.

5. Inability to capture evolving user preferences: The recommender system assumes that user preferences remain static over time. However, user preferences may change over time due to various factors, such as evolving tastes, trends, or life events. The system does not adapt to these changes, leading to potentially outdated recommendations.

To mitigate these downsides, it is important to consider alternative recommendation approaches, hybrid models, or more advanced techniques that address the specific limitations and requirements of your application and user based recommender system.