# Recommender System with LightFM

Our LightFM recommender system should be able to predict customer products based on similarities between other users, features of the items they purchased and information about themselves. Since our dataset is comprised of mostly one time buyers, the primary challenge is to get our recommender to suggest relevent items based on a small pool of data. The hope is that key information about he product and user will improve the results. Our dataset also includes a small number of customers who purchased but never received an item due to delivery issues, we will consider them cold starts and use them to see how our system does without any user history. 

LightFM requires at minimum, user to product interactions, which makes it capable of handling a basic collaborative recommendation system. We will use this type of model as our baseline for comparison. Following this we will create sparse matrices for user information and product information and test our model while introducing one of these options at a time. This will gradually increase the complexity of the model but will hopefully improve results in turn. 

A challenge that we face in including user information is that we are only given the users location as information about them. In an attempt to include inferred information about the user we will also attempt to cluster them based on the categories that they purchase. This will be done using unsupervised learning techniques to cluster common features among users. The ideal result would provide a sub-grouping to users for our algorithm to use as part of its user information. The known problem with this approach will be the model overfitting to these results. 

Our last approach for comparison will be with the PySpark's ALS recommender algorithm. This will be implemented in another notebook and will be tested with the same users to see differences in what is recommended. 

## Create master df

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
seller_dt_cols = ['shipping_limit_date', 'review_creation_date', 'review_answer_timestamp']
customer_dt_cols = ['shipping_limit_date', 'review_creation_date', 'review_answer_timestamp', 'order_purchase_timestamp',\
                'order_delivered_customer_date', 'order_estimated_delivery_date']

customers = pd.read_csv('modified_data/customers.csv', 
                        index_col=0, delimiter = ',', 
                        parse_dates = customer_dt_cols, 
                        infer_datetime_format = True, 
                        low_memory=False)
sellers = pd.read_csv('modified_data/sellers.csv', 
                      index_col=0, delimiter = ',', 
                      parse_dates = seller_dt_cols, 
                      infer_datetime_format = True, 
                      low_memory=False)

In [9]:
sellers_selected = sellers[['order_id', 'seller_zip_code_prefix', 'seller_city', 'seller_state']]

In [10]:
master_df = customers.merge(sellers_selected, on ='order_id', ).drop(['order_delivered_customer_date', 
                                                          'order_estimated_delivery_date', 'shipping_limit_date',
                                                         'freight_value', 'review_comment_title', 'review_creation_date', 
                                                          'review_answer_timestamp', 'geolocation_lat', 'geolocation_lng',
                                                         'product_name_lenght', 'product_description_lenght', 'product_length_cm',
                                                          'product_height_cm', 'product_width_cm', 'review_id'], axis=1)

In [11]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 152810 entries, 0 to 152809
Data columns (total 19 columns):
order_id                    152810 non-null object
customer_id                 152810 non-null object
order_purchase_timestamp    152810 non-null datetime64[ns]
order_item_id               152810 non-null int64
product_id                  152810 non-null object
seller_id                   152810 non-null object
price                       152810 non-null float64
review_score                152810 non-null int64
review_comment_message      152810 non-null object
customer_unique_id          152810 non-null object
customer_zip_code_prefix    152810 non-null int64
customer_city               152810 non-null object
customer_state              152810 non-null object
product_category_name       152810 non-null object
product_photos_qty          152810 non-null float64
product_weight_g            152810 non-null float64
seller_zip_code_prefix      152810 non-null int64
seller_city    

In [12]:
master_df.to_csv(r'/Users/mattmerrill/Springboard/Capstone2/olist_datascience/exploration/modified_data/master_df.csv', index=True)

### Data preprocessing:

**Metadata:**

We want to begin by extracting information about each product. More general information will also be required about each product, such as the average review score, the number of messages and number of reviews. This will be done by pulling columns from the master dataframe.

Following this we will group the product descriptions into bins to generalize for the model using pandas cut and qcut.

### Create user-item and item feature dataframes

In [13]:
user_item = master_df[['customer_id', 'customer_unique_id', 'product_id', 'review_score', 'order_purchase_timestamp']]
user_features = master_df[['customer_unique_id', 'customer_city', 'customer_state']]
item_features = master_df[['customer_unique_id','product_id', 'customer_id', 'product_category_name', 'seller_id', 'review_score', 
                          'review_comment_message', 'seller_city', 'seller_state', 'price', 'order_id', 'order_item_id']]

In [14]:
item_features = item_features.drop_duplicates(['customer_id','order_id', 'order_item_id']).reset_index(drop=True)

### Create aggregated columns to improve product features

In [15]:
# to avoid counting a review comment labeled 'no comment given', we will convert them back to NAN
item_features['review_comment_message'] = item_features.review_comment_message\
                                                               .replace('no comment given', np.nan)

# create average review score column 
item_features['avg_product_score'] = item_features['review_score']\
                                                                 .groupby(item_features['product_id'])\
                                                                 .transform('mean')
# create average review score column 
item_features['avg_seller_score'] = item_features['review_score']\
                                                                 .groupby(item_features['seller_id'])\
                                                                 .transform('mean')

# create average review score column 
item_features['num_comments'] = item_features['review_comment_message']\
                                                                       .groupby(item_features['product_id'])\
                                                                       .transform('count')

# create average review score column 
item_features['num_reviews'] = item_features['review_score']\
                                                            .groupby(item_features['product_id'])\
                                                            .transform('count')

In [16]:
# create average price column
item_features['avg_price'] = item_features[['price']]\
                                                     .groupby(item_features['product_id'])\
                                                     .transform('mean')

In [17]:
# Create product_count column for the rating
item_features['product_count'] = item_features['product_id']\
                                                            .groupby(item_features['customer_unique_id'])\
                                                            .transform('count')

### Transform columns to improve distance based algorithms used in clustering, to be performed in seperate notebook

In [18]:
item_features['log_avg_price'] = np.log1p(item_features['avg_price'])

In [19]:
item_features['log_num_reviews'] = np.log1p(item_features['num_reviews'])
item_features['log_num_comments'] = np.log1p(item_features['num_comments'])

In [20]:
# use cut to group the same number of reviews in each section
item_features['num_reviews_binned'] = pd.cut(item_features['log_num_reviews'], 7, duplicates = 'drop')
item_features['num_comments_binned'] = pd.cut(item_features['log_num_comments'], 6, duplicates = 'drop')

# 
item_features['avg_product_reviews_binned'] = pd.cut(item_features['avg_product_score'], 4, duplicates = 'drop')
item_features['avg_seller_reviews_binned'] = pd.cut(item_features['avg_seller_score'], 4, duplicates = 'drop')

item_features['avg_price_binned'] = pd.qcut(item_features['avg_price'], 4, duplicates = 'drop')

In [21]:
item_features = item_features.sort_values('customer_unique_id').reset_index(drop=True)

In [22]:
cols = ['num_reviews_binned', 'num_comments_binned','avg_product_reviews_binned', 
        'avg_seller_reviews_binned', 'avg_price_binned']
for col in cols:
    item_features[col] = item_features[col].astype(str)

In [23]:
# Save item features to csv to be pulled and used in notebook for clustering
item_features.to_csv(r'/Users/mattmerrill/Springboard/Capstone2/olist_datascience/exploration/modified_data/item_features.csv', index=True)

In [24]:
item_features.head()

Unnamed: 0,customer_unique_id,product_id,customer_id,product_category_name,seller_id,review_score,review_comment_message,seller_city,seller_state,price,...,avg_price,product_count,log_avg_price,log_num_reviews,log_num_comments,num_reviews_binned,num_comments_binned,avg_product_reviews_binned,avg_seller_reviews_binned,avg_price_binned
0,0000366f3b9a7992bf8c76cfdf3221e2,372645c7439f9661fbbacfd129aa92ec,fadbb3709178fc513abc1b2670aa1ad2,bed_bath_table,da8622b14eb17ae2831f4ac5b9dab84a,5,"Adorei a cortina, ficou linda na minha sala, e...",piracicaba,SP,129.9,...,109.382759,1,4.703954,3.401197,2.397895,"(3.075, 3.87]","(1.824, 2.736]","(4.0, 5.0]","(4.0, 5.0]","(74.9, 135.0]"
1,0000b849f77a49e4a4ce2b2a4ca5be3f,5099f7000472b634fea8304448d20825,4cb282e167ae9234755102258dd52ee8,health_beauty,138dbe45fc62f1e244378131a6801526,4,,sao paulo,SP,18.9,...,19.042857,1,2.997873,2.079442,0.693147,"(1.487, 2.281]","(-0.00547, 0.912]","(4.0, 5.0]","(3.0, 4.0]","(0.849, 39.9]"
2,0000f46a3911fa3c0805444483337064,64b488de448a5324c4134ea39c28a34b,9b3932a6253894a02c1df9d19004239f,stationery,3d871de0142ce09b7081e2b9d1733cb1,3,,campo limpo paulista,SP,69.0,...,73.0,1,4.304065,1.791759,1.098612,"(1.487, 2.281]","(0.912, 1.824]","(4.0, 5.0]","(4.0, 5.0]","(39.9, 74.9]"
3,0000f6ccb0745a6a4b88665a16c9f078,2345a354a6f2033609bbf62bf5be9ef6,914991f0c02ef0843c0e7010c819d642,telephony,ef506c96320abeedfb894c34db06f478,4,Bom vendedor,sao paulo,SP,25.99,...,27.99,1,3.366951,1.386294,1.098612,"(0.688, 1.487]","(0.912, 1.824]","(4.0, 5.0]","(3.0, 4.0]","(0.849, 39.9]"
4,0004aac84e0df4da2b147fca70cf8255,c72e18b3fe2739b8d24ebf3102450f37,47227568b10f5f58a524a75507e6992c,telephony,70a12e78e608ac31179aea7f8422044b,5,,jacarei,SP,180.0,...,180.0,1,5.198497,0.693147,0.0,"(0.688, 1.487]","(-0.00547, 0.912]","(4.0, 5.0]","(3.0, 4.0]","(135.0, 6735.0]"


### Functions for creating lists that will later be transformed into sparse matrices

In [25]:
def get_user_list(df, user_column):
    """
    
    creating a list of user from dataframe df, user_column is a column 
    consisting of users in the dataframe df
    
    """
    
    return np.sort(df[user_column].unique())

def get_item_list(df, item_name_column):
    
    """
    
    creating a list of items from dataframe df, item_column is a column 
    consisting of items in the dataframe df
    
    return to item_id_list and item_id2name_mapping
    
    """
    
    item_list = df[item_name_column].unique()
    
    
    return item_list

def get_item_feature_list(df, product_category_col, seller_col, seller_rating_col, product_rating_col, 
                          num_comments, num_reviews, avg_price, seller_city, seller_state):
    
    categories = df[product_category_col]
    sellers = df[seller_col]
    seller_ratings = df[seller_rating_col]
    product_ratings = df[product_rating_col]
    num_product_comments = df[num_comments]
    num_product_reviews = df[num_reviews]
    avg_product_price = df[avg_price]
    seller_city = df[seller_city]
    seller_state = df[seller_state]
    
    return pd.concat([categories, sellers, seller_ratings, product_ratings, num_product_comments, 
                      num_product_reviews, avg_product_price, seller_city, seller_state], ignore_index = True).unique()

def get_user_feature_list(df, cluster_id, customer_state):
    
    customer_state = df[customer_state]
    cluster_id = df[cluster_id]
    
    return pd.concat([cluster_id, customer_state], ignore_index = True).unique()


### Add clustered groupings to user_features

The results from clustering yeilded 14 groups using Kmeans, which is a distance based algorithm. Algomaritive clustering was also used with ward linkage but this was unsuccesssful as it clustered too aggresively, leaving few groups with one dominating most users. The Kmeans clustering did lead to more groupings but they were more evenly seperated. The commonalities between groups were mainly seen with the types of products purchased. It was able to group together categories such as "consoles_games" and "toys", and "housewares" and "furniture_decor". This may provide our recommender with an idea of what products are related and if users are grouped into the same cluster, a recommendation within the cluster might be appropriate. Like was previously mentioned however, this may lead to some overfitting.

In [26]:
# read in results from clustering
item_features_clustered = pd.read_csv('/Users/mattmerrill/Springboard/Capstone2/olist_datascience/exploration/modified_data/item_features_clustered.csv', index_col = 0)

In [28]:
# merge with user_features
user_features = pd.merge(user_features, item_features_clustered[['customer_unique_id', 'cluster_id']], 
        on = 'customer_unique_id', 
        how='left').drop('customer_city', axis=1)

In [29]:
# drop duplicates from unaligned rows
user_features = user_features.dropna()

In [30]:
# create the user, item, feature lists
users = get_user_list(user_item, "customer_unique_id")
items = get_item_list(item_features, "product_id")
item_features_list = get_item_feature_list(item_features, "product_category_name", "seller_id", "avg_seller_reviews_binned",
                                           "avg_product_reviews_binned", "num_comments_binned", "num_reviews_binned", 
                                           "avg_price_binned", "seller_city", "seller_state")
user_features_list = get_user_feature_list(user_features, "cluster_id", "customer_state")

In [31]:
item_features_clustered.head()

Unnamed: 0,customer_unique_id,customer_id,product_id,order_id,product_category_name,order_item_id,seller_id,review_score,review_comment_message,seller_city,seller_state,price,product_category_count,cluster_id,cluster_id_2
0,7c396fd4830fd04220f754e42b4e5bff,9ef432eb6251297304e76186b10a928d,87285b34884572647811a353c7ac498a,e481f51cbdc54678b7cc49136f2d6af7,housewares,1,3504c0cb71d7fa48d967e0e4c94d59d9,4,"Não testei o produto ainda, mas ele veio corre...",maua,SP,29.99,1,12,0
1,7c396fd4830fd04220f754e42b4e5bff,31f31efcb333fcbad2b1371c8cf0fa84,9abb00920aae319ef9eba674b7d2e6ff,69923a4e07ce446644394df37a710286,baby,1,1771297ac436903d1dd6b0e9279aa505,5,O produto está ok e foi entregue bem antes do ...,guarulhos,SP,35.39,1,12,0
2,e781fdcc107d13d865fc7698711cc572,53904ddbea91e1e92b2b3f1d09a7af86,87285b34884572647811a353c7ac498a,bfc39df4f36c3693ff3b63fcbea9e90a,housewares,1,3504c0cb71d7fa48d967e0e4c94d59d9,3,no comment given,maua,SP,29.99,1,3,0
3,3a51803cc0d012c3b5dc8b7528cb05f7,a20e8105f23924cd00833fd87daa0831,87285b34884572647811a353c7ac498a,128e10d95713541c87cd1a2e48201934,housewares,1,3504c0cb71d7fa48d967e0e4c94d59d9,4,Deveriam embalar melhor o produto. A caixa vei...,maua,SP,29.99,1,3,0
4,ef0996a1a279c26e7ecbd737be23d235,26c7ac168e1433912a51b924fbd34d34,87285b34884572647811a353c7ac498a,0e7e841ddf8f8f2de2bad69267ecfbcf,housewares,1,3504c0cb71d7fa48d967e0e4c94d59d9,5,"Só achei ela pequena pra seis xícaras ,mais é ...",maua,SP,29.99,1,3,0


### Id-mapping function for converting string id into numeric for products, customers and features

In [32]:
def id_mappings(user_list, item_list, item_feature_list, user_feature_list):
    """
    
    Create id mappings to convert user_id, item_id, and feature_id
    
    """
    user_to_index_mapping = {}
    index_to_user_mapping = {}
    for user_index, user_id in enumerate(user_list):
        user_to_index_mapping[user_id] = user_index
        index_to_user_mapping[user_index] = user_id
        
    item_to_index_mapping = {}
    index_to_item_mapping = {}
    for item_index, item_id in enumerate(item_list):
        item_to_index_mapping[item_id] = item_index
        index_to_item_mapping[item_index] = item_id
        
    item_feature_to_index_mapping = {}
    index_to_item_feature_mapping = {}
    for item_feature_index, item_feature_id in enumerate(item_feature_list):
        item_feature_to_index_mapping[item_feature_id] = item_feature_index
        index_to_item_feature_mapping[item_feature_index] = item_feature_id

    user_feature_to_index_mapping = {}
    index_to_user_feature_mapping = {}
    for user_feature_index, user_feature_id in enumerate(user_feature_list):
        user_feature_to_index_mapping[user_feature_id] = user_feature_index
        index_to_user_feature_mapping[user_feature_index] = user_feature_id
        
        
    return user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping, \
           item_feature_to_index_mapping, index_to_item_feature_mapping, \
           user_feature_to_index_mapping, index_to_user_feature_mapping

__Items dictionary__

This df will be used to associate each product with it's corresponding features. This will be used to to improve the output of the recommender system, which will produce a recommended product_id (specifying a specific product) with it's corresponding product category.

In [33]:
# generate mapping, LightFM library can't read other than (integer) index
user_to_index_mapping, index_to_user_mapping, \
           item_to_index_mapping, index_to_item_mapping, \
           item_feature_to_index_mapping, index_to_item_feature_mapping, \
           user_feature_to_index_mapping, index_to_user_feature_mapping = id_mappings(users, items, item_features_list, 
                                                                            user_features_list)

In [34]:
# df of product features
product_to_feature = item_features[['product_id', 'product_category_name', 'seller_id', 'num_reviews_binned', 
                                    'num_comments_binned', 'avg_product_reviews_binned', 'avg_seller_reviews_binned',
                                    'avg_price_binned', 'seller_city', 'seller_state']]
product_to_feature = product_to_feature.drop_duplicates('product_id').reset_index(drop=True)

In [35]:
# user rating df based on number of purchases
user_to_product_rating = item_features[['customer_unique_id', 'product_id', 'product_count']]
user_to_product_rating = user_to_product_rating.sort_values(by='customer_unique_id').reset_index(drop=True)

__Train, test split__

In [36]:
import random
np.random.seed(31)

rows = np.random.binomial(1, .8, size=len(user_to_product_rating)).astype('bool')

user_to_product_rating_train = user_to_product_rating[rows].reset_index(drop=True)
user_to_product_rating_test = user_to_product_rating[~rows].reset_index(drop=True)

__Product-feature interactions df:__

We want to now create a dataframe that describes the relationship between product and features.

In [37]:
# Transpose product_to_feature to extract features for each product_id
df = product_to_feature.set_index('product_id').T.reset_index(drop=True)

In [38]:
# Need to create list of product_id's repeated 6 times for each feauture the resulting df will hold
cols = list(df.columns)
res =  [ele for ele in cols for i in range(len(product_to_feature.columns) - 1)]

In [39]:
# Create empty dataframe with index as product_id
features = pd.DataFrame(index = res)
features.index.name = 'product_id'

# Reset index so 'product_id' becomes column
features = features.reset_index()

# create empty column to fill
features['feature'] = ""

In [40]:
# Create list of feature items to replace empty column
feature_items = []
for col in cols:
    for i in range(len(product_to_feature.columns) - 1):
        feature_items.append(df[col][i])

In [41]:
# set column to list created for each users features
features['feature'] = feature_items

# change name back to product_to_feature
product_to_feature = features

# add column with all ones for sparse matrix
product_to_feature['feature_count'] = 1

In [42]:
product_to_feature.head(10)

Unnamed: 0,product_id,feature,feature_count
0,372645c7439f9661fbbacfd129aa92ec,bed_bath_table,1
1,372645c7439f9661fbbacfd129aa92ec,da8622b14eb17ae2831f4ac5b9dab84a,1
2,372645c7439f9661fbbacfd129aa92ec,"(3.075, 3.87]",1
3,372645c7439f9661fbbacfd129aa92ec,"(1.824, 2.736]",1
4,372645c7439f9661fbbacfd129aa92ec,"(4.0, 5.0]",1
5,372645c7439f9661fbbacfd129aa92ec,"(4.0, 5.0]",1
6,372645c7439f9661fbbacfd129aa92ec,"(74.9, 135.0]",1
7,372645c7439f9661fbbacfd129aa92ec,piracicaba,1
8,372645c7439f9661fbbacfd129aa92ec,SP,1
9,5099f7000472b634fea8304448d20825,health_beauty,1


__User-feature interaction df__

In [43]:
user_to_feature = user_features.drop_duplicates('customer_unique_id').reset_index(drop=True)

In [44]:
# Transpose product_to_feature to extract features for each customer
df = user_to_feature.set_index('customer_unique_id').T.reset_index(drop=True)

In [45]:
df.head()

customer_unique_id,7c396fd4830fd04220f754e42b4e5bff,e781fdcc107d13d865fc7698711cc572,3a51803cc0d012c3b5dc8b7528cb05f7,ef0996a1a279c26e7ecbd737be23d235,8a4002923e801e3120a11070fd31c9e2,0cad1c6c08ef74b3ae7818514c158258,eb5c27c09badfe9c053416ca5c3c7c35,6eff1cd9cc7f4dde6e1cbb679e219a50,cc3f5d537772957f90b7f15f1fb70316,87c9eb971f4cfcbefa08cb27b21bc9a7,...,6d1b4e0269ddf214a835aef2cfc23158,17cb19c8526583b203e7a3a6c828d3aa,7ce47ba6982b3ce9bee72ab948481f9e,f9ccc64481a04c1a9886994895b97f03,f5138d94d7e1085c5219408f32b62529,8b8c8f067a3faaf116211277147a88de,1ef6a1d949703afd7a74347aed3b0503,ebc6df639d125e446f07c0e9b1e47b90,ed81a42bec90c87578108d2e4c742d20,a49e8e11e850592fe685ae3c64b40eca
0,SP,SC,SP,SP,SP,SP,SP,RJ,SP,SP,...,SC,PE,RS,PB,SP,MG,TO,MG,BA,PR
1,12,3,3,3,2,2,2,2,2,2,...,14,4,7,7,3,9,14,13,7,0


In [46]:
# Need to create list of product_id's repeated 6 times for each feauture the resulting df will hold
cols = list(df.columns)
res =  [ele for ele in cols for i in range(len(user_to_feature.columns)-1)]

In [47]:
# Create empty dataframe with index as product_id
user_feature = pd.DataFrame(index = res)
user_feature.index.name = 'customer_unique_id'

# Reset index so 'product_id' becomes column
user_feature = user_feature.reset_index()

# create empty column to fill
user_feature['feature'] = ""

In [48]:
# Create list of feature items to replace empty column
feature_items = []
for col in cols:
    for i in range(len(user_to_feature.columns)-1):
        feature_items.append(df[col][i])

In [49]:
user_feature['feature'] = feature_items
user_to_feature = user_feature
user_to_feature['feature_count'] = 1

__Final interaction dfs__

In [50]:
# 9 total features for each product
product_to_feature.head(9)

Unnamed: 0,product_id,feature,feature_count
0,372645c7439f9661fbbacfd129aa92ec,bed_bath_table,1
1,372645c7439f9661fbbacfd129aa92ec,da8622b14eb17ae2831f4ac5b9dab84a,1
2,372645c7439f9661fbbacfd129aa92ec,"(3.075, 3.87]",1
3,372645c7439f9661fbbacfd129aa92ec,"(1.824, 2.736]",1
4,372645c7439f9661fbbacfd129aa92ec,"(4.0, 5.0]",1
5,372645c7439f9661fbbacfd129aa92ec,"(4.0, 5.0]",1
6,372645c7439f9661fbbacfd129aa92ec,"(74.9, 135.0]",1
7,372645c7439f9661fbbacfd129aa92ec,piracicaba,1
8,372645c7439f9661fbbacfd129aa92ec,SP,1


In [51]:
# two features for each user
user_to_feature.head(2)

Unnamed: 0,customer_unique_id,feature,feature_count
0,7c396fd4830fd04220f754e42b4e5bff,SP,1
1,7c396fd4830fd04220f754e42b4e5bff,12,1


In [52]:
# count for each user for the products they purchased
user_to_product_rating_train.head()

Unnamed: 0,customer_unique_id,product_id,product_count
0,0000366f3b9a7992bf8c76cfdf3221e2,372645c7439f9661fbbacfd129aa92ec,1
1,0000f46a3911fa3c0805444483337064,64b488de448a5324c4134ea39c28a34b,1
2,0004aac84e0df4da2b147fca70cf8255,c72e18b3fe2739b8d24ebf3102450f37,1
3,0004bd2a26a76fe21f786e4fbd80607f,25cf184645f3fae66083bf33581b8f13,1
4,00053a61a98854899e70ed204dd4bafe,62984ea1bba7fcea1f5b57084d3bf885,2


In [53]:
user_to_product_rating_test.head()

Unnamed: 0,customer_unique_id,product_id,product_count
0,0000b849f77a49e4a4ce2b2a4ca5be3f,5099f7000472b634fea8304448d20825,1
1,0000f6ccb0745a6a4b88665a16c9f078,2345a354a6f2033609bbf62bf5be9ef6,1
2,00050ab1314c0e55a6ca13cf7181fecf,8cefe1c6f2304e7e6825150218ffc58c,1
3,000de6019bb59f34c099a907c151d855,af0a917aec9cea3b353ece61a8825326,2
4,000ed48ceeb6f4bf8ad021a10a3c7b43,d2f5484cbffe4ca766301b21ab9246dd,1


__Create sparse matrix for products and metadata__

In [54]:
from scipy import sparse

def get_interaction_matrix(df, df_column_as_row, df_column_as_col, df_column_as_value, row_indexing_map, 
                          col_indexing_map):
    
    row = df[df_column_as_row].apply(lambda x: row_indexing_map[x]).values
    col = df[df_column_as_col].apply(lambda x: col_indexing_map[x]).values
    value = df[df_column_as_value].values
    
    return sparse.coo_matrix((value, (row, col)), shape = (len(row_indexing_map), len(col_indexing_map)))

In [55]:
# generate user_item_interaction_matrix for train data
user_to_product_interaction_train = get_interaction_matrix(user_to_product_rating_train, "customer_unique_id", 
                                                    "product_id", "product_count", user_to_index_mapping, item_to_index_mapping)

# generate user_item_interaction_matrix for test data
user_to_product_interaction_test = get_interaction_matrix(user_to_product_rating_test, "customer_unique_id", 
                                                    "product_id", "product_count", user_to_index_mapping, item_to_index_mapping)

# generate item_to_feature interaction
product_to_feature_interaction = get_interaction_matrix(product_to_feature, "product_id", "feature",  "feature_count", 
                                                        item_to_index_mapping, item_feature_to_index_mapping)

user_to_feature_interaction = get_interaction_matrix(user_to_feature, "customer_unique_id", "feature", "feature_count", 
                                                     user_to_index_mapping, user_feature_to_index_mapping)

## Model with only collaborative interactions

In [None]:
import time
from lightfm import LightFM
from lightfm.evaluation import auc_score
# initialising model with warp loss function
model_without_features = LightFM(loss = "warp")

# fitting into user to product interaction matrix only / pure collaborative filtering factor
start = time.time()
#===================

model_without_features.fit(user_to_product_interaction_train,
          user_features=None, 
          item_features=None, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

# auc metric score (ranging from 0 to 1)

start = time.time()
#===================

auc_without_features = auc_score(model = model_without_features, 
                        test_interactions = user_to_product_interaction_test,
                        num_threads = 4, check_intersections = False)
#===================
end = time.time()

print("time taken = {0:.{1}f} seconds".format(end - start, 2))
print("average AUC without adding item-feature interaction = {0:.{1}f}".format(auc_without_features.mean(), 2))

## Model with collaborative filtering and product content

In [57]:
# initialising model with warp loss function
model_with_features = LightFM(loss = "warp")

# fitting the model with hybrid collaborative filtering + content based (product + features)
start = time.time()
#===================


model_with_features.fit(user_to_product_interaction_train,
          user_features=None, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

start = time.time()
#===================
auc_with_features = auc_score(model = model_with_features, 
                        test_interactions = user_to_product_interaction_test,
                        train_interactions = user_to_product_interaction_train,
                        user_features=None,
                        item_features = product_to_feature_interaction,
                        num_threads = 4, check_intersections=False)
#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

print("average AUC with adding item-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2))


time taken = 0.24 seconds
time taken = 59.40 seconds
average AUC with adding item-feature interaction = 0.79


> Much better results with including product interactions. This makes sense here due to the fact that many of the customers in this dataset have only bought from olist one time. 

## Model with hybrid collaborative filtering and content based including user features

In [58]:
import time
from lightfm import LightFM
from lightfm.evaluation import auc_score
# initialising model with warp loss function
model_with_features_and_users = LightFM(loss = "warp")

# fitting the model with hybrid collaborative filtering + content based (product + features)
start = time.time()
#===================


model_with_features_and_users.fit(user_to_product_interaction_train,
          user_features=user_to_feature_interaction, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

start = time.time()
#===================
auc_with_features = auc_score(model = model_with_features_and_users, 
                        test_interactions = user_to_product_interaction_test,
                        train_interactions = user_to_product_interaction_train,
                        user_features=user_to_feature_interaction,
                        item_features = product_to_feature_interaction,
                        num_threads = 4, check_intersections=False)
#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

print("average AUC with adding item-feature and user-feature interaction = {0:.{1}f}".format(auc_with_features.mean(), 2))


time taken = 0.24 seconds
time taken = 58.56 seconds
average AUC with adding item-feature and user-feature interaction = 0.97


> When adding clustering groups the auc increases to 97, which indicates some possiblt overfitting like was predicted. The clustering groups may be too heavily relied on for recommendations.

### Combine train and test for testing on individual user recommendations

In [59]:
def combined_train_test(train, test):
    """
    
    test set is the more recent rating/number_of_order of users.
    train set is the previous rating/number_of_order of users.
    non-zero value in the test set will replace the elements in 
    the train set matrices
    """
    # initialising train dict
    train_dict = {}
    for train_row, train_col, train_data in zip(train.row, train.col, train.data):
        train_dict[(train_row, train_col)] = train_data
        
    # replacing with the test set
    
    for test_row, test_col, test_data in zip(test.row, test.col, test.data):
        train_dict[(test_row, test_col)] = max(test_data, train_dict.get((test_row, test_col), 0))
        
    
    # converting to the row
    row_element = []
    col_element = []
    data_element = []
    for row, col in train_dict:
        row_element.append(row)
        col_element.append(col)
        data_element.append(train_dict[(row, col)])
        
    # converting to np array
    
    row_element = np.array(row_element)
    col_element = np.array(col_element)
    data_element = np.array(data_element)
    
    return sparse.coo_matrix((data_element, (row_element, col_element)), shape = (train.shape[0], train.shape[1]))

In [60]:
# Create one user-product interaction matrix
user_to_product_interaction = combined_train_test(user_to_product_interaction_train, 
                                                 user_to_product_interaction_test)

### Fit initial model with collaborative and product feature interaction matrices

In [61]:
# retraining the final model with combined dataset

initial_model1 = LightFM(loss = "warp")

# fitting to combined dataset with pure collaborative filtering result

start = time.time()
#===================

initial_model1.fit(user_to_product_interaction,
          user_features=None, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

time taken = 0.30 seconds


### Fit model with user features included

In [62]:
# retraining the final model with combined dataset

initial_model2 = LightFM(loss = "warp")

# fitting to combined dataset with pure collaborative filtering result

start = time.time()
#===================

initial_model2.fit(user_to_product_interaction,
          user_features=user_to_feature_interaction, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=1, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

time taken = 0.21 seconds


__Define recommendation function__

This function should provide the requested number of recommended items given a customers id. For reproducibility it will take the model, customer-product interaction, and the customer and item dictionaries. The function will print the known items liked by the customer, and will output the top n recommended items based on their history. 

In [63]:
class recommendation_sampling:
    
    def __init__(self, model, items = items, user_to_product_interaction_matrix = user_to_product_interaction, 
                user2index_map = user_to_index_mapping):
        
        self.user_to_product_interaction_matrix = user_to_product_interaction_matrix
        self.model = model
        self.items = items
        self.user2index_map = user2index_map
    
    def recommendation_for_user(self, user, user_features=None):
        
        # getting the userindex
        
        userindex = self.user2index_map.get(user, None)
        
        if userindex == None:
            return None
        
        users = [userindex]
        
        # products already bought
        
        known_positives = self.items[self.user_to_product_interaction_matrix.tocsr()[userindex].indices]
        
        # scores from model prediction
        scores = self.model.predict(user_ids = users, item_ids = np.arange(self.user_to_product_interaction_matrix.shape[1]),
                                    user_features=user_features,
                                    item_features = product_to_feature_interaction)

        # top items
        
        top_items = self.items[np.argsort(-scores)]
        
        # printing out the result
        print("User %s" % user)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("                  %s" % x)
            print("                  {}".format(product_to_feature['feature'][product_to_feature['product_id'] == x].iloc[0]))
            
            
        print("     Recommended:")
        
        for x in top_items[:3]:
            print("                  %s" % x)
            print("                  {}".format(product_to_feature['feature'][product_to_feature['product_id'] == x].iloc[0]))

### Test before hyperparameter tuning with both initial models

__Test for customer_id = 'c8ed31310fc440a3f8031b177f9842c3'__

In [64]:
# Initial model without user features
recom = recommendation_sampling(model = initial_model1)
print(recom.recommendation_for_user('c8ed31310fc440a3f8031b177f9842c3'))

User c8ed31310fc440a3f8031b177f9842c3
     Known positives:
                  4a5c3967bfd3629fe07ef4d0cc8c3818
                  construction_tools_construction
                  21b524c4c060169fa75ccf08c7da4627
                  construction_tools_construction
                  5dae498eff2d80057f56122235a36aff
                  construction_tools_construction
     Recommended:
                  d1c427060a0f73f6b889a5c7c61f2ac4
                  computers_accessories
                  36f60d45225e60c7da4558b070ce4b60
                  computers_accessories
                  35afc973633aaeb6b877ff57b2793310
                  home_confort
None


In [65]:
# initial model with user features (which includes clustered groups)
recom = recommendation_sampling(model = initial_model2)
print(recom.recommendation_for_user('c8ed31310fc440a3f8031b177f9842c3', user_features=user_to_feature_interaction))

User c8ed31310fc440a3f8031b177f9842c3
     Known positives:
                  4a5c3967bfd3629fe07ef4d0cc8c3818
                  construction_tools_construction
                  21b524c4c060169fa75ccf08c7da4627
                  construction_tools_construction
                  5dae498eff2d80057f56122235a36aff
                  construction_tools_construction
     Recommended:
                  0aabfb375647d9738ad0f7b4ea3653b1
                  consoles_games
                  89321f94e35fc6d7903d36f74e351d40
                  food
                  4fe644d766c7566dbc46fb851363cb3b
                  art
None


> What is interesting about these results is that our initial model without user features, and thus without clustering groups, suggests products that were found to be purchased often, while the second model suggests more novel items. It is difficult to say which is better.

## Implement hyperparameter tuning

In [66]:
import itertools

import numpy as np

from lightfm import LightFM
from lightfm.evaluation import auc_score


def sample_hyperparameters():
    """
    Yield possible hyperparameter choices.
    """

    while True:
        yield {
            "no_components": np.random.randint(16, 64),
            "learning_schedule": np.random.choice(["adagrad", "adadelta"]),
            "loss": np.random.choice(["bpr", "warp", "warp-kos"]),
            "learning_rate": np.random.exponential(0.05),
            "item_alpha": np.random.exponential(1e-8),
            "user_alpha": np.random.exponential(1e-8),
            "max_sampled": np.random.randint(5, 15),
            "num_epochs": np.random.randint(5, 50),
        }


def random_search(test, train, user_features, item_features, num_samples=10):
    """
    Sample random hyperparameters, fit a LightFM model, and evaluate it
    on the test set.

    Parameters
    ----------

    train: np.float32 coo_matrix of shape [n_users, n_items]
        Training data.
    test: np.float32 coo_matrix of shape [n_users, n_items]
        Test data.
    num_samples: int, optional
        Number of hyperparameter choices to evaluate.


    Returns
    -------

    generator of (auc_score, hyperparameter dict, fitted model)

    """

    for hyperparams in itertools.islice(sample_hyperparameters(), num_samples):
        num_epochs = hyperparams.pop("num_epochs")

        model = LightFM(**hyperparams)
        model.fit(train, user_features=user_features, item_features=item_features, epochs=num_epochs, num_threads=1)

        score = auc_score(model = model, 
                        test_interactions = test,
                        train_interactions = train,
                        user_features = user_features,
                        item_features = item_features,
                        num_threads = 4, check_intersections=False).mean()

        hyperparams["num_epochs"] = num_epochs

        yield (score, hyperparams, model)

__Without user features__

In [67]:
if __name__ == "__main__":
    
    (score, hyperparams1, model) = max(random_search(user_to_product_interaction_test,
                                                    user_to_product_interaction_train,
                                                    user_features=None,
                                                    item_features=product_to_feature_interaction), 
                                      key=lambda x: x[0])
    
    
    print("Best score {} at {}".format(score, hyperparams1))

Best score 0.7649544477462769 at {'no_components': 20, 'learning_schedule': 'adagrad', 'loss': 'warp', 'learning_rate': 0.06527879177463329, 'item_alpha': 1.549602673630783e-08, 'user_alpha': 6.547639310592444e-10, 'max_sampled': 11, 'num_epochs': 16}


In [68]:
# retraining the final model with combined dataset
model_hyperparams1 = {key: value for key, value in hyperparams1.items() if key not in 'num_epochs'}
# without clustered label
final_model1 = LightFM(**(model_hyperparams1), random_state=3)

# fitting to combined dataset with pure collaborative filtering result

start = time.time()
#===================

final_model1.fit(user_to_product_interaction,
          user_features=None, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=5, 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

time taken = 1.44 seconds


__With user features__

In [69]:
if __name__ == "__main__":
    
    (score, hyperparams2, model) = max(random_search(user_to_product_interaction_test,
                                                    user_to_product_interaction_train,
                                                    user_to_feature_interaction,
                                                    product_to_feature_interaction), 
                                      key=lambda x: x[0])
    
    
    print("Best score {} at {}".format(score, hyperparams2))

Best score 0.9701918959617615 at {'no_components': 57, 'learning_schedule': 'adagrad', 'loss': 'warp-kos', 'learning_rate': 0.0204193383282095, 'item_alpha': 7.179002006738488e-09, 'user_alpha': 2.0156094663261573e-09, 'max_sampled': 12, 'num_epochs': 26}


In [70]:
# retraining the final model with combined dataset
# with clustered label
model_hyperparams2 = {key: value for key, value in hyperparams2.items() if key not in 'num_epochs'}
final_model2 = LightFM(**model_hyperparams2, random_state=3)

# fitting to combined dataset with pure collaborative filtering result

start = time.time()
#===================

final_model2.fit(user_to_product_interaction,
          user_features=user_to_feature_interaction, 
          item_features=product_to_feature_interaction, 
          sample_weight=None, 
          epochs=hyperparams2.pop('num_epochs'), 
          num_threads=4,
          verbose=False)

#===================
end = time.time()
print("time taken = {0:.{1}f} seconds".format(end - start, 2))

time taken = 14.20 seconds


### Testing recommendations for select customers

To compare model1 and model2, we will select customers with many items purchased and those with very few to see how it performs. Ideally we will not see anything out of the ordinary or unexpected although sometimes that may be the better recommendation. Our final choice will be based on consistency so many tests will be performed to get a good overall picture. 

__Test for customer_id = 'c8ed31310fc440a3f8031b177f9842c3'__

In [71]:
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
print(recom.recommendation_for_user('c8ed31310fc440a3f8031b177f9842c3'))

User c8ed31310fc440a3f8031b177f9842c3
     Known positives:
                  4a5c3967bfd3629fe07ef4d0cc8c3818
                  construction_tools_construction
                  21b524c4c060169fa75ccf08c7da4627
                  construction_tools_construction
                  5dae498eff2d80057f56122235a36aff
                  construction_tools_construction
     Recommended:
                  18fa9cc25ea8b54f32d029f261673c0f
                  construction_tools_construction
                  97d94ffa4936cbc2555e83aefc1f427b
                  construction_tools_construction
                  cd46a885543f0e169a49f1eb25c04e43
                  computers_accessories
None


In [72]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
print(recom.recommendation_for_user('c8ed31310fc440a3f8031b177f9842c3', user_features=user_to_feature_interaction))

User c8ed31310fc440a3f8031b177f9842c3
     Known positives:
                  4a5c3967bfd3629fe07ef4d0cc8c3818
                  construction_tools_construction
                  21b524c4c060169fa75ccf08c7da4627
                  construction_tools_construction
                  5dae498eff2d80057f56122235a36aff
                  construction_tools_construction
     Recommended:
                  0aabfb375647d9738ad0f7b4ea3653b1
                  consoles_games
                  d017a2151d543a9885604dc62a3d9dcc
                  fashion_bags_accessories
                  4fe644d766c7566dbc46fb851363cb3b
                  art
None


> The initial results from our first run through of both models yields some interesting results. Our model without user features gives cool stuff and fashion bags for this user who bought only construction tools. This would seem unexpected but it may be detecting other customers with similar purchases who also bought these items.

__Test for customer_id = '698e1cf81d01a3d389d96145f7fa6df8'__

In [75]:
#698e1cf81d01a3d389d96145f7fa6df8
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
print(recom.recommendation_for_user('698e1cf81d01a3d389d96145f7fa6df8'))

User 698e1cf81d01a3d389d96145f7fa6df8
     Known positives:
                  9571759451b1d780ee7c15012ea109d4
                  auto
     Recommended:
                  0152f69b6cf919bcdaf117aa8c43e5a2
                  bed_bath_table
                  b59fb744c6f3cd1dc23b10f760848d98
                  sports_leisure
                  dca18a6e2fb6da75092ed874094ed7b6
                  telephony
None


In [76]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
print(recom.recommendation_for_user('698e1cf81d01a3d389d96145f7fa6df8', user_features=user_to_feature_interaction))

User 698e1cf81d01a3d389d96145f7fa6df8
     Known positives:
                  9571759451b1d780ee7c15012ea109d4
                  auto
     Recommended:
                  9ddc4249779322828f89d2a9c04f7ee1
                  auto
                  629e019a6f298a83aeecc7877964f935
                  auto
                  a659cb33082b851fb87a33af8f0fff29
                  auto
None


> Based on the results above, model1 is tending to suggest popular items, while model2 is suggesting very similar items to what has already been purchased. One positive for model2 is that it is suggesting different products in the same category. This would be favorable to help customers discover what other items there are that are similar to ones already purchased. 

### More tests

__Test for customer_id = '02ce431b0797023384d47edad5dd7284'__

In [102]:
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('02ce431b0797023384d47edad5dd7284')

User 02ce431b0797023384d47edad5dd7284
     Known positives:
                  893b9464c6ab7f700148fe9db838b6b4
                  stationery
     Recommended:
                  43423cdffde7fda63d0414ed38c11a73
                  watches_gifts
                  2ffdf10e724b958c0f7ea69e97d32f64
                  watches_gifts
                  e0d64dcfaa3b6db5c54ca298ae101d05
                  watches_gifts


In [103]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('02ce431b0797023384d47edad5dd7284', user_features=user_to_feature_interaction)

User 02ce431b0797023384d47edad5dd7284
     Known positives:
                  893b9464c6ab7f700148fe9db838b6b4
                  stationery
     Recommended:
                  fb55982be901439613a95940feefd9ee
                  stationery
                  e03102efbc2229024c89be731f0aedcb
                  stationery
                  c706d50b57c9e83293c2586d01f32445
                  stationery


__Test for customer_id = '059907a512be8ac75b50cbcf6f837d18'__

In [105]:
# 059907a512be8ac75b50cbcf6f837d18
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('059907a512be8ac75b50cbcf6f837d18')

User 059907a512be8ac75b50cbcf6f837d18
     Known positives:
                  2b2428ab65b564c08fd9b40e187df246
                  drinks
     Recommended:
                  4520766ec412348b8d4caa5e8a18c464
                  auto
                  b4f9530c931398e215242293c2c8ba4c
                  fixed_telephony
                  bd6e8cf9fe4122c385da2bcb9f979d5d
                  telephony


In [106]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('059907a512be8ac75b50cbcf6f837d18', user_features=user_to_feature_interaction)

User 059907a512be8ac75b50cbcf6f837d18
     Known positives:
                  2b2428ab65b564c08fd9b40e187df246
                  drinks
     Recommended:
                  0aabfb375647d9738ad0f7b4ea3653b1
                  consoles_games
                  d017a2151d543a9885604dc62a3d9dcc
                  fashion_bags_accessories
                  4fe644d766c7566dbc46fb851363cb3b
                  art


__Test for customer_id = '0848ef3901afaa99199cbe1bbbd71e1a'__

In [108]:
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('0848ef3901afaa99199cbe1bbbd71e1a')

User 0848ef3901afaa99199cbe1bbbd71e1a
     Known positives:
                  cba54528d3adcef3c12d8e8b9a48bc17
                  bed_bath_table
     Recommended:
                  a0fe1efb855f3e786f0650268cd77f44
                  agro_industry_and_commerce
                  1a300f482e35d7eac74b229be067aefd
                  computers_accessories
                  466d263ce8b7bd275003ee2104428127
                  telephony


In [109]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('0848ef3901afaa99199cbe1bbbd71e1a', user_features=user_to_feature_interaction)

User 0848ef3901afaa99199cbe1bbbd71e1a
     Known positives:
                  cba54528d3adcef3c12d8e8b9a48bc17
                  bed_bath_table
     Recommended:
                  99a4788cb24856965c36a24e339b6058
                  bed_bath_table
                  06edb72f1e0c64b14c5b79353f7abea3
                  bed_bath_table
                  ec2d43cc59763ec91694573b31f1c29a
                  bed_bath_table


__Test for customer_id = '1bbd9a63db6ed2bb12623de2b1ceb7c3'__

In [120]:
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('1bbd9a63db6ed2bb12623de2b1ceb7c3')

User 1bbd9a63db6ed2bb12623de2b1ceb7c3
     Known positives:
                  a54244559e62c8ef2939e52189d65d4c
                  food_drink
     Recommended:
                  e070a61270050c4b1b704300f331cca6
                  pet_shop
                  d0b61bfb1de832b15ba9d266ca96e5b0
                  pet_shop
                  a5ae8400fc9fcacd4af585a57bbf264a
                  pet_shop


In [121]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('1bbd9a63db6ed2bb12623de2b1ceb7c3', user_features=user_to_feature_interaction)

User 1bbd9a63db6ed2bb12623de2b1ceb7c3
     Known positives:
                  a54244559e62c8ef2939e52189d65d4c
                  food_drink
     Recommended:
                  d017a2151d543a9885604dc62a3d9dcc
                  fashion_bags_accessories
                  c6dd917a0be2a704582055949915ab32
                  cool_stuff
                  601a360bd2a916ecef0e88de72a6531a
                  cool_stuff


__Test for customer_id = '37b5ae93bb8e35c742a4bcd701395daa'__

In [123]:
recom = recommendation_sampling(model = final_model1)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('37b5ae93bb8e35c742a4bcd701395daa')

User 37b5ae93bb8e35c742a4bcd701395daa
     Known positives:
                  c6dd917a0be2a704582055949915ab32
                  cool_stuff
     Recommended:
                  389d119b48cf3043d311335e499d9c6b
                  garden_tools
                  368c6c730842d78016ad823897a372db
                  garden_tools
                  422879e10f46682990de24d770e7f83d
                  garden_tools


In [124]:
recom = recommendation_sampling(model = final_model2)
#print(recom.recommendation_for_user(2))
recom.recommendation_for_user('37b5ae93bb8e35c742a4bcd701395daa', user_features=user_to_feature_interaction)

User 37b5ae93bb8e35c742a4bcd701395daa
     Known positives:
                  c6dd917a0be2a704582055949915ab32
                  cool_stuff
     Recommended:
                  f35927953ed82e19d06ad3aac2f06353
                  books_general_interest
                  5d66715cc928aadd0074f61332698593
                  electronics
                  6a8631b72a2f8729b91514db87e771c0
                  electronics


## Summary of Results

Our results were mixed for both of the final models. In the end they both displayed a bit of inconsistency, with the best recommendations being products that simply matched what was already purchased. Model1 seemed to generalize more around items that were frequently purchased, possibly bias towards the frequency of the collaborative approach. Model2 tended to suggest items with less dependency on the most bought items and more towards matching what was already purchased.

In the end model2 seems to be better suited as our recommendation engine. Using clustering may have given our final model (model2) the ability to recognize latent factors among users and therefore the ability to connect relevant products in a more consistent manner. While this did produce what seemed to be a better performing recommendation engine, there is still some healthy scepticism. It is reasonable to assume that our model is fitting too closely to the cluster ids and not enough to other traits.

Our next approach will be to do a last comparison using pysparks ALS recommendation algorithm. This will be compared to the same customers to see what similarities can be found and if we can possibly improve even further with this more simplistic approach. 