# import module

In [1]:
import pandas as pd
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

# Premodeling

## Loading data

In this section, we load the dataset, which is a combinenation of customer\product\order datasets. In this dataset, 3026 customers totally consume  4155 orders on 1753 products.

In [2]:
df=pd.read_csv(open('recommender.csv','rU'),index_col='Customers.id',encoding='utf-8', engine='c')

  """Entry point for launching an IPython kernel.


engine='c',engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

In [3]:
df.head()

Unnamed: 0_level_0,Order_Items.product_name,Order_Items.qty,Order_Items.product_id,Products.long_description
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
797,basic rollators green,0.693147,2310.0,classically designed value priced constructed ...
3,urinary drain bags,1.609438,177.0,only medline drain bags have slide tap for eas...
3,sensicare nitrile exam gloves blue large,0.693147,1.0,sensicare reg nitrile exam gloves feature depe...
4,basket for button walkers,0.693147,983.0,this wire basket attaches almost any walker me...
5,tens units,0.693147,991.0,the tens sup sup analog unit uses microprocess...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4151 entries, 797 to 3736
Data columns (total 4 columns):
Order_Items.product_name     4151 non-null object
Order_Items.qty              4151 non-null float64
Order_Items.product_id       4151 non-null float64
Products.long_description    4151 non-null object
dtypes: float64(2), object(2)
memory usage: 162.1+ KB


In [5]:
df.shape

(4151, 4)

In [71]:
print('# products: %d' % df['Order_Items.product_id'].nunique())

print('# customers: %d' % len(customers_order_count_df))

# products: 1753
# customers: 3026


## Cold Start Problem in Recommender Systems

Let's find how many different orders each customer conduct

In [73]:
customers_order_count_df = df.groupby(['Customers.id','Order_Items.product_id']).\
size().groupby('Customers.id').size()

customers_order_count_df.head()

Customers.id
3    2
4    1
5    3
7    1
8    1
dtype: int64

In [72]:
print('# customers with at least 3 different orders: %d' % len(customers_order_count_df[customers_order_count_df>2]))

# customers with at least 3 different orders: 121


there aren’t enough user actions for a particular item, the engine will not know when to display it

The term “cold start” derives from cars. When the engine is cold, the car is not yet working so smoothly, but once the optimal temperature is reached, it works just fine. For a recommendation engine it simply means that the conditions are not yet optimal for it to operate smoothly and provide best results. There are two major cold start categories: product cold start and customer cold start.

ways to help recommender systems cope with these issues.
* Content-based filtering  
* popularity-based filtering.

** Here I also tried collaborative filtering model.

## Data preprocessing

In [9]:
order_full_df = df.groupby(['Customers.id','Order_Items.product_id'])['Order_Items.qty'].sum().reset_index()

In [10]:
order_full_df.head(10)

Unnamed: 0,Customers.id,Order_Items.product_id,Order_Items.qty
0,3,1.0,0.693147
1,3,177.0,1.609438
2,4,983.0,0.693147
3,5,310.0,0.693147
4,5,799.0,0.693147
5,5,991.0,0.693147
6,7,1379.0,0.693147
7,8,815.0,0.693147
8,12,795.0,1.098612
9,13,1385.0,2.197225


In [11]:
order_full_df.shape

(3695, 3)

In [12]:
order_train_df, order_test_df = train_test_split(order_full_df, 
                                   test_size=0.20,
                                   random_state=42)

print('# orders on Train set: %d' % len(order_train_df))
print('# orders on Test set: %d' % len(order_test_df))

#Indexing by customer.id to speed up the searches during evaluation
order_full_indexed_df = order_full_df.set_index('Customers.id')
order_train_indexed_df = order_train_df.set_index('Customers.id')
order_test_indexed_df = order_test_df.set_index('Customers.id')

# orders on Train set: 2956
# orders on Test set: 739


# Model training and Evaluation

## Evaluation

In Recommender Systems, there are a set metrics commonly used for evaluation. We chose to work with Top-N accuracy metrics, which evaluates the accuracy of the top recommendations provided to a user, comparing to the items the user has actually interacted in test set.
This evaluation method works as follows:

* For each user
* For each item the user has interacted in test set
* Ask the recommender model to produce a ranked list of recommended items, Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list

* Aggregate the global Top-N accuracy metrics. The Top-N accuracy metric choosen was Recall@N which evaluates whether the interacted item is among the top N items (hit) in the ranked list of recommendations for a user.

In [13]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:
    
    def get_items_ordered(self,CustomersId, df):
    # Get the user's data and merge in the movie information.    
        if CustomersId in list(df.index):
            ordered_items = df.loc[df.index == CustomersId,'Order_Items.product_id']
            return set(ordered_items)
        else:
            return []

    
   
    def get_not_ordered_items_sample(self, CustomersId, sample_size, seed=42):
        ordered_items = self.get_items_ordered(CustomersId, order_full_indexed_df)
        all_items = set(df['Order_Items.product_id'])
        non_ordered_items = all_items - ordered_items

        random.seed(seed)
        non_ordered_items_sample = random.sample(non_ordered_items, sample_size)
        return set(non_ordered_items_sample)

    def _verify_hit_top_n(self, product_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == product_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, CustomersId):
        #Getting the items in test set
        ordered_values_testset = order_test_indexed_df.loc[order_test_indexed_df.index ==CustomersId,'Order_Items.product_id']
        customer_ordered_items_testset = set(ordered_values_testset)
         
        ordered_items_count_testset = len(customer_ordered_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        customer_recs_df = model.recommend_items(CustomersId, 
                                               items_to_ignore=self.get_items_ordered(CustomersId,\
                                                                                      order_train_indexed_df),\
                                                 topn=10000000000)
        customer_recs_df = customer_recs_df.sort_values(by='Order_Items.qty',ascending=False)
     
        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for product_id in customer_ordered_items_testset: 
            valid_recs = customer_recs_df['Order_Items.product_id'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(product_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(product_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(ordered_items_count_testset)
        recall_at_10 = hits_at_10_count / float(ordered_items_count_testset)

        customer_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'ordered_count': ordered_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return customer_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, CustomersId in enumerate(list(order_test_indexed_df.index.unique().values)):
            if idx % 100 == 0 and idx > 0:
                print('%d users processed' % idx)
            customer_metrics = self.evaluate_model_for_user(model, CustomersId)  
            customer_metrics['_customer_id'] = CustomersId
            people_metrics.append(customer_metrics)
        print('%d customers processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('ordered_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['ordered_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['ordered_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()  

## Popularity model

A common (and usually hard-to-beat) baseline approach is the Popularity model. This model is not actually personalized - it simply recommends to a user the most popular items that the user has not previously consumed. As the popularity accounts for the "wisdom of the crowds", it usually provides good recommendations, generally interesting for most people.
Ps. The main objective of a recommender system is to leverage the long-tail items to the users with very specific interests, which goes far beyond this simple technique.

In [14]:
item_popularity_df = order_train_df.groupby('Order_Items.product_id')['Order_Items.qty'].sum().sort_values(ascending=False).reset_index()
item_popularity_df.head(10)

Unnamed: 0,Order_Items.product_id,Order_Items.qty
0,1842.0,51.359687
1,911.0,39.221707
2,2107.0,37.636618
3,1469.0,34.436473
4,1862.0,34.264072
5,910.0,33.67653
6,858.0,31.714871
7,1867.0,31.463557
8,2109.0,26.894741
9,493.0,26.5227


In [15]:
class PopularityRecommender:
    
    MODEL_NAME = 'Popularity'
    
    def __init__(self, popularity_df, items_df=None):
        self.popularity_df = popularity_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, CustomersId, items_to_ignore=[], topn=10):
        # Recommend the more popular items that the user hasn't seen yet.
        recommendations_df = self.popularity_df[~self.popularity_df['Order_Items.product_id'].isin(items_to_ignore)] \
                               .sort_values('Order_Items.qty', ascending = False) \
                               .head(topn)

        
        return recommendations_df
    
popularity_model = PopularityRecommender(item_popularity_df)

In [16]:
popularity_model.recommend_items('851',topn=10)

Unnamed: 0,Order_Items.product_id,Order_Items.qty
0,1842.0,51.359687
1,911.0,39.221707
2,2107.0,37.636618
3,1469.0,34.436473
4,1862.0,34.264072
5,910.0,33.67653
6,858.0,31.714871
7,1867.0,31.463557
8,2109.0,26.894741
9,493.0,26.5227


In [17]:
print('Evaluating Popularity recommendation model...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)
print('\nGlobal metrics:\n%s' % pop_global_metrics)
pop_detailed_results_df.head(10)

Evaluating Popularity recommendation model...
100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
600 users processed
689 customers processed

Global metrics:
{'modelName': 'Popularity', 'recall@5': 0.058186738836265225, 'recall@10': 0.11096075778078485}


Unnamed: 0,_customer_id,hits@10_count,hits@5_count,ordered_count,recall@10,recall@5
138,1730,0,0,6,0.0,0.0
4,851,0,0,4,0.0,0.0
196,1616,0,0,4,0.0,0.0
413,213,1,0,4,0.25,0.0
32,515,0,0,3,0.0,0.0
3,1371,0,0,3,0.0,0.0
292,1420,0,0,3,0.0,0.0
517,1845,0,0,2,0.0,0.0
66,1313,0,0,2,0.0,0.0
330,1385,0,0,2,0.0,0.0


According to the result we achieved above, the Recall@5 of 0.058, which means that 5.8% of ordered items in test set were ranked by Popularity model among the top-5 items. And Recall@10 was even higher(11%), as expected.
Popularity models perform not so well!

## Content-Based Filtering Model

Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously ordered to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past. In particular, various candidate items are compared with items previously ordered by the user and the items with highest cosine_similarities are recommended.

In [23]:
df.columns

Index(['Order_Items.product_name', 'Order_Items.qty', 'Order_Items.product_id',
       'Products.long_description'],
      dtype='object')

In [24]:
train_df, test_df = train_test_split(df,test_size=0.2,random_state=42)


In [25]:
order_full_df.head()


Unnamed: 0,Customers.id,Order_Items.product_id,Order_Items.qty
0,3,1.0,0.693147
1,3,177.0,1.609438
2,4,983.0,0.693147
3,5,310.0,0.693147
4,5,799.0,0.693147


In [26]:
train_df.head()


Unnamed: 0_level_0,Order_Items.product_name,Order_Items.qty,Order_Items.product_id,Products.long_description
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3679,medline wheelchair walker combination red incl...,0.693147,909.0,combine the functionality both rollator and tr...
910,fitright extra briefs large,0.693147,321.0,fitright reg extra brief promotes discreet com...
2113,medline plus digital wrist blood pressure monitor,1.098612,4187.0,automatically inflates and deflates provides q...
1539,cup holder for wheelchairs,0.693147,13092.0,cup holder for wheelchairs
3559,medline bed assist bar,0.693147,782.0,built last our bed assist bar provides help ge...


In [27]:
order_train_df.head()

Unnamed: 0,Customers.id,Order_Items.product_id,Order_Items.qty
2577,2611,8162.0,0.693147
1490,1565,1986.0,0.693147
1255,1337,911.0,0.693147
3012,3032,17651.0,1.386294
2353,2391,857.0,0.693147


In [28]:
df.head(10)


Unnamed: 0_level_0,Order_Items.product_name,Order_Items.qty,Order_Items.product_id,Products.long_description
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
797,basic rollators green,0.693147,2310.0,classically designed value priced constructed ...
3,urinary drain bags,1.609438,177.0,only medline drain bags have slide tap for eas...
3,sensicare nitrile exam gloves blue large,0.693147,1.0,sensicare reg nitrile exam gloves feature depe...
4,basket for button walkers,0.693147,983.0,this wire basket attaches almost any walker me...
5,tens units,0.693147,991.0,the tens sup sup analog unit uses microprocess...
5,fitright ultra protective underwear large,0.693147,310.0,fitright ultra protective underwear large
5,sensicare silk nitrile exam gloves dark blue s...,0.693147,799.0,sensicare silk nitrile exam gloves dark blue s...
7,aloetouch sensitive personal cleansing baby wipes,0.693147,1379.0,super soft spunlace wipes are gentle the skin ...
8,universal raised toilet seat,0.693147,815.0,universal raised toilet seat has height adjust...
12,biohazard multipurpose sharps containers red,1.098612,795.0,these containers are designed for use restrict...


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4151 entries, 797 to 3736
Data columns (total 4 columns):
Order_Items.product_name     4151 non-null object
Order_Items.qty              4151 non-null float64
Order_Items.product_id       4151 non-null float64
Products.long_description    4151 non-null object
dtypes: float64(2), object(2)
memory usage: 162.1+ KB


In [30]:
stopwords_list = stopwords.words('english')

In [31]:
vectorizer = TfidfVectorizer(analyzer='word',\
                            ngram_range=(1,2),\
                            min_df=0.003,\
                            max_df=0.5,\
                            max_features=5000,\
                            stop_words=stopwords_list)

In [32]:
item_ids = df['Order_Items.product_id'].tolist()
item_ids.index(2310.0)


0

In [33]:
tfidf_matrix=vectorizer.fit_transform(df['Order_Items.product_name'] + '' + df['Products.long_description'])

In [45]:
tfidf_matrix

<4151x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 243468 stored elements in Compressed Sparse Row format>

In [34]:
tfidf_feature_names = vectorizer.get_feature_names()

In [46]:
tfidf_feature_names

['abdominal',
 'abdominal binders',
 'abduction',
 'ability',
 'ability rest',
 'absorbency',
 'absorbency choices',
 'absorbency core',
 'absorbency levels',
 'absorbency one',
 'absorbent',
 'absorbent core',
 'absorbent material',
 'absorbent pad',
 'absorbent padcommode',
 'absorbent polymer',
 'absorbing',
 'absorbing significant',
 'absorbs',
 'absorption',
 'access',
 'access maneuverability',
 'accessories',
 'accessories anti',
 'accessories holder',
 'accessories pole',
 'accessory',
 'accessory great',
 'accommodate',
 'accommodate users',
 'accommodates',
 'accommodates standard',
 'accommodates users',
 'according',
 'accurate',
 'achievement',
 'achievement daily',
 'acid',
 'acids',
 'acids antioxidants',
 'acquisition',
 'acquisition dryness',
 'acquisition layer',
 'action',
 'activated',
 'activated push',
 'activated safety',
 'active',
 'active children',
 'active ingredient',
 'active lifestyle',
 'active male',
 'activities',
 'activities add',
 'activities round'

In [35]:
tfidf_matrix

<4151x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 243468 stored elements in Compressed Sparse Row format>

In [40]:
def get_item_profile(ProductId):
    idx = item_ids.index(ProductId)
     
    item_profile = tfidf_matrix[idx:idx+1]
     
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_customers_profile(CustomersId, order_full_df):
    #order_customer_df = order_indexed_df.loc[CustomersId]
    ids=order_full_df.loc[order_full_df['Customers.id']==CustomersId,'Order_Items.product_id'].tolist()
    customer_item_profiles = get_item_profiles(ids)
    
    customer_item_qtys = np.array(order_full_df.loc[order_full_df['Customers.id']==CustomersId,'Order_Items.qty']).reshape(-1,1)
    #Weighted average of item profiles by the interactions strength
    customer_item_qtys_weighted_avg = np.sum(customer_item_profiles.multiply(customer_item_qtys), axis=0) / np.sum(customer_item_qtys)
    customer_profile_norm = sklearn.preprocessing.normalize(customer_item_qtys_weighted_avg)
    return customer_profile_norm

def build_customers_profiles(): 
    order_indexed_df = order_full_df.set_index('Customers.id')
    customer_profiles = {}
    for CustomersId in order_indexed_df.index.unique():
        #print(CustomersId)
        customer_profiles[CustomersId] = build_customers_profile(CustomersId, order_full_df)
    return customer_profiles

In [None]:
CustomersId ='100'
order_indexed_df = order_full_df.set_index('Customers.id')
order_customer_df = order_indexed_df.loc[CustomersId]
customer_item_qtys = np.array(order_full_df.loc[order_full_df['Customers.id']==CustomersId,'Order_Items.qty']).reshape(-1,1)
customer_item_qtys

In [41]:
customer_profiles = build_customers_profiles()
len(customer_profiles)
#CustomersId = '100'
#order_full_df.loc[order_full_df['Customers.id']=='100','Order_Items.product_id'].tolist()
#order_indexed_df = order_full_df.set_index('Customers.id')

#order_customer_df = pd.DataFrame(order_indexed_df.loc[CustomersId]).T
#order_customer_df.head()
#ids = order_customer_df['Order_Items.product_id'].tolist()
#print(ids)
#order_customer_df.head()
#ids = order_customer_df['Order_Items.product_id'].values()
#type(ids)
#build_customers_profile(CustomersId,order_full_df)
#order_customer_df = order_indexed_df.loc[CustomersId]
#order_customer_df.head()

3026

In [42]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, items_df=None):
        self.item_ids = item_ids
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_items_to_customer_profile(self, CustomersId, topn=1000):
        #Computes the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(customer_profiles[CustomersId], tfidf_matrix)
        #Gets the top similar items
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar items by similarity
        similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
        
    def recommend_items(self, CustomersId, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_customer_profile(CustomersId)
        #Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['Order_Items.product_id', 'Order_Items.qty']) \
                                    .head(topn)


        return recommendations_df
    
content_based_recommender_model = ContentBasedRecommender(train_df)

In [43]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...
100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
600 users processed
689 customers processed

Global metrics:
{'modelName': 'Content-Based', 'recall@5': 0.9012178619756428, 'recall@10': 0.9350473612990527}


Unnamed: 0,_customer_id,hits@10_count,hits@5_count,ordered_count,recall@10,recall@5
138,1730,3,1,6,0.5,0.166667
4,851,1,1,4,0.25,0.25
196,1616,3,1,4,0.75,0.25
413,213,1,1,4,0.25,0.25
32,515,0,0,3,0.0,0.0
3,1371,0,0,3,0.0,0.0
292,1420,2,2,3,0.666667,0.666667
517,1845,2,2,2,1.0,1.0
66,1313,1,1,2,0.5,0.5
330,1385,2,1,2,1.0,0.5


Yay! With personalized recommendations of content-based filtering model, we have a jump on Recall@5 to about 0.90, which means that about 94% of interacted items in test set were ranked by this model among the top-5 items.

Wonderful!


## Collaborative Filtering model

Collaborative Filtering: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

Model-based: This approach, models are developed using different machine learning algorithms to recommend items to users. There are many model-based CF algorithms, like neural networks, bayesian networks, clustering models, and latent factor models such as Singular Value Decomposition (SVD) and, probabilistic latent semantic analysis.

In [47]:
#Creating a sparse pivot table with users in rows and items in columns
customer_product_pivot_matrix_df = order_train_df.pivot(index='Customers.id', 
                                                          columns='Order_Items.product_id', 
                                                          values='Order_Items.qty').fillna(0)

customer_product_pivot_matrix_df.head(10)

Order_Items.product_id,11.0,14.0,15.0,19.0,20.0,22.0,26.0,28.0,29.0,30.0,...,25003.0,25107.0,25170.0,25269.0,25356.0,25527.0,25612.0,25694.0,25920.0,26175.0
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
customer_product_pivot_matrix = customer_product_pivot_matrix_df.as_matrix()
customer_product_pivot_matrix[:10]

  """Entry point for launching an IPython kernel.


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [49]:
customer_ids = list(customer_product_pivot_matrix_df.index)
customer_ids[:10]

[3, 4, 5, 7, 12, 13, 14, 22, 23, 27]

In [50]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(customer_product_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [51]:
sigma = np.diag(sigma)

In [52]:
U.shape, Vt.shape, np.diag(sigma).shape

((2489, 15), (15, 1517), (15,))

After the factorization, I try to to reconstruct the original matrix by multiplying its factors. The resulting matrix is not sparse any more. It was generated predictions for items the user have not yet interaction, which I will exploit for recommendations.

In [54]:
all_customer_predicted_qtys = np.dot(np.dot(U, sigma), Vt) 
all_customer_predicted_qtys

array([[-6.79804571e-34,  7.68446827e-37, -3.68432643e-33, ...,
        -6.13993996e-34, -4.40271119e-34, -3.44952466e-34],
       [ 2.07820141e-34, -3.17488649e-37,  2.42258536e-34, ...,
         2.37251724e-34,  9.90330768e-35,  2.46979360e-35],
       [-2.44415180e-34,  2.78454815e-37, -4.04277895e-34, ...,
        -2.11013881e-34, -1.46148937e-34, -4.05240878e-35],
       ...,
       [ 1.09216236e-32, -1.05860535e-36,  1.06415192e-33, ...,
         1.15205644e-33, -3.28674133e-33,  2.11232161e-34],
       [ 1.12748904e-33,  4.45949884e-38,  7.49462106e-34, ...,
        -2.00439296e-36, -4.28230293e-34,  7.94418789e-35],
       [ 4.49753879e-34, -6.36343206e-37,  6.16460644e-34, ...,
         4.76076985e-34,  2.52031222e-34,  6.22020747e-35]])

In [55]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_customer_predicted_qtys, columns = customer_product_pivot_matrix_df.columns, index=customer_ids).transpose()
cf_preds_df.head(10)


Unnamed: 0_level_0,3,4,5,7,12,13,14,22,23,27,...,3719,3721,3722,3723,3725,3726,3728,3730,3732,3736
Order_Items.product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11.0,-6.798046e-34,2.0782009999999997e-34,-2.4441519999999997e-34,-3.5057389999999997e-34,2.8163e-35,2.386577e-33,-9.844454e-35,-1.5193979999999999e-34,2.0541030000000002e-33,5.3984179999999995e-34,...,-4.558116e-34,-3.592892e-34,1.799036e-35,-1.295886e-17,-4.3289139999999995e-34,2.749425e-34,-3.670878e-34,1.092162e-32,1.127489e-33,4.497539e-34
14.0,7.684468e-37,-3.174886e-37,2.784548e-37,2.957167e-37,-2.484141e-37,-1.172457e-36,-6.480759e-37,1.669533e-37,-8.645679e-37,1.395833e-36,...,4.680687e-37,3.249017e-37,-2.877675e-38,-2.703717e-20,3.5731399999999998e-37,2.9693829999999997e-38,4.1942759999999994e-38,-1.058605e-36,4.4594989999999994e-38,-6.363432e-37
15.0,-3.6843260000000005e-33,2.422585e-34,-4.042779e-34,-1.065181e-33,-1.125103e-34,1.8056540000000002e-33,2.373567e-34,-3.5189529999999998e-34,1.749064e-33,6.654519e-34,...,-5.0566919999999996e-34,2.3096989999999997e-34,2.072674e-35,3.623543e-17,-1.020003e-33,-7.952141e-34,4.494841e-34,1.064152e-33,7.494621e-34,6.164606e-34
19.0,-8.571184e-34,3.4049239999999997e-34,-2.974117e-34,-3.217285e-34,2.620236e-34,1.189109e-33,6.951267e-34,-1.793969e-34,9.071711e-34,-1.505689e-33,...,-4.628644e-34,-3.359859e-34,3.083847e-35,2.9384170000000003e-17,-3.872485e-34,-4.71293e-35,-3.562772e-35,1.205042e-33,-3.681674e-35,6.7988559999999995e-34
20.0,3.4788199999999997e-34,-3.449382e-36,1.202168e-36,6.179731e-35,2.828454e-35,2.4651829999999997e-34,-2.871914e-35,1.105644e-35,-3.4308149999999998e-37,2.247317e-34,...,-1.209861e-34,-9.320517e-35,-2.563364e-37,-4.867255e-18,4.98573e-35,1.544095e-34,-1.014249e-34,3.420051e-34,-1.322079e-34,8.774337e-37
22.0,-2.56282e-34,1.13067e-34,-1.333968e-34,-1.369616e-34,9.628161e-35,-1.886885e-34,1.715263e-34,-7.896742e-35,-1.436466e-34,-3.099071e-34,...,-1.267184e-34,-3.916408e-35,9.797020999999999e-36,9.930089e-18,-1.594302e-34,-1.863691e-34,-4.958921e-35,3.505257e-34,-2.618572e-35,2.547774e-34
26.0,-1.2453730000000001e-33,7.203751e-35,-9.850715e-35,-4.448709e-34,6.574596e-35,1.736821e-32,2.970472e-34,-1.318456e-34,9.051077000000001e-33,2.3226000000000003e-33,...,-5.6215620000000005e-33,-1.741609e-33,1.067686e-35,-5.1233540000000006e-17,-4.4611389999999995e-34,3.312572e-33,-8.84349e-35,-1.941207e-33,-1.740891e-33,2.3170209999999997e-34
28.0,1.883424e-19,1.0747769999999998e-19,-8.639791e-20,-3.717502e-21,1.463919e-19,1.045545e-18,1.8967809999999997e-19,-3.275075e-20,2.652008e-19,-4.765622e-18,...,-5.150814999999999e-19,-2.3270119999999997e-19,7.545035e-21,1.3650700000000001e-17,-4.486601e-20,4.665761e-19,-7.657416e-20,-2.5674989999999997e-19,-1.149081e-18,1.5174179999999998e-19
29.0,-6.403837e-33,-1.241794e-33,1.8883110000000003e-33,-1.4867180000000001e-33,-2.2824570000000002e-33,5.291775e-32,2.430457e-33,1.822869e-34,2.7124950000000003e-32,7.856808000000001e-32,...,5.471186e-34,-7.446367e-33,-6.80718e-35,2.7279790000000004e-17,-3.7974829999999997e-34,1.810105e-32,3.2937500000000004e-33,-8.599969e-32,-1.998399e-32,-2.322522e-33
30.0,-1.142989e-33,9.905757e-35,-1.593482e-34,-3.592858e-34,-1.212698e-35,7.4379709999999994e-34,7.992915e-35,-1.2775879999999999e-34,6.150729e-34,1.961577e-34,...,-2.7680489999999997e-34,3.146095e-35,8.466619e-36,1.0686440000000001e-17,-3.546025e-34,-2.318696e-34,9.112122e-35,8.675419e-34,2.351002e-34,2.5105529999999998e-34


In [56]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, CustomersId, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        for CustomersId in self.cf_predictions_df.columns:
            sorted_customer_predictions = self.cf_predictions_df[CustomersId].sort_values(ascending=False) \
                                    .reset_index().rename(columns={CustomersId: 'Order_Items.qty'})

        # Recommend the highest predicted qty product that the user hasn't seen yet.
            recommendations_df = sorted_customer_predictions[~sorted_customer_predictions['Order_Items.product_id'].isin(items_to_ignore)] \
                               .sort_values('Order_Items.qty', ascending = False) \
                               .head(topn)

        #recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
        #                                                  left_on = 'contentId', 
        #                                                  right_on = 'contentId')[['recStrength', 'contentId', 'title', 'url', 'lang']]


            return recommendations_df
    
cf_recommender_model = CFRecommender(cf_preds_df)

In [57]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...
100 users processed
200 users processed
300 users processed
400 users processed
500 users processed
600 users processed
689 customers processed

Global metrics:
{'modelName': 'Collaborative Filtering', 'recall@5': 0.0027063599458728013, 'recall@10': 0.005412719891745603}


Unnamed: 0,_customer_id,hits@10_count,hits@5_count,ordered_count,recall@10,recall@5
138,1730,0,0,6,0.0,0.0
4,851,0,0,4,0.0,0.0
196,1616,0,0,4,0.0,0.0
413,213,0,0,4,0.0,0.0
32,515,0,0,3,0.0,0.0
3,1371,0,0,3,0.0,0.0
292,1420,0,0,3,0.0,0.0
517,1845,0,0,2,0.0,0.0
66,1313,0,0,2,0.0,0.0
330,1385,0,0,2,0.0,0.0


As expected, collaborative filtering model does not perform well. Because this approach is based on the users similarities(user-based approach) and items similarities(item-based approach).

# Deployment

Let's test the best model(Content-Based Filtering) for customers

In [58]:
test_df.head()

Unnamed: 0_level_0,Order_Items.product_name,Order_Items.qty,Order_Items.product_id,Products.long_description
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1268,medline deluxe aluminum transport chair with h...,0.693147,911.0,wheels and handbrakes top lifetime warranty fr...
3652,premium series shower chair with back and arms,1.098612,2386.0,top the line safety and bathing support the dr...
2558,tub grab bars,0.693147,425.0,tub grab bar has step through clamp design tha...
1346,readybath select medium weight cleansing washc...,0.693147,1277.0,readybath reg cloths are pre moistened with ri...
578,avant gauze non woven non sterile sponges,0.693147,1695.0,avant gauze medline standard non woven dressin...


Unnamed: 0_level_0,Order_Items.product_name,Order_Items.qty,Order_Items.product_id,Products.long_description
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
321,aluminum transport chair with wheels blue,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
1512,medline deluxe aluminum transport chair with h...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
1766,medline deluxe aluminum transport chair with h...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
2079,medline deluxe aluminum transport chair with h...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
730,medline lightweight aluminum transport wheelch...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
2827,medline deluxe transport chair inch wheels red...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
2835,medline deluxe transport chair inch wheels red...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
776,medline lightweight aluminum transport wheelch...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
2166,medline deluxe aluminum transport chair with h...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...
748,medline lightweight aluminum transport wheelch...,0.693147,910.0,wheels and handbrakes top lifetime warranty fr...


In [61]:
content_based_recommender_model.recommend_items(1268, items_to_ignore=[911.0], topn=10, verbose=False)

Unnamed: 0,Order_Items.product_id,Order_Items.qty
0,910.0,0.924922
1,910.0,0.924922
2,910.0,0.924922
3,910.0,0.924922
4,910.0,0.924922
5,910.0,0.924922
6,910.0,0.924922
7,910.0,0.924922
8,910.0,0.924922
9,910.0,0.924922


In [None]:
test_df

In [69]:
test_df.loc[test_df['Order_Items.product_id']==911.0,['Products.long_description','Order_Items.product_name']].head(1)

Unnamed: 0_level_0,Products.long_description,Order_Items.product_name
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1
1268,wheels and handbrakes top lifetime warranty fr...,medline deluxe aluminum transport chair with h...


In [70]:
test_df.loc[test_df['Order_Items.product_id']==910.0,['Products.long_description','Order_Items.product_name']].head(1)

Unnamed: 0_level_0,Products.long_description,Order_Items.product_name
Customers.id,Unnamed: 1_level_1,Unnamed: 2_level_1
321,wheels and handbrakes top lifetime warranty fr...,aluminum transport chair with wheels blue


The itme decription of recommendation  is exactly same with given item.

# Conclusion

In this notebook, I've explored and compared three basic models used in Recommender Systems. It could be observed that for 'cold-start' dataset, content-based model works best.

1, What I learned in this project:

* How to define class in python to make the notebook more organized
* New evaluation approach. In this project we evaluate the model using the  top@N metrics, it's different from all the previous projects.

2, More to do:

* The hybrid of obove models may generate a better result.
* Other model-based Filtering approaches are also good try, like Decision Trees/Logistic models/Neural network.