In [1]:
from IPython.display import Image
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd


# Product Recommendation Algorithm
<br>
<br>
A product recommendation is a filtering system that seeks to predict and show the items that a user would like to purchase.

Rocommender systems have become increasingly popular in recent years, and are utilized in a variety of areas including movies, music, news, books, search queries, etc. Mostly used in the digital domain, majority of today's E-commerce sites like eBay, Amazon, Alibaba make use of their proprieraty recommendation algorithms in order to better serve the customers with the products they are bound to like
<br>
<br>
*source:*
<br>
https://towardsdatascience.com/what-are-product-recommendation-engines-and-the-various-versions-of-them-9dcab4ee26d5

## Content-based filtering
<br>
<br>
<b>Individual-based:</b>
This method is based on the description of an item and a profile of <b>the user</b>'s preferred choices.<br>
In a content-based recommedation system, keywords are used to describe the items; besides, a user's prfile is built to state the type of item this user likes. In other words, the algorithms try to recommend products which are similar to the ones that a user has liked in the past.<br><br>
A major issue with content-based filtering, however, is able to learn user preferences from users actions about one content source and replicate them acorss other difference content types. When the system is limited to recommending the content of the same type as the user is already using, the value from the recommendation system is significantly less when other content types from other services can be recommended. 

## Collaborative filtering
<br>
<br>
<b>User-to-users:</b>
this method makes automatic predictions about the interests of a user by collecting perferences or taste information from other users.
<br>
<b>Theory:</b>
If a person A has the same opinion as a person B on an issue, A is more likely t have B's opinion on a difference issue than that of a randomly chosen person.
<br>
<br>
<img class="collaborative_filtering"
src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Collaborative_filtering.gif/300px-Collaborative_filtering.gif"

 - Memory-based filtering:uses user rating data to compute the similarity between users or items
 


<i>source</i>
<br>
https://en.wikipedia.org/wiki/Collaborative_filtering 

### *The following algorithm building will be based on this method*

### 1. Data Examination

In [2]:
df = pd.read_csv('onlinepurchase.csv')

In [3]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,Invoice_time,Price,CustomerID,Country,Purchase_dt
0,493410,TEST001,This is a test product.,5,2010/1/4 9:24,4.5,12346.0,United Kingdom,2010/1/4
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010/1/4 9:43,4.25,14590.0,United Kingdom,2010/1/4
2,493412,TEST001,This is a test product.,5,2010/1/4 9:53,4.5,12346.0,United Kingdom,2010/1/4
3,493413,21724,PANDA AND BUNNIES STICKER SHEET,1,2010/1/4 9:54,0.85,,United Kingdom,2010/1/4
4,493413,84578,ELEPHANT TOY WITH BLUE T-SHIRT,1,2010/1/4 9:54,3.75,,United Kingdom,2010/1/4


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1022664 entries, 0 to 1022663
Data columns (total 9 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   Invoice       1022664 non-null  object 
 1   StockCode     1022664 non-null  object 
 2   Description   1018436 non-null  object 
 3   Quantity      1022664 non-null  int64  
 4   Invoice_time  1022664 non-null  object 
 5   Price         1022664 non-null  float64
 6   CustomerID    793031 non-null   float64
 7   Country       1022664 non-null  object 
 8   Purchase_dt   1022664 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 70.2+ MB


In [5]:
# Number of Customers who have made purchases
df['CustomerID'].nunique()

5887

### 2. Data Cleaning

In [6]:
df.duplicated().sum()

11585

In [7]:
# Drop duplicated records
df = df.drop_duplicates()

In [8]:
# Retain invoices that have postitive quantity only
df = df.loc[df['Quantity']>0]

In [9]:
# Retain customers with a valid CustomerID only
df = df.dropna(subset=['CustomerID'])

In [10]:
# Make sure every product in the dateset has a description
df = df.dropna(subset=['Description'])

In [11]:
df.to_csv('onlinepurchase_cleaned.csv', sep = '|', index=False)

### 3. Customer-item matrix
<br>
The first step to implementing a collaborative filtering algorithm for a product recommendation system is building a user-to-item matrix. 

Transform data into a customer-item matrix, where each row represents a customer and each column represents one product

In [12]:
customer_item_matrix = df.pivot_table(
    index='CustomerID',
    columns='StockCode',
    values='Quantity',
    aggfunc='sum')
customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,BANK CHARGES,C2,D,DOT,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,45.0,1.0
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,10.0,,,
12349.0,,,,,,,,,,,...,,,,,,,3.0,,,
12350.0,,,,,,,,,,,...,,,,,,,1.0,,,


For each customer, if the stock has a value in the cell then it means this customer purchased the product represented by the stock code.
<br>But for this algorithm we only care about what products each customer bought. We can encode purchased item using 1 and the rest 0

In [13]:
customer_item_matrix = customer_item_matrix.applymap(lambda x:1 if x>0 else 0)
customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,BANK CHARGES,C2,D,DOT,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


### 4. User-User Similarity Matrix
After building customer-to-item matrix, the next step is to compute similarities between every two users. To measure similarities, cosine similarity is frequently used. We will use this function *cosine_similarity* imported from an existing library

The formula of cosine similarity between person A and person B:


$$similarity = cos(U_A,U_B) ={\sum_{i=1}^nP_{Ai}P_{Bi}\over\sqrt{\sum_{i=1}^nP_{Ai}^2P_{Bi}^2}}$$ 

In this equation, $U_A$ and $U_B$ each represents User A and User B. $P_{1i}$ and $P_{2i}$ each represents the products, that User A and User B have bought.

In [14]:
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

In [15]:
user_user_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5822,5823,5824,5825,5826,5827,5828,5829,5830,5831
0,1.000000,0.000000,0.000000,0.131060,0.000000,0.000000,0.023002,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.070535,0.000000
1,0.000000,1.000000,0.053452,0.045502,0.043214,0.038881,0.031944,0.055728,0.023395,0.090351,...,0.037987,0.000000,0.067344,0.064820,0.102869,0.113961,0.067344,0.000000,0.076186,0.024398
2,0.000000,0.053452,1.000000,0.017025,0.048507,0.000000,0.023905,0.000000,0.026261,0.067612,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.138580,0.000000,0.000000,0.000000,0.054772
3,0.131060,0.045502,0.017025,1.000000,0.041292,0.037152,0.152617,0.071000,0.044710,0.057555,...,0.054447,0.000000,0.032174,0.020646,0.049147,0.140654,0.016087,0.000000,0.062399,0.038854
4,0.000000,0.043214,0.048507,0.041292,1.000000,0.000000,0.028989,0.050572,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.058824,0.000000,0.038782,0.000000,0.000000,0.029630,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5827,0.000000,0.113961,0.138580,0.140654,0.038782,0.034893,0.140153,0.066683,0.125976,0.018019,...,0.034091,0.044348,0.030218,0.077563,0.015386,1.000000,0.090655,0.015386,0.110698,0.087581
5828,0.000000,0.067344,0.000000,0.016087,0.000000,0.041239,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.090655,1.000000,0.054554,0.069264,0.120761
5829,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.034503,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.015386,0.054554,1.000000,0.105802,0.026352
5830,0.070535,0.076186,0.000000,0.062399,0.029630,0.026660,0.087612,0.000000,0.016042,0.082602,...,0.000000,0.000000,0.069264,0.000000,0.070535,0.110698,0.069264,0.105802,1.000000,0.066915


Each row and column represent a customer. We are going to rename them using 'CustomerID' from customer_item_matrix

In [16]:
customer_item_matrix.index
#CustomerID

Float64Index([12346.0, 12347.0, 12348.0, 12349.0, 12350.0, 12351.0, 12352.0,
              12353.0, 12354.0, 12355.0,
              ...
              18278.0, 18279.0, 18280.0, 18281.0, 18282.0, 18283.0, 18284.0,
              18285.0, 18286.0, 18287.0],
             dtype='float64', name='CustomerID', length=5832)

In [17]:
user_user_sim_matrix.columns = customer_item_matrix.index
user_user_sim_matrix.index = customer_item_matrix.index

In [18]:
user_user_sim_matrix

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12351.0,12352.0,12353.0,12354.0,12355.0,...,18278.0,18279.0,18280.0,18281.0,18282.0,18283.0,18284.0,18285.0,18286.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.000000,0.000000,0.000000,0.131060,0.000000,0.000000,0.023002,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.070535,0.000000
12347.0,0.000000,1.000000,0.053452,0.045502,0.043214,0.038881,0.031944,0.055728,0.023395,0.090351,...,0.037987,0.000000,0.067344,0.064820,0.102869,0.113961,0.067344,0.000000,0.076186,0.024398
12348.0,0.000000,0.053452,1.000000,0.017025,0.048507,0.000000,0.023905,0.000000,0.026261,0.067612,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.138580,0.000000,0.000000,0.000000,0.054772
12349.0,0.131060,0.045502,0.017025,1.000000,0.041292,0.037152,0.152617,0.071000,0.044710,0.057555,...,0.054447,0.000000,0.032174,0.020646,0.049147,0.140654,0.016087,0.000000,0.062399,0.038854
12350.0,0.000000,0.043214,0.048507,0.041292,1.000000,0.000000,0.028989,0.050572,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.058824,0.000000,0.038782,0.000000,0.000000,0.029630,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18283.0,0.000000,0.113961,0.138580,0.140654,0.038782,0.034893,0.140153,0.066683,0.125976,0.018019,...,0.034091,0.044348,0.030218,0.077563,0.015386,1.000000,0.090655,0.015386,0.110698,0.087581
18284.0,0.000000,0.067344,0.000000,0.016087,0.000000,0.041239,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.090655,1.000000,0.054554,0.069264,0.120761
18285.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.034503,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.015386,0.054554,1.000000,0.105802,0.026352
18286.0,0.070535,0.076186,0.000000,0.062399,0.029630,0.026660,0.087612,0.000000,0.016042,0.082602,...,0.000000,0.000000,0.069264,0.000000,0.070535,0.110698,0.069264,0.105802,1.000000,0.066915


- the cosine similarity measure between Customer 12347 and Customer 12348 is 0.053452
- this number is 0.045502 between Customer 12347 and Customer 12349
- This suggests that Customer 12348 is more similar to Customer 12347 than Customer 12349 to the Customer 12347

### 5. Generate Recommendation List

These pairwise cosine similarity measures are what we going to use for product recommendations. Let's work by picking one customer as an example

In [19]:
# Customer 12350
sim_rank = user_user_sim_matrix.loc[12350.0].sort_values(ascending=False)
sim_rank

CustomerID
12350.0    1.000000
12568.0    0.216930
16886.0    0.171499
12503.0    0.171499
12814.0    0.171499
             ...   
15835.0    0.000000
15829.0    0.000000
15828.0    0.000000
15827.0    0.000000
12346.0    0.000000
Name: 12350.0, Length: 5832, dtype: float64

In [20]:
sim_rank.iloc[1]

0.21693045781865616

If we want to recommend products to, say, Customer 12568 (target), based on the using the similarity between he/him and Customer 12350, the strategy is as follows:
- First identify what products each of these two have already bought
- Then, find out what products 12350 bought but 12568 didn't.

Since these two customers have bought similar items in the past, we are going to assume that the target customer 12568 has a high chance of purchasing the items that he/she has not bought, but Customer 12350 has.<br>
Lastly, we are going to use this list of items and recommend them to the target customer 12568.

In [21]:
customer_item_matrix.loc[12350.0]

StockCode
10002      0
10080      0
10120      0
10123C     0
10123G     0
          ..
PADS       0
POST       1
SP1002     0
TEST001    0
TEST002    0
Name: 12350.0, Length: 4589, dtype: int64

In [22]:
customer_item_matrix.loc[12350.0].iloc[customer_item_matrix.loc[12350.0].to_numpy().nonzero()]

StockCode
20615     1
20652     1
21171     1
21832     1
21864     1
21866     1
21908     1
21915     1
22348     1
22412     1
22551     1
22557     1
22620     1
79066K    1
79191C    1
84086C    1
POST      1
Name: 12350.0, dtype: int64

In [23]:
# Products Customer 12350 has alreayd purchased
items_bought_by_12350=set(customer_item_matrix.loc[12350.0].iloc[customer_item_matrix.loc[12350.0].to_numpy().nonzero()].index)
items_bought_by_12350

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22348',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C',
 'POST'}

In [24]:
# Products Customer 12568 has alreayd purchased
items_bought_by_12568=set(customer_item_matrix.loc[12568.0].iloc[customer_item_matrix.loc[12568.0].to_numpy().nonzero()].index)
items_bought_by_12568

{'16161P', '20676', '22348', '47570', 'POST'}

Find out the items 12350 has bought but 12568 hasn't

In [25]:
items_to_recommend_to_12568=items_bought_by_12350-items_bought_by_12568
items_to_recommend_to_12568

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C'}

In [26]:
df.loc[df['StockCode'].isin(items_to_recommend_to_12568),
      ['StockCode','Description']
      ].drop_duplicates().set_index('StockCode')

Unnamed: 0_level_0,Description
StockCode,Unnamed: 1_level_1
21171,BATHROOM METAL SIGN
21864,UNION JACK FLAG PASSPORT COVER
21908,CHOCOLATE THIS WAY METAL SIGN
21832,CHOCOLATE CALCULATOR
20652,BLUE SPOTTY LUGGAGE TAG
20615,BLUE SPOTTY PASSPORT COVER
79066K,RETRO MOD TRAY
21866,UNION JACK FLAG LUGGAGE TAG
79191C,RETRO PLASTIC ELEPHANT TRAY
22412,METAL SIGN NEIGHBOURHOOD WITCH


### 6. Automation

In [27]:
def get_user_user_sim_matrix(customer_item_matrix):
    user_user_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix))
    user_user_sim_matrix.columns = customer_item_matrix.index
    user_user_sim_matrix.index = customer_item_matrix.index
    
    return user_user_sim_matrix

In [28]:
def get_recommendations(target_id, user_user_sim_matrix, customer_item_matrix):
    sim_rank = user_user_sim_matrix.loc[target_id].sort_values(ascending=False)
    
    
    sim_userid = sim_rank.index[1]
    cos_value = sim_rank.iloc[1]
    
    items_bought_by_target_id = set(customer_item_matrix.loc[target_id].iloc[customer_item_matrix.loc[target_id].to_numpy().nonzero()[0]].index)
    items_bought_by_sim_userid = set(customer_item_matrix.loc[sim_userid].iloc[customer_item_matrix.loc[sim_userid].to_numpy().nonzero()[0]].index)
    
    items_to_recommend_to_target_id = items_bought_by_sim_userid - items_bought_by_target_id
    
    recommend_df = pd.DataFrame(columns = ['target_id','sim_userid','cos_value'], index=range(len(items_to_recommend_to_target_id)))
    
    recommend_df['target_id'] = target_id
    recommend_df['sim_userid'] = sim_userid
    recommend_df['cos_value'] = cos_value
    
    recommend_df['recommend_prod'] = list(items_to_recommend_to_target_id)
    
    return recommend_df

In [29]:
user_user_sim_matrix = get_user_user_sim_matrix(customer_item_matrix)

In [30]:
get_recommendations(12350.0, user_user_sim_matrix,customer_item_matrix)

Unnamed: 0,target_id,sim_userid,cos_value,recommend_prod
0,12350.0,12568.0,0.21693,20676
1,12350.0,12568.0,0.21693,47570
2,12350.0,12568.0,0.21693,16161P
