# Testing Notebook

## Some Theory about Recommender Systems

The main families of methods for RecSys are:

- Collaborative Filtering: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

- Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.

- Hybrid methods: Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

In [2]:
import scipy
import math
import random
import sklearn
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler

## 1. ItemBased Collaborative Filter Recommendation

Example: https://www.kaggle.com/code/hendraherviawan/itembased-collaborative-filter-recommendation-r/report

### 2.1 Preprocessing

In [3]:
articles = pd.read_csv("data/articles.csv")
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [7]:
#keep only the columns of interest
articles = articles[['article_id', 'product_type_name', 'graphical_appearance_name', 'perceived_colour_master_name', 'section_name']]

In [9]:
articles.head()

Unnamed: 0,article_id,product_type_name,graphical_appearance_name,perceived_colour_master_name,section_name
0,108775015,Vest top,Solid,Black,Womens Everyday Basics
1,108775044,Vest top,Solid,White,Womens Everyday Basics
2,108775051,Vest top,Stripe,White,Womens Everyday Basics
3,110065001,Bra,Solid,Black,Womens Lingerie
4,110065002,Bra,Solid,White,Womens Lingerie


In [10]:
customers = pd.read_csv("data/customers.csv")
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [11]:
#keep only the columns of interest
customers = customers[['customer_id', 'fashion_news_frequency', 'age']]

In [12]:
customers.head()

Unnamed: 0,customer_id,fashion_news_frequency,age
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,NONE,49.0
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,NONE,25.0
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,NONE,24.0
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,NONE,54.0
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,Regularly,52.0


In [15]:
transactions = pd.read_csv("data/transactions_train.csv")
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,0.050831,2.0
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023.0,0.030492,2.0
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004.0,0.015237,2.0
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003.0,0.016932,2.0
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004.0,0.016932,2.0


In [16]:
#keep only the columns of interest
transactions = transactions[['customer_id', 'article_id', 'price']]
transactions.head()

Unnamed: 0,customer_id,article_id,price
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,0.050831
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023.0,0.030492
2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004.0,0.015237
3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003.0,0.016932
4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004.0,0.016932


In [17]:
#article_id is a float. First convert to int and then to string.
transactions['article_id'] = transactions['article_id'].astype("Int64").astype(str) 

### 2.2 Build single dataframe: Articles + Transactions + Customers

#### Create Transactions subset for testing

In [18]:
transactions_subset = transactions.sample(100000, random_state=0)
#transactions_subset = transactions

In [19]:
transactions_subset.head()

Unnamed: 0,customer_id,article_id,price
7347469,1caf62af1fca9d95f42315b7f7866d4b1e10dc9068035b...,504154020,0.022017
15356593,9e5f9779bdc2cd9a5aad66222034ae6b25d2cb6fbaa568...,758611002,0.053373
718615,2fb4e6e4b1b586a3570991143636ac179cb4acd1e2f263...,568808001,0.031763
4621974,532a4be8bb46d8e3d6cfb2271b7151d390fccb8b0e5a2f...,599719008,0.020322
11611938,236715db7ed237a6b0b7207ee804eaeff11d937f9eb561...,630994004,0.037169


#### Join Transactions and Articles dataframes

In [20]:
#transactions_articles_joined = transactions_subset.set_index('article_id').join(articles.set_index('article_id'))
transactions_articles_joined = transactions_subset.join(articles.set_index('article_id'), on='article_id')

In [21]:
transactions_articles_joined.head()

Unnamed: 0,customer_id,article_id,price,product_type_name,graphical_appearance_name,perceived_colour_master_name,section_name
7347469,1caf62af1fca9d95f42315b7f7866d4b1e10dc9068035b...,504154020,0.022017,Sweater,Melange,Grey,Womens Everyday Collection
15356593,9e5f9779bdc2cd9a5aad66222034ae6b25d2cb6fbaa568...,758611002,0.053373,Skirt,All over pattern,Yellow,Womens Everyday Collection
718615,2fb4e6e4b1b586a3570991143636ac179cb4acd1e2f263...,568808001,0.031763,Trousers,Solid,Black,Womens Tailoring
4621974,532a4be8bb46d8e3d6cfb2271b7151d390fccb8b0e5a2f...,599719008,0.020322,Skirt,All over pattern,Unknown,Womens Casual
11611938,236715db7ed237a6b0b7207ee804eaeff11d937f9eb561...,630994004,0.037169,Swimsuit,Solid,Black,"Womens Swimwear, beachwear"


#### Join Transactions-Articles with Customers dataframes

In [30]:
# Join also customer info

#trans_arts_cust_joined = transactions_articles_joined.set_index('customer_id').join(customers.set_index('customer_id'))
features_joined = transactions_articles_joined.join(customers.set_index('customer_id'), on='customer_id')
features_joined.head()

#Index of output df belongs to original index of transactions

Unnamed: 0,customer_id,article_id,price,product_type_name,graphical_appearance_name,perceived_colour_master_name,section_name,fashion_news_frequency,age
7347469,1caf62af1fca9d95f42315b7f7866d4b1e10dc9068035b...,504154020,0.022017,Sweater,Melange,Grey,Womens Everyday Collection,NONE,27.0
15356593,9e5f9779bdc2cd9a5aad66222034ae6b25d2cb6fbaa568...,758611002,0.053373,Skirt,All over pattern,Yellow,Womens Everyday Collection,Regularly,40.0
718615,2fb4e6e4b1b586a3570991143636ac179cb4acd1e2f263...,568808001,0.031763,Trousers,Solid,Black,Womens Tailoring,Regularly,66.0
4621974,532a4be8bb46d8e3d6cfb2271b7151d390fccb8b0e5a2f...,599719008,0.020322,Skirt,All over pattern,Unknown,Womens Casual,NONE,46.0
11611938,236715db7ed237a6b0b7207ee804eaeff11d937f9eb561...,630994004,0.037169,Swimsuit,Solid,Black,"Womens Swimwear, beachwear",Regularly,24.0


#### Check if join has been done correctly

In [23]:
#1. check that customer_id are repeated (some customers bought multiple items)
#Number of products purchased by each customer
grouped = features_joined.groupby("customer_id")["customer_id"].count().reset_index(name='counts').sort_values(by='counts', ascending=False)
grouped

Unnamed: 0,customer_id,counts
60296,b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940...,10
33987,63c984b674ff20a21f6c9eb4fe3139ceecf8526bc1559a...,7
16457,30d1e9b6378a74a740f64c3d34f1686693d0430b03c6cd...,7
26235,4d5b026a812d7d24914e5e80db9ce647600de7860b43d7...,7
70637,d00063b94dcb1342869d4994844a2742b5d62927f36843...,6
...,...,...
30945,5b0920f2e5fb360b49e19fc16f4d63f492159e583b77ff...,1
30944,5b090dcd009546f5907752e7ae5b727a4b4ff114da0366...,1
30943,5b07b76f7c42a4521ceca743140a9f91f4d20b0e7ffb1c...,1
30942,5b06d4d47173bcfea570861da81e9adfccba4ce950d2f4...,1


In [31]:
#2. check that article_id are repeated (different customers bought same item)
#Number of times the products were purchased by the customers
grouped = features_joined.groupby("article_id")["article_id"].count().reset_index(name='counts').sort_values(by='counts', ascending=False)
grouped

Unnamed: 0,article_id,counts
18313,706016002,120
18312,706016001,105
6927,610776002,99
4358,565379001,93
4112,562245001,93
...,...,...
12027,659940001,1
12028,659950004,1
12029,659953001,1
12031,659955001,1


In [32]:
#3. Check duplicated rows
features_joined.duplicated().sum()

113

Duplicate rows correspond to multiple purchases of the same item by the same client. 

In [33]:
features_joined.head()

Unnamed: 0,customer_id,article_id,price,product_type_name,graphical_appearance_name,perceived_colour_master_name,section_name,fashion_news_frequency,age
7347469,1caf62af1fca9d95f42315b7f7866d4b1e10dc9068035b...,504154020,0.022017,Sweater,Melange,Grey,Womens Everyday Collection,NONE,27.0
15356593,9e5f9779bdc2cd9a5aad66222034ae6b25d2cb6fbaa568...,758611002,0.053373,Skirt,All over pattern,Yellow,Womens Everyday Collection,Regularly,40.0
718615,2fb4e6e4b1b586a3570991143636ac179cb4acd1e2f263...,568808001,0.031763,Trousers,Solid,Black,Womens Tailoring,Regularly,66.0
4621974,532a4be8bb46d8e3d6cfb2271b7151d390fccb8b0e5a2f...,599719008,0.020322,Skirt,All over pattern,Unknown,Womens Casual,NONE,46.0
11611938,236715db7ed237a6b0b7207ee804eaeff11d937f9eb561...,630994004,0.037169,Swimsuit,Solid,Black,"Womens Swimwear, beachwear",Regularly,24.0


### 2.4 Manage Null values

In [34]:
features_joined.isnull().sum()

customer_id                       0
article_id                        0
price                             0
product_type_name                 0
graphical_appearance_name         0
perceived_colour_master_name      0
section_name                      0
fashion_news_frequency          414
age                             504
dtype: int64

In [35]:
#Replace Age with the mean
mean_age = features_joined['age'].median()
features_joined['age'].fillna(mean_age,inplace=True)

In [36]:
#Replace fashion_news_frequency with the most common value
features_joined = features_joined.fillna(features_joined['fashion_news_frequency'].value_counts().index[0])

In [37]:
features_joined.isnull().sum()

customer_id                     0
article_id                      0
price                           0
product_type_name               0
graphical_appearance_name       0
perceived_colour_master_name    0
section_name                    0
fashion_news_frequency          0
age                             0
dtype: int64

### 2.4 Manage Categorical Columns

#### Explore levels of the categorical variables

In [40]:
features_joined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 5892708 to 13284337
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   customer_id                 20000 non-null  object 
 1   article_id                  20000 non-null  object 
 2   price                       20000 non-null  float64
 3   sales_channel_id            20000 non-null  float64
 4   product_code                20000 non-null  object 
 5   product_type_no             20000 non-null  object 
 6   graphical_appearance_no     20000 non-null  object 
 7   colour_group_code           20000 non-null  object 
 8   perceived_colour_value_id   20000 non-null  object 
 9   perceived_colour_master_id  20000 non-null  object 
 10  department_no               20000 non-null  object 
 11  index_code                  20000 non-null  object 
 12  index_group_no              20000 non-null  object 
 13  section_no            

#### Handle categorical variables

In [42]:
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder

# # apply 1-hot encoding to catgorical predictors
# pip = ColumnTransformer([
#         ("cat", OneHotEncoder(), ["article_id", "sales_channel_id", "product_code", "product_type_no", "graphical_appearance_no", "colour_group_code", 
#                                   "perceived_colour_value_id", "perceived_colour_master_id", "index_code", "index_group_no", "section_no",
#                                   "garment_group_no", "club_member_status", "fashion_news_frequency"]),
#     ], remainder='drop')

In [None]:
# features_prepared = pd.DataFrame(pip.fit_transform(features_joined))
# features_prepared

In [38]:
features_prepared = features_joined

### 2.5 Machine Learning

https://www.datasource.ai/uploads/6b86b1630562b323a26143f90d97fe08.html

#### 2.5.1 Collaborative filtering

Build a matrix of items for the client

In [39]:
#df = features_prepared.reset_index()
features_prepared_sample = features_prepared[0:100000].reset_index()

In [41]:
features_prepared_sample.head()

Unnamed: 0,index,customer_id,article_id,price,product_type_name,graphical_appearance_name,perceived_colour_master_name,section_name,fashion_news_frequency,age
0,7347469,1caf62af1fca9d95f42315b7f7866d4b1e10dc9068035b...,504154020,0.022017,Sweater,Melange,Grey,Womens Everyday Collection,NONE,27.0
1,15356593,9e5f9779bdc2cd9a5aad66222034ae6b25d2cb6fbaa568...,758611002,0.053373,Skirt,All over pattern,Yellow,Womens Everyday Collection,Regularly,40.0
2,718615,2fb4e6e4b1b586a3570991143636ac179cb4acd1e2f263...,568808001,0.031763,Trousers,Solid,Black,Womens Tailoring,Regularly,66.0
3,4621974,532a4be8bb46d8e3d6cfb2271b7151d390fccb8b0e5a2f...,599719008,0.020322,Skirt,All over pattern,Unknown,Womens Casual,NONE,46.0
4,11611938,236715db7ed237a6b0b7207ee804eaeff11d937f9eb561...,630994004,0.037169,Swimsuit,Solid,Black,"Womens Swimwear, beachwear",Regularly,24.0


In [42]:
#Get counts of each sold article
grouped = features_prepared_sample.groupby("article_id")["article_id"].count().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
grouped

Unnamed: 0,article_id,counts
18313,706016002,120
18312,706016001,105
6927,610776002,99
4358,565379001,93
4112,562245001,93
...,...,...
12027,659940001,1
12028,659950004,1
12029,659953001,1
12031,659955001,1


In [43]:
#Get counts of each customer to see if same customer has purchased more than once
grouped = features_prepared_sample.groupby("customer_id")["customer_id"].count().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
grouped

Unnamed: 0,customer_id,counts
60296,b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940...,10
33987,63c984b674ff20a21f6c9eb4fe3139ceecf8526bc1559a...,7
16457,30d1e9b6378a74a740f64c3d34f1686693d0430b03c6cd...,7
26235,4d5b026a812d7d24914e5e80db9ce647600de7860b43d7...,7
70637,d00063b94dcb1342869d4994844a2742b5d62927f36843...,6
...,...,...
30945,5b0920f2e5fb360b49e19fc16f4d63f492159e583b77ff...,1
30944,5b090dcd009546f5907752e7ae5b727a4b4ff114da0366...,1
30943,5b07b76f7c42a4521ceca743140a9f91f4d20b0e7ffb1c...,1
30942,5b06d4d47173bcfea570861da81e9adfccba4ce950d2f4...,1


In [44]:
grouped.iloc[0]['customer_id']

'b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0'

In [None]:
#Add column quantity
features_prepared_sample['Quantity'] = 1

In [195]:
df.head()

Unnamed: 0,index,customer_id,article_id,price,sales_channel_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,club_member_status,fashion_news_frequency,age,Quantity
0,0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,283,1010016,9,4,5,1338,B,1,61,1017,ACTIVE,NONE,24.0,1
1,1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,306,1010016,51,1,4,1334,B,1,61,1017,ACTIVE,NONE,24.0,1
2,2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,252,1010010,52,2,4,5963,D,2,58,1003,ACTIVE,Regularly,32.0,1
3,3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,252,1010010,52,7,4,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1
4,4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,252,1010010,93,4,19,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1


In [196]:
customer_item_matrix = df.pivot_table(
    index='customer_id', 
    columns='article_id', 
    values='Quantity',
    aggfunc='sum'
)

In [197]:
customer_item_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,,,,,,,,,,,...,,,,,,,,,,
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,,,,,,,,,...,,,,,,,,,,
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,,,,,,,,,,,...,,,,,,,,,,
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,,,,,,,,,,,...,,,,,,,,,,
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,,,,,,,,,,,...,,,,,,,,,,
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,,,,,,,,,,,...,,,,,,,,,,
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,,,,,,,,,,,...,,,,,,,,,,
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,,,,,,,,,,,...,,,,,,,,,,


 we now have a matrix where each row represents the total quantities purchased for each product for each customer.

let's code 0-1 this data, so that a value of 1 means that the given product was bought by the given customer, and a value of 0 means that the given product was never bought by the given customer. Take a look at the following code:

In [198]:
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)

In [200]:
customer_item_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [201]:
#Test. Check that matrix hs been correctly build
#Customer with id b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0 should have bought 4 items
customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].sum()

53

Calculate the cosine similarities between users

In [202]:
from sklearn.metrics.pairwise import cosine_similarity

In [203]:
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

In [204]:
user_user_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28307,28308,28309,28310,28311,28312,28313,28314,28315,28316
0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
1,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
3,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
4,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28312,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
28313,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
28314,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
28315,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [205]:
user_user_sim_matrix.columns = customer_item_matrix.index

In [206]:
user_user_sim_matrix

customer_id,0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0006d37aaf7dd84f9bbc02f6cadcb74fd72ebf370bdc5f110a8a4092aa7e173e,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4c73235dccbbc132280,0008968c0d451dbc5a9968da03196fe20051965edde7413775c4eb3be9abe9c2,000aa7f0dc06cd7174389e76c9e132a67860c5f65f970699daccc14425ac31a8,000b872410f5ac2064acb999a1e0a7db4c1b5007ecaa7bcdc0a0e9006fa5f968,...,ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,ffefe95a1c711b634023279e0bc7180d5991d4558fa036e7d5ac77cc3348d171,fff04954c6e484a8deb5ec475e581aefd25d5850d1886f6c0198edaa9b67c958,fff0ac18093a702a0a06f4cc76582632df3ede9a36556e345150befbeed6885a,fff15526121f7d914a54784e68761a1d30b7547e3555738dcceb386eaaa24c4b,fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264
0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
1,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
3,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
4,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28312,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
28313,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
28314,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
28315,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [207]:
user_user_sim_matrix['customer_id'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('customer_id')

In [208]:
user_user_sim_matrix

customer_id,0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0006d37aaf7dd84f9bbc02f6cadcb74fd72ebf370bdc5f110a8a4092aa7e173e,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4c73235dccbbc132280,0008968c0d451dbc5a9968da03196fe20051965edde7413775c4eb3be9abe9c2,000aa7f0dc06cd7174389e76c9e132a67860c5f65f970699daccc14425ac31a8,000b872410f5ac2064acb999a1e0a7db4c1b5007ecaa7bcdc0a0e9006fa5f968,...,ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,ffefe95a1c711b634023279e0bc7180d5991d4558fa036e7d5ac77cc3348d171,fff04954c6e484a8deb5ec475e581aefd25d5850d1886f6c0198edaa9b67c958,fff0ac18093a702a0a06f4cc76582632df3ede9a36556e345150befbeed6885a,fff15526121f7d914a54784e68761a1d30b7547e3555738dcceb386eaaa24c4b,fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [209]:
#Find similar users as customer with id b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
user_user_sim_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].sort_values(ascending=False)[0:10]

customer_id
2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13    1.000000
3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc    0.206041
af12dc9783346acf18f5bac8b13585f10eb6549d035112821e67f60e7c843621    0.194257
e1946a8690ab57f6a92f779c62cf5a3fb57df87cc3062a3cd21103e1a4230850    0.194257
5da22e5fc8f619f64982431934410d75d991cafa181587d1893898189219bc41    0.184289
1a20134a3e7d92e6c73916996d7b6332b916cc6c501a250e801d16d9b5d08d60    0.173749
b3a1fc7a2679a8d888f361e247b6a84789ce96f37bd5fc672cc69c2a6a1bfe55    0.155752
1a812529f7996aa1c13b07b755ca63caec8c170d43e4553f84bcacef93ecc973    0.137361
420ef3fcd79c3c1418103f296fe41240604dad6558575cf8013db10695d84ae9    0.137361
26c47eebe4cda52bf77cf791a5d06392218143beaa45ff94a36a119915db1712    0.137361
Name: 2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13, dtype: float64

These are the 10 most similar clients to the b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0 client. Let's choose client 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9 and discuss how we can recommend products using these results.

Lets identify both users:
- user_A: b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
- user_B: 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9

The strategy is as follows.

- First, we need to identify the items that user_A and user2 customers have already purchased.
- Then, let's find the products that target client user_B has not purchased, but client user_A has.
- Since these two customers have bought similar items in the past, we will assume that target customer user_B has a high probability of buying the items he or she has not bought, but customer user_A has.
- Finally, we are going to use this list of items and recommend them to target customer user_B.

Let's first see how we can retrieve the items that the user_A customer has purchased in the past:

In [210]:
items_bought_by_A = set(customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].iloc[
    customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].to_numpy().nonzero()
].index)

In [211]:
items_bought_by_A

{'305304010',
 '464454011',
 '467302079',
 '467302099',
 '490113003',
 '494030013',
 '496762018',
 '496762020',
 '532954003',
 '536358002',
 '559601002',
 '559642001',
 '560209001',
 '560270002',
 '566140001',
 '581162001',
 '598795018',
 '601876004',
 '607427002',
 '607427003',
 '610216001',
 '613459001',
 '615042001',
 '619884014',
 '621018003',
 '621939010',
 '626168001',
 '627147001',
 '627147002',
 '628535006',
 '628921003',
 '630319001',
 '633109002',
 '636420003',
 '637194001',
 '638777001',
 '638777002',
 '641312002',
 '642189001',
 '642189005',
 '643305002',
 '645626001',
 '647190001',
 '651558002',
 '651558003',
 '651558006',
 '651558007',
 '651558012',
 '651558013',
 '652361001',
 '671852003',
 '676255002',
 '709688001'}

Using this function in the customer_item_matrix for the given user_A client, we can get the list of elements that the user_A client has purchased. We can apply the same code for the target client user_B, as in the following:

In [212]:
items_bought_by_B = set(customer_item_matrix.loc['3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc'].iloc[
    customer_item_matrix.loc['3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc'].to_numpy().nonzero()
].index)

In [213]:
items_bought_by_B

{'559601002', '559642001', '559715001', '627147002'}

We now have two sets of items that customers A and B have purchased. Using a simple set operation, we can find the items that customer A has purchased, but customer B has not. The code is like the one below:

In [214]:
items_to_recommend_to_B = items_bought_by_A - items_bought_by_B

In [215]:
items_to_recommend_to_B

{'305304010',
 '464454011',
 '467302079',
 '467302099',
 '490113003',
 '494030013',
 '496762018',
 '496762020',
 '532954003',
 '536358002',
 '560209001',
 '560270002',
 '566140001',
 '581162001',
 '598795018',
 '601876004',
 '607427002',
 '607427003',
 '610216001',
 '613459001',
 '615042001',
 '619884014',
 '621018003',
 '621939010',
 '626168001',
 '627147001',
 '628535006',
 '628921003',
 '630319001',
 '633109002',
 '636420003',
 '637194001',
 '638777001',
 '638777002',
 '641312002',
 '642189001',
 '642189005',
 '643305002',
 '645626001',
 '647190001',
 '651558002',
 '651558003',
 '651558006',
 '651558007',
 '651558012',
 '651558013',
 '652361001',
 '671852003',
 '676255002',
 '709688001'}

To obtain the descriptions of these items:

In [220]:
articles.loc[
    articles['article_id'].isin(items_to_recommend_to_B), 
    ['article_id', 'prod_name', 'product_type_name']
].drop_duplicates().set_index('article_id')

Unnamed: 0_level_0,prod_name,product_type_name
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
305304010,Boy Denim Shorts,Shorts
464454011,TANJA SKIRT,Skirt
467302079,Panda dress J,Dress
467302099,Panda dress J,Dress
490113003,Lola Denim Shorts,Shorts
494030013,Tika (1),Vest top
496762018,Summer strap dress,Dress
496762020,Summer strap dress,Dress
532954003,Small thin hoops,Earring
536358002,Cool Claudia Hoops RT,Earring


#### 2.5.2 Item-Based Filtering

Item-based collaborative filtering is similar to the user-based approach, except that it uses measures of similarity between items, rather than between users or customers.

In [221]:
item_item_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix.T)
)

In [222]:
item_item_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15571,15572,15573,15574,15575,15576,15577,15578,15579,15580
0,1.000000,0.350438,0.041885,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.030387,0.000000,0.0
1,0.350438,1.000000,0.059761,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
2,0.041885,0.059761,1.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.129099,0.0,0.0,0.000000,0.000000,0.0
3,0.000000,0.000000,0.000000,1.0,0.0,0.136083,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
4,0.000000,0.000000,0.000000,0.0,1.0,0.149071,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15576,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,1.0,0.0,0.000000,0.000000,0.0
15577,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.0
15578,0.030387,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,1.000000,0.250313,0.0
15579,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.250313,1.000000,0.0


If you compare this code with the previous one, in which we calculate an array of similarities between users, the only difference is that here we are transposing the customer_item_matrix, so that the indexes in the rows represent individual items and the columns represent the customers.

In [223]:
item_item_sim_matrix.columns = customer_item_matrix.T.index

item_item_sim_matrix['article_id'] = customer_item_matrix.T.index
item_item_sim_matrix = item_item_sim_matrix.set_index('article_id')

In [224]:
item_item_sim_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
108775015,1.000000,0.350438,0.041885,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.030387,0.000000,0.0
108775044,0.350438,1.000000,0.059761,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
108775051,0.041885,0.059761,1.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.129099,0.0,0.0,0.000000,0.000000,0.0
110065001,0.000000,0.000000,0.000000,1.0,0.0,0.136083,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
110065002,0.000000,0.000000,0.000000,0.0,1.0,0.149071,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
728111001,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,1.0,0.0,0.000000,0.000000,0.0
728146001,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.0
728162001,0.030387,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,1.000000,0.250313,0.0
728162002,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.250313,1.000000,0.0


Let's suppose that a new customer has just bought a product with article_id 108775015, and we want to include in our marketing emails some products that this customer is most likely to buy. The first thing we have to do is to find the items most similar to the one with article_id 108775015. You can use the following code to get the 10 items most similar to the item with article_id 108775015:

In [225]:
top_10_similar_items = list(
   item_item_sim_matrix\
        .loc['108775015']\
        .sort_values(ascending=False)\
        .iloc[:10]\
    .index
)

In [226]:
top_10_similar_items

['108775015',
 '108775044',
 '568842007',
 '628927001',
 '536968001',
 '659211001',
 '641611002',
 '528790004',
 '635579001',
 '562251007']

In [228]:
articles.loc[
    articles['article_id'].isin(top_10_similar_items), 
    ['article_id', 'prod_name', 'product_type_name', 'graphical_appearance_name', 'colour_group_name']
].drop_duplicates().set_index('article_id')

Unnamed: 0_level_0,prod_name,product_type_name,graphical_appearance_name,colour_group_name
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
108775015,Strap top,Vest top,Solid,Black
108775044,Strap top,Vest top,Solid,White
528790004,Cloud,Vest top,Solid,Pink
536968001,Domino,Top,Melange,Dark Grey
562251007,Stella cropped RW 5 pkt,Trousers,Denim,Blue
568842007,Nihon long leg red,Trousers,Solid,Light Pink
628927001,OLIVIA BOHO,Blouse,All over pattern,Dark Red
635579001,Blossom Blouse,Blouse,All over pattern,Light Beige
641611002,Angel Hoodie,Hoodie,Mixed solid/pattern,Greenish Khaki
659211001,Flirty Travel pack,Other accessories,Solid,Light Pink


## 2. Image Processing

Future, if there is time.

Example: https://www.kaggle.com/code/gulgaishatemerbekova/clothes-recommendation-system-using-densenet121