# Testing Notebook

## Some Theory about Recommender Systems

The main families of methods for RecSys are:

- Collaborative Filtering: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

- Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.

- Hybrid methods: Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

In [3]:
import scipy
import math
import random
import sklearn
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler

## 1. ItemBased Collaborative Filter Recommendation

Example: https://www.kaggle.com/code/hendraherviawan/itembased-collaborative-filter-recommendation-r/report

### 2.1 Preprocessing

In [4]:
articles = pd.read_csv("data/articles.csv")
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [5]:
import re

# regex pattern: find all column names with '_id', '_code', or '_no'
pattern = '.*(_id|_code|_no).*'

# dict comprehension: Sets all columns with '_id', '_code', or '_no' to str type
dtype_dict = {column: str for column in articles.columns if re.match(pattern, column)}

articles = articles.astype(dtype = dtype_dict)

In [6]:
customers = pd.read_csv("data/customers.csv")
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [7]:
transactions = pd.read_csv("data/transactions_train.csv")
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,0.050831,2.0
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023.0,0.030492,2.0
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004.0,0.015237,2.0
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003.0,0.016932,2.0
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004.0,0.016932,2.0


In [8]:
#article_id is a float. First convert to int and then to string.
transactions['article_id'] = transactions['article_id'].astype("Int64").astype(str) 

### 2.2 Build single dataframe: Articles + Transactions + Customers

#### Create Transactions subset for testing

In [158]:
#transactions_subset = transactions.sample(20000)
transactions_subset = transactions

In [159]:
transactions_subset.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0


#### Join Transactions and Articles dataframes

In [160]:
#transactions_articles_joined = transactions_subset.set_index('article_id').join(articles.set_index('article_id'))
transactions_articles_joined = transactions_subset.join(articles.set_index('article_id'), on='article_id')

In [161]:
transactions_articles_joined.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,Atlanta Push Body Harlow,283,Underwear body,Underwear,...,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up body with underwired, moulded, pa..."
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,Rae Push (Melbourne) 2p,306,Bra,Underwear,...,Casual Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up bras with underwired, moulded, pa..."
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,Inca Jumper,252,Sweater,Garment Upper body,...,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Jumper in rib-knit cotton with hard-worn detai...
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Campaigns,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Campaigns,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...


#### Join Transactions-Articles with Customers dataframes

In [162]:
# Join also customer info

#trans_arts_cust_joined = transactions_articles_joined.set_index('customer_id').join(customers.set_index('customer_id'))
trans_arts_cust_joined = transactions_articles_joined.join(customers.set_index('customer_id'), on='customer_id')
trans_arts_cust_joined.head()

#Index of output df belongs to original index of transactions

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,Atlanta Push Body Harlow,283,Underwear body,Underwear,...,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up body with underwired, moulded, pa...",,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,Rae Push (Melbourne) 2p,306,Bra,Underwear,...,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up bras with underwired, moulded, pa...",,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,Inca Jumper,252,Sweater,Garment Upper body,...,Divided Selected,1003,Knitwear,Jumper in rib-knit cotton with hard-worn detai...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...


#### Check if join has been done correctly

In [163]:
#1. check that customer_id are repeated (some customers bought multiple items)
#Number of products purchased by each customer
grouped = trans_arts_cust_joined.groupby("customer_id")["customer_id"].count().reset_index(name='counts').sort_values(by='counts', ascending=False)
grouped

Unnamed: 0,customer_id,counts
683982,b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940...,897
733432,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,819
511891,84c34f4f564db1f437943c77af41f83bf6fd7c01701cbb...,731
802295,d00063b94dcb1342869d4994844a2742b5d62927f36843...,730
46064,0bf4c6fd4e9d33f9bfb807bb78348cbf5c565846ff4006...,674
...,...,...
312298,5126e4062c035c53ac1b2a2dedf2fd09724d4f52677f90...,1
769715,c78c4b04098ca1438dd4570ab3a53651616605a283b2e6...,1
769719,c78c73c45b31be4e6a38c9e1db467a96954650004b3fcd...,1
312295,5126db4d0bf54a5b59176545b3dde26407af1d7576c04f...,1


In [164]:
#2. check that article_id are repeated (different customers bought same item)
#Number of times the products were purchased by the customers
grouped = trans_arts_cust_joined.groupby("article_id")["article_id"].count().reset_index(name='counts').sort_values(by='counts', ascending=False)
grouped

Unnamed: 0,article_id,counts
49710,706016001,17726
49711,706016002,16489
1502,372860001,15388
13026,562245001,14793
22952,610776002,14628
...,...,...
3682,478004012,1
12706,560375002,1
12705,560375001,1
64837,773223001,1


In [16]:
#3. Check duplicated rows
trans_arts_cust_joined.duplicated().sum()

0

Duplicate rows correspond to multiple purchases of the same item by the same client. 

### 2.3 Drop columns

@To-Do: Are we working with the name or the number of the following categories?:
- product
- product_type
- product_group
- garment_group
- ...

Drop the unchosen.

In [165]:
trans_arts_cust_joined.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,Atlanta Push Body Harlow,283,Underwear body,Underwear,...,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up body with underwired, moulded, pa...",,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,Rae Push (Melbourne) 2p,306,Bra,Underwear,...,Womens Lingerie,1017,"Under-, Nightwear","Lace push-up bras with underwired, moulded, pa...",,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,Inca Jumper,252,Sweater,Garment Upper body,...,Divided Selected,1003,Knitwear,Jumper in rib-knit cotton with hard-worn detai...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,W YODA KNIT OL OFFER,252,Sweater,Garment Upper body,...,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...,1.0,1.0,ACTIVE,Regularly,32.0,8d6f45050876d059c830a0fe63f1a4c022de279bb68ce3...


In [166]:
# Drop columns
trans_arts_cust_joined = trans_arts_cust_joined.drop("detail_desc", axis=1)   
trans_arts_cust_joined = trans_arts_cust_joined.drop("prod_name", axis=1)  
trans_arts_cust_joined = trans_arts_cust_joined.drop("product_type_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("garment_group_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("product_group_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("graphical_appearance_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("colour_group_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("perceived_colour_value_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("perceived_colour_master_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("department_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("index_name", axis=1) 
#trans_arts_cust_joined = trans_arts_cust_joined.drop("index_group_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("section_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("index_group_name", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("postal_code", axis=1) 
trans_arts_cust_joined = trans_arts_cust_joined.drop("FN", axis=1)   

In [167]:
trans_arts_cust_joined = trans_arts_cust_joined.drop("t_dat", axis=1) 

In [168]:
trans_arts_cust_joined.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,Active,club_member_status,fashion_news_frequency,age
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,283,1010016,9,4,5,1338,B,1,61,1017,,ACTIVE,NONE,24.0
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,306,1010016,51,1,4,1334,B,1,61,1017,,ACTIVE,NONE,24.0
2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,252,1010010,52,2,4,5963,D,2,58,1003,1.0,ACTIVE,Regularly,32.0
3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,252,1010010,52,7,4,3090,A,1,15,1023,1.0,ACTIVE,Regularly,32.0
4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,252,1010010,93,4,19,3090,A,1,15,1023,1.0,ACTIVE,Regularly,32.0


### 2.4 Manage Null values

In [169]:
trans_arts_cust_joined.isnull().sum()

customer_id                         0
article_id                          0
price                               1
sales_channel_id                    1
product_code                        1
product_type_no                     1
graphical_appearance_no             1
colour_group_code                   1
perceived_colour_value_id           1
perceived_colour_master_id          1
department_no                       1
index_code                          1
index_group_no                      1
section_no                          1
garment_group_no                    1
Active                        9393735
club_member_status              34071
fashion_news_frequency          64810
age                             79980
dtype: int64

In [170]:
#Replace Age with the mean
mean_age = trans_arts_cust_joined['age'].median()
trans_arts_cust_joined['age'].fillna(mean_age,inplace=True)

In [171]:
#Remove Active Column
trans_arts_cust_joined = trans_arts_cust_joined.drop("Active", axis=1)   

In [172]:
#Rplace club_member_status and fashion_news_frequency with the most common value
trans_arts_cust_joined = trans_arts_cust_joined.fillna(trans_arts_cust_joined['club_member_status'].value_counts().index[0])
trans_arts_cust_joined = trans_arts_cust_joined.fillna(trans_arts_cust_joined['fashion_news_frequency'].value_counts().index[0])

In [173]:
trans_arts_cust_joined.isnull().sum()

customer_id                   0
article_id                    0
price                         0
sales_channel_id              0
product_code                  0
product_type_no               0
graphical_appearance_no       0
colour_group_code             0
perceived_colour_value_id     0
perceived_colour_master_id    0
department_no                 0
index_code                    0
index_group_no                0
section_no                    0
garment_group_no              0
club_member_status            0
fashion_news_frequency        0
age                           0
dtype: int64

### 2.4 Manage Categorical Columns

#### Explore levels of the categorical variables

In [26]:
trans_arts_cust_joined['article_id'].value_counts()

562245001    24
484398001    21
399256001    20
684209001    19
610776002    18
             ..
685933002     1
456163028     1
561551015     1
630304002     1
766194002     1
Name: article_id, Length: 11713, dtype: int64

In [27]:
trans_arts_cust_joined['sales_channel_id'].value_counts()

2.0    13818
1.0     6182
Name: sales_channel_id, dtype: int64

In [28]:
trans_arts_cust_joined['product_code'].value_counts()

562245    111
610776     85
706016     79
589599     63
573716     60
         ... 
514865      1
749806      1
688260      1
712514      1
677506      1
Name: product_code, Length: 7094, dtype: int64

In [29]:
trans_arts_cust_joined['product_type_no'].value_counts()

272    2654
265    1928
252    1770
255    1470
258    1040
       ... 
532       1
83        1
291       1
156       1
515       1
Name: product_type_no, Length: 88, dtype: int64

In [30]:
trans_arts_cust_joined['graphical_appearance_no'].value_counts()

1010016    10525
1010001     2889
1010017     1193
1010023     1182
1010010     1143
1010021      483
1010026      343
1010014      329
1010004      312
1010008      243
1010005      213
1010006      165
1010007      157
1010009      145
1010020      117
1010018      106
1010022      102
1010002      100
1010015       53
1010012       39
1010011       38
1010013       26
1010024       25
1010027       19
1010025       15
-1            14
1010019       11
1010028       10
1010003        3
Name: graphical_appearance_no, dtype: int64

In [31]:
trans_arts_cust_joined['colour_group_code'].value_counts()

9     6915
10    2148
73    1735
72     678
12     676
71     625
42     591
7      526
43     514
51     513
11     483
19     482
8      456
13     438
93     395
22     333
6      275
52     265
17     233
31     231
33     152
5      149
53     131
14     126
92     108
23      91
83      79
21      78
32      74
3       70
82      60
91      59
81      46
50      46
63      35
15      33
40      18
41      18
30      17
1       17
20      16
61      16
70       9
62       8
2        8
60       8
-1       8
4        6
80       1
90       1
Name: colour_group_code, dtype: int64

In [32]:
trans_arts_cust_joined['perceived_colour_value_id'].value_counts()

4     10141
1      3104
3      2871
2      2206
7       870
5       783
6        17
-1        8
Name: perceived_colour_value_id, dtype: int64

In [33]:
trans_arts_cust_joined['perceived_colour_master_id'].value_counts()

5     6823
2     3057
9     2694
12    1224
18    1142
11    1068
4      919
20     564
19     510
8      505
3      409
13     348
15     225
-1     148
7      146
1      124
6       77
14      17
Name: perceived_colour_master_id, dtype: int64

In [34]:
trans_arts_cust_joined['index_code'].value_counts()

A    7967
D    4676
B    3419
F    1090
C    1077
S     714
I     439
H     321
G     195
J     102
Name: index_code, dtype: int64

In [35]:
trans_arts_cust_joined['index_group_no'].value_counts()

1     12463
2      4676
3      1090
4      1057
26      714
Name: index_group_no, dtype: int64

In [36]:
trans_arts_cust_joined['section_no'].value_counts()

15    3654
53    2540
60    1667
61    1224
11    1214
16     964
6      800
51     783
5      674
57     663
62     599
66     416
64     377
26     368
18     345
2      327
65     284
50     268
58     235
52     215
19     193
20     190
77     188
8      142
21     129
55     123
76     120
46     115
14     115
47     114
79     108
44     105
23      92
56      84
72      82
41      49
25      48
42      48
45      43
43      40
80      38
40      35
97      32
82      30
31      27
22      27
27      22
49      13
70      11
48      10
24       4
71       3
28       2
30       1
Name: section_no, dtype: int64

In [37]:
trans_arts_cust_joined['garment_group_no'].value_counts()

1005    3374
1002    1961
1009    1866
1018    1733
1017    1662
1010    1572
1003    1460
1013    1383
1019    1034
1016     814
1025     556
1020     485
1007     429
1021     426
1012     369
1001     278
1008     203
1023     153
1011     140
1014      58
1006      44
Name: garment_group_no, dtype: int64

In [38]:
trans_arts_cust_joined['club_member_status'].value_counts()

ACTIVE        19472
PRE-CREATE      524
LEFT CLUB         4
Name: club_member_status, dtype: int64

In [39]:
trans_arts_cust_joined['fashion_news_frequency'].value_counts()

NONE         11533
Regularly     8390
ACTIVE          72
Monthly          5
Name: fashion_news_frequency, dtype: int64

In [40]:
trans_arts_cust_joined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 5892708 to 13284337
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   customer_id                 20000 non-null  object 
 1   article_id                  20000 non-null  object 
 2   price                       20000 non-null  float64
 3   sales_channel_id            20000 non-null  float64
 4   product_code                20000 non-null  object 
 5   product_type_no             20000 non-null  object 
 6   graphical_appearance_no     20000 non-null  object 
 7   colour_group_code           20000 non-null  object 
 8   perceived_colour_value_id   20000 non-null  object 
 9   perceived_colour_master_id  20000 non-null  object 
 10  department_no               20000 non-null  object 
 11  index_code                  20000 non-null  object 
 12  index_group_no              20000 non-null  object 
 13  section_no            

#### Handle categorical variables

In [41]:
trans_arts_cust_joined.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,club_member_status,fashion_news_frequency,age
5892708,7876f201ba4869aaf4701447a240fc7c3007180ebbac71...,605106004,0.013542,2.0,605106,286,1010001,71,3,2,3937,D,2,51,1017,ACTIVE,NONE,26.0
13414259,8560613ffb8affd37dfce112320bd68918316b68df1771...,701792005,0.022017,1.0,701792,262,1010016,51,1,4,1244,D,2,53,1007,ACTIVE,Regularly,32.0
12210512,ab315c222174fa49dcfcd8de00993b40dc32b2449c8c92...,707699001,0.015237,1.0,707699,258,1010007,10,3,9,1522,A,1,15,1010,ACTIVE,Regularly,52.0
15698421,0b0d28ce3d95d21a42a9976245e995c943f35111ab61b9...,619580008,0.009136,2.0,619580,274,1010016,73,4,2,1723,A,1,15,1025,ACTIVE,NONE,31.0
8372506,5fc9c565b43fa8a50855c8cc26526ee1f36142dc99e4ed...,719245001,0.047441,2.0,719245,272,1010016,19,4,20,1722,A,1,15,1009,ACTIVE,NONE,23.0


In [42]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# apply 1-hot encoding to catgorical predictors
pip = ColumnTransformer([
        ("cat", OneHotEncoder(), ["article_id", "sales_channel_id", "product_code", "product_type_no", "graphical_appearance_no", "colour_group_code", 
                                  "perceived_colour_value_id", "perceived_colour_master_id", "index_code", "index_group_no", "section_no",
                                  "garment_group_no", "club_member_status", "fashion_news_frequency"]),
    ], remainder='drop')

In [43]:
df_prepared = pd.DataFrame(pip.fit_transform(trans_arts_cust_joined))
df_prepared

Unnamed: 0,0
0,"(0, 2871)\t1.0\n (0, 11714)\t1.0\n (0, 131..."
1,"(0, 7634)\t1.0\n (0, 11713)\t1.0\n (0, 160..."
2,"(0, 7907)\t1.0\n (0, 11713)\t1.0\n (0, 162..."
3,"(0, 3312)\t1.0\n (0, 11714)\t1.0\n (0, 134..."
4,"(0, 8764)\t1.0\n (0, 11714)\t1.0\n (0, 167..."
...,...
19995,"(0, 5968)\t1.0\n (0, 11713)\t1.0\n (0, 150..."
19996,"(0, 277)\t1.0\n (0, 11714)\t1.0\n (0, 1187..."
19997,"(0, 6581)\t1.0\n (0, 11713)\t1.0\n (0, 153..."
19998,"(0, 8909)\t1.0\n (0, 11714)\t1.0\n (0, 168..."


### 2.5 Machine Learning

https://www.datasource.ai/uploads/6b86b1630562b323a26143f90d97fe08.html

#### 2.5.1 Collaborative filtering

Build a matrix of items for the client

In [188]:
#df = trans_arts_cust_joined.reset_index()
df = trans_arts_cust_joined[0:100000].reset_index()

In [189]:
df.head()

Unnamed: 0,index,customer_id,article_id,price,sales_channel_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,club_member_status,fashion_news_frequency,age,Quantity
0,0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,283,1010016,9,4,5,1338,B,1,61,1017,ACTIVE,NONE,24.0,1
1,1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,306,1010016,51,1,4,1334,B,1,61,1017,ACTIVE,NONE,24.0,1
2,2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,252,1010010,52,2,4,5963,D,2,58,1003,ACTIVE,Regularly,32.0,1
3,3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,252,1010010,52,7,4,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1
4,4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,252,1010010,93,4,19,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1


In [191]:
#Get counts of each sold article
grouped = df.groupby("article_id")["article_id"].count().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
grouped

Unnamed: 0,article_id,counts
14861,685687004,797
14858,685687001,526
14860,685687003,438
14859,685687002,288
4149,562245001,275
...,...,...
11386,644432001,1
11385,644419001,1
5903,585275003,1
5904,585275004,1


In [192]:
#Get counts of each customer to see if same customer has purchased more than once
grouped = df.groupby("customer_id")["customer_id"].count().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
grouped

Unnamed: 0,customer_id,counts
6049,2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c...,103
27354,f6fbb1480291047c308f83a25e8e8b1e2c5a44a9390662...,72
742,05c5685f1301de31ac4463293228f74ba1b3b427dc1a79...,52
6224,31287b3d29b025cf00822b66b462a415e9c58d65385627...,52
8026,4043eeb9fb65b1735639ce44efbc5c43d2f9160fb26c8d...,52
...,...,...
14578,7def8a390bf6d679f6c3d979a6529a33bca31aa69208ff...,1
14580,7df24c972b2f724db3eef0b599e4cd98d88a4ffb884e33...,1
14582,7df3bbddfe8905e0141fc65bb162c41c7eb87d90d40325...,1
14585,7df50e9957586c3c3fa4adf87af466d84607018d7ec741...,1


In [193]:
grouped.iloc[0]['customer_id']

'2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'

In [194]:
#Add column quantity
df['Quantity'] = 1

In [195]:
df.head()

Unnamed: 0,index,customer_id,article_id,price,sales_channel_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,club_member_status,fashion_news_frequency,age,Quantity
0,0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2.0,663713,283,1010016,9,4,5,1338,B,1,61,1017,ACTIVE,NONE,24.0,1
1,1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2.0,541518,306,1010016,51,1,4,1334,B,1,61,1017,ACTIVE,NONE,24.0,1
2,2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2.0,505221,252,1010010,52,2,4,5963,D,2,58,1003,ACTIVE,Regularly,32.0,1
3,3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2.0,685687,252,1010010,52,7,4,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1
4,4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2.0,685687,252,1010010,93,4,19,3090,A,1,15,1023,ACTIVE,Regularly,32.0,1


In [196]:
customer_item_matrix = df.pivot_table(
    index='customer_id', 
    columns='article_id', 
    values='Quantity',
    aggfunc='sum'
)

In [197]:
customer_item_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,,,,,,,,,,,...,,,,,,,,,,
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,,,,,,,,,...,,,,,,,,,,
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,,,,,,,,,,,...,,,,,,,,,,
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,,,,,,,,,,,...,,,,,,,,,,
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,,,,,,,,,,,...,,,,,,,,,,
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,,,,,,,,,,,...,,,,,,,,,,
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,,,,,,,,,,,...,,,,,,,,,,
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,,,,,,,,,,,...,,,,,,,,,,


 we now have a matrix where each row represents the total quantities purchased for each product for each customer.

let's code 0-1 this data, so that a value of 1 means that the given product was bought by the given customer, and a value of 0 means that the given product was never bought by the given customer. Take a look at the following code:

In [198]:
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)

In [200]:
customer_item_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [201]:
#Test. Check that matrix hs been correctly build
#Customer with id b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0 should have bought 4 items
customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].sum()

53

Calculate the cosine similarities between users

In [202]:
from sklearn.metrics.pairwise import cosine_similarity

In [203]:
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

In [204]:
user_user_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28307,28308,28309,28310,28311,28312,28313,28314,28315,28316
0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
1,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
3,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
4,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28312,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
28313,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
28314,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
28315,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [205]:
user_user_sim_matrix.columns = customer_item_matrix.index

In [206]:
user_user_sim_matrix

customer_id,0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0006d37aaf7dd84f9bbc02f6cadcb74fd72ebf370bdc5f110a8a4092aa7e173e,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4c73235dccbbc132280,0008968c0d451dbc5a9968da03196fe20051965edde7413775c4eb3be9abe9c2,000aa7f0dc06cd7174389e76c9e132a67860c5f65f970699daccc14425ac31a8,000b872410f5ac2064acb999a1e0a7db4c1b5007ecaa7bcdc0a0e9006fa5f968,...,ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,ffefe95a1c711b634023279e0bc7180d5991d4558fa036e7d5ac77cc3348d171,fff04954c6e484a8deb5ec475e581aefd25d5850d1886f6c0198edaa9b67c958,fff0ac18093a702a0a06f4cc76582632df3ede9a36556e345150befbeed6885a,fff15526121f7d914a54784e68761a1d30b7547e3555738dcceb386eaaa24c4b,fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264
0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
1,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
3,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
4,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28312,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
28313,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
28314,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
28315,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [207]:
user_user_sim_matrix['customer_id'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('customer_id')

In [208]:
user_user_sim_matrix

customer_id,0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0006d37aaf7dd84f9bbc02f6cadcb74fd72ebf370bdc5f110a8a4092aa7e173e,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4c73235dccbbc132280,0008968c0d451dbc5a9968da03196fe20051965edde7413775c4eb3be9abe9c2,000aa7f0dc06cd7174389e76c9e132a67860c5f65f970699daccc14425ac31a8,000b872410f5ac2064acb999a1e0a7db4c1b5007ecaa7bcdc0a0e9006fa5f968,...,ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,ffefe95a1c711b634023279e0bc7180d5991d4558fa036e7d5ac77cc3348d171,fff04954c6e484a8deb5ec475e581aefd25d5850d1886f6c0198edaa9b67c958,fff0ac18093a702a0a06f4cc76582632df3ede9a36556e345150befbeed6885a,fff15526121f7d914a54784e68761a1d30b7547e3555738dcceb386eaaa24c4b,fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2,0.0,0.0,1.0,0.0,0.0,0.316228,0.0,0.0,0.000000,0.0,...,0.2,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
0003abe64294e66a6310c3436fa9e5b754cc5603deef4f26fc8ab8d043af9358,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
0004068f54dbe1c7054b23c615edc5f733a508ecc54930bf323209f20410898c,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff3e75605ec575be9b95eda1e6557299e81bba12668d750c0e973528e48b7ee,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.00000,0.0,0.00000,0.0
fff4b145d7469e023b147b0f8375c565b1be43944987792153ccc0af41466cf3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.149071,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.00000,0.0,0.57735,0.0
fff627c97a69e53afb4a2b49a3ebf7fa06660afaac959b46e8080849008fe17c,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,1.0,0.00000,0.0
fff969b13a1c848d53ae3f08f111bfebcdcf6cd27e3815235db95f1e99524c79,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,1.00000,0.0


In [209]:
#Find similar users as customer with id b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
user_user_sim_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].sort_values(ascending=False)[0:10]

customer_id
2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13    1.000000
3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc    0.206041
af12dc9783346acf18f5bac8b13585f10eb6549d035112821e67f60e7c843621    0.194257
e1946a8690ab57f6a92f779c62cf5a3fb57df87cc3062a3cd21103e1a4230850    0.194257
5da22e5fc8f619f64982431934410d75d991cafa181587d1893898189219bc41    0.184289
1a20134a3e7d92e6c73916996d7b6332b916cc6c501a250e801d16d9b5d08d60    0.173749
b3a1fc7a2679a8d888f361e247b6a84789ce96f37bd5fc672cc69c2a6a1bfe55    0.155752
1a812529f7996aa1c13b07b755ca63caec8c170d43e4553f84bcacef93ecc973    0.137361
420ef3fcd79c3c1418103f296fe41240604dad6558575cf8013db10695d84ae9    0.137361
26c47eebe4cda52bf77cf791a5d06392218143beaa45ff94a36a119915db1712    0.137361
Name: 2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13, dtype: float64

These are the 10 most similar clients to the b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0 client. Let's choose client 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9 and discuss how we can recommend products using these results.

Lets identify both users:
- user_A: b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
- user_B: 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9

The strategy is as follows.

- First, we need to identify the items that user_A and user2 customers have already purchased.
- Then, let's find the products that target client user_B has not purchased, but client user_A has.
- Since these two customers have bought similar items in the past, we will assume that target customer user_B has a high probability of buying the items he or she has not bought, but customer user_A has.
- Finally, we are going to use this list of items and recommend them to target customer user_B.

Let's first see how we can retrieve the items that the user_A customer has purchased in the past:

In [210]:
items_bought_by_A = set(customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].iloc[
    customer_item_matrix.loc['2fdf822dbaad2b983b37e651a982bba24352a92c8a5c4c75be25c771f2af6d13'].to_numpy().nonzero()
].index)

In [211]:
items_bought_by_A

{'305304010',
 '464454011',
 '467302079',
 '467302099',
 '490113003',
 '494030013',
 '496762018',
 '496762020',
 '532954003',
 '536358002',
 '559601002',
 '559642001',
 '560209001',
 '560270002',
 '566140001',
 '581162001',
 '598795018',
 '601876004',
 '607427002',
 '607427003',
 '610216001',
 '613459001',
 '615042001',
 '619884014',
 '621018003',
 '621939010',
 '626168001',
 '627147001',
 '627147002',
 '628535006',
 '628921003',
 '630319001',
 '633109002',
 '636420003',
 '637194001',
 '638777001',
 '638777002',
 '641312002',
 '642189001',
 '642189005',
 '643305002',
 '645626001',
 '647190001',
 '651558002',
 '651558003',
 '651558006',
 '651558007',
 '651558012',
 '651558013',
 '652361001',
 '671852003',
 '676255002',
 '709688001'}

Using this function in the customer_item_matrix for the given user_A client, we can get the list of elements that the user_A client has purchased. We can apply the same code for the target client user_B, as in the following:

In [212]:
items_bought_by_B = set(customer_item_matrix.loc['3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc'].iloc[
    customer_item_matrix.loc['3e4c3cff005441f58a9641a8f4e38ca6050e772c4b7a9f2ef92448e6e244d5fc'].to_numpy().nonzero()
].index)

In [213]:
items_bought_by_B

{'559601002', '559642001', '559715001', '627147002'}

We now have two sets of items that customers A and B have purchased. Using a simple set operation, we can find the items that customer A has purchased, but customer B has not. The code is like the one below:

In [214]:
items_to_recommend_to_B = items_bought_by_A - items_bought_by_B

In [215]:
items_to_recommend_to_B

{'305304010',
 '464454011',
 '467302079',
 '467302099',
 '490113003',
 '494030013',
 '496762018',
 '496762020',
 '532954003',
 '536358002',
 '560209001',
 '560270002',
 '566140001',
 '581162001',
 '598795018',
 '601876004',
 '607427002',
 '607427003',
 '610216001',
 '613459001',
 '615042001',
 '619884014',
 '621018003',
 '621939010',
 '626168001',
 '627147001',
 '628535006',
 '628921003',
 '630319001',
 '633109002',
 '636420003',
 '637194001',
 '638777001',
 '638777002',
 '641312002',
 '642189001',
 '642189005',
 '643305002',
 '645626001',
 '647190001',
 '651558002',
 '651558003',
 '651558006',
 '651558007',
 '651558012',
 '651558013',
 '652361001',
 '671852003',
 '676255002',
 '709688001'}

To obtain the descriptions of these items:

In [220]:
articles.loc[
    articles['article_id'].isin(items_to_recommend_to_B), 
    ['article_id', 'prod_name', 'product_type_name']
].drop_duplicates().set_index('article_id')

Unnamed: 0_level_0,prod_name,product_type_name
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
305304010,Boy Denim Shorts,Shorts
464454011,TANJA SKIRT,Skirt
467302079,Panda dress J,Dress
467302099,Panda dress J,Dress
490113003,Lola Denim Shorts,Shorts
494030013,Tika (1),Vest top
496762018,Summer strap dress,Dress
496762020,Summer strap dress,Dress
532954003,Small thin hoops,Earring
536358002,Cool Claudia Hoops RT,Earring


#### 2.5.2 Item-Based Filtering

Item-based collaborative filtering is similar to the user-based approach, except that it uses measures of similarity between items, rather than between users or customers.

In [221]:
item_item_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix.T)
)

In [222]:
item_item_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15571,15572,15573,15574,15575,15576,15577,15578,15579,15580
0,1.000000,0.350438,0.041885,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.030387,0.000000,0.0
1,0.350438,1.000000,0.059761,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
2,0.041885,0.059761,1.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.129099,0.0,0.0,0.000000,0.000000,0.0
3,0.000000,0.000000,0.000000,1.0,0.0,0.136083,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
4,0.000000,0.000000,0.000000,0.0,1.0,0.149071,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15576,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,1.0,0.0,0.000000,0.000000,0.0
15577,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.0
15578,0.030387,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,1.000000,0.250313,0.0
15579,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.250313,1.000000,0.0


If you compare this code with the previous one, in which we calculate an array of similarities between users, the only difference is that here we are transposing the customer_item_matrix, so that the indexes in the rows represent individual items and the columns represent the customers.

In [223]:
item_item_sim_matrix.columns = customer_item_matrix.T.index

item_item_sim_matrix['article_id'] = customer_item_matrix.T.index
item_item_sim_matrix = item_item_sim_matrix.set_index('article_id')

In [224]:
item_item_sim_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111586001,111593001,111609001,...,723595001,723595002,724281001,725253001,727754001,728111001,728146001,728162001,728162002,729931001
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
108775015,1.000000,0.350438,0.041885,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.030387,0.000000,0.0
108775044,0.350438,1.000000,0.059761,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
108775051,0.041885,0.059761,1.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.129099,0.0,0.0,0.000000,0.000000,0.0
110065001,0.000000,0.000000,0.000000,1.0,0.0,0.136083,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
110065002,0.000000,0.000000,0.000000,0.0,1.0,0.149071,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
728111001,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,1.0,0.0,0.000000,0.000000,0.0
728146001,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.0
728162001,0.030387,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,1.000000,0.250313,0.0
728162002,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.250313,1.000000,0.0


Let's suppose that a new customer has just bought a product with article_id 108775015, and we want to include in our marketing emails some products that this customer is most likely to buy. The first thing we have to do is to find the items most similar to the one with article_id 108775015. You can use the following code to get the 10 items most similar to the item with article_id 108775015:

In [225]:
top_10_similar_items = list(
   item_item_sim_matrix\
        .loc['108775015']\
        .sort_values(ascending=False)\
        .iloc[:10]\
    .index
)

In [226]:
top_10_similar_items

['108775015',
 '108775044',
 '568842007',
 '628927001',
 '536968001',
 '659211001',
 '641611002',
 '528790004',
 '635579001',
 '562251007']

In [228]:
articles.loc[
    articles['article_id'].isin(top_10_similar_items), 
    ['article_id', 'prod_name', 'product_type_name', 'graphical_appearance_name', 'colour_group_name']
].drop_duplicates().set_index('article_id')

Unnamed: 0_level_0,prod_name,product_type_name,graphical_appearance_name,colour_group_name
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
108775015,Strap top,Vest top,Solid,Black
108775044,Strap top,Vest top,Solid,White
528790004,Cloud,Vest top,Solid,Pink
536968001,Domino,Top,Melange,Dark Grey
562251007,Stella cropped RW 5 pkt,Trousers,Denim,Blue
568842007,Nihon long leg red,Trousers,Solid,Light Pink
628927001,OLIVIA BOHO,Blouse,All over pattern,Dark Red
635579001,Blossom Blouse,Blouse,All over pattern,Light Beige
641611002,Angel Hoodie,Hoodie,Mixed solid/pattern,Greenish Khaki
659211001,Flirty Travel pack,Other accessories,Solid,Light Pink


## 2. Image Processing

Future, if there is time.

Example: https://www.kaggle.com/code/gulgaishatemerbekova/clothes-recommendation-system-using-densenet121