# Testing Notebook

## 1. Some Theory about Recommender Systems

The main families of methods for RecSys are:

- Collaborative Filtering: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.

- Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.

- Hybrid methods: Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

In [2]:
import scipy
import math
import random
import sklearn
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler

Example: https://www.kaggle.com/code/hendraherviawan/itembased-collaborative-filter-recommendation-r/report

## 2 Read Data from Parquets

In [3]:
transactions = pd.read_parquet('data/transactions_train_sample_gt15transactions.parquet')
customers = pd.read_parquet('data/customers_sample_gt15transactions.parquet')
articles = pd.read_parquet('data/articles_sample_gt15transactions.parquet')

## 3. Join Dataframes

In [4]:
#### Join Transactions and Articles dataframes
transactions_articles_joined = transactions.join(articles.set_index('article_id'), on='article_id')

In [5]:
transactions_articles_joined.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,product_code,prod_name,product_type_no,product_type_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,2018-09-20,1728846800780188,519773001,0.028458,2,0,519773,7147,245,17,...,1,0,0,1,0,15,0,1003,3,10231
1,2018-09-20,1728846800780188,578472001,0.032525,2,0,578472,37340,263,38,...,23,0,0,1,0,19,40,1007,9,26053
2,2018-09-20,2076973761519164,661795002,0.167797,2,0,661795,43993,263,38,...,23,0,0,1,0,19,40,1007,9,32892
3,2018-09-20,2076973761519164,684080003,0.101678,2,0,684080,1768,262,6,...,23,0,0,1,0,19,40,1007,9,6151
4,2018-09-20,49501769952275870,615508002,0.016932,1,0,615508,2086,265,1,...,4,0,0,1,0,15,0,1013,8,1937


In [6]:
#### Join Transactions-Articles with Customers dataframes
features_joined = transactions_articles_joined.join(customers.set_index('customer_id'), on='customer_id')

In [7]:
features_joined.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,product_code,prod_name,product_type_no,product_type_name,...,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,club_member_status,fashion_news_frequency,age,postal_code
0,2018-09-20,1728846800780188,519773001,0.028458,2,0,519773,7147,245,17,...,0,15,0,1003,3,10231,0,0,59.0,44730
1,2018-09-20,1728846800780188,578472001,0.032525,2,0,578472,37340,263,38,...,0,19,40,1007,9,26053,0,0,59.0,44730
2,2018-09-20,2076973761519164,661795002,0.167797,2,0,661795,43993,263,38,...,0,19,40,1007,9,32892,0,0,55.0,18589
3,2018-09-20,2076973761519164,684080003,0.101678,2,0,684080,1768,262,6,...,0,19,40,1007,9,6151,0,0,55.0,18589
4,2018-09-20,49501769952275870,615508002,0.016932,1,0,615508,2086,265,1,...,0,15,0,1013,8,1937,0,1,76.0,312383


In [8]:
len(features_joined)

1335776

## 4 Drop unused columns

Collaborative Filtering only needs customers_id and articles_id

In [9]:
features_prepared = features_joined[['customer_id', 'article_id']]

## 5 Collaborative filtering

Build a matrix of items for the client

### 2.5.1.2 Build customer-articles matrix

In [10]:
#Add column quantity
features_prepared['Quantity'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_prepared['Quantity'] = 1


In [11]:
features_prepared.head()

Unnamed: 0,customer_id,article_id,Quantity
0,1728846800780188,519773001,1
1,1728846800780188,578472001,1
2,2076973761519164,661795002,1
3,2076973761519164,684080003,1
4,49501769952275870,615508002,1


In [12]:
customer_item_matrix = features_prepared.pivot_table(
    index='customer_id', 
    columns='article_id', 
    values='Quantity',
    aggfunc='sum'
)

In [13]:
customer_item_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111565003,111586001,111593001,...,946795001,947060001,947168001,947509001,947934001,949198001,949551001,949551002,953450001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
345001598676045,,,,,,,,,,,...,,,,,,,,,,
1134266496627188,,,,,,,,,,,...,,,,,,,,,,
1728846800780188,,,,,,,,,,,...,,,,,,,,,,
1845857727772358,,,,,,,,,,,...,,,,,,,,,,
2076973761519164,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18442722120177658597,,,,,,,,,,,...,,,,,,,,,,
18444248544465254723,,,,,,,,,,,...,,,,,,,,,,
18444595675436699040,,,,,,,,,,,...,,,,,,,,,,
18445051157201360796,,,,,,,,,,,...,,,,,,,,,,


 We now have a matrix where each row represents the total quantities purchased for each product for each customer.

let's code 0-1 this data, so that a value of 1 means that the given product was bought by the given customer, and a value of 0 means that the given product was never bought by the given customer. Take a look at the following code:

In [None]:
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)

In [None]:
customer_item_matrix

### 2.5.1.3 Build Customers Similarity Matrix

Calculate the cosine similarities between users

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

In [None]:
user_user_sim_matrix

In [None]:
user_user_sim_matrix.columns = customer_item_matrix.index

In [None]:
user_user_sim_matrix['customer_id'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('customer_id')

In [None]:
user_user_sim_matrix

### SKIP 2.5.1.3 Test on one user: Get recommended items for User A according to the items purchased by a similar User B

In [None]:
#Find similar users as customer with id b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
user_user_sim_matrix.loc[1728846800780188].sort_values(ascending=False)[0:10]

These are the 10 most similar clients to the b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0 client. Let's choose client 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9 and discuss how we can recommend products using these results.

Lets identify both users:
- user_A: b14bfba3ae0da5af6e9711059773acf713cd7bb9a2c940ddf570affb715988a0
- user_B: 26f41c2913090e7a620df05975a52c604caf017c3110bb62596fdebd1aae4ba9

The strategy is as follows.

- First, we need to identify the items that user_A and user2 customers have already purchased.
- Then, let's find the products that target client user_B has not purchased, but client user_A has.
- Since these two customers have bought similar items in the past, we will assume that target customer user_B has a high probability of buying the items he or she has not bought, but customer user_A has.
- Finally, we are going to use this list of items and recommend them to target customer user_B.

Let's first see how we can retrieve the items that the user_A customer has purchased in the past:

In [35]:
user_A = 1728846800780188
user_B = 1188311575786073826

In [36]:
items_bought_by_A = set(customer_item_matrix.loc[user_A].iloc[
    customer_item_matrix.loc[user_A].to_numpy().nonzero()
].index)

In [None]:
items_bought_by_A

Using this function in the customer_item_matrix for the given user_A client, we can get the list of elements that the user_A client has purchased. We can apply the same code for the target client user_B, as in the following:

In [38]:
items_bought_by_B = set(customer_item_matrix.loc[user_B].iloc[
    customer_item_matrix.loc[user_B].to_numpy().nonzero()
].index)

In [None]:
items_bought_by_B

We now have two sets of items that customers A and B have purchased. Using a simple set operation, we can find the items that customer A has purchased, but customer B has not. The code is like the one below:

In [40]:
items_to_recommend_to_A = items_bought_by_B - items_bought_by_A

In [41]:
items_to_recommend_to_A

{560221002,
 560221012,
 560222002,
 560222012,
 585130004,
 585158003,
 600043010,
 600044008,
 617245003,
 641187004,
 680263013,
 768879001,
 776237001}

In [45]:
column_names = ["customer_id", "recom_articles"]
customers_rec_articles_df = pd.DataFrame(columns = column_names)

In [None]:
%%capture --no-display 
customers_rec_articles_df = customers_rec_articles_df.append({'customer_id': user_A, 'recom_articles':items_to_recommend_to_A}, ignore_index=True)
customers_rec_articles_df

### 2.5.1.2 Build the recommendations for each customer

In [None]:
# Create empty dataframe that will store the customers and their recommended products
column_names = ["customer_id", "recom_articles"]
customer_rec_articles_df = pd.DataFrame(columns = column_names)

In [None]:
%%capture --no-display 

for i in range(len(customer_item_matrix)-1):   
    customer_i_id = customer_item_matrix.index[i]
    items_to_recommend_to_i = set()

    items_bought_by_i = set(customer_item_matrix.loc[customer_i_id].iloc[customer_item_matrix.loc[customer_i_id].to_numpy().nonzero()].index)
    
    #Get most similar users to current user i
    similar_users = user_user_sim_matrix.iloc[i].sort_values(ascending=False)[0:10] #get just the 10 most similar clients
    similar_users = similar_users[similar_users!=0].to_frame() #Get only clients with similarity score greater than 0

    #Get articles bought by the j similar users
    for j in similar_users.index.values:
        customer_j_id = j
        items_bought_by_j = set(customer_item_matrix.loc[customer_j_id].iloc[customer_item_matrix.loc[customer_j_id].to_numpy().nonzero()].index)
        items_to_recommend = items_bought_by_j - items_bought_by_i
        #if(len(items_to_recommend)>0): items_to_recommend_to_i.append(items_to_recommend)
        items_to_recommend_to_i = set.union(items_to_recommend_to_i, items_to_recommend)

    customer_rec_articles_df = customer_rec_articles_df.append({'customer_id': customer_i_id, 'recom_articles':items_to_recommend_to_i}, ignore_index=True)


In [None]:
#Convert set of recommended articles to list
customer_rec_articles_df['recom_articles'] = customer_rec_articles_df['recom_articles'].apply(lambda x: list(x))

In [None]:
len(customer_rec_articles_df)

In [None]:
customer_rec_articles_df.head()

Export dataframe to csv:

In [None]:
customer_rec_articles_df.to_csv('customers_recommended_articles.csv', index=False)

## 6 FUTURE: Content-Based Filtering

https://www.kaggle.com/code/fabiendaniel/film-recommendation-engine/notebook

SIMILARITY

Criteria to determine if two products are similar:
1. Section and Type (both together)
2. Graphical appearence
3. Colour