# H&M - Implicit ALS model
![](https://storage.googleapis.com/kaggle-competitions/kaggle/31254/logos/header.png)

## Implicit ALS base model for the competition [H&M Personalized Fashion Recommendations](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations).


[Implicit](https://github.com/benfred/implicit/) is a library for recommender models. In theory, it supports GPU out-of-the-box, but I haven't tried it yet.

In this notebook we use ALS (Alternating Least Squares), but the library supports a lot of other models with not many changes.

ALS is one of the most used ML models for recommender systems. It's a matrix factorization method based on SVD (it's actually an approximated, numerical version of SVD). Basically, ALS factorizes the interaction matrix (user x items) into two smaller matrices, one for item embeddings and one for user embeddings. These new matrices are built in a manner such that the multiplication of a user and an item gives (approximately) it's interaction score. This build embeddings for items and for users that live in the same vector space, allowing the implementation of recommendations as simple cosine distances between users and items. This is, the 12 items we recommend for a given user are the 12 items with their embedding vectors closer to the user embedding vector.

There are a lot of online resources explaining it. For example, [here](https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1).


Be aware that there was a breaking API change in a recent release of implicit (11 days ago): https://github.com/benfred/implicit/releases/tag/v0.5.0 so some thing in the documentation are off if you use the version that comes installed in the Kaggle environments. Anyway, this competition doesn't forbid Internet usage, so upgrading the package to its latest version fixes all.


---

**I have reverted the kernet to version 14 with a score of `0.014` for now. The scores above `0.014` are using the `0.02` model by Heng Zheng as the fallback strategy for cold-start users. Therefore, a `0.018` score is actually bad. I did those version with the hope that they would work nicely together, obtaining `>0.02`. Since it was not the case, reporting a `0.018` score is not accurate at all.**

If I can obtain a score that surpasses the `0.02`, using ALS + Heng's baseline, I will roll back to those versions again.


# Please, _DO_ upvote if you find this kernel useful or interesting!

# Imports

In [1]:
# FYI:
# This pip command takes a lot with GPU enabled (~15 min)
# It works though. And GPU accelerates the process *a lot*.
# I am developing with GPU turned off and submitting with GPU turned on
!pip install --upgrade implicit

Collecting implicit
  Downloading implicit-0.5.2-cp38-cp38-win_amd64.whl (628 kB)
Installing collected packages: implicit
Successfully installed implicit-0.5.2


You should consider upgrading via the 'C:\Users\mdurh\anaconda3\python.exe -m pip install --upgrade pip' command.


In [2]:
import os; os.environ['OPENBLAS_NUM_THREADS']='1'
import numpy as np
import pandas as pd
import implicit
from scipy.sparse import coo_matrix
from implicit.evaluation import mean_average_precision_at_k

# Load dataframes

In [None]:
#transaction data 
#users 
#items 

In [21]:
df = df = pd.read_csv('cleaned_data2.csv', dtype={'StockCode': str}, parse_dates=['InvoiceDate'])

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 516528 entries, 0 to 516527
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    516528 non-null  object        
 1   StockCode    516528 non-null  object        
 2   Quantity     516528 non-null  int64         
 3   InvoiceDate  516528 non-null  datetime64[ns]
 4   UnitPrice    516528 non-null  float64       
 5   CustomerID   392980 non-null  float64       
 6   Country      516528 non-null  object        
 7   TotalValue   516528 non-null  float64       
 8   Description  516528 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 35.5+ MB


In [23]:
df_cust = df.dropna()
dfu = pd.DataFrame([df_cust.groupby("CustomerID")["InvoiceNo"].nunique().rename("Purchases"),
                          df_cust.groupby("CustomerID")['StockCode'].nunique().rename("unique_products"),
                          df_cust.groupby("CustomerID")['Quantity'].sum().rename("total_no_items"),
                          df_cust.groupby("CustomerID")['TotalValue'].sum().rename("monetary"),
                          df_cust.groupby("CustomerID")['TotalValue'].median().rename("med_purchase_value")
                         ]).T
#Country
#recency of most recent purchase
#frequency 
#length of time as a customer
dfu.head()

Unnamed: 0_level_0,Purchases,unique_products,total_no_items,monetary,med_purchase_value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12346.0,2.0,1.0,0.0,0.0,0.0
12347.0,7.0,100.0,2392.0,4215.5,17.0
12348.0,4.0,19.0,2260.0,1376.04,41.76
12349.0,1.0,72.0,630.0,1457.55,17.7
12350.0,1.0,16.0,196.0,294.4,18.75


In [24]:
#product table
df_purchases = df[df["Quantity"]>0]

df_returns = df[df["Quantity"]<0]

In [25]:
dfi = pd.DataFrame([df.groupby("StockCode")['Description'].unique().rename("Description"),
                         df.groupby("StockCode")['UnitPrice'].mean().rename("UnitPrice"), #problem with the unit price
                         df_purchases.groupby("StockCode")["InvoiceNo"].nunique().rename("times_purchased"),
                         df_returns.groupby("StockCode")['InvoiceNo'].nunique().rename("times_returned"),
                         df_purchases.groupby("StockCode")['Quantity'].mean().rename("avg_quant_purchased"),
                         df.groupby("StockCode")['TotalValue'].sum().rename("net_item_revenue")
                         ]).T
dfi.times_returned.fillna(0, inplace=True)

#should add the frequency over different times of the year etc. 
# 12 variables for times purchased in each month 
dfi.head()

Unnamed: 0_level_0,Description,UnitPrice,times_purchased,times_returned,avg_quant_purchased,net_item_revenue
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10002,[inflatable political globe ],1.08662,71,0.0,12.112676,759.89
10125,[mini funky design tapes],0.859681,91,0.0,13.787234,994.84
10133,[colouring pencils brown tube],0.649045,196,1.0,14.479798,1540.02
10135,[colouring pencils brown tube],1.410167,175,1.0,12.463687,2206.14
11001,[asstd design racing car pen],1.878167,113,5.0,14.043478,2152.39


In [None]:
#%%time

#base_path = '../input/h-and-m-personalized-fashion-recommendations/'
#csv_train = f'{base_path}transactions_train.csv'
#csv_sub = f'{base_path}sample_submission.csv'
#csv_users = f'{base_path}customers.csv'
#csv_items = f'{base_path}articles.csv'

#df = pd.read_csv(csv_train, dtype={'article_id': str}, parse_dates=['t_dat'])
#df_sub = pd.read_csv(csv_sub)
#dfu = pd.read_csv(csv_users)
#dfi = pd.read_csv(csv_items, dtype={'article_id': str})

In [None]:
# Trying with less data:
# https://www.kaggle.com/tomooinubushi/folk-of-time-is-our-best-friend/notebook
#df = df[df['t_dat'] > '2020-08-21']
#df.shape

In [10]:
# For validation this means 3 weeks of training and 1 week for validation
# For submission, it means 4 weeks of training
df['InvoiceDate'].max()

Timestamp('2011-12-09 12:50:00')

## Assign autoincrementing ids starting from 0 to both users and items

In [26]:
ALL_USERS = df['CustomerID'].unique().tolist()
ALL_ITEMS = df['StockCode'].unique().tolist()

user_ids = dict(list(enumerate(ALL_USERS)))
item_ids = dict(list(enumerate(ALL_ITEMS)))

user_map = {u: uidx for uidx, u in user_ids.items()}
item_map = {i: iidx for iidx, i in item_ids.items()}

df['user_id'] = df['CustomerID'].map(user_map)
df['item_id'] = df['StockCode'].map(item_map)

del dfu, dfi

## Create coo_matrix (user x item) and csr matrix (user x item)

It is common to use scipy sparse matrices in recommender systems, because the main core of the problem is typically modeled as a matrix with users and items, with the values representing whether the user purchased (or liked) an items. Since each user purchases only a small fraction of the catalog of products, this matrix is full of zero (aka: it's sparse).

In a very recent release they did an API breaking change, so be aware of that: https://github.com/benfred/implicit/releases
In this notebook we are using the latest version, so everything is aligned with (user x item)

**We are using (user x item) matrices, both for training and for evaluating/recommender.**

In the previous versions the training procedure required a COO item x user

For evaluation and prediction, on the other hand, CSR matrices with users x items format should be provided.


### About COO matrices
COO matrices are a kind of sparse matrix.
They store their values as tuples of `(row, column, value)` (the coordinates)

You can read more about them here: 
* https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)
* https://scipy-lectures.org/advanced/scipy_sparse/coo_matrix.html

From https://het.as.utexas.edu/HET/Software/Scipy/generated/scipy.sparse.coo_matrix.html

```python
>>> row  = np.array([0,3,1,0]) # user_ids
>>> col  = np.array([0,3,1,2]) # item_ids
>>> data = np.array([4,5,7,9]) # a bunch of ones of lenght unique(user) x unique(items)
>>> coo_matrix((data,(row,col)), shape=(4,4)).todense()
matrix([[4, 0, 9, 0],
        [0, 7, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 5]])
```

## About CSR matrices
* https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)


In [27]:
row = df['user_id'].values
col = df['item_id'].values
data = np.ones(df.shape[0])
coo_train = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
coo_train

<4356x2489 sparse matrix of type '<class 'numpy.float64'>'
	with 516528 stored elements in COOrdinate format>

# Check that model works ok with data

In [30]:
%%time
model = implicit.als.AlternatingLeastSquares(factors=10, iterations=2)
model.fit(coo_train)



  0%|          | 0/2 [00:00<?, ?it/s]

Wall time: 692 ms


# Validation

## Functions required for validation

In [33]:
def to_user_item_coo(df):
    """ Turn a dataframe with transactions into a COO sparse items x users matrix"""
    row = df['user_id'].values
    col = df['item_id'].values
    data = np.ones(df.shape[0])
    coo = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
    return coo


def split_data(df, validation_days=7):
    """ Split a pandas dataframe into training and validation data, using <<validation_days>>
    """
    validation_cut = df['InvoiceDate'].max() - pd.Timedelta(validation_days)

    df_train = df[df['InvoiceDate'] < validation_cut]
    df_val = df[df['InvoiceDate'] >= validation_cut]
    return df_train, df_val

def get_val_matrices(df, validation_days=7):
    """ Split into training and validation and create various matrices
        
        Returns a dictionary with the following keys:
            coo_train: training data in COO sparse format and as (users x items)
            csr_train: training data in CSR sparse format and as (users x items)
            csr_val:  validation data in CSR sparse format and as (users x items)
    
    """
    df_train, df_val = split_data(df, validation_days=validation_days)
    coo_train = to_user_item_coo(df_train)
    coo_val = to_user_item_coo(df_val)

    csr_train = coo_train.tocsr()
    csr_val = coo_val.tocsr()
    
    return {'coo_train': coo_train,
            'csr_train': csr_train,
            'csr_val': csr_val
          }


def validate(matrices, factors=200, iterations=20, regularization=0.01, show_progress=True):
    """ Train an ALS model with <<factors>> (embeddings dimension) 
    for <<iterations>> over matrices and validate with MAP@12
    """
    coo_train, csr_train, csr_val = matrices['coo_train'], matrices['csr_train'], matrices['csr_val']
    
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    
    # The MAPK by implicit doesn't allow to calculate allowing repeated items, which is the case.
    # TODO: change MAP@12 to a library that allows repeated items in prediction
    map12 = mean_average_precision_at_k(model, csr_train, csr_val, K=12, show_progress=show_progress, num_threads=4)
    print(f"Factors: {factors:>3} - Iterations: {iterations:>2} - Regularization: {regularization:4.3f} ==> MAP@12: {map12:6.5f}")
    return map12

In [34]:
matrices = get_val_matrices(df)

In [35]:
%%time
best_map12 = 0
for factors in [40, 50, 60, 100, 200, 500, 1000]:
    for iterations in [3, 12, 14, 15, 20]:
        for regularization in [0.01]:
            map12 = validate(matrices, factors, iterations, regularization, show_progress=False)
            if map12 > best_map12:
                best_map12 = map12
                best_params = {'factors': factors, 'iterations': iterations, 'regularization': regularization}
                print(f"Best MAP@12 found. Updating: {best_params}")

Factors:  40 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00758
Best MAP@12 found. Updating: {'factors': 40, 'iterations': 3, 'regularization': 0.01}
Factors:  40 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00694
Factors:  40 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00694
Factors:  40 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00000
Factors:  40 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00000
Factors:  50 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.01190
Best MAP@12 found. Updating: {'factors': 50, 'iterations': 3, 'regularization': 0.01}
Factors:  50 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00926
Factors:  50 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00926
Factors:  50 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00926
Factors:  50 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.01042
Factors:  60 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.01042
Factors:  60 -

In [36]:
del matrices

# Training over the full dataset

In [37]:
coo_train = to_user_item_coo(df)
csr_train = coo_train.tocsr()

In [38]:
def train(coo_train, factors=200, iterations=15, regularization=0.01, show_progress=True):
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    return model

In [39]:
best_params

{'factors': 500, 'iterations': 20, 'regularization': 0.01}

In [40]:
model = train(coo_train, **best_params)

  0%|          | 0/20 [00:00<?, ?it/s]

# Submission

## Submission function

In [41]:
def submit(model, csr_train, submission_name="submissions.csv"):
    preds = []
    batch_size = 2000
    to_generate = np.arange(len(ALL_USERS))
    for startidx in range(0, len(to_generate), batch_size):
        batch = to_generate[startidx : startidx + batch_size]
        ids, scores = model.recommend(batch, csr_train[batch], N=12, filter_already_liked_items=False)
        for i, userid in enumerate(batch):
            customer_id = user_ids[userid]
            user_items = ids[i]
            article_ids = [item_ids[item_id] for item_id in user_items]
            preds.append((customer_id, ' '.join(article_ids)))

    df_preds = pd.DataFrame(preds, columns=['customer_id', 'prediction'])
    df_preds.to_csv(submission_name, index=False)
    
    display(df_preds.head())
    print(df_preds.shape)
    
    return df_preds

In [42]:
%%time
df_preds = submit(model, csr_train);

Unnamed: 0,customer_id,prediction
0,17850.0,71477 82494l 82483 82482 84029g 85123a 84029e ...
1,13047.0,22961 22090 22692 22077 22969 21755 23245 2262...
2,12583.0,22417 22551 22617 21094 22556 22728 35970 2265...
3,13748.0,22423 22086 22961 22457 84946 22077 22952 2208...
4,15100.0,21258 21259 21257 21936 22567 22627 23310 2159...


(4356, 2)
Wall time: 717 ms
