# H&M Personalized Fashion Recommendations

# 1. The Overview
The objective of this notebook is to setup a recommender model with <a href="https://www.tensorflow.org/recommenders?hl=pt-br">Tensor Flow Recommenders</a> and submit the results to the H&M Personalized Fashion Recommendations Competition on Kaggle.

You can find the complete overview of the competition and the datasets by clicking <a href='https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations'>HERE</a>.

<b>Lets make a short recap of the H&M Personalized Fashion Recommendations Competition.</b>

H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Their online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are key. More importantly, helping customers make the right choices also has a positive implications for sustainability, as it reduces returns, and thereby minimizes emissions from transportation.

In this competition, H&M Group invites competitors to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

The challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends.

# 2. TensorFlow Recommenders

TensorFlow Recommenders (TFRS) is a library for building recommender system models.
It helps with the full workflow of building a recommender system: data preparation, model formulation, training, evaluation, and deployment.
It's built on Keras and aims to have a gentle learning curve while still giving you the flexibility to build complex models.

TFRS makes it possible to:
* Build and evaluate flexible recommendation retrieval models.
* Freely incorporate item, user, and context information into recommendation models.
* Train multi-task models that jointly optimize multiple recommendation objectives.
* TFRS is open source and available on <a href="https://github.com/tensorflow/recommenders">Github</a>.

To learn more, see the <a href = "https://www.tensorflow.org/recommenders/examples/basic_retrieval'tutorial"> on how to build a movie recommender system</a>, or check the API docs for the <a href= "https://www.tensorflow.org/recommenders/api_docs/python/tfrs">API reference.</a>

# 3. The Plan of Attack
A retail recommender system will normally have 2 phases:
* Retrieving - Which is responsible for selecting an initial set of candidates from all possible candidates.
* Ranking - Which takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations.

In this notebook we are gonna create a only the retrieving model, which are simplier but yet powerfull.

The basic idea is that we have two models, one consists in the query model (customer model) and the other, the candidate model (article model).
Then, we combine these two models in a new model, which will perform the loss calculation and optimize the weights using ADAM optimzer.

<img src="https://lh4.googleusercontent.com/CmzEJysJCdipgp1ntCzgZ5pdowgieyB43ep6cU_0WRpsOhXQy4aDWTZxd2IhV0A210TWIP41BpvSXKrGh5sv5Mya30lh1vKHQDiZK3wobWgIt23hx7pCZ3Em8TXMGt1mukuiu-2b" width = "700">

# 4. Instaling and importing the required dependencies

In [8]:
!pip install -q tensorflow-recommenders
!pip install -q scann

In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import StringLookup
from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Normalization
import tensorflow_recommenders as tfrs
from sklearn.preprocessing import MinMaxScaler
import scann

# 5. The Dataset

* images - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
* articles.csv - detailed metadata for each article_id available for purchase
* customers.csv - metadata for each customer_id in dataset
* sample_submission.csv - a sample submission file in the correct format
* transactions_train.csv - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same

For the sake of this notebook we are gonna skip the images dataset, as it can be implemented further.

In [2]:
articles_dataset = pd.read_csv("Data/articles.csv")
customers_dataset = pd.read_csv("Data/customers.csv")
train_dataset = pd.read_csv("Data/transactions_train.csv")

In [3]:
articles_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  int64 
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

In [4]:
customers_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      476930 non-null   float64
 2   Active                  464404 non-null   float64
 3   club_member_status      1365918 non-null  object 
 4   fashion_news_frequency  1355971 non-null  object 
 5   age                     1356119 non-null  float64
 6   postal_code             1371980 non-null  object 
dtypes: float64(3), object(4)
memory usage: 73.3+ MB


In [5]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       object 
 2   article_id        int64  
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 1.2+ GB


## Reducing Memory
Once the dataset is huge and our model will have some complexity, one of the most important things to do, is to use the memory in the most efficient way possible.

The customer_id is a length 64 string which uses 64 bytes. 

The code above coverts the column to int64 which only takes 8 bytes!

We assert that the article_id is converted to int32 as well.

In [236]:
# Saving Memory
customers_dataset["customer_id"] = customers_dataset.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
articles_dataset["article_id"] = articles_dataset["article_id"].astype(np.int32)
train_dataset["customer_id"] = train_dataset.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
train_dataset["article_id"] = train_dataset["article_id"].astype(np.int32)

# Preprocessing
age = customers_dataset.age.values
age = MinMaxScaler().fit_transform(age.reshape(-1,1)).T[0]
customers_dataset["age"] = age

## Saving as Parquet
Now, to achieve a better performance when reading the datasets, we save then as parquet format.

In [7]:
# Save Parquet
articles_dataset.to_parquet("articles.parquet.gzip", compression='gzip')
customers_dataset.to_parquet("customers.parquet.gzip", compression='gzip')
train_dataset.to_parquet("transactions.parquet.gzip", compression='gzip')

# Del the used variables so far to reduce the memory usage
del articles_dataset
del customers_dataset
del train_dataset
del age

In [235]:
# Read Parquet
articles_dataset = pd.read_parquet("Reduced_Data/articles.parquet.gzip")
customers_dataset = pd.read_parquet("Reduced_Data/customers.parquet.gzip")
train_dataset = pd.read_parquet("Reduced_Data/transactions.parquet.gzip")

In [237]:
# Train/Test
train_dataset = train_dataset[train_dataset.t_dat >= "2020-09-01"]
#test_dataset = train_dataset[train_dataset.t_dat >= "2020-09-19"]
#train_dataset = train_dataset[train_dataset.t_dat < "2020-09-19"]

# 6. Parsing the data

In [238]:
"""
This function will be very useful further in merging the features in one dataset into another with less use of memory.
"""
def merger(left,right,var,on):  
    mapper = right[[on,var]].set_index(on).to_dict()[var]
    left[var] = left[on].map(mapper)
    return left

## 6.1 Articles Dataset

In [239]:
# Articles
articles_dataset = merger(articles_dataset,train_dataset,"price","article_id").fillna(-1)

## 6.2 Customers Dataset
We have already parsed the age feature, so we can skip this, but if you want to make some feature engineering with the customers dataset, this session can be used.

In [240]:
# Customers

## 6.3 Training Dataset
Now we can merge all features we have into the training dataset.

In [241]:
# Training
train_dataset = merger(train_dataset,customers_dataset,"age","customer_id").fillna(-1)
#test_dataset = merger(test_dataset,customers_dataset,"age","customer_id").fillna(-1)

## 6.4 Converting to Tensors

In [242]:
# Tensors
candidates_tensor = tf.data.Dataset.from_tensor_slices(dict(articles_dataset[["article_id","price"]]))
train_tensor = tf.data.Dataset.from_tensor_slices(dict(train_dataset[['customer_id','article_id','age','price']])).shuffle(100000).batch(5000).cache()
#test_tensor = tf.data.Dataset.from_tensor_slices(dict(test_dataset[['customer_id','article_id','age','price']])).shuffle(100000).batch(5000).cache()

## 6.5 Getting Uniques
Here we get the vocabularies of customer_id and article_id to be inserted into the IntegerLookup layer of our models.

This is important because we must have a integer index in our embedding layer.

We get their lenghts as well to give to the embedding layer its input dimensions.

In [243]:
# Getting uniques
unique_customers_ids = customers_dataset.customer_id.unique()
unique_articles_ids = articles_dataset.article_id.unique()

unique_customers = len(unique_customers_ids)
unique_articles = len(unique_articles_ids)

Finally we can del our used variables to save memory.

In [244]:
del train_dataset
del articles_dataset
del customers_dataset

# 7. Creating the Model
## 7.1 Customer/Query Model

Lets start by creating the Customer/Query model.

The first step is to create a embedding model (customer_id_model).

One of the biggest advantages of using the TFRS is the facility to input context features into the model, in our case we are gonna input the age feature.


In [245]:
class CustomerModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.customer_id_model = tf.keras.Sequential()
        self.customer_id_model.add(IntegerLookup(vocabulary = unique_customers_ids,mask_token = None))
        self.customer_id_model.add(Embedding(unique_customers + 1,200))
        
    def call(self, inputs):
        reshaped_age = tf.reshape(inputs['age'],(-1,1))
        return tf.concat([self.customer_id_model(inputs["customer_id"]),
                          reshaped_age],axis=1)
    
class QueryModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.embedding_layer = CustomerModel()
        self.dense_layers = tf.keras.Sequential()
        self.dense_layers.add(tf.keras.layers.Dense(100))

        
    def call(self, inputs):
        feature_embeddings = self.embedding_layer(inputs)
        return self.dense_layers(feature_embeddings)

## 7.2 Article/Candidate Model

Here we create the Article/Candidate Model. 

First, we need to create the embedding model, and afterwards we create the candidate model, inputing the price feature.

In [246]:
class ArticleModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.article_id_model = tf.keras.Sequential()
        self.article_id_model.add(IntegerLookup(vocabulary = unique_articles_ids,mask_token = None))
        self.article_id_model.add(Embedding(unique_articles + 1, 200))
        
    def call(self, inputs):
        reshaped_price = tf.reshape(inputs["price"],(-1,1))
        return tf.concat([self.article_id_model(inputs["article_id"]),
                          reshaped_price],axis=1)
    
class CandidateModel(tf.keras.Model):
    def __init__(self):
        super().__init__()
        self.embedding_layer = ArticleModel()
        self.dense_layers = tf.keras.Sequential()
        self.dense_layers.add(tf.keras.layers.Dense(100))

        
    def call(self, inputs):
        feature_embeddings = self.embedding_layer(inputs)
        return self.dense_layers(feature_embeddings)

## 7.3 Combined Model

Then we combine the two models into a new one, the Combined Model.

In [247]:
class CombinedModel(tfrs.models.Model):
    def __init__(self):
        super().__init__()
        self.query_model = QueryModel()
        self.candidate_model = CandidateModel()
        self.task = tfrs.tasks.Retrieval(
            metrics = tfrs.metrics.FactorizedTopK(
                candidates=candidates_tensor.batch(128).map(self.candidate_model)))
        
    def compute_loss(self, features, training = False):
        query_dict = {"customer_id":features["customer_id"],"age":features["age"]}
        query_outcome = self.query_model(query_dict)
        
        candidate_dict = {"article_id":features["article_id"],"price":features["price"]}
        candidate_outcome = self.candidate_model(candidate_dict)
        
        return self.task(query_outcome, candidate_outcome, compute_metrics = not training)

## 7.4 Training the Model

Before we train our model, we are going to declare some callbacks.

In [263]:
model = CombinedModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.002))

In [264]:
#csv_logger_callback = tf.keras.callbacks.CSVLogger("logger.csv", separator=',', append=False)

In [265]:
history = model.fit(
    train_tensor,    
    epochs=100,
    verbose=1, #callbacks = [csv_logger_callback],
    #validation_data=test_tensor,
    #validation_freq=1,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# 8. Scanning and Indexing

In [None]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.query_model,k=12)
scann_index.index_from_dataset(tf.data.Dataset.zip((candidates_tensor.map(lambda x : x["article_id"]).batch(100), candidates_tensor.batch(100).map(model.candidate_model))))

# 9. Submiting the Predictions

In [None]:
sub = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")

In [None]:
#Converter Customer Id
sub['customer_id_int64'] = sub.customer_id.apply(lambda x: int(x[-16:],16) ).astype('int64')
customer_id_dict = sub[["customer_id","customer_id_int64"]].set_index("customer_id_int64").to_dict()
sub["customer_id"] = sub["customer_id_int64"]

# Merge Age and Fillna
customers_dataset = pd.read_parquet("./customers.parquet.gzip")
sub_train = merger(sub,customers_dataset,"age","customer_id").fillna(-1)

In [None]:
_,articles = scann_index(dict(sub_train[['customer_id','age']]))
predictions = np.array(articles).astype(str)
mapper = np.vectorize(lambda x : x.zfill(10))
zfilled_preds = [mapper(row) for row in pred_array]
predictions = pd.Series(map(' '.join, predictions))
sub['prediction'] = predictions
sub["customer_id"] = sub['customer_id'].map(customer_id_dict["customer_id"])
sub.to_csv("submission.csv",index = False)