## Feature Engineering

In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.

**TODO Could be nice with an illustration of the control flow here.**

### Data

We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.

<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->

The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:
- `articles.csv`: info about fashion items.
- `customers.csv`: info about users.
- `transactions_train.csv`: info about transactions.

In [1]:
import pandas as pd

articles_df = pd.read_csv("articles.csv")
customers_df = pd.read_csv("customers.csv")
trans_df = pd.read_csv("transactions_train.csv", parse_dates=["t_dat"])

In [2]:
articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [3]:
customers_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [4]:
trans_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [5]:
print(f"There are {len(trans_df):,} transactions in total.")

There are 31,788,324 transactions in total.


We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions.

In [6]:
N_USERS = 25_000

# Consider only customers with age defined.
customers_df.dropna(inplace=True, subset=["age"])
customer_subset_df = customers_df.sample(N_USERS, random_state=27)
trans_df = trans_df.merge(customer_subset_df["customer_id"])

print(f"Subset has {len(trans_df):,} transactions in total.")

Subset has 585,510 transactions in total.


### Feature Engineering

Next, we do some feature engineering.

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine.

In [7]:
import numpy as np

# Map month to range [0,11].
month = trans_df["t_dat"].apply(lambda x : x.month - 1)
C = 2*np.pi/12

# Map month to the unit circle.
trans_df["month_sin"] = np.sin(month*C)
trans_df["month_cos"] = np.cos(month*C)

### Feature Groups

**TODO Explain the concept of a feature group. Create a feature group for each dataframe.**

### Dataset Split

**TODO This part should be done with Hopsworks (maybe in a separate notebook). The split will be chronological instead of random.**

In [8]:
df = trans_df.merge(customer_subset_df, on="customer_id")\
    .merge(articles_df, on="article_id")

df.sample(5)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,month_sin,month_cos,FN,Active,club_member_status,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
60589,2020-01-30,b35a247363500dc4a114d072b40e6f3bb9d311a20a2794...,785515003,0.050407,1,0.0,1.0,,,ACTIVE,...,Suit,A,Ladieswear,1,Ladieswear,11,Womens Tailoring,1008,Dressed,"Gently fitted, single-breasted jacket in woven..."
547290,2020-09-03,8c8b8eac7e41cfcc24c5b264fd2e747821691f5e3fbb6e...,868874006,0.031831,1,-0.866025,-0.5,1.0,1.0,ACTIVE,...,Blouse,A,Ladieswear,1,Ladieswear,6,Womens Casual,1010,Blouses,"Shirt in soft, lightweight cotton flannel with..."
18022,2019-11-29,f4ccdc3907bdd5650f7e27955df7295f44e00901bec4d3...,820563003,0.027102,1,-0.866025,0.5,,,ACTIVE,...,Knitwear,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1003,Knitwear,Jumper in a soft knit containing some wool wit...
474525,2019-01-17,34fd8d37f4f0ccf1fd8ead99abcebf39614e0a8eec2e14...,617725001,0.013542,2,0.0,1.0,,,ACTIVE,...,Divided+ inactive from s.1,A,Ladieswear,1,Ladieswear,2,H&M+,1001,Unknown,Cotton jersey top with short sleeves with sewn...
546365,2019-01-03,89d7be7d3a2f5487541c64f570a6a62186f1e13ca3e997...,592975012,0.030492,2,0.0,1.0,,,ACTIVE,...,Trouser,A,Ladieswear,1,Ladieswear,11,Womens Tailoring,1009,Trousers,Tailored trousers in a sturdy textured weave w...


In [9]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.3, random_state=27)

In [10]:
train_df.to_csv("train_df.csv", index=False)
val_df.to_csv("val_df.csv", index=False)