## Feature Engineering

**Your Python Jupyter notebook should be configured for >8GB of memory.**

In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.

### Data

We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.

<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->

The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:
- `articles.csv`: info about fashion items.
- `customers.csv`: info about users.
- `transactions_train.csv`: info about transactions.

You can use the *hopsworks* library to download these files locally, assuming that they are stored in your cluster. In this example, we have saved them to the `Resources` directory.

In [None]:
# Uncomment this cell and fill in details if you are running external Python
import os
key=""
with open("api-key.txt", "r") as f:
    key = f.read().rstrip()
os.environ['HOPSWORKS_PROJECT']="hm"
os.environ['HOPSWORKS_HOST']="35.240.81.237"
os.environ['HOPSWORKS_API_KEY']=key   

In [None]:
import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()

In [None]:
import pandas as pd
path="https://repo.hops.works/dev/jdowling/"

In [None]:
articles_df = pd.read_csv(path + "articles.csv")
articles_df["article_id"] = articles_df["article_id"].astype(str)
articles_df.head(3)

In [None]:
customers_df = pd.read_csv(path + "customers.csv")
customers_df.head()

In [None]:
trans_df = pd.read_csv(path + "transactions_train.csv", parse_dates=["t_dat"])
trans_df["article_id"] = trans_df["article_id"].astype(str)
trans_df.head()

In [None]:
trans_df.info()

In [None]:
print(f"There are {len(trans_df):,} transactions in total.")

We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions.

In [None]:
N_USERS = 25_000

# Consider only customers with age defined.
customers_df.dropna(inplace=True, subset=["age"])
customer_subset_df = customers_df.sample(N_USERS, random_state=27)
trans_df = trans_df.merge(customer_subset_df["customer_id"])

print(f"Subset has {len(trans_df):,} transactions in total.")

### Feature Engineering

Next, we do some feature engineering.

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine.

In [None]:
import numpy as np

# TODO - this is a transformation. We are applying it before we write to the FG.
# We should instead apply it as a transformation fn to the feature-view

# Map month to range [0,11].
month = trans_df["t_dat"].apply(lambda x : x.month - 1)
C = 2*np.pi/12

# Map month to the unit circle.
trans_df["month_sin"] = np.sin(month*C)
trans_df["month_cos"] = np.cos(month*C)

We'll also remove columns with null values.

In [None]:
customers_df.dropna(axis=1, inplace=True)
articles_df.dropna(axis=1, inplace=True)

convert python datetime object to unix epoch milliseconds 

In [None]:
trans_df.t_dat = trans_df.t_dat.values.astype(np.int64) // 10 ** 6

### Feature Groups

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.

Before we can create a feature group we need to connect to our feature store.

In [None]:
fs = project.get_feature_store()

To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [None]:
customers_fg = fs.create_feature_group(
    name="customers",
    version=1,
    description="Customer data.",
    primary_key=["customer_id"],
    online_enabled=True
)

Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function.

In [None]:
customers_fg.insert(customers_df)

Let's do the same thing for the rest of the data frames.

In [None]:
articles_fg = fs.create_feature_group(
    name="articles",
    version=1,
    description="Fashion item data.",
    primary_key=["article_id"],
    online_enabled=True
)
articles_fg.insert(articles_df)



In [None]:
trans_fg = fs.create_feature_group(
    name="transactions",
    version=1,
    description="Transaction data.",
    primary_key=["customer_id", "article_id"], 
    online_enabled=True,
    event_time=["t_dat"]
)
trans_fg.insert(trans_df)

You should now be able to inspect the feature groups in the Hopsworks UI.

### Next Steps

In the next notebook we'll create a dataset that we can train a retrieval model on.