## <span style="color:#ff5f27">👩🏻‍🔬 Feature Engineering </span>

**Your Python Jupyter notebook should be configured for >8GB of memory.**

In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.

### <span style="color:#ff5f27">✍🏻 Data</span>

We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.

<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->

The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:
- `articles.csv`: info about fashion items.
- `customers.csv`: info about users.
- `transactions_train.csv`: info about transactions.

You can use the *hopsworks* library to download these files locally, assuming that they are stored in your cluster. In this example, we have saved them to the `Resources` directory.

## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
import os
import pandas as pd
import numpy as np

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

## <span style="color:#ff5f27">💽 Download datasets</span>

In [None]:
dataset_api = project.get_dataset_api()

data_dir = "data/"

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

for file in ["articles.parquet", "customers.parquet", "transactions_train.parquet"]:
    dataset_api.download(f"Resources/{file}", local_path=data_dir, overwrite=True)

## <span style="color:#ff5f27">🗄️ Read Articles Data</span>

In [None]:
articles_df = pd.read_parquet(data_dir + "articles.parquet")
articles_df["article_id"] = articles_df["article_id"].astype(str)
print(articles_df.shape)
articles_df.head(3)

In [None]:
# Check for NaNs
articles_df.isna().sum()[articles_df.isna().sum() > 0]

## <span style="color:#ff5f27">🗄️ Read Customers Data</span>

In [None]:
customers_df = pd.read_parquet(data_dir + "customers.parquet")
print(customers_df.shape)
customers_df.head(3)

In [None]:
# Check for NaNs
customers_df.isna().sum()[customers_df.isna().sum() > 0]

## <span style="color:#ff5f27">🗄️ Read Transactions Train Data</span>

In [None]:
trans_df = pd.read_parquet(data_dir + "transactions_train.parquet")
print(trans_df.shape)
trans_df.head(3)

In [None]:
# Check for NaNs
trans_df.isna().sum()[trans_df.isna().sum() > 0]

In [None]:
trans_df["article_id"] = trans_df["article_id"].astype(str)
trans_df['t_dat'] = trans_df['t_dat'].apply(lambda x: pd.to_datetime(x))
trans_df.head(3)

In [None]:
print(f"There are {len(trans_df):,} transactions in total.")

We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions.

In [None]:
N_USERS = 25_000

# Consider only customers with age defined.
customers_df.dropna(inplace=True, subset=["age"])
customer_subset_df = customers_df.sample(N_USERS, random_state=27)
trans_df = trans_df.merge(customer_subset_df["customer_id"])

print(f"Subset has {len(trans_df):,} transactions in total.")

## <span style="color:#ff5f27">👨🏻‍🏭 Feature Engineering</span>

Next, we do some feature engineering.

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine.

In [None]:
%%writefile transformations.py

def month_sin(t_dat):
    month = t_dat.month - 1
    C = 2*np.pi/12
    return np.sin(month*C).item()

def month_cos(t_dat):
    month = t_dat.month - 1
    C = 2*np.pi/12
    return np.cos(month*C).item()

In [None]:
# create transformation functions for computing sin and cos of month
from transformations import month_sin, month_cos

fns = [fn.name for fn in fs.get_transformation_functions()]

if "month_sin" not in fns:
    month_to_sin = fs.create_transformation_function(month_sin, output_type=float, version=1)
    month_to_sin.save()
    
if "month_cos" not in fns:
    month_cos = fs.create_transformation_function(month_cos, output_type=float, version=1)
    month_cos.save()

We'll also remove columns with null values.

In [None]:
customers_df.dropna(axis=1, inplace=True)
articles_df.dropna(axis=1, inplace=True)

In [None]:
trans_df["month_sin"] = trans_df["t_dat"]
trans_df["month_cos"] = trans_df["t_dat"]

convert python datetime object to unix epoch milliseconds 

In [None]:
trans_df.t_dat = trans_df.t_dat.values.astype(np.int64) // 10 ** 6

## <span style="color:#ff5f27">🪄 Feature Group Creation </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.

Before we can create a feature group we need to connect to our feature store.

To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [None]:
customers_fg = fs.create_feature_group(
    name="customers",
    description="Customers data including age and postal code",
    primary_key=["customer_id"],
    online_enabled=True,
)

Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function.

In [None]:
customers_fg.insert(customers_df)

Let's do the same thing for the rest of the data frames.

In [None]:
articles_fg = fs.create_feature_group(
    name="articles",
    description="Fashion items data including type of item, visual description and category",
    primary_key=["article_id"],
    online_enabled=True,
)
articles_fg.insert(articles_df)

In [None]:
trans_fg = fs.create_feature_group(
    name="transactions",
    version=1,
    description="Transactions data including customer, item, price, sales channel and transaction date",
    primary_key=["customer_id", "article_id"], 
    online_enabled=True,
    event_time="t_dat",
)
trans_fg.insert(trans_df)

You should now be able to inspect the feature groups in the Hopsworks UI.

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

A [feature view](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_view/) can be seen as a logical view over a set of features that may come from different feature groups.

Feature views provides an Offline and Online API that can be used to generate training data or retrieve online feature vectors at inference time.

Now we can create two feature views for customers and articles, that will be used during model serving.

In [None]:
customers_query = customers_fg.select_all()
customers_query

In [None]:
customers_feature_view = fs.create_feature_view(
    name='customers',
    query=customers_query,
)

In [None]:
articles_query = articles_fg.select_all()
articles_query

In [None]:
articles_feature_view = fs.create_feature_view(
    name='articles',
    query=articles_query,
)

---
## <span style="color:#ff5f27">⏩️ Next Steps </span>
In the next notebook we'll create a dataset that we can train a retrieval model on.