## <span style="color:#ff5f27">👩🏻‍🔬 Feature Engineering </span>

**Note**: This tutorial does not support Google Colab.

**Your Python Jupyter notebook should be configured for >8GB of memory.**

In this series of tutorials, you will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why you also train a ranking model that can afford to use more features than the retrieval model.

### <span style="color:#ff5f27">✍🏻 Data</span>

You will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.

<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->

The full dataset contains images of all products, but here you will simply use the tabular data. You have three data sources:
- `articles.csv`: info about fashion items.
- `customers.csv`: info about users.
- `transactions_train.csv`: info about transactions.


## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
import pandas as pd
import numpy as np

import great_expectations as ge
from great_expectations.core import ExpectationSuite, ExpectationConfiguration

from features.articles import prepare_articles
from features.customers import prepare_customers
from features.transactions import prepare_transactions
from features.ranking import compute_ranking_dataset  

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

## <span style="color:#ff5f27">🗄️ Read Articles Data</span>

The **article_id** and **product_code** serve different purposes in the context of H&M's product database:

- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.

- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.

While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.

Here is an example:

**Product: Basic T-Shirt**

- **Product Code:** TS001

- **Article IDs:**
    - Article ID: 1001 (Size: Small, Color: White)
    - Article ID: 1002 (Size: Medium, Color: White)
    - Article ID: 1003 (Size: Large, Color: White)
    - Article ID: 1004 (Size: Small, Color: Black)
    - Article ID: 1005 (Size: Medium, Color: Black)

In this example, "TS001" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.



In [None]:
articles_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/articles.parquet')
print(articles_df.shape)
articles_df.head(3)

In [None]:
# Check for NaNs
articles_df.isna().sum()[articles_df.isna().sum() > 0]

## <span style="color:#ff5f27">👨🏻‍🏭 Articles Feature Engineering</span>


In [None]:
articles_df = prepare_articles(articles_df)
articles_df.head(3)

## <span style="color:#ff5f27">🗄️ Read Customers Data</span>

In [None]:
customers_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/customers.parquet')
print(customers_df.shape)
customers_df.head(3)

In [None]:
# Check for NaNs
customers_df.isna().sum()[customers_df.isna().sum() > 0]

## <span style="color:#ff5f27">👨🏻‍🏭 Customers Feature Engineering</span>


In [None]:
customers_df = prepare_customers(customers_df)
customers_df.head(3)

## <span style="color:#ff5f27">🗄️ Read Transactions Data</span>

In [None]:
trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:1_000_000]
print(trans_df.shape)
trans_df.head(3)

In [None]:
# Check for NaNs
trans_df.isna().sum()[trans_df.isna().sum() > 0]

## <span style="color:#ff5f27">👨🏻‍🏭 Transactions Feature Engineering</span>

The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine.

In [None]:
trans_df = prepare_transactions(trans_df)
trans_df.head(3)

In [None]:
print(f"There are {len(trans_df):,} transactions in total.")

You can see that you have a large dataset. For the sake of the tutorial, you will use a small subset of this dataset, which you generate by sampling 25'000 customers and using their transactions.

In [None]:
N_USERS = 25_000

# Consider only customers with age defined.
customers_df.dropna(inplace=True, subset=["age"])
customer_subset_df = customers_df.sample(N_USERS, random_state=27)

In [None]:
trans_df = trans_df.merge(customer_subset_df["customer_id"])

print(f"Subset has {len(trans_df):,} transactions in total.")

## <span style="color:#ff5f27">👮🏻‍♂️ Great Expectations </span>

In [None]:
ge_customers_df = ge.from_pandas(customers_df)
expectation_suite_customers = ge_customers_df.get_expectation_suite()
expectation_suite_customers.expectation_suite_name = "customers_suite"

In [None]:
expectation_suite_customers.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "age",
            "min_value": 0,
            "max_value": 120,
        }
    )
)

for column in ge_customers_df.columns:
    expectation_suite_customers.add_expectation(
        ExpectationConfiguration(
            expectation_type="expect_column_values_to_be_null",
            kwargs={
                "column": column,
                "mostly": 0.0,
            }
        )
    )

In [None]:
ge_articles_df = ge.from_pandas(articles_df)
expectation_suite_articles = ge_articles_df.get_expectation_suite()
expectation_suite_articles.expectation_suite_name = "articles_suite"

In [None]:
expectation_suite_articles.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "prod_name_length",
            "min_value": 1,
            "max_value": 200,
        }
    )
)

for column in ['article_id', 'product_code']:
    expectation_suite_articles.add_expectation(
        ExpectationConfiguration(
            expectation_type="expect_column_values_to_be_null",
            kwargs={
                "column": column,
                "mostly": 0.0,
            }
        )
    )

In [None]:
ge_trans_df = ge.from_pandas(trans_df) 
expectation_suite_transactions = ge_trans_df.get_expectation_suite()
expectation_suite_transactions.expectation_suite_name = "transactions_suite"

In [None]:
expectation_suite_transactions.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_null",
        kwargs={
            "column": "customer_id",
            "mostly": 0.0,
        }
    )
)

expectation_suite_transactions.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={
            "column": "price",
            "min_value": 0,
            "max_value": None,
        }
    )
)

## <span style="color:#ff5f27">🪄 Feature Group Creation </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.

Before you can create a feature group you need to connect to your feature store.

To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group.

In [None]:
customers_fg = fs.get_or_create_feature_group(
    name="customers",
    description="Customers data including age and postal code",
    version=1,
    primary_key=["customer_id"],
    online_enabled=True,
    expectation_suite=expectation_suite_customers,
)

Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you populate it with its associated data using the `insert` method.

In [None]:
customers_fg.insert(customers_df)

In [None]:
feature_descriptions = [
    {"name": "customer_id", "description": "Unique identifier for each customer."},
    {"name": "club_member_status", "description": "Membership status of the customer in the club."},
    {"name": "age", "description": "Age of the customer."},
    {"name": "postal_code", "description": "Postal code associated with the customer's address."},
    {"name": "age_group", "description": "Categorized age group of the customer."},
]

for desc in feature_descriptions: 
    customers_fg.update_feature_description(desc["name"], desc["description"])

Let's do the same thing for the rest of the data frames.

In [None]:
articles_fg = fs.get_or_create_feature_group(
    name="articles",
    description="Fashion items data including type of item, visual description and category",
    version=1,
    primary_key=["article_id"],
    online_enabled=True,
    expectation_suite=expectation_suite_articles,
)
articles_fg.insert(articles_df)

In [None]:
feature_descriptions = [
    {"name": "article_id", "description": "Identifier for the article."},
    {"name": "product_code", "description": "Code associated with the product."},
    {"name": "prod_name", "description": "Name of the product."},
    {"name": "product_type_no", "description": "Number associated with the product type."},
    {"name": "product_type_name", "description": "Name of the product type."},
    {"name": "product_group_name", "description": "Name of the product group."},
    {"name": "graphical_appearance_no", "description": "Number associated with graphical appearance."},
    {"name": "graphical_appearance_name", "description": "Name of the graphical appearance."},
    {"name": "colour_group_code", "description": "Code associated with the colour group."},
    {"name": "colour_group_name", "description": "Name of the colour group."},
    {"name": "perceived_colour_value_id", "description": "ID associated with perceived colour value."},
    {"name": "perceived_colour_value_name", "description": "Name of the perceived colour value."},
    {"name": "perceived_colour_master_id", "description": "ID associated with perceived colour master."},
    {"name": "perceived_colour_master_name", "description": "Name of the perceived colour master."},
    {"name": "department_no", "description": "Number associated with the department."},
    {"name": "department_name", "description": "Name of the department."},
    {"name": "index_code", "description": "Code associated with the index."},
    {"name": "index_name", "description": "Name of the index."},
    {"name": "index_group_no", "description": "Number associated with the index group."},
    {"name": "index_group_name", "description": "Name of the index group."},
    {"name": "section_no", "description": "Number associated with the section."},
    {"name": "section_name", "description": "Name of the section."},
    {"name": "garment_group_no", "description": "Number associated with the garment group."},
    {"name": "garment_group_name", "description": "Name of the garment group."},
    {"name": "prod_name_length", "description": "Length of the product name."},
]

for desc in feature_descriptions: 
    articles_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
trans_fg = fs.get_or_create_feature_group(
    name="transactions",
    version=1,
    description="Transactions data including customer, item, price, sales channel and transaction date",
    primary_key=["customer_id", "article_id"],
    online_enabled=True,
    event_time="t_dat",
    expectation_suite=expectation_suite_transactions,
)
trans_fg.insert(
    trans_df,
    write_options={"wait_for_job": True},
)

In [None]:
feature_descriptions = [
    {"name": "t_dat", "description": "Timestamp of the data record."},
    {"name": "customer_id", "description": "Unique identifier for each customer."},
    {"name": "article_id", "description": "Identifier for the purchased article."},
    {"name": "price", "description": "Price of the purchased article."},
    {"name": "sales_channel_id", "description": "Identifier for the sales channel."},
    {"name": "year", "description": "Year of the transaction."},
    {"name": "month", "description": "Month of the transaction."},
    {"name": "day", "description": "Day of the transaction."},
    {"name": "day_of_week", "description": "Day of the week of the transaction."},
    {"name": "month_sin", "description": "Sine of the month used for seasonal patterns."},
    {"name": "month_cos", "description": "Cosine of the month used for seasonal patterns."},
]

for desc in feature_descriptions: 
    trans_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
ranking_df = compute_ranking_dataset(
    trans_fg, 
    articles_fg, 
    customers_fg,
)
ranking_df.head(3)

In [None]:
ranking_df.label.value_counts()

In [None]:
rank_fg = fs.get_or_create_feature_group(
    name="ranking",
    version=1,
    description="Derived feature group for ranking",
    primary_key=["customer_id", "article_id"], 
    parents=[articles_fg, customers_fg, trans_fg],
)
rank_fg.insert(ranking_df)

In [None]:
feature_descriptions = [
    {"name": "customer_id", "description": "Unique identifier for each customer."},
    {"name": "article_id", "description": "Identifier for the purchased article."},
    {"name": "age", "description": "Age of the customer."},
    {"name": "month_sin", "description": "Sine of the month used for seasonal patterns."},
    {"name": "month_cos", "description": "Cosine of the month used for seasonal patterns."},
    {"name": "product_type_name", "description": "Name of the product type."},
    {"name": "product_group_name", "description": "Name of the product group."},
    {"name": "graphical_appearance_name", "description": "Name of the graphical appearance."},
    {"name": "colour_group_name", "description": "Name of the colour group."},
    {"name": "perceived_colour_value_name", "description": "Name of the perceived colour value."},
    {"name": "perceived_colour_master_name", "description": "Name of the perceived colour master."},
    {"name": "department_name", "description": "Name of the department."},
    {"name": "index_name", "description": "Name of the index."},
    {"name": "index_group_name", "description": "Name of the index group."},
    {"name": "section_name", "description": "Name of the section."},
    {"name": "garment_group_name", "description": "Name of the garment group."},
    {"name": "label", "description": "Label indicating whether the article was purchased (1) or not (0)."},
]

for desc in feature_descriptions: 
    rank_fg.update_feature_description(desc["name"], desc["description"])

You should now be able to inspect the feature groups in the Hopsworks UI.

---
## <span style="color:#ff5f27">⏩️ Next Steps </span>
In the next notebook you'll train a retrieval model.