Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,13 @@ advanced_tutorials/citibike/data/__MACOSX/._202304-citibike-tripdata.csv
advanced_tutorials/citibike/data/__MACOSX/._202305-citibike-tripdata.csv
loan_approval/lending_model/roc_curve.png
advanced_tutorials/timeseries/price_model/model_prediction.png

advanced_tutorials/recommender-system/query_model/variables/variables.index
advanced_tutorials/recommender-system/query_model/variables/variables.data-00000-of-00001
advanced_tutorials/recommender-system/query_model/saved_model.pb
advanced_tutorials/recommender-system/query_model/fingerprint.pb
advanced_tutorials/recommender-system/candidate_model/variables/variables.index
advanced_tutorials/recommender-system/candidate_model/variables/variables.data-00000-of-00001
advanced_tutorials/recommender-system/candidate_model/fingerprint.pb
advanced_tutorials/recommender-system/candidate_model/saved_model.pb
integrations/neo4j/aml_model/*
integrations/neo4j/aml_model_transformer.py
integrations/neo4j/aml_model_transformer.py
58 changes: 46 additions & 12 deletions advanced_tutorials/recommender-system/1_feature_engineering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@
"\n",
"**Your Python Jupyter notebook should be configured for >8GB of memory.**\n",
"\n",
"In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.\n",
"In this series of tutorials, you will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why you also train a ranking model that can afford to use more features than the retrieval model.\n",
"\n",
"### <span style=\"color:#ff5f27\">✍🏻 Data</span>\n",
"\n",
"We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
"You will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
"\n",
"<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data\n",
"\n",
"For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->\n",
"\n",
"The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:\n",
"The full dataset contains images of all products, but here you will simply use the tabular data. You have three data sources:\n",
"- `articles.csv`: info about fashion items.\n",
"- `customers.csv`: info about users.\n",
"- `transactions_train.csv`: info about transactions.\n"
Expand Down Expand Up @@ -75,7 +75,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>"
"## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>\n",
"\n",
"The **article_id** and **product_code** serve different purposes in the context of H&M's product database:\n",
"\n",
"- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.\n",
"\n",
"- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.\n",
"\n",
"While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.\n",
"\n",
"Here is an example:\n",
"\n",
"**Product: Basic T-Shirt**\n",
"\n",
"- **Product Code:** TS001\n",
"\n",
"- **Article IDs:**\n",
" - Article ID: 1001 (Size: Small, Color: White)\n",
" - Article ID: 1002 (Size: Medium, Color: White)\n",
" - Article ID: 1003 (Size: Large, Color: White)\n",
" - Article ID: 1004 (Size: Small, Color: Black)\n",
" - Article ID: 1005 (Size: Medium, Color: Black)\n",
"\n",
"In this example, \"TS001\" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.\n",
"\n"
]
},
{
Expand Down Expand Up @@ -176,7 +200,7 @@
"metadata": {},
"outputs": [],
"source": [
"trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:600000]\n",
"trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:1_000_000]\n",
"print(trans_df.shape)\n",
"trans_df.head(3)"
]
Expand All @@ -199,7 +223,7 @@
"source": [
"## <span style=\"color:#ff5f27\">👨🏻‍🏭 Transactions Feature Engineering</span>\n",
"\n",
"The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine."
"The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine."
]
},
{
Expand All @@ -225,7 +249,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions."
"You can see that you have a large dataset. For the sake of the tutorial, you will use a small subset of this dataset, which you generate by sampling 25'000 customers and using their transactions."
]
},
{
Expand Down Expand Up @@ -386,14 +410,14 @@
"\n",
"A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.\n",
"\n",
"Before we can create a feature group we need to connect to our feature store."
"Before you can create a feature group you need to connect to your feature store."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
"To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
]
},
{
Expand All @@ -416,9 +440,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
"Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
"\n",
"At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function."
"At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you populate it with its associated data using the `insert` method."
]
},
{
Expand Down Expand Up @@ -565,7 +589,17 @@
" trans_fg, \n",
" articles_fg, \n",
" customers_fg,\n",
")"
")\n",
"ranking_df.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ranking_df.label.value_counts()"
]
},
{
Expand Down
Loading