diff --git a/.gitignore b/.gitignore
index a75ea9dc..d54a7ced 100644
--- a/.gitignore
+++ b/.gitignore
@@ -179,6 +179,13 @@ advanced_tutorials/citibike/data/__MACOSX/._202304-citibike-tripdata.csv
advanced_tutorials/citibike/data/__MACOSX/._202305-citibike-tripdata.csv
loan_approval/lending_model/roc_curve.png
advanced_tutorials/timeseries/price_model/model_prediction.png
-
+advanced_tutorials/recommender-system/query_model/variables/variables.index
+advanced_tutorials/recommender-system/query_model/variables/variables.data-00000-of-00001
+advanced_tutorials/recommender-system/query_model/saved_model.pb
+advanced_tutorials/recommender-system/query_model/fingerprint.pb
+advanced_tutorials/recommender-system/candidate_model/variables/variables.index
+advanced_tutorials/recommender-system/candidate_model/variables/variables.data-00000-of-00001
+advanced_tutorials/recommender-system/candidate_model/fingerprint.pb
+advanced_tutorials/recommender-system/candidate_model/saved_model.pb
integrations/neo4j/aml_model/*
-integrations/neo4j/aml_model_transformer.py
\ No newline at end of file
+integrations/neo4j/aml_model_transformer.py
diff --git a/advanced_tutorials/recommender-system/1_feature_engineering.ipynb b/advanced_tutorials/recommender-system/1_feature_engineering.ipynb
index 83f6a8fb..9ead9d7a 100644
--- a/advanced_tutorials/recommender-system/1_feature_engineering.ipynb
+++ b/advanced_tutorials/recommender-system/1_feature_engineering.ipynb
@@ -10,17 +10,17 @@
"\n",
"**Your Python Jupyter notebook should be configured for >8GB of memory.**\n",
"\n",
- "In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.\n",
+ "In this series of tutorials, you will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why you also train a ranking model that can afford to use more features than the retrieval model.\n",
"\n",
"### âđģ Data\n",
"\n",
- "We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
+ "You will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
"\n",
"\n",
"\n",
- "The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:\n",
+ "The full dataset contains images of all products, but here you will simply use the tabular data. You have three data sources:\n",
"- `articles.csv`: info about fashion items.\n",
"- `customers.csv`: info about users.\n",
"- `transactions_train.csv`: info about transactions.\n"
@@ -75,7 +75,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## đī¸ Read Articles Data"
+ "## đī¸ Read Articles Data\n",
+ "\n",
+ "The **article_id** and **product_code** serve different purposes in the context of H&M's product database:\n",
+ "\n",
+ "- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.\n",
+ "\n",
+ "- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.\n",
+ "\n",
+ "While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.\n",
+ "\n",
+ "Here is an example:\n",
+ "\n",
+ "**Product: Basic T-Shirt**\n",
+ "\n",
+ "- **Product Code:** TS001\n",
+ "\n",
+ "- **Article IDs:**\n",
+ " - Article ID: 1001 (Size: Small, Color: White)\n",
+ " - Article ID: 1002 (Size: Medium, Color: White)\n",
+ " - Article ID: 1003 (Size: Large, Color: White)\n",
+ " - Article ID: 1004 (Size: Small, Color: Black)\n",
+ " - Article ID: 1005 (Size: Medium, Color: Black)\n",
+ "\n",
+ "In this example, \"TS001\" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.\n",
+ "\n"
]
},
{
@@ -176,7 +200,7 @@
"metadata": {},
"outputs": [],
"source": [
- "trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:600000]\n",
+ "trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:1_000_000]\n",
"print(trans_df.shape)\n",
"trans_df.head(3)"
]
@@ -199,7 +223,7 @@
"source": [
"## đ¨đģâđ Transactions Feature Engineering\n",
"\n",
- "The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine."
+ "The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine."
]
},
{
@@ -225,7 +249,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions."
+ "You can see that you have a large dataset. For the sake of the tutorial, you will use a small subset of this dataset, which you generate by sampling 25'000 customers and using their transactions."
]
},
{
@@ -386,14 +410,14 @@
"\n",
"A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.\n",
"\n",
- "Before we can create a feature group we need to connect to our feature store."
+ "Before you can create a feature group you need to connect to your feature store."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
+ "To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
]
},
{
@@ -416,9 +440,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
+ "Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
"\n",
- "At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function."
+ "At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you populate it with its associated data using the `insert` method."
]
},
{
@@ -565,7 +589,17 @@
" trans_fg, \n",
" articles_fg, \n",
" customers_fg,\n",
- ")"
+ ")\n",
+ "ranking_df.head(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ranking_df.label.value_counts()"
]
},
{
diff --git a/advanced_tutorials/recommender-system/2_train_retrieval_model.ipynb b/advanced_tutorials/recommender-system/2_train_retrieval_model.ipynb
index ad8ddf72..48b1bb90 100644
--- a/advanced_tutorials/recommender-system/2_train_retrieval_model.ipynb
+++ b/advanced_tutorials/recommender-system/2_train_retrieval_model.ipynb
@@ -6,9 +6,9 @@
"source": [
"## đ§Ŧ Train Retrieval Model \n",
"\n",
- "In this notebook, we will train a retrieval model that will be able to quickly generate a small subset of candidate items from a large collection of items. Our model will be based on the two-tower architecture, which embeds queries and candidates (keys) into a shared low-dimensional vector space. Here, a query consists of features of a customer and a transaction (e.g. timestamp of the purchase), whereas a candidate consists of features of a particular item. All queries will have a user ID and all candidates will have an item ID, and the model will be trained such that the embedding of a user will be close to all the embeddings of items the user has previously bought.\n",
+ "In this notebook, you will train a retrieval model that will be able to quickly generate a small subset of candidate items from a large collection of items. Your model will be based on the *two-tower architecture*, which embeds queries and candidates (keys) into a shared low-dimensional vector space. Here, a query consists of features of a customer and a transaction (e.g. timestamp of the purchase), whereas a candidate consists of features of a particular item. All queries will have a user ID and all candidates will have an item ID, and the model will be trained such that the embedding of a user will be close to all the embeddings of items the user has previously bought.\n",
"\n",
- "After training the model we will save and upload its components to the Hopsworks Model Registry.\n",
+ "After training the model you will save and upload its components to the Hopsworks Model Registry.\n",
"\n",
"Let's go ahead and load the data."
]
@@ -28,6 +28,8 @@
"source": [
"import tensorflow as tf\n",
"from tensorflow.keras.layers.experimental.preprocessing import StringLookup, Normalization\n",
+ "import tensorflow_recommenders as tfrs\n",
+ "import tensorflow_addons as tfa\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
@@ -59,7 +61,7 @@
"source": [
"## đĒ Feature Selection \n",
"\n",
- "First, we'll load the feature groups we created in the previous tutorial."
+ "First, you'll load the feature groups you created in the previous tutorial."
]
},
{
@@ -86,7 +88,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We'll need to join these three data sources to make the data compatible with out retrieval model. Recall that each row in the `transactions` feature group relates information about which customer bought which item. We'll join this feature group with the `customers` and `articles` feature groups to inject customer and item features into each row."
+ "You'll need to join these three data sources to make the data compatible with out retrieval model. Recall that each row in the `transactions` feature group relates information about which customer bought which item. You'll join this feature group with the `customers` and `articles` feature groups to inject customer and item features into each row."
]
},
{
@@ -157,29 +159,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We will train our retrieval model with a subset of features.\n",
+ "You will train your retrieval model with a subset of features.\n",
"\n",
- "For the query embedding we will use:\n",
+ "For the query embedding you will use:\n",
"- `customer_id`: ID of the customer.\n",
"- `age`: age of the customer at the time of purchase.\n",
"- `month_sin`, `month_cos`: time of year the purchase was made.\n",
"\n",
- "For the candidate embedding we will use:\n",
+ "For the candidate embedding you will use:\n",
"- `article_id`: ID of the item.\n",
"- `garment_group_name`: type of garment.\n",
"- `index_group_name`: menswear/ladieswear etc."
]
},
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "train_df[\"article_id\"] = train_df[\"article_id\"].astype(str) # to be removed\n",
- "val_df[\"article_id\"] = val_df[\"article_id\"].astype(str) # to be removed"
- ]
- },
{
"cell_type": "code",
"execution_count": null,
@@ -190,7 +182,7 @@
"candidate_features = [\"article_id\", \"garment_group_name\", \"index_group_name\"]\n",
"\n",
"def df_to_ds(df):\n",
- " return tf.data.Dataset.from_tensor_slices({col : df[col] for col in df})\n",
+ " return tf.data.Dataset.from_tensor_slices({col: df[col] for col in df})\n",
"\n",
"BATCH_SIZE = 2048\n",
"train_ds = df_to_ds(train_df).batch(BATCH_SIZE).cache().shuffle(BATCH_SIZE*10)\n",
@@ -201,7 +193,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You will need a list of user and item IDs when we initialize our embeddings."
+ "You will need a list of user and item IDs when you initialize your embeddings."
]
},
{
@@ -210,12 +202,15 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Retrieve unique customer IDs and article IDs from the training dataset\n",
"user_id_list = train_df[\"customer_id\"].unique().tolist()\n",
"item_id_list = train_df[\"article_id\"].unique().tolist()\n",
"\n",
+ "# Retrieve unique garment group names and index group names from the training dataset\n",
"garment_group_list = train_df[\"garment_group_name\"].unique().tolist()\n",
"index_group_list = train_df[\"index_group_name\"].unique().tolist()\n",
"\n",
+ "# Print the number of transactions, number of users, number of items, and unique garment group names\n",
"print(f\"Number of transactions: {len(train_df):,}\")\n",
"print(f\"Number of users: {len(user_id_list):,}\")\n",
"print(f\"Number of items: {len(item_id_list):,}\")\n",
@@ -229,10 +224,10 @@
"## đ° Two Tower Model \n",
"\n",
"The two tower model consist of two models:\n",
- "- Query model: generates a query representation given user and transaction features.\n",
- "- Candidate model: generates an item representation given item features.\n",
+ "- Query model: Generates a query representation given user and transaction features.\n",
+ "- Candidate model: Generates an item representation given item features.\n",
"\n",
- "Both models produce embeddings that live in the same embedding space. We let this space be low-dimensional to prevent overfitting on the training data. (Otherwise, the model might simply memorize previous purchases, which makes it recommend items customers already have bought.)"
+ "**Both models produce embeddings that live in the same embedding space**. You let this space be low-dimensional to prevent overfitting on the training data. (Otherwise, the model might simply memorize previous purchases, which makes it recommend items customers already have bought)."
]
},
{
@@ -248,7 +243,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We start with creating the query model."
+ "You start with creating the query model."
]
},
{
@@ -268,7 +263,7 @@
" mask_token=None\n",
" ),\n",
" tf.keras.layers.Embedding(\n",
- " # We add an additional embedding to account for unknown tokens.\n",
+ " # You add an additional embedding to account for unknown tokens.\n",
" len(user_id_list) + 1,\n",
" EMB_DIM\n",
" )\n",
@@ -308,7 +303,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The candidate model is very similar to the query model. A difference is that it has two categorical features as input, which we one-hot encode."
+ "The candidate model is very similar to the query model. A difference is that it has two categorical features as input, which you one-hot encode."
]
},
{
@@ -328,14 +323,20 @@
" mask_token=None\n",
" ),\n",
" tf.keras.layers.Embedding(\n",
- " # We add an additional embedding to account for unknown tokens.\n",
+ " # You add an additional embedding to account for unknown tokens.\n",
" len(item_id_list) + 1,\n",
" EMB_DIM\n",
" )\n",
" ])\n",
- "\n",
- " self.garment_group_tokenizer = StringLookup(vocabulary=garment_group_list, mask_token=None)\n",
- " self.index_group_tokenizer = StringLookup(vocabulary=index_group_list, mask_token=None)\n",
+ " # Converts strings into integer indices (scikit-learn LabelEncoder analog)\n",
+ " self.garment_group_tokenizer = StringLookup(\n",
+ " vocabulary=garment_group_list, \n",
+ " mask_token=None,\n",
+ " )\n",
+ " self.index_group_tokenizer = StringLookup(\n",
+ " vocabulary=index_group_list, \n",
+ " mask_token=None,\n",
+ " )\n",
"\n",
" self.fnn = tf.keras.Sequential([\n",
" tf.keras.layers.Dense(EMB_DIM, activation=\"relu\"),\n",
@@ -345,18 +346,18 @@
" def call(self, inputs):\n",
" garment_group_embedding = tf.one_hot(\n",
" self.garment_group_tokenizer(inputs[\"garment_group_name\"]),\n",
- " len(garment_group_list)\n",
+ " len(garment_group_list),\n",
" )\n",
"\n",
" index_group_embedding = tf.one_hot(\n",
" self.index_group_tokenizer(inputs[\"index_group_name\"]),\n",
- " len(index_group_list)\n",
+ " len(index_group_list),\n",
" )\n",
"\n",
" concatenated_inputs = tf.concat([\n",
" self.item_embedding(inputs[\"article_id\"]),\n",
" garment_group_embedding,\n",
- " index_group_embedding\n",
+ " index_group_embedding,\n",
" ], axis=1)\n",
"\n",
" outputs = self.fnn(concatenated_inputs)\n",
@@ -371,7 +372,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You will evaluate the two tower model using the *top-100 accuracy*. That is, for each transaction in the validation data we will generate the associated query embedding and retrieve the set of the 100 items that are closest to this query in the embedding space. The top-100 accuracy measures how often the item that was actually bought is part of this subset. To evaluate this, we create a dataset of all unique items in the training data."
+ "You will evaluate the two tower model using the *top-100 accuracy*. That is, for each transaction in the validation data you will generate the associated query embedding and retrieve the set of the 100 items that are closest to this query in the embedding space. The top-100 accuracy measures how often the item that was actually bought is part of this subset. To evaluate this, you create a dataset of all unique items in the training data."
]
},
{
@@ -389,7 +390,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "With this in place, we can finally create our two tower model."
+ "With this in place, you can finally create your two tower model."
]
},
{
@@ -398,8 +399,6 @@
"metadata": {},
"outputs": [],
"source": [
- "import tensorflow_recommenders as tfrs\n",
- "\n",
"class TwoTowerModel(tf.keras.Model):\n",
" def __init__(self, query_model, item_model):\n",
" super().__init__()\n",
@@ -479,10 +478,13 @@
"metadata": {},
"outputs": [],
"source": [
- "import tensorflow_addons as tfa\n",
- "\n",
+ "# Create a TwoTowerModel with the specified query_model and item_model\n",
"model = TwoTowerModel(query_model, item_model)\n",
+ "\n",
+ "# Define an optimizer using AdamW with a learning rate of 0.01\n",
"optimizer = tfa.optimizers.AdamW(0.001, learning_rate=0.01)\n",
+ "\n",
+ "# Compile the model using the specified optimizer\n",
"model.compile(optimizer=optimizer)"
]
},
@@ -492,7 +494,11 @@
"metadata": {},
"outputs": [],
"source": [
- "model.fit(train_ds, validation_data=val_ds, epochs=5)"
+ "model.fit(\n",
+ " train_ds, \n",
+ " validation_data=val_ds, \n",
+ " epochs=5,\n",
+ ")"
]
},
{
@@ -501,7 +507,7 @@
"source": [
"## đī¸ Upload Model to Model Registry \n",
"\n",
- "One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.\n",
+ "One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.\n",
"\n",
"Let's connect to the model registry using the [HSML library](https://docs.hopsworks.ai/machine-learning-api/latest) from Hopsworks."
]
@@ -528,10 +534,12 @@
" @tf.function()\n",
" def compute_emb(self, instances):\n",
" query_emb = self.query_model(instances)\n",
- " return {\"customer_id\": instances[\"customer_id\"],\n",
- " \"month_sin\": instances[\"month_sin\"],\n",
- " \"month_cos\": instances[\"month_cos\"],\n",
- " \"query_emb\": query_emb}\n",
+ " return {\n",
+ " \"customer_id\": instances[\"customer_id\"],\n",
+ " \"month_sin\": instances[\"month_sin\"],\n",
+ " \"month_cos\": instances[\"month_cos\"],\n",
+ " \"query_emb\": query_emb,\n",
+ " }\n",
"\n",
"# wrap query_model: query_model -> query_model_module\n",
"query_model = QueryModelModule(model.query_model)"
@@ -543,22 +551,30 @@
"metadata": {},
"outputs": [],
"source": [
- "instances_spec={\n",
- " 'customer_id': tf.TensorSpec(shape=(None,), dtype=tf.string, name='customer_id'),\n",
- " 'month_sin': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='month_sin'),\n",
- " 'month_cos': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='month_cos'),\n",
- " 'age': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='age')\n",
+ "# Define the input specifications for the instances\n",
+ "instances_spec = {\n",
+ " 'customer_id': tf.TensorSpec(shape=(None,), dtype=tf.string, name='customer_id'), # Specification for customer IDs\n",
+ " 'month_sin': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='month_sin'), # Specification for sine of month\n",
+ " 'month_cos': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='month_cos'), # Specification for cosine of month\n",
+ " 'age': tf.TensorSpec(shape=(None,), dtype=tf.float64, name='age'), # Specification for age\n",
"}\n",
+ "\n",
+ "# Get the concrete function for the query_model's compute_emb function using the specified input signatures\n",
"signatures = query_model.compute_emb.get_concrete_function(instances_spec)\n",
"\n",
- "tf.saved_model.save(query_model, \"query_model\", signatures=signatures)"
+ "# Save the query_model along with the concrete function signatures\n",
+ "tf.saved_model.save(\n",
+ " query_model, # The model to save\n",
+ " \"query_model\", # Path to save the model\n",
+ " signatures=signatures, # Concrete function signatures to include\n",
+ ")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "First, we need to save our models locally."
+ "First, you need to save our models locally."
]
},
{
@@ -567,7 +583,10 @@
"metadata": {},
"outputs": [],
"source": [
- "tf.saved_model.save(model.item_model, \"candidate_model\")"
+ "tf.saved_model.save(\n",
+ " model.item_model, # The model to save\n",
+ " \"candidate_model\", # Path to save the model\n",
+ ")"
]
},
{
@@ -593,12 +612,12 @@
"query_model_output_schema = Schema([{\n",
" \"name\": \"query_embedding\",\n",
" \"type\": \"float32\",\n",
- " \"shape\": [EMB_DIM]\n",
+ " \"shape\": [EMB_DIM],\n",
"}])\n",
"\n",
"query_model_schema = ModelSchema(\n",
" input_schema=query_model_input_schema,\n",
- " output_schema=query_model_output_schema\n",
+ " output_schema=query_model_output_schema,\n",
")\n",
"\n",
"query_model_schema.to_dict()"
@@ -608,7 +627,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "With the schema in place, we can finally register our model."
+ "With the schema in place, you can finally register your model."
]
},
{
@@ -617,22 +636,26 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Sample a query example from the query DataFrame\n",
"query_example = query_df.sample().to_dict(\"records\")\n",
"\n",
+ "# Create a tensorflow model for the query_model in the Model Registry \n",
"mr_query_model = mr.tensorflow.create_model(\n",
- " name=\"query_model\",\n",
- " description=\"Model that generates query embeddings from user and transaction features\",\n",
- " input_example=query_example,\n",
- " model_schema=query_model_schema,\n",
+ " name=\"query_model\", # Name of the model\n",
+ " description=\"Model that generates query embeddings from user and transaction features\", # Description of the model\n",
+ " input_example=query_example, # Example input for the model\n",
+ " model_schema=query_model_schema, # Schema of the model\n",
")\n",
- "mr_query_model.save(\"query_model\")"
+ "\n",
+ "# Save the query_model to the Model Registry\n",
+ "mr_query_model.save(\"query_model\") # Path to save the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Here we have also saved an input example from the training data, which can be helpful for test purposes.\n",
+ "Here you have also saved an input example from the training data, which can be helpful for test purposes.\n",
"\n",
"Let's repeat the process with the candidate model."
]
@@ -643,28 +666,35 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Define the input schema for the candidate_model based on item_df\n",
"candidate_model_input_schema = Schema(item_df)\n",
"\n",
+ "# Define the output schema for the candidate_model, specifying the shape and type of the output\n",
"candidate_model_output_schema = Schema([{\n",
- " \"name\": \"candidate_embedding\",\n",
- " \"type\": \"float32\",\n",
- " \"shape\": [EMB_DIM],\n",
+ " \"name\": \"candidate_embedding\", # Name of the output feature\n",
+ " \"type\": \"float32\", # Data type of the output feature\n",
+ " \"shape\": [EMB_DIM], # Shape of the output feature\n",
"}])\n",
"\n",
+ "# Combine the input and output schemas to create the overall model schema for the candidate_model\n",
"candidate_model_schema = ModelSchema(\n",
- " input_schema=candidate_model_input_schema,\n",
- " output_schema=candidate_model_output_schema\n",
+ " input_schema=candidate_model_input_schema, # Input schema for the model\n",
+ " output_schema=candidate_model_output_schema, # Output schema for the model\n",
")\n",
"\n",
+ "# Sample a candidate example from the item DataFrame\n",
"candidate_example = item_df.sample().to_dict(\"records\")\n",
"\n",
+ "# Create a tensorflow model for the candidate_model in the Model Registry\n",
"mr_candidate_model = mr.tensorflow.create_model(\n",
- " name=\"candidate_model\",\n",
- " description=\"Model that generates candidate embeddings from item features\",\n",
- " input_example=candidate_example,\n",
- " model_schema=candidate_model_schema,\n",
+ " name=\"candidate_model\", # Name of the model\n",
+ " description=\"Model that generates candidate embeddings from item features\", # Description of the model\n",
+ " input_example=candidate_example, # Example input for the model\n",
+ " model_schema=candidate_model_schema, # Schema of the model\n",
")\n",
- "mr_candidate_model.save(\"candidate_model\")"
+ "\n",
+ "# Save the candidate_model to the Model Registry\n",
+ "mr_candidate_model.save(\"candidate_model\") # Path to save the model"
]
},
{
@@ -674,7 +704,7 @@
"---\n",
"## âŠī¸ Next Steps \n",
"\n",
- "Retrieving the top-k closest candidate embeddings in a brute-force way (computing the distances between the query embedding and all candidate embeddings) is too expensive in a practical setting. In the next notebook, we will index the item embeddings using OpenSearch, which will allow us to retrieve candidates with very low latency."
+ "Retrieving the top-k closest candidate embeddings in a brute-force way (computing the distances between the query embedding and all candidate embeddings) is too expensive in a practical setting. In the next notebook, you will compute embeddings and create a feature view which will allow you to retrieve candidates with very low latency."
]
}
],
diff --git a/advanced_tutorials/recommender-system/3_build_index.ipynb b/advanced_tutorials/recommender-system/3_build_index.ipynb
deleted file mode 100644
index fe8dd55d..00000000
--- a/advanced_tutorials/recommender-system/3_build_index.ipynb
+++ /dev/null
@@ -1,372 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## đ¨đģâđĢ Build Index \n",
- "\n",
- "In this notebook we will build an index for our candidate embeddings. Here we will use OpenSearch, which is natively supported by Hopsworks."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## đ Imports "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import tensorflow as tf\n",
- "import pprint\n",
- "import numpy as np\n",
- "\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## đŽ Connect to Hopsworks Feature Store "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import hopsworks\n",
- "\n",
- "project = hopsworks.login()\n",
- "\n",
- "fs = project.get_feature_store()\n",
- "mr = project.get_model_registry()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## đ¯ Compute Candidate Embeddings \n",
- "\n",
- "We start by computing candidate embeddings for all items in the training data.\n",
- "\n",
- "First, we load our candidate model. Recall that we uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "model = mr.get_model(\n",
- " name=\"candidate_model\",\n",
- " version=1,\n",
- ")\n",
- "model_path = model.download()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "If you already have the model saved locally you can simply replace `model_path` with the path to your model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "candidate_model = tf.saved_model.load(model_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next we compute the embeddings of all candidate items that were used to train the retrieval model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "feature_view = fs.get_feature_view(\n",
- " name=\"retrieval\", \n",
- " version=1,\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(\n",
- " validation_size=0.1, \n",
- " test_size=0.1,\n",
- " description='Retrieval dataset splits',\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "train_df[\"article_id\"] = train_df[\"article_id\"].astype(str)\n",
- "val_df[\"article_id\"] = val_df[\"article_id\"].astype(str)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Get list of input features for the candidate model.\n",
- "model_schema = model.model_schema['input_schema']['columnar_schema']\n",
- "candidate_features = [feat['name'] for feat in model_schema]\n",
- "\n",
- "# Get list of unique candidate items.\n",
- "item_df = train_df[candidate_features]\n",
- "item_df.drop_duplicates(subset=\"article_id\", inplace=True)\n",
- "\n",
- "item_ds = tf.data.Dataset.from_tensor_slices(\n",
- " {col: item_df[col] for col in item_df})\n",
- "\n",
- "# Compute embeddings for all candidate items.\n",
- "candidate_embeddings = item_ds.batch(2048).map(\n",
- " lambda x: (x[\"article_id\"], candidate_model(x)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "(Strictly speaking, we haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## đŽ Index Embeddings \n",
- "\n",
- "Next we index these embeddings. We start by connecting to our project's OpenSearch client using the *hopsworks* library."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from opensearchpy import OpenSearch\n",
- "\n",
- "opensearch_api = project.get_opensearch_api()\n",
- "client = OpenSearch(**opensearch_api.get_default_py_config())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We'll create an index called `candidate_index`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "index_name = opensearch_api.get_project_index(\"candidate_index\")\n",
- "\n",
- "emb_dim = 16 # candidate_model.layers[-1].output.shape[-1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Here we use the HNSW (Hierarchical Navigable Small World) data structure, which can be thought of as a skip list for graphs.\n",
- "\n",
- "See the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/knn-index) for more detailed information about parameters."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# To delete the indices\n",
- "# response = client.indices.delete(\n",
- "# index = index_name\n",
- "# )\n",
- "# print(response)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Dimensionality of candidate embeddings.\n",
- "\n",
- "index_body = {\n",
- " \"settings\": {\n",
- " \"knn\": True,\n",
- " \"knn.algo_param.ef_search\": 100,\n",
- " },\n",
- " \"mappings\": {\n",
- " \"properties\": {\n",
- " \"my_vector1\": {\n",
- " \"type\": \"knn_vector\",\n",
- " \"dimension\": emb_dim,\n",
- " \"method\": {\n",
- " \"name\": \"hnsw\",\n",
- " \"space_type\": \"innerproduct\",\n",
- " \"engine\": \"faiss\",\n",
- " \"parameters\": {\n",
- " \"ef_construction\": 256,\n",
- " \"m\": 48\n",
- " }\n",
- " }\n",
- " }\n",
- " }\n",
- " }\n",
- "}\n",
- "\n",
- "response = client.indices.create(index_name, body=index_body)\n",
- "print(response)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we can finally insert our candidate embeddings."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from opensearchpy.helpers import bulk\n",
- "\n",
- "actions = []\n",
- "for batch in candidate_embeddings:\n",
- " item_id_list, embedding_list = batch\n",
- " item_id_list = item_id_list.numpy().astype(int)\n",
- " embedding_list = embedding_list.numpy()\n",
- "\n",
- " for item_id, embedding in zip(item_id_list, embedding_list):\n",
- " actions.append({\n",
- " \"_index\": index_name,\n",
- " \"_id\": item_id,\n",
- " \"_source\": {\n",
- " \"my_vector1\": embedding,\n",
- " }\n",
- " })\n",
- "\n",
- "# Bulk insertion.\n",
- "bulk(client, actions)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To test that it works we can retrieve the neighbors of a random vector."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "embedding = np.random.rand(emb_dim)\n",
- "\n",
- "query = {\n",
- " \"size\": 10,\n",
- " \"query\": {\n",
- " \"knn\": {\n",
- " \"my_vector1\": {\n",
- " \"vector\": embedding,\n",
- " \"k\": 10\n",
- " }\n",
- " }\n",
- " }\n",
- "}\n",
- "\n",
- "response = client.search(\n",
- " body = query,\n",
- " index = index_name\n",
- ")\n",
- "\n",
- "pprint.pprint(response)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "---\n",
- "## âŠī¸ Next Steps \n",
- "\n",
- "At this point we have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, we'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions."
- ]
- }
- ],
- "metadata": {
- "interpreter": {
- "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
- },
- "kernelspec": {
- "display_name": "Python",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.11"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
\ No newline at end of file
diff --git a/advanced_tutorials/recommender-system/3_embeddings_creation.ipynb b/advanced_tutorials/recommender-system/3_embeddings_creation.ipynb
new file mode 100644
index 00000000..7075f0d4
--- /dev/null
+++ b/advanced_tutorials/recommender-system/3_embeddings_creation.ipynb
@@ -0,0 +1,321 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đ¨đģâđĢ Build Index \n",
+ "\n",
+ "In this notebook you will create a feature group for your candidate embeddings."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đ Imports "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import tensorflow as tf\n",
+ "import pprint\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đŽ Connect to Hopsworks Feature Store "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import hopsworks\n",
+ "\n",
+ "project = hopsworks.login()\n",
+ "\n",
+ "fs = project.get_feature_store()\n",
+ "mr = project.get_model_registry()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đ¯ Compute Candidate Embeddings \n",
+ "\n",
+ "You start by computing candidate embeddings for all items in the training data.\n",
+ "\n",
+ "First, you load your candidate model. Recall that you uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = mr.get_model(\n",
+ " name=\"candidate_model\",\n",
+ " version=1,\n",
+ ")\n",
+ "model_path = model.download()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If you already have the model saved locally you can simply replace `model_path` with the path to your model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "candidate_model = tf.saved_model.load(model_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next you compute the embeddings of all candidate items that were used to train the retrieval model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "feature_view = fs.get_feature_view(\n",
+ " name=\"retrieval\", \n",
+ " version=1,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(\n",
+ " validation_size=0.1, \n",
+ " test_size=0.1,\n",
+ " description='Retrieval dataset splits',\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "train_df.head(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get the list of input features for the candidate model from the model schema\n",
+ "model_schema = model.model_schema['input_schema']['columnar_schema']\n",
+ "candidate_features = [feat['name'] for feat in model_schema]\n",
+ "\n",
+ "# Select the candidate features from the training DataFrame\n",
+ "item_df = train_df[candidate_features]\n",
+ "\n",
+ "# Drop duplicate rows based on the 'article_id' column to get unique candidate items\n",
+ "item_df.drop_duplicates(subset=\"article_id\", inplace=True)\n",
+ "\n",
+ "item_df.head(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a TensorFlow dataset from the item DataFrame\n",
+ "item_ds = tf.data.Dataset.from_tensor_slices(\n",
+ " {col: item_df[col] for col in item_df})\n",
+ "\n",
+ "# Compute embeddings for all candidate items using the candidate_model\n",
+ "candidate_embeddings = item_ds.batch(2048).map(\n",
+ " lambda x: (x[\"article_id\"], candidate_model(x))\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> Strictly speaking, you haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## âī¸ Data Preparation \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Concatenate all article IDs and embeddings from the candidate_embeddings dataset\n",
+ "all_article_ids = tf.concat([batch[0] for batch in candidate_embeddings], axis=0)\n",
+ "all_embeddings = tf.concat([batch[1] for batch in candidate_embeddings], axis=0)\n",
+ "\n",
+ "# Convert tensors to numpy arrays\n",
+ "all_article_ids_np = all_article_ids.numpy().astype(int)\n",
+ "all_embeddings_np = all_embeddings.numpy()\n",
+ "\n",
+ "# Convert numpy arrays to lists\n",
+ "items_ids_list = all_article_ids_np.tolist()\n",
+ "embeddings_list = all_embeddings_np.tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a DataFrame\n",
+ "data_emb = pd.DataFrame({\n",
+ " 'article_id': items_ids_list, \n",
+ " 'embeddings': embeddings_list,\n",
+ "})\n",
+ "\n",
+ "data_emb.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đĒ Feature Group Creation \n",
+ "\n",
+ "Now to are ready to create a feature group for your candidate embeddings"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from hsfs import embedding\n",
+ "\n",
+ "# Create the Embedding Index\n",
+ "emb = embedding.EmbeddingIndex()\n",
+ "\n",
+ "emb.add_embedding(\n",
+ " \"embeddings\", \n",
+ " len(data_emb[\"embeddings\"].iloc[0]),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get or create the 'candidate_embeddings_fg' feature group\n",
+ "candidate_embeddings_fg = fs.get_or_create_feature_group(\n",
+ " name=\"candidate_embeddings_fg\",\n",
+ " embedding_index=emb,\n",
+ " primary_key=['article_id'],\n",
+ " version=1,\n",
+ " description='Embeddings for each article',\n",
+ " online_enabled=True,\n",
+ ")\n",
+ "\n",
+ "candidate_embeddings_fg.insert(data_emb)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## đĒ Feature View Creation \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get or create the 'candidate_embeddings' feature view\n",
+ "feature_view = fs.get_or_create_feature_view(\n",
+ " name=\"candidate_embeddings\",\n",
+ " version=1,\n",
+ " description='Embeddings of each article',\n",
+ " query=candidate_embeddings_fg.select([\"article_id\"]),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## âŠī¸ Next Steps \n",
+ "\n",
+ "At this point you have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, you'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions."
+ ]
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/advanced_tutorials/recommender-system/4_train_ranking_model.ipynb b/advanced_tutorials/recommender-system/4_train_ranking_model.ipynb
index cec2f58e..e35de9ce 100644
--- a/advanced_tutorials/recommender-system/4_train_ranking_model.ipynb
+++ b/advanced_tutorials/recommender-system/4_train_ranking_model.ipynb
@@ -6,7 +6,7 @@
"source": [
"## đ¨đģâđĢ Train Ranking Model \n",
"\n",
- "In this notebook, we will train a ranking model using gradient boosted trees. "
+ "In this notebook, you will train a ranking model using gradient boosted trees. "
]
},
{
@@ -149,7 +149,16 @@
" description='Ranking training dataset',\n",
")\n",
"\n",
- "#X_train, X_val, y_train, y_val = feature_view_ranking.get_train_test_split(1)"
+ "X_train.head(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "y_train.head(3)"
]
},
{
@@ -185,7 +194,10 @@
" use_best_model=True,\n",
")\n",
"\n",
- "model.fit(pool_train, eval_set=pool_val)"
+ "model.fit(\n",
+ " pool_train, \n",
+ " eval_set=pool_val,\n",
+ ")"
]
},
{
@@ -194,7 +206,7 @@
"source": [
"## đŽđģââī¸ Model Validation \n",
"\n",
- "Next, we'll evaluate how well the model performs on the validation data."
+ "Next, you'll evaluate how well the model performs on the validation data."
]
},
{
@@ -221,7 +233,7 @@
"source": [
"It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.\n",
"\n",
- "Let's see which features our model considers important."
+ "Let's see which features your model considers important."
]
},
{
@@ -255,7 +267,7 @@
"source": [
"It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.\n",
"\n",
- "Finally, we'll save our model."
+ "Finally, you'll save your model."
]
},
{
@@ -273,7 +285,7 @@
"source": [
"### đž Upload Model to Model Registry \n",
"\n",
- "We'll upload the model to the Hopsworks Model Registry."
+ "You'll upload the model to the Hopsworks Model Registry."
]
},
{
@@ -282,7 +294,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# connect to Hopsworks Model Registry\n",
+ "# Connect to Hopsworks Model Registry\n",
"mr = project.get_model_registry()"
]
},
@@ -317,7 +329,7 @@
"---\n",
"## âŠī¸ Next Steps \n",
"\n",
- "Now we have trained both a retrieval and a ranking model, which will allow us to generate recommendations for users. In the next notebook, we'll take a look at how we can deploy these models with the `HSML` library."
+ "Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library."
]
}
],
diff --git a/advanced_tutorials/recommender-system/5_create_deployments.ipynb b/advanced_tutorials/recommender-system/5_create_deployments.ipynb
index fc978e57..6d73b3f2 100644
--- a/advanced_tutorials/recommender-system/5_create_deployments.ipynb
+++ b/advanced_tutorials/recommender-system/5_create_deployments.ipynb
@@ -6,7 +6,7 @@
"source": [
"## đ¨đģâđĢ Create Deployment \n",
"\n",
- "In this notebook, we'll create a deployment for our recommendation system.\n",
+ "In this notebook, you'll create a deployment for your recommendation system.\n",
"\n",
"**NOTE Currently the transformer scripts are not implemented.**"
]
@@ -51,7 +51,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# connect to Hopsworks Model Registry\n",
+ "# Connect to Hopsworks Model Registry\n",
"mr = project.get_model_registry()\n",
"\n",
"dataset_api = project.get_dataset_api()"
@@ -70,16 +70,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Next, we'll deploy our ranking model. Since it is a CatBoost model we need to implement a `Predict` class that tells Hopsworks how to load the model and how to use it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "ranking_model = mr.get_best_model(\"ranking_model\", \"fscore\", \"max\")"
+ "You start by deploying your ranking model. Since it is a CatBoost model you need to implement a `Predict` class that tells Hopsworks how to load the model and how to use it."
]
},
{
@@ -88,6 +79,11 @@
"metadata": {},
"outputs": [],
"source": [
+ "ranking_model = mr.get_best_model(\n",
+ " name=\"ranking_model\", \n",
+ " metric=\"fscore\", \n",
+ " direction=\"max\",\n",
+ ")\n",
"ranking_model"
]
},
@@ -111,97 +107,122 @@
"class Transformer(object):\n",
" \n",
" def __init__(self):\n",
- " # connect to Hopsworks\n",
+ " # Connect to Hopsworks\n",
" project = hopsworks.connection().get_project()\n",
- " \n",
- " # get feature views\n",
" self.fs = project.get_feature_store()\n",
- " self.articles_fv = self.fs.get_feature_view(\"articles\", 1)\n",
+ " \n",
+ " # Retrieve the 'articles' feature view\n",
+ " self.articles_fv = self.fs.get_feature_view(\n",
+ " name=\"articles\", \n",
+ " version=1,\n",
+ " )\n",
+ " \n",
+ " # Get list of feature names for articles\n",
" self.articles_features = [feat.name for feat in self.articles_fv.schema]\n",
- " self.customer_fv = self.fs.get_feature_view(\"customers\", 1)\n",
- "\n",
- " # create opensearch client\n",
- " opensearch_api = project.get_opensearch_api()\n",
- " self.os_client = OpenSearch(**opensearch_api.get_default_py_config())\n",
- " self.candidate_index = opensearch_api.get_project_index(\"candidate_index\")\n",
- "\n",
- " # get ranking model feature names\n",
+ " \n",
+ " # Retrieve the 'customers' feature view\n",
+ " self.customer_fv = self.fs.get_feature_view(\n",
+ " name=\"customers\", \n",
+ " version=1,\n",
+ " )\n",
+ "\n",
+ " # Retrieve the 'candidate_embeddings' feature view\n",
+ " self.candidate_index = self.fs.get_feature_view(\n",
+ " name=\"candidate_embeddings\", \n",
+ " version=1,\n",
+ " )\n",
+ "\n",
+ " # Retrieve ranking model\n",
" mr = project.get_model_registry()\n",
- " model = mr.get_model(\"ranking_model\", 1)\n",
+ " model = mr.get_model(\n",
+ " name=\"ranking_model\", \n",
+ " version=1,\n",
+ " )\n",
+ " \n",
+ " # Extract input schema from the model\n",
" input_schema = model.model_schema[\"input_schema\"][\"columnar_schema\"]\n",
" \n",
+ " # Get the names of features expected by the ranking model\n",
" self.ranking_model_feature_names = [feat[\"name\"] for feat in input_schema]\n",
" \n",
" def preprocess(self, inputs):\n",
+ " # Extract the input instance\n",
" inputs = inputs[\"instances\"][0]\n",
+ " \n",
+ " # Extract customer_id from inputs\n",
" customer_id = inputs[\"customer_id\"]\n",
" \n",
- " # search for candidates\n",
- " hits = self.search_candidates(inputs[\"query_emb\"], k=100)\n",
+ " # Search for candidate items\n",
+ " neighbors = self.candidate_index.find_neighbors(\n",
+ " inputs[\"query_emb\"], \n",
+ " k=100,\n",
+ " )\n",
+ " neighbors = [neighbor[0] for neighbor in neighbors]\n",
" \n",
- " # get already bought items\n",
+ " # Get IDs of items already bought by the customer\n",
" already_bought_items_ids = self.fs.sql(\n",
" f\"SELECT article_id from transactions_1 WHERE customer_id = '{customer_id}'\"\n",
" ).values.reshape(-1).tolist()\n",
" \n",
- " # build dataframes\n",
- " item_id_list = []\n",
- " item_emb_list = []\n",
- " exclude_set = set(already_bought_items_ids)\n",
- " for el in hits:\n",
- " item_id = str(el[\"_id\"])\n",
- " if item_id in exclude_set:\n",
- " continue\n",
- " item_emb = el[\"_source\"][\"my_vector1\"]\n",
- " item_id_list.append(item_id)\n",
- " item_emb_list.append(item_emb)\n",
+ " # Filter candidate items to exclude those already bought by the customer\n",
+ " item_id_list = [\n",
+ " str(item_id) \n",
+ " for item_id \n",
+ " in neighbors \n",
+ " if str(item_id) \n",
+ " not in already_bought_items_ids\n",
+ " ]\n",
" item_id_df = pd.DataFrame({\"article_id\" : item_id_list})\n",
- " #item_emb_df = pd.DataFrame(item_emb_list).add_prefix(\"item_emb_\")\n",
" \n",
- " # get articles feature vectors\n",
- " articles_data = []\n",
- " for article_id in item_id_list:\n",
- " try:\n",
- " article_features = self.articles_fv.get_feature_vector({\"article_id\" : article_id})\n",
- " articles_data.append(article_features)\n",
- " except:\n",
- " logging.info(\"-- not found:\" + str(article_id))\n",
- " pass # article might have been removed from catalogue\n",
- " \n",
- " articles_df = pd.DataFrame(data=articles_data, columns=self.articles_features)\n",
+ " # Retrieve Article data for candidate items\n",
+ " articles_data = [\n",
+ " self.articles_fv.get_feature_vector({\"article_id\": item_id}) \n",
+ " for item_id \n",
+ " in item_id_list\n",
+ " ]\n",
+ "\n",
+ " articles_df = pd.DataFrame(\n",
+ " data=articles_data, \n",
+ " columns=self.articles_features,\n",
+ " )\n",
" \n",
- " # join candidates with item features\n",
- " ranking_model_inputs = item_id_df.merge(articles_df, on=\"article_id\", how=\"inner\")\n",
+ " # Join candidate items with their features\n",
+ " ranking_model_inputs = item_id_df.merge(\n",
+ " articles_df, \n",
+ " on=\"article_id\", \n",
+ " how=\"inner\",\n",
+ " ) \n",
" \n",
- " # add customer features\n",
- " customer_features = self.customer_fv.get_feature_vector({\"customer_id\": customer_id}, return_type=\"pandas\")\n",
+ " # Add customer features\n",
+ " customer_features = self.customer_fv.get_feature_vector(\n",
+ " {\"customer_id\": customer_id}, \n",
+ " return_type=\"pandas\",\n",
+ " )\n",
" ranking_model_inputs[\"age\"] = customer_features.age.values[0] \n",
" ranking_model_inputs[\"month_sin\"] = inputs[\"month_sin\"]\n",
" ranking_model_inputs[\"month_cos\"] = inputs[\"month_cos\"]\n",
+ " \n",
+ " # Select only the features required by the ranking model\n",
" ranking_model_inputs = ranking_model_inputs[self.ranking_model_feature_names]\n",
" \n",
- " return { \"inputs\" : [{\"ranking_features\": ranking_model_inputs.values.tolist(), \"article_ids\": item_id_list} ]}\n",
+ " return { \n",
+ " \"inputs\" : [{\"ranking_features\": ranking_model_inputs.values.tolist(), \"article_ids\": item_id_list}]\n",
+ " }\n",
"\n",
" def postprocess(self, outputs):\n",
+ " # Extract predictions from the outputs\n",
" preds = outputs[\"predictions\"]\n",
- " ranking = list(zip(preds[\"scores\"], preds[\"article_ids\"])) # merge lists\n",
- " ranking.sort(reverse=True) # sort by score (descending)\n",
- " return { \"ranking\": ranking }\n",
- " \n",
- " def search_candidates(self, query_emb, k=100):\n",
- " k = 100\n",
- " query = {\n",
- " \"size\": k,\n",
- " \"query\": {\n",
- " \"knn\": {\n",
- " \"my_vector1\": {\n",
- " \"vector\": query_emb,\n",
- " \"k\": k\n",
- " }\n",
- " }\n",
- " }\n",
- " }\n",
- " return self.os_client.search(body = query, index = self.candidate_index)[\"hits\"][\"hits\"]"
+ " \n",
+ " # Merge prediction scores and corresponding article IDs into a list of tuples\n",
+ " ranking = list(zip(preds[\"scores\"], preds[\"article_ids\"]))\n",
+ " \n",
+ " # Sort the ranking list by score in descending order\n",
+ " ranking.sort(reverse=True)\n",
+ " \n",
+ " # Return the sorted ranking list\n",
+ " return { \n",
+ " \"ranking\": ranking,\n",
+ " }"
]
},
{
@@ -210,9 +231,19 @@
"metadata": {},
"outputs": [],
"source": [
- "# copy transformer file into Hopsworks File System\n",
- "uploaded_file_path = dataset_api.upload(\"ranking_transformer.py\", \"Resources\", overwrite=True)\n",
- "transformer_script_path = os.path.join(\"/Projects\", project.name, uploaded_file_path)"
+ "# Copy transformer file into Hopsworks File System \n",
+ "uploaded_file_path = dataset_api.upload(\n",
+ " \"ranking_transformer.py\", # File name to be uploaded\n",
+ " \"Resources\", # Destination directory in Hopsworks File System \n",
+ " overwrite=True, # Overwrite the file if it already exists\n",
+ ") \n",
+ "\n",
+ "# Construct the path to the uploaded transformer script\n",
+ "transformer_script_path = os.path.join(\n",
+ " \"/Projects\", # Root directory for projects in Hopsworks\n",
+ " project.name, # Name of the current project\n",
+ " uploaded_file_path, # Path to the uploaded file within the project\n",
+ ")"
]
},
{
@@ -235,15 +266,24 @@
" self.model = joblib.load(os.environ[\"ARTIFACT_FILES_PATH\"] + \"/ranking_model.pkl\")\n",
"\n",
" def predict(self, inputs):\n",
+ " # Extract ranking features and article IDs from the inputs\n",
" features = inputs[0].pop(\"ranking_features\")\n",
" article_ids = inputs[0].pop(\"article_ids\")\n",
" \n",
+ " # Log the extracted features\n",
" logging.info(\"predict -> \" + str(features))\n",
"\n",
+ " # Predict probabilities for the positive class\n",
" scores = self.model.predict_proba(features).tolist()\n",
- " scores = np.asarray(scores)[:,1].tolist() # get scores of positive class\n",
+ " \n",
+ " # Get scores of positive class\n",
+ " scores = np.asarray(scores)[:,1].tolist() \n",
"\n",
- " return { \"scores\": scores, \"article_ids\": article_ids }"
+ " # Return the predicted scores along with the corresponding article IDs\n",
+ " return {\n",
+ " \"scores\": scores, \n",
+ " \"article_ids\": article_ids,\n",
+ " }"
]
},
{
@@ -252,16 +292,26 @@
"metadata": {},
"outputs": [],
"source": [
- "# upload predictor file to Hopsworks\n",
- "uploaded_file_path = dataset_api.upload(\"ranking_predictor.py\", \"Resources\", overwrite=True)\n",
- "predictor_script_path = os.path.join(\"/Projects\", project.name, uploaded_file_path)"
+ "# Upload predictor file to Hopsworks\n",
+ "uploaded_file_path = dataset_api.upload(\n",
+ " \"ranking_predictor.py\", \n",
+ " \"Resources\", \n",
+ " overwrite=True,\n",
+ ")\n",
+ "\n",
+ "# Construct the path to the uploaded script\n",
+ "predictor_script_path = os.path.join(\n",
+ " \"/Projects\", \n",
+ " project.name, \n",
+ " uploaded_file_path,\n",
+ ")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "With that in place, we can finally deploy our model."
+ "With that in place, you can finally deploy your model."
]
},
{
@@ -274,13 +324,13 @@
"\n",
"ranking_deployment_name = \"rankingdeployment\"\n",
"\n",
- "# define transformer\n",
+ "# Define transformer\n",
"ranking_transformer=Transformer(\n",
" script_file=transformer_script_path, \n",
" resources={\"num_instances\": 1},\n",
")\n",
"\n",
- "# deploy ranking model\n",
+ "# Deploy ranking model\n",
"ranking_deployment = ranking_model.deploy(\n",
" name=ranking_deployment_name,\n",
" description=\"Deployment that search for item candidates and scores them based on customer metadata\",\n",
@@ -296,6 +346,7 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Start the deployment\n",
"ranking_deployment.start()"
]
},
@@ -307,10 +358,20 @@
},
"outputs": [],
"source": [
- "# #in case of failure\n",
+ "# Check logs in case of failure\n",
"# ranking_deployment.get_logs(component=\"predictor\", tail=200)"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_top_recommendations(ranked_candidates, k=3):\n",
+ " return [candidate[-1] for candidate in ranked_candidates['ranking'][:k]]"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -319,7 +380,7 @@
},
"outputs": [],
"source": [
- "# test ranking deployment\n",
+ "# Define a test input example\n",
"test_ranking_input = {\"instances\": [{\"customer_id\": \"641e6f3ef3a2d537140aaa0a06055ae328a0dddf2c2c0dd6e60eb0563c7cbba0\",\n",
" \"month_sin\": 1.2246467991473532e-16,\n",
" \"query_emb\": [0.214135289,\n",
@@ -340,8 +401,12 @@
" 0.966559],\n",
" \"month_cos\": -1.0}]}\n",
"\n",
- " # test ranking\n",
- "ranking_deployment.predict(test_ranking_input)"
+ "# Test ranking deployment\n",
+ "ranked_candidates = ranking_deployment.predict(test_ranking_input)\n",
+ "\n",
+ "# Retrieve article ids of the top recommended items\n",
+ "recommendations = get_top_recommendations(ranked_candidates, k=3)\n",
+ "recommendations"
]
},
{
@@ -350,7 +415,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# #in case of failure\n",
+ "# Check logs in case of failure\n",
"# ranking_deployment.get_logs(component=\"transformer\",tail=200)"
]
},
@@ -362,7 +427,7 @@
"source": [
"## đ Query Model Deployment \n",
"\n",
- "We start by deploying our query model."
+ "Next, you'll deploy your query model."
]
},
{
@@ -371,6 +436,7 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Retrieve the 'query_model' from the Model Registry\n",
"query_model = mr.get_model(\n",
" name=\"query_model\",\n",
" version=1,\n",
@@ -398,46 +464,60 @@
"class Transformer(object):\n",
" \n",
" def __init__(self): \n",
- " # connect to Hopsworks\n",
+ " # Connect to the Hopsworks\n",
" project = hopsworks.connection().get_project()\n",
+ " ms = project.get_model_serving()\n",
" \n",
- " # get feature views and transformation functions\n",
+ " # Retrieve the 'customers' feature view\n",
" fs = project.get_feature_store()\n",
- " self.customer_fv = fs.get_feature_view(\"customers\", 1)\n",
- " \n",
- " # get ranking deployment metadata object\n",
- " ms = project.get_model_serving()\n",
+ " self.customer_fv = fs.get_feature_view(\n",
+ " name=\"customers\", \n",
+ " version=1,\n",
+ " )\n",
+ " # Retrieve the ranking deployment \n",
" self.ranking_server = ms.get_deployment(\"rankingdeployment\")\n",
- "\n",
- " # TODO (Davit): make this as on-demand feature calculation\n",
- " self.c = 2 * np.pi / 12\n",
" \n",
" \n",
" def preprocess(self, inputs):\n",
+ " # Check if the input data contains a key named \"instances\"\n",
+ " # and extract the actual data if present\n",
" inputs = inputs[\"instances\"] if \"instances\" in inputs else inputs\n",
+ " \n",
+ " # Extract customer_id and transaction_date from the inputs\n",
" customer_id = inputs[\"customer_id\"]\n",
" transaction_date = inputs[\"transaction_date\"]\n",
" \n",
- " # extract month\n",
+ " # Extract month from the transaction_date\n",
" month_of_purchase = datetime.fromisoformat(inputs.pop(\"transaction_date\"))\n",
" \n",
- " # get customer features\n",
- " #customer_features = self.customer_fv.get_feature_vector(inputs, return_type=\"pandas\")\n",
- " customer_features = self.customer_fv.get_feature_vector({\"customer_id\": customer_id}, return_type=\"pandas\")\n",
+ " # Get customer features\n",
+ " customer_features = self.customer_fv.get_feature_vector(\n",
+ " {\"customer_id\": customer_id}, \n",
+ " return_type=\"pandas\",\n",
+ " )\n",
" \n",
- " # enrich inputs\n",
+ " # Enrich inputs with customer age\n",
" inputs[\"age\"] = customer_features.age.values[0] \n",
" \n",
- " # TODO (Davit): make this as on-demand feature calculation\n",
+ " # Calculate the sine and cosine of the month_of_purchase\n",
" month_of_purchase = datetime.strptime(transaction_date, \"%Y-%m-%dT%H:%M:%S.%f\").month\n",
- " inputs[\"month_sin\"] = float(np.sin(month_of_purchase * self.c)) \n",
- " inputs[\"month_cos\"] = float(np.cos(month_of_purchase * self.c))\n",
+ " \n",
+ " # Calculate a coefficient for adjusting the periodicity of the month\n",
+ " coef = np.random.uniform(0, 2 * np.pi) / 12\n",
+ " \n",
+ " # Calculate the sine and cosine components for the month_of_purchase\n",
+ " inputs[\"month_sin\"] = float(np.sin(month_of_purchase * coef)) \n",
+ " inputs[\"month_cos\"] = float(np.cos(month_of_purchase * coef))\n",
" \n",
- " return {\"instances\" : [inputs]}\n",
+ " return {\n",
+ " \"instances\" : [inputs]\n",
+ " }\n",
" \n",
" def postprocess(self, outputs):\n",
- " # get ordered ranking predictions \n",
- " return {\"predictions\": self.ranking_server.predict({ \"instances\": outputs[\"predictions\"]})}\n"
+ " # Return ordered ranking predictions \n",
+ " return {\n",
+ " \"predictions\": self.ranking_server.predict({ \"instances\": outputs[\"predictions\"]}),\n",
+ " }"
]
},
{
@@ -446,9 +526,19 @@
"metadata": {},
"outputs": [],
"source": [
- "# copy transformer file into Hopsworks File System\n",
- "uploaded_file_path = dataset_api.upload(\"querymodel_transformer.py\", \"Models\", overwrite=True)\n",
- "transformer_script_path = os.path.join(\"/Projects\", project.name, uploaded_file_path)"
+ "# Copy transformer file into Hopsworks File System\n",
+ "uploaded_file_path = dataset_api.upload(\n",
+ " \"querymodel_transformer.py\", \n",
+ " \"Models\", \n",
+ " overwrite=True,\n",
+ ")\n",
+ "\n",
+ "# Construct the path to the uploaded script\n",
+ "transformer_script_path = os.path.join(\n",
+ " \"/Projects\", \n",
+ " project.name, \n",
+ " uploaded_file_path,\n",
+ ")"
]
},
{
@@ -461,13 +551,13 @@
"\n",
"query_model_deployment_name = \"querydeployment\"\n",
"\n",
- "# define transformer\n",
+ "# Define transformer\n",
"query_model_transformer=Transformer(\n",
" script_file=transformer_script_path, \n",
" resources={\"num_instances\": 1},\n",
")\n",
"\n",
- "# deploy query model\n",
+ "# Deploy the query model\n",
"query_model_deployment = query_model.deploy(\n",
" name=query_model_deployment_name,\n",
" description=\"Deployment that generates query embeddings from customer and item features using the query model\",\n",
@@ -480,7 +570,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "At this point, we have registered our deployment. To start it up we need to run:"
+ "At this point, you have registered your deployment. To start it up you need to run:"
]
},
{
@@ -489,6 +579,7 @@
"metadata": {},
"outputs": [],
"source": [
+ "# Start the deployment\n",
"query_model_deployment.start()"
]
},
@@ -498,27 +589,25 @@
"metadata": {},
"outputs": [],
"source": [
- "# #in case of failure\n",
+ "# Check logs in case of failure\n",
"# query_model_deployment.get_logs(component=\"transformer\", tail=20)"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can test the deployment by making a prediction on the input example we registered together with the model."
- ]
- },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
+ "# Define a test input example\n",
"data = {\"instances\": {\"customer_id\": \"641e6f3ef3a2d537140aaa0a06055ae328a0dddf2c2c0dd6e60eb0563c7cbba0\", \"transaction_date\": \"2022-11-15T12:16:25.330916\"}}\n",
- "# # data = {\"customer_id\": \"641e6f3ef3a2d537140aaa0a06055ae328a0dddf2c2c0dd6e60eb0563c7cbba0\", \"date_of_purchase\": \"2022-11-15T12:16:25.330916\"}\n",
"\n",
- "query_model_deployment.predict(data)"
+ "# Test the deployment\n",
+ "ranked_candidates = query_model_deployment.predict(data)\n",
+ "\n",
+ "# Retrieve article ids of the top recommended items\n",
+ "recommendations = get_top_recommendations(ranked_candidates['predictions'], k=3)\n",
+ "recommendations"
]
},
{
@@ -527,7 +616,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# #in case of failure\n",
+ "# Check logs in case of failure\n",
"# query_model_deployment.get_logs(component=\"transformer\",tail=200)"
]
},
@@ -535,7 +624,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Let's stop the deployment when we're not using it."
+ "Stop the deployment when you're not using it."
]
},
{
@@ -544,8 +633,11 @@
"metadata": {},
"outputs": [],
"source": [
- "# ranking_deployment.stop()\n",
- "# query_model_deployment.stop()"
+ "# Stop the ranking model deployment\n",
+ "ranking_deployment.stop()\n",
+ "\n",
+ "# Stop the query model deployment\n",
+ "query_model_deployment.stop()"
]
},
{
diff --git a/advanced_tutorials/recommender-system/README.md b/advanced_tutorials/recommender-system/README.md
index be77fc78..de0712b0 100644
--- a/advanced_tutorials/recommender-system/README.md
+++ b/advanced_tutorials/recommender-system/README.md
@@ -14,7 +14,7 @@
* 1_feature_engineering.ipynb
* 2_train_retrieval_model.ipynb
-* 3_build_index.ipynb
+* 3_embeddings_creation.ipynb
* 4_train_ranking_model.ipynb
* 5_create_deployments.ipynb
* 6_inference_and_ui.py
diff --git a/advanced_tutorials/recommender-system/features/ranking.py b/advanced_tutorials/recommender-system/features/ranking.py
index c3bd7542..b4c23dfa 100644
--- a/advanced_tutorials/recommender-system/features/ranking.py
+++ b/advanced_tutorials/recommender-system/features/ranking.py
@@ -15,12 +15,12 @@ def compute_ranking_dataset(trans_fg, articles_fg, customers_fg):
# Define the number of negative pairs to generate
n_neg = len(positive_pairs) * 10
- # Generate random article_id for negative_pairs that are not in positive_pairs
- negative_article_ids = positive_pairs["article_id"].drop_duplicates().sample(n_neg, replace=True, random_state=2).to_frame()
-
# Initialize the negative_pairs DataFrame
negative_pairs = pd.DataFrame()
+ # Generate random article_id for negative_pairs that are not in positive_pairs
+ negative_pairs['article_id'] = positive_pairs["article_id"].drop_duplicates().sample(n_neg, replace=True, random_state=2)
+
# Add customer_id to negative_pairs
negative_pairs["customer_id"] = positive_pairs["customer_id"].sample(n_neg, replace=True, random_state=3).to_numpy()
@@ -32,7 +32,7 @@ def compute_ranking_dataset(trans_fg, articles_fg, customers_fg):
negative_pairs["label"] = 0
# Concatenate positive and negative pairs
- ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)
+ ranking_df = pd.concat([positive_pairs, negative_pairs[positive_pairs.columns]], ignore_index=True)
# Keep unique article_id from item features
item_df = articles_fg.read()