logicalclocks · jimdowling · Feb 22, 2024 · Feb 22, 2024 · Feb 22, 2024 · Maxxx-zh
diff --git a/api_examples/hsfs/knn_search/news-search-knn.ipynb b/api_examples/hsfs/knn_search/news-search-knn.ipynb
@@ -0,0 +1,312 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8b0ba628",
+   "metadata": {},
+   "source": [
+    "# News search using kNN in Hopsworks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d727fa41",
+   "metadata": {},
+   "source": [
+    "In this tutorial, you are going to learn how to create a news search application which allows you to search news using natural language. You will create embedding for the news and search news similar to a given description using embeddings and kNN search. The steps include:\n",
+    "1. Load news data\n",
+    "2. Create embedddings for news heading and news body\n",
+    "3. Ingest the news data and embedding into Hopsworks\n",
+    "4. Search news using Hopsworks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32974a73",
+   "metadata": {},
+   "source": [
+    "## Load news data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "840cc4e5",
+   "metadata": {},
+   "source": [
+    "First, you need to load the news articles downloaded from [Kaggle news articles](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles).\n",
+    "Since creating embeddings for the full news is time-consuming, here we sample some articles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d4062d8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "df_all = pd.read_csv(\"https://repo.hops.works/dev/jdowling/Articles.csv\", encoding='utf-8', encoding_errors='ignore')\n",
+    "df = df_all.sample(n=300).reset_index().drop([\"index\"], axis=1)\n",
+    "df[\"news_id\"] = list(range(len(df)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ddea5ab0",
+   "metadata": {},
+   "source": [
+    "## Create embeddings"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b43d7b68",
+   "metadata": {},
+   "source": [
+    "Next, you need to create embeddings for heading and body of the news. The embeddings will then be used for kNN search against the embedding of the news description you want to search. Here we use a light weighted language model (LM) which encodes the news into embeddings. You can use any other language models including LLM (llama, Mistral)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "776aaa6d-b2b3-4c6a-9814-1dcc547f0a36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install sentence_transformers -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "017ae8a7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sentence_transformers import SentenceTransformer\n",
+    "model = SentenceTransformer('all-MiniLM-L6-v2')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "245ee674",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# truncate the body to 100 characters\n",
+    "embeddings_body = model.encode([body[:100] for body in df[\"Article\"]])\n",
+    "embeddings_heading = model.encode(df[\"Heading\"])\n",
+    "df[\"embedding_heading\"] = pd.Series(embeddings_heading.tolist())\n",
+    "df[\"embedding_body\"] = pd.Series(embeddings_body.tolist())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af272d56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ca7d180",
+   "metadata": {},
+   "source": [
+    "## Ingest into Hopsworks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4edaa1d3",
+   "metadata": {},
+   "source": [
+    "You need to ingest the data to Hopsworks, so that they are stored and indexed. First, you login into Hopsworks and prepare the feature store."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88f99b8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hopsworks\n",
+    "proj = hopsworks.login()\n",
+    "fs = proj.get_feature_store()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53ca5b13",
+   "metadata": {},
+   "source": [
+    "Next, as embeddings are stored in an index in the backing vecotor database, you need to specify the index name and the embedding features in the dataframe. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4994b30b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "version = 1\n",
+    "from hsfs import embedding\n",
+    "\n",
+    "emb = embedding.EmbeddingIndex(index_name=f\"news_fg_{version}\")\n",
+    "# specify the name and dimension of the embedding features \n",
+    "emb.add_embedding(\"embedding_body\", len(df[\"embedding_body\"][0]))\n",
+    "emb.add_embedding(\"embedding_heading\", len(df[\"embedding_heading\"][0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "755be3cb",
+   "metadata": {},
+   "source": [
+    "Next, you create a feature group with the `embedding_index` and ingest data into the feature group."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2fa6af0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "news_fg = fs.get_or_create_feature_group(\n",
+    "    name=\"news_fg\",\n",
+    "    embedding_index=emb,\n",
+    "    primary_key=[\"news_id\"],\n",
+    "    version=version,\n",
+    "    online_enabled=True\n",
+    ")\n",
+    "\n",
+    "news_fg.insert(df, write_options={\"start_offline_materialization\": False})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "508ae2c4",
+   "metadata": {},
+   "source": [
+    "## Search News"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "baa6d6c3",
+   "metadata": {},
+   "source": [
+    "Once the data are ingested into Hopsworks, you can search news by giving a news description. The news description first needs to be encoded by the same LM you used to encode the news. And then you can search news which are similar to the description using kNN search functionality provided by the feature group."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e114be5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set the logging level to WARN to avoid INFO message\n",
+    "import logging\n",
+    "logging.getLogger().setLevel(logging.WARN)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "343bcbcf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "news_description = \"news about europe\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa0d15d3",
+   "metadata": {},
+   "source": [
+    "You can search similar news to the description against news heading."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a73ea59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = news_fg.find_neighbors(model.encode(news_description), k=3, col=\"embedding_heading\")\n",
+    "# print out the heading\n",
+    "for result in results:\n",
+    "    print(result[1][2])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2c4f5fc",
+   "metadata": {},
+   "source": [
+    "Alternative, you can search similar news to the description against the news body and filter by news type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a3c31b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = news_fg.find_neighbors(model.encode(news_description), k=3, col=\"embedding_body\",\n",
+    "                                filter=news_fg.newstype == \"business\")\n",
+    "# print out the heading\n",
+    "for result in results:\n",
+    "    print(result[1][2])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5cf246b3",
+   "metadata": {},
+   "source": [
+    "## Next step"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb82cba1",
+   "metadata": {},
+   "source": [
+    "Now you are able to search articles using natural language. You can learn how to rank the result in [this tutorial]() and learn best practices in the [guide]()."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}