## <span style="color:#ff5f27;"> 🔍🗞️ News search using kNN in Hopsworks</span>

In this tutorial, you are going to learn how to create a news search application which allows you to search news using natural language. You will create embedding for the news and search news similar to a given description using embeddings and kNN search. The steps include:
1. Load news data
2. Create embedddings for news heading and news body
3. Ingest the news data and embedding into Hopsworks
4. Search news using Hopsworks

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U 'hopsworks[python]' --quiet
!pip install sentence_transformers -q

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import logging
import hopsworks
from hsfs import embedding

## <span style="color:#ff5f27;"> 📰 Load news data</span>

First, you need to load the news articles downloaded from [Kaggle news articles](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles).
Since creating embeddings for the full news is time-consuming, here we sample some articles.

In [None]:
df_all = pd.read_csv(
    "https://repo.hops.works/dev/jdowling/Articles.csv", 
    encoding='utf-8', 
    encoding_errors='ignore',
)

df = df_all.sample(n=300).reset_index().drop(["index"], axis=1)
df["news_id"] = list(range(len(df)))
df.columns = df.columns.str.lower()
df.head(3)

## <span style="color:#ff5f27;"> 🧠 Create embeddings</span>

Next, you need to create embeddings for heading and body of the news. The embeddings will then be used for kNN search against the embedding of the news description you want to search. Here we use a light weighted language model (LM) which encodes the news into embeddings. You can use any other language models including LLM (llama, Mistral).

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Truncate the body to 100 characters
embeddings_body = model.encode([body[:100] for body in df["article"]])
embeddings_heading = model.encode(df["heading"])

df["embedding_heading"] = pd.Series(embeddings_heading.tolist())
df["embedding_body"] = pd.Series(embeddings_body.tolist())

df.head(3)

## <span style="color:#ff5f27;"> 📥 Ingest into Hopsworks</span>

You need to ingest the data to Hopsworks, so that they are stored and indexed. First, you login into Hopsworks and prepare the feature store.

In [None]:
project = hopsworks.login()
fs = project.get_feature_store()

Next, as embeddings are stored in an index in the backing vecotor database, you need to specify the index name and the embedding features in the dataframe. 

In [None]:
VERSION = 1

embedding_index = embedding.EmbeddingIndex(index_name=f"news_fg_{VERSION}")

In [None]:
# Specify the name and dimension of the embedding features 
embedding_index.add_embedding("embedding_body", model.get_sentence_embedding_dimension())
embedding_index.add_embedding("embedding_heading", model.get_sentence_embedding_dimension())

Next, you create a feature group with the `embedding_index` and ingest data into the feature group.

In [None]:
news_fg = fs.get_or_create_feature_group(
    name="news_fg",
    version=VERSION,
    primary_key=["news_id"],
    online_enabled=True,
    embedding_index=embedding_index,
)

In [None]:
news_fg.insert(df)

## <span style="color:#ff5f27;"> 🔎🗞️ Search News</span>

Once the data are ingested into Hopsworks, you can search news by giving a news description. The news description first needs to be encoded by the same LM you used to encode the news. And then you can search news which are similar to the description using kNN search functionality provided by the feature group.

In [None]:
# Set the logging level to WARN to avoid INFO message
logging.getLogger().setLevel(logging.WARN)

In [None]:
news_description = "news about europe"

You can search similar news to the description against news heading.

In [None]:
results = news_fg.find_neighbors(
    model.encode(news_description), 
    k=3, 
    col="embedding_heading",
)

# Print out the heading
for result in results:
    print(result[1][2])

Alternative, you can search similar news to the description against the news body and filter by news type.

In [None]:
results = news_fg.find_neighbors(
    model.encode(news_description), 
    k=3, 
    col="embedding_body",
    filter=news_fg.newstype == "business",
)

# Print out the heading
for result in results:
    print(result[1][2])

---

## <span style="color:#ff5f27;">➡️ Next step</span>

Now you are able to search articles using natural language. You can learn how to rank the result in [this tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/branch-4.2/api_examples/vector_similarity_search/2_feature_view_embeddings_api.ipynb).