
# Topic Vector Search Demonstration

This notebook demonstrates how to set up and retrieve documents from a vector store by searching for a topic model embedding. 

For ease, the BERTopic package is used to create the topic model. However, you can use any topic model pipeline that meets your needs. The last step in the pipeline should generate an embedding which represents the "centroid" of the topic. This is the embedding that will be used to retrieve search results. 

## Steps Overview
1. Load the sample posts and store the `PostDocuments` on OpenSearch.
2. Train a topic model using the methods defined in `topic_model.py`.
3. Explore the results of the topic model, including coherence score and topic diversity.
4. Search OpenSearch for posts matching the topic embeddings and evaluate the search performance.


In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 100)
pd.set_option('display.max_colwidth', 9999)
pd.options.display.float_format = '{:.2f}'.format

import sys
import os
sys.path.append(os.path.abspath(os.path.join('src')))

from dotenv import load_dotenv
env_loaded = load_dotenv()
if not env_loaded: 
    logger.error("Environment variables did not load. Did you create .env file in the root of the project?")


## Step 1: Load Sample Posts and Store in OpenSearch

In this step, we load the sample posts and store them as a `PostDocuments`
in OpenSearch.  We're using a pydantic model to structure the data for ease. The
model includes some convenience methods for pre-processing the text. 

In [2]:
# Step 1: Load the sample posts and store in OpenSearch
from opensearchpy import OpenSearch
import time
# Import the main function to run the initial setup
from src.source import create_index, load_sample_posts, convert_to_pydantic, create_embeddings_for_posts, store_posts_in_opensearch

# Initialize OpenSearch client
opensearch_client = OpenSearch(
    hosts=[{"host": "localhost", "port": 9200}], http_compress=True
)

# Create the index
create_index(opensearch_client)

# Load sample posts from the JSON file
sample_posts = load_sample_posts("sample_posts.json")
print(len(sample_posts))

# Convert posts to Pydantic models
posts = convert_to_pydantic(sample_posts)

# Pre-process text and create embeddings for each post
await create_embeddings_for_posts(posts)

# Store the posts in OpenSearch using batch upload
store_posts_in_opensearch(opensearch_client, posts)

# Allow OpenSearch to index the documents
time.sleep(1)

  from tqdm.autonotebook import tqdm, trange
INFO:opensearch:HEAD http://localhost:9200/post_docs [status:200 request:0.012s]
INFO:src.source:Loaded 145 posts from sample_posts.json.
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Index 'post_docs' already exists.
145


INFO:src.source:Completed creating embeddings for posts.
INFO:opensearch:POST http://localhost:9200/_bulk [status:200 request:0.098s]
INFO:opensearch:POST http://localhost:9200/_bulk [status:200 request:0.064s]
INFO:opensearch:POST http://localhost:9200/_bulk [status:200 request:0.058s]
INFO:src.source:Successfully stored 145 posts in OpenSearch.


In [3]:
from src.index import INDEX_NAME
from models import PostDocument

# Ensure that the data is loaded correctly
response = opensearch_client.search(index=INDEX_NAME, body={"size": 1000, "query": {"match_all": {}}})
results = response["hits"]["hits"]
print(f"Total number of posts in OpenSearch: {len(results)}")

# Display the first post document
PostDocument(**results[0]['_source'])

INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.798s]


Total number of posts in OpenSearch: 145


PostDocument(post_id='185bb492-e993-4f69-9b88-e55b59da7567', post_author='user_95', created_at=datetime.datetime(2023, 10, 24, 12, 55, 53, 722524, tzinfo=TzInfo(UTC)), modified_at=datetime.datetime(2024, 2, 28, 17, 9, 42, 572027, tzinfo=TzInfo(UTC)), post_text="Let's paws for a moment to appreciate the majesty of cats üê± Their grace and agility never fail to amaze me! üòª #CatLove #FelineFun", doc_embedding=[-0.024599889293313026, 0.0070482720620930195, 0.08802028745412827, -0.005882162135094404, -0.004295976832509041, 0.021553533151745796, 0.12377623468637466, -0.05751150846481323, -0.03581945225596428, -0.029644496738910675, 0.005496165249496698, -0.09166023880243301, 0.012848381884396076, 0.04113560542464256, -0.07764243334531784, 0.06196322292089462, -0.07074573636054993, -0.0001087921264115721, -0.03267202898859978, 0.10064049810171127, -0.02325579896569252, -0.009864791296422482, -0.0099931126460433, 0.02681971900165081, -0.07322884351015091, 0.034660059958696365, -0.042337533

Above is an example of a structured document. The pydantic model includes methods for pre-processing the post to output nlp-ready text. The embedding is stored on the document as a first-order object for indexing by OpenSearch.


## Step 2: Train a Topic Model

We use the `TopicModeler` class defined in `topic_model.py` to train a BERTopic model on the sample posts stored in OpenSearch.


In [4]:
# Step 2: Train a topic model

# Import the TopicModeler class
from src.topic_model import TopicModeler

# Initialize the TopicModeler
topic_modeler = TopicModeler(index_name="post_docs")

# Run the topic model training process
topic_modeler.run()

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:src.topic_model:Output directory already exists: output
INFO:src.topic_model:Retrieving embeddings from OpenSearch.
INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.092s]
INFO:src.topic_model:Retrieved 145 post documents.
INFO:src.topic_model:Preprocessing text.
INFO:src.topic_model:Extracting embeddings.
INFO:src.topic_model:Training BERTopic model.
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/


## Step 3: Explore the Results of the Topic Model

Explore the results of the trained topic model by calculating the coherence score and topic diversity.
We also preview the reference posts, BERTopic keywords, and top 5 reference posts for each topic.


In [5]:
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
# Step 3: Explore the results of the topic model

from src.topic_model import TopicModeler
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
import pandas as pd
import json

# Load the trained topic model
topic_modeler = TopicModeler(index_name="post_docs")
topic_model = topic_modeler.load_topic_model()

# Load the sample posts
with open("sample_posts.json", "r") as file:
    sample_posts = json.load(file)
docs = [post['post_text'] for post in sample_posts]

# Calculate coherence score
def calculate_coherence_score(topic_model, docs):
    try:
        topics = topic_model.get_topics()
        texts = [doc.split() for doc in docs]
        dictionary = Dictionary(texts)
        topics_words = [[word for word, _ in topic_model.get_topic(topic)] for topic in topics]
        coherence_model = CoherenceModel(topics=topics_words, texts=texts, dictionary=dictionary, coherence="c_v")
        coherence_score = coherence_model.get_coherence()
        return coherence_score
    except Exception as error:
        raise RuntimeError(f"Error calculating coherence score: {error}") from error

# Calculate topic diversity
def calculate_topic_diversity(topic_model):
    topics = topic_model.get_topics()
    topic_ids = list(topics.keys())
    topic_diversity = len(set(topic_ids)) / len(topic_ids)
    return topic_diversity

# Calculate coherence score
coherence_score = calculate_coherence_score(topic_model, docs)
print(f"Coherence Score: {coherence_score}")

# Calculate topic diversity
topic_diversity = calculate_topic_diversity(topic_model)
print(f"Topic Diversity: {topic_diversity}")


INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:src.topic_model:Output directory already exists: output
INFO:src.topic_model:Loading BERTopic model.
INFO:src.topic_model:Model loaded from output/bertopic_model
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:built Dictionary<1373 unique tokens: ['#CatLove', '#FelineFun', "Let's", 'Their', 'a']...> from 145 documents (total 3137 corpus positions)
INFO:gensim.utils:Dictionary lifecycle event {'msg': 'built Dictionary<1373 unique tokens: [\'#CatLove\', \'#FelineFun\', "Let\'s", \'Their\', \'a\']...> from 145 documents (total 3137 corpus positions)', 'datetime': '2024-09-20T12:52:55.629716', 'gensim': '4.3.3', 'python': '3.12.0 (main, Mar 18 2024, 22:21:23) [Clang 15.0.0 (clang-1500.0.40.1)]', 'platform': 'macOS-14.0-arm64-arm-64bit', 'event'

Coherence Score: 0.4449198280131731
Topic Diversity: 1.0


## Review the topics and their top words

In [6]:
from IPython.display import display, Markdown

text = f"""
For this simplified example, 20 posts were generated for each topic. An additional 30 posts
were generated on random subjects by OpenAI. The prompt was given a set of
keywords for each topic. The goal was to generate a set of diverse and coherent
samples for the purposes of this demonstration. If the model were perfectly
tuned you'd expect to see 30 posts categorized as noise, and 20 posts in each
of the other topics.

Cluster parameters (HDBScan):
- min_cluster_size=12  # Ensures clusters need at least n-points to form a distinct group
- min_samples=5
- metric="euclidean"  # Does not support cosine distance with the standard backend, so we use Euclidean
- cluster_selection_method="eom",
- cluster_selection_epsilon=0.001,  # Making cluster selection more
conservative
- prediction_data=True

The topic model correctly identified the topics in the sample posts with a
coherence score of {round(coherence_score, 2)} and a topic diversity of {topic_diversity}.
"""

display(Markdown(text))


For this simplified example, 20 posts were generated for each topic. An additional 30 posts
were generated on random subjects by OpenAI. The prompt was given a set of
keywords for each topic. The goal was to generate a set of diverse and coherent
samples for the purposes of this demonstration. If the model were perfectly
tuned you'd expect to see 30 posts categorized as noise, and 20 posts in each
of the other topics.

Cluster parameters (HDBScan):
- min_cluster_size=12  # Ensures clusters need at least n-points to form a distinct group
- min_samples=5
- metric="euclidean"  # Does not support cosine distance with the standard backend, so we use Euclidean
- cluster_selection_method="eom",
- cluster_selection_epsilon=0.001,  # Making cluster selection more
conservative
- prediction_data=True

The topic model correctly identified the topics in the sample posts with a
coherence score of 0.44 and a topic diversity of 1.0.


In [7]:
topic_modeler.topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,OpenAI,Representative_Docs
0,-1,2,-1_characters_universe_therapy_walk,"[characters, universe, therapy, walk, means, marvel, leaf_fluttering_in_wind, grown, crown, creators]","[marvel, therapy, comic, leaf_fluttering_in_wind, nature, walk, money, grown, creators, characters]",[Marvel Therapy Walk],
1,0,30,0_cats_cat_love_truly,"[cats, cat, love, truly, cat_face, eyes, paw_prints, smiling_cat_with_heart, grinning_cat, glowing_star]","[smiling_cat_with_heart, grinning_cat_with_smiling_eyes, catlove, cat, cats, cat_face, grinning_cat, feline, playfulfelines, smiling_face_with_smiling_eyes]",[Feline Love Bond],
2,1,25,1_high_speed_rail_highspeedrail,"[high, speed, rail, highspeedrail, project, california, future, transportation, progress, speed_trainbridge_at_night]","[highspeedrail, rail, speed_trainbridge_at_night, speed_trainrailway_track, transportation, infrastructure, speed_train, transportationit, railway_tracktrain, speed]",[Future Rail Connectivity],
3,2,24,2_music_playlist_song_self,"[music, playlist, song, self, time, headphone, share, let, world, feel]","[musical_notessparkling_heart, musical_note, musical_notes, music, musical_notesmiling_face_with_smiling_eyes, rainbowmusical_note, milky_waymusical_note, musical, melodies, musicrecommendation]",[Musical Discovery Journey],
4,3,23,3_water_open_water_wave_swim,"[water, open, water_wave, swim, man_swimming, challenge, nature, sunrise, alcatraz, sports_medal]","[man_swimming, water_wave, swim, swimming, waters, swimmers, surfing, dive, openwater, tides]",[Open Water Swimmer],
5,4,21,4_let_community_equality_justice,"[let, community, equality, justice, change, solidarity, vote, diversity, activism, africa]","[solidarity, activism, injustices, unite, community, africa, social, socialactivism, kindness, communities]",[Social Justice Activism],
6,5,20,5_fog_city_francisco_san,"[fog, city, francisco, san, mist, embrace, foggy, like, misty, karl]","[fog, foggy, francisco, mist, mistymornings, waterfront, night_with_starssparkles, san, misty, air]",[Foggy City Vibes],


In [8]:
topics = topic_modeler.topic_model.get_topics()
topics

{-1: [['characters', 0.37347458805864675],
  ['universe', 0.37347458805864675],
  ['therapy', 0.37347458805864675],
  ['walk', 0.37347458805864675],
  ['means', 0.37347458805864675],
  ['marvel', 0.37347458805864675],
  ['leaf_fluttering_in_wind', 0.37347458805864675],
  ['grown', 0.37347458805864675],
  ['crown', 0.37347458805864675],
  ['creators', 0.37347458805864675]],
 0: [['cats', 0.10069224113528003],
  ['cat', 0.09413007090890406],
  ['love', 0.07134464228587827],
  ['truly', 0.056600751203848464],
  ['cat_face', 0.056600751203848464],
  ['eyes', 0.056600751203848464],
  ['paw_prints', 0.056600751203848464],
  ['smiling_cat_with_heart', 0.04776083327986644],
  ['grinning_cat', 0.04776083327986644],
  ['glowing_star', 0.04528060096307877]],
 1: [['high', 0.1827207227593195],
  ['speed', 0.1322705024807515],
  ['rail', 0.1322705024807515],
  ['highspeedrail', 0.1228812552914524],
  ['project', 0.11295801710357188],
  ['california', 0.09689003850077074],
  ['future', 0.06817809739

HDBSCAN is a clustering algorithm that does not force documents into a cluster.
Instead, it allows for outliers. As such, topic id [-1] is associated
with documents that did not fit well into any cluster.  

We could spend time hyper-tuning the model parameters. However, we're not using
this model to classify posts. Our goal is to identify topics and associate those
topics with keywords/embeddings. So, in evaluating how well the model performs,
we want to focus on the keywords. These words will be used to define the 'topic
embedding'. which will then be used to retrieve posts on the given topic. 

Notes:
- The coherence score is a measure of how interpretable the topics are. A higher coherence score indicates that the topics are more coherent and interpretable. This score is relatively low in this case, which suggests that the topics may not be well-defined. But, we're going to see what happens when we search for posts related to a specific topic anyway....
- The topic diversity is a measure of how diverse the topics are. A higher topic diversity indicates that the topics are more distinct from each other. In this case, the topic diversity is `1.0` indicating 'perfect' separation. This is because we generated the documents, a real-world dataset would have a lower topic diversity.
- If you are planning to use a model like this in production, you should pay
close attention to the algorthom that identifies the key words and potentially
write your own custom algorthim. c-TF-IDF has a tendency to score words that are
"rare' in the corpus more highly. This can result in these words being returned as
keywords for the topic even though they are not included in many documents. One
way to handle this is to set the `min-df` and `max-df` parameters in the
CountVectorizer model that is passed to the the c-TF-IDF model.
- It is often useful to have human-readable labels. So, as part of the topic
model pipeline, this model calls OpenAI to generate a 3-word summary of the
topic. The prompt passes the set of keywords identified by c-TF-IDF matrix and a set of
representative posts. The result is a human-readable label. 
- It's important to note that the AI-generated labels strongly depend on the
  keywords associated with each topic. Notice that OpenAI tries to label the set of
  random posts assigned to [-1] based on keywords associated with this outlier
  topic. It doesn't mean that the topic is a coherent cluster, however. So, this
  topic would normally be ignored. 


## Step 4: Convert Keywords to Topic Embeddings and Search OpenSearch

In this step, we convert the keywords from the BERTopic model into embeddings using the same embedding model (`all-MiniLM-L6-v2`). 
We then use these embeddings to search OpenSearch for matching posts.


In [9]:
keywords = topic_model.get_topic(-1)
text = " ".join([word for word, _ in keywords])
text

'characters universe therapy walk means marvel leaf_fluttering_in_wind grown crown creators'

In [10]:
# Function to convert keywords to embeddings using the SentenceTransformer model
def convert_keywords_to_embeddings(keywords, embedding_model):
    text = " ".join([word for word, _ in keywords])
    return embedding_model.encode(text, convert_to_numpy=True)

# Convert all topics' keywords to embeddings
topic_embeddings = {topic_id: convert_keywords_to_embeddings(topic_model.get_topic(topic_id), topic_modeler.embedding_model) for topic_id in topics}

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.09it/s]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.59it/s]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.67it/s]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

In [11]:

# Step 4: Convert Keywords to Topic Embeddings and Search OpenSearch

from src.search import Searcher
import numpy as np

# Initialize the Searcher class
searcher = Searcher(index_name="post_docs")

# Search OpenSearch for posts matching each topic embedding
search_results = {}
for topic_id, embedding in topic_embeddings.items():
    # Search for similar posts
    # Limit the search to the same number of posts generated for each topic
    # Compare the top 20 similar documents to posts assigned to the topic by the
    # topic model.
    top_k = 20
    results = searcher.search_similar_documents(embedding, top_k=top_k)
    for result in results:
        # Add keywords and topic id to results for visualization
        result['keywords'] = topic_model.get_topic(topic_id)[0]
    search_results[topic_id] = results

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:src.search:Searching for similar documents in OpenSearch.
INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.029s]
INFO:src.search:Found 20 similar documents.
INFO:src.search:Searching for similar documents in OpenSearch.
INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.014s]
INFO:src.search:Found 20 similar documents.
INFO:src.search:Searching for similar documents in OpenSearch.
INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.017s]
INFO:src.search:Found 20 similar documents.
INFO:src.search:Searching for similar documents in OpenSearch.
INFO:opensearch:POST http://localhost:9200/post_docs/_search [status:200 request:0.015s]
INFO:src.search:Found 20 similar documents.
INFO:src.search:Searching for similar d


## Step 5: Evaluate the Search Results

This step evaluates the performance of the search by comparing the posts retrieved from OpenSearch using topic embeddings with the posts assigned to each topic by the BERTopic model.
We calculate the number of matches and the match ratio to assess how well the search results align with the topic assignments.


In [12]:
topic_assignments = pd.read_csv("./output/topic_assignments.csv")
topic_assignments.head(3)

Unnamed: 0,post_id,text,topic_id
0,185bb492-e993-4f69-9b88-e55b59da7567,let's paws for a moment to appreciate the majesty of cats cat_face their grace and agility never fail to amaze me smiling_cat_with_heart-eyes catlove felinefun,0
1,25df52d2-1c88-4bf5-9330-57a8ec70252e,did you know that cats spend about 70 of their lives sleeping grinning_cat that's the dream life person_in_bed catnap lazycat,0
2,ea28e66b-66bb-438e-90d1-935a694c8309,whiskers are not just cute accessories for cats they are essential tools for their sensory perception cat let's hear it for whisker power paw_prints catfacts,0


In [13]:
# Group by 'topic' and aggregate 'post_ids'
grouped = topic_assignments.groupby('topic_id')['post_id'].apply(list)
post_assignments = dict(zip(grouped.index, grouped))
post_assignments[-1]

['7b8d7e06-0a25-455a-846b-ba10e6bfe3e0',
 '98465010-0583-4d93-b071-89a0f8d1a21b']

In [14]:

# Step 5: Evaluate the Search Results

# Function to evaluate search results based on topic assignments
evaluation = {}
def evaluate_search_results(search_results, post_assignments, topic_ids):
    for topic_id in list(topic_ids.keys())[0:]: # Only evaluate the topics that were generated
        topic_search_result = search_results.get(topic_id, [])

        # Extract post ids returned in search results for each topic
        search_post_ids = [result['post_id'] for result in topic_search_result]

        # Extract post_ids assigned to each topic
        assigned_post_ids = post_assignments.get(topic_id, [])

    # Compare retrieved documents with actual topic assignments
        match_count = sum(1 for post in search_post_ids if post in assigned_post_ids)
        evaluation[topic_id] = {
            'retrieved_count': len(search_post_ids),
            'assigned_count': len(assigned_post_ids),
            'matches': match_count,
            'match_ratio': match_count / len(search_post_ids) if search_post_ids else 0
        }
    return evaluation

# Evaluate how well the search results match the posts assigned to each topic
evaluation_results = evaluate_search_results(search_results, post_assignments, topics)
evaluation_df = pd.DataFrame(evaluation_results).T


In [15]:
evaluation_df

Unnamed: 0,retrieved_count,assigned_count,matches,match_ratio
-1,20.0,2.0,2.0,0.1
0,20.0,30.0,19.0,0.95
1,20.0,25.0,19.0,0.95
2,20.0,24.0,19.0,0.95
3,20.0,23.0,18.0,0.9
4,20.0,21.0,18.0,0.9
5,20.0,20.0,19.0,0.95


The matched ratio represents the number of posts assigned to a topic that were returned within the search results for the topic. 

In [16]:
round(evaluation_df['match_ratio'].mean(), 2)

0.81

Now let's qualitatively eyeball the search results returned for each topic
embedding. Do they align with the topics originally used to generate the posts? 

In [17]:
from src.generate_posts import topics as generated_topics

In [18]:
generated_topics = list(generated_topics.keys())
generated_topics

['Cats, cats, cats',
 'Music recommendations',
 'Social Activism',
 'San Francisco Fog',
 'California High Speed Rail',
 'Open Water Swimming']

To get a feel for the search results let's display: 
- The top five search results when searching for documents using the topic
  embedding. 
- The mean search similarity score for the documents returned. 
- And the descriptive AI label assigned to each topic by OpenAI as part of the BERTopic pipeline.

In [19]:
# Display the top search results returned for each topic embedding.
topic_ids = list(topics.keys())
for topic_id in topic_ids[1:]: # Skip the outlier cluster
     topic_model_ai_label = topic_modeler.topic_model.get_topic_info(topic_id)['OpenAI'].values[0]
     print(f"Topic AI Label: {topic_model_ai_label}")
     print(f"Search Results for Topic Embedding: {topic_id}")

     mean_similarity_search_score = pd.DataFrame(search_results[topic_id])['score'].mean()
     # Adjust the mean search score to be between -1 and 1
     # Cosine similarity returns a number between -1 and 1, but because OpenSearch relevance scores can‚Äôt be below 0, the k-NN plugin adds 1 to get the final score.
     print(f"Mean Similarity Score: {round(mean_similarity_search_score - 1, 2)}")

     # Convert the DataFrame column to a string with left alignment
     df = pd.DataFrame(search_results[topic_id])[['score','post_text']].head(5)
     df['score'] = df['score'].apply(lambda x: round(x - 1, 2))
     display(df)
     print("\n")

Topic AI Label: ['Feline Love Bond']
Search Results for Topic Embedding: 0
Mean Similarity Score: 0.63


Unnamed: 0,score,post_text
0,0.78,Let's paws for a moment to appreciate the majesty of cats üê± Their grace and agility never fail to amaze me! üòª #CatLove #FelineFun
1,0.77,"The bond between a cat and its human is truly special and unique üåü It's a relationship built on trust, love, and mutual understanding üòª #CatHumanBond"
2,0.75,Fluffy cats are like living clouds of softness and love üíï Who can resist their charm and irresistible cuddles? üòª #FluffyLove
3,0.75,"Every cat has its own unique purr-sonality üò∫ Some are adventurous, others are cuddly, but all are special in their own way üåü #CatPurrsonality"
4,0.73,The way cats effortlessly navigate their surroundings with grace and agility is truly mesmerizing üêæ They are the epitome of elegance in motion! üòª #GracefulCats




Topic AI Label: ['Future Rail Connectivity']
Search Results for Topic Embedding: 1
Mean Similarity Score: 0.7


Unnamed: 0,score,post_text
0,0.87,Excited to see the progress made on the High Speed Rail project! This innovative infrastructure will redefine how we travel across California. üöÑüåâ #HighSpeedRail
1,0.83,California's commitment to the High Speed Rail project demonstrates bold leadership in advancing modern transportation solutions. Let's keep the momentum going! üöÑüåâ #HighSpeedRail
2,0.82,"Exciting news for California! The High Speed Rail project is making great progress, connecting major cities like never before. üöÑüåâ #HighSpeedRail #Infrastructure"
3,0.81,"As challenges arise, so does the determination to see the High Speed Rail project through to completion. Together, we can build a better future for California's transportation. üöÑüåâ #HighSpeedRail"
4,0.8,"Construction of the High Speed Rail is underway, shaping the future of public transportation in California. Stay tuned for updates on this transformative project! üöÑüöß #HighSpeedRail"




Topic AI Label: ['Musical Discovery Journey']
Search Results for Topic Embedding: 2
Mean Similarity Score: 0.43


Unnamed: 0,score,post_text
0,0.62,üéß Let the music be your guide as you embark on a journey of self-discovery and emotional exploration through the power of melodies and lyrics. üååüéµ
1,0.58,üéß Dive into a world of soulful melodies with this Must-Listen album that will uplift your spirits and soothe your soul. üéµ #MusicRecommendation
2,0.56,üéß Unwind after a long day with this chill playlist that will transport you to a state of relaxation and tranquility. üåøüéµ
3,0.52,‚≠ê Explore a hidden gem in the world of music with this Top Pick recommendation that deserves to be heard by music enthusiasts everywhere. üéßüé∂
4,0.52,üëå Need a mood booster? Look no further than this feel-good playlist that will brighten even the gloomiest of days. üåàüéµ




Topic AI Label: ['Open Water Swimmer']
Search Results for Topic Embedding: 3
Mean Similarity Score: 0.54


Unnamed: 0,score,post_text
0,0.86,üèÖ Conquer the iconic Alcatraz swim and write your name in the history of open water endurance challenges. üåä #Alcatraz
1,0.81,"üèÖ Alcatraz awaits those brave enough to swim its open waters, a true test of endurance and determination. üåä #Alcatraz"
2,0.72,üåä Dive into the open water and let the waves carry you to new adventures! üèä‚Äç‚ôÇÔ∏è #OpenWater #Swim
3,0.71,"üèä‚Äç‚ôÇÔ∏è Triathletes thrive in the open water, combining swimming with cycling and running for the ultimate challenge. üèÖ #Triathlon"
4,0.63,"üö© Buoy markers guide the way, marking the path for swimmers braving the open water challenge. üèÖ #Endurance"




Topic AI Label: ['Social Justice Activism']
Search Results for Topic Embedding: 4
Mean Similarity Score: 0.42


Unnamed: 0,score,post_text
0,0.56,ü§ù Building a strong community starts with understanding and respecting each other's differences. #Community #Diversity
1,0.53,üåà Transgender rights are human rights. Let's support and uplift our transgender community. #TransGenderRights #Equality
2,0.53,üö´ Say no to discrimination in all its forms. Embrace diversity and celebrate uniqueness. #NoHate #Diversity
3,0.52,‚úäüèø Black lives matter. Let's work towards ending systemic racism and inequality. #BlackLivesMatter #Equality
4,0.51,üåç Let's join hands in solidarity to create a better world for all. #SocialActivism #Change #Solidarity




Topic AI Label: ['Foggy City Vibes']
Search Results for Topic Embedding: 5
Mean Similarity Score: 0.63


Unnamed: 0,score,post_text
0,0.74,"When the fog rolls in, it's like the city takes on a whole new persona. üåÅüé≠ San Francisco becomes a stage where mist and light dance in harmony. #FoggyMagic"
1,0.73,Get ready to cozy up in San Francisco's iconic fog blanket! üåÅ Embrace the chilly embrace of Karl the Fog as he weaves his misty magic over the Golden Gate. üå´Ô∏è #SFWeather
2,0.72,San Francisco's fog is a reminder that beauty can be found even in the gloomiest of days. üåßÔ∏èüåÅ Let's appreciate the artistry of Karl the Fog as he paints the city in shades of gray. #BeautyInFog
3,0.72,"When the fog blankets the city, it's like a veil of anonymity that allows San Francisco to reinvent itself with each passing day. üåÅüí≠ Embrace the ever-changing nature of the city under Karl the Fog's watchful eye. #Reinvention"
4,0.72,"In the embrace of the fog, San Francisco takes on a timeless quality. üåÅ‚è≥ Let's savor the moment and appreciate the ephemeral beauty of a misty day in the city. #TimelessSF"






# Conclusion 

- The topic model was able to identify the topics in the sample posts. Given the size of the dataset, this is not a strong model. Many outliers were classed into a topic adding noise to the topics. However, it performs well enough to identify keywords and generate embeddings based on those keywords for each topic.
- The search results for each topic embedding were evaluated based on the number of posts assigned to each topic by the topic model that were also returned by the search query. The search results were able to retrieve posts that matched the assigned topics with an average match ratio > 0.90 (not including the outlier topic.)
  
- When configuring the pipeline for the topic model, consider the following:
    - Pre-processing posts to ensure that the unstructured text is handled in a way that the text passed to the embedding has a high level of document parity. This may mean using headings, sub-headings and other contextual data to chunk your documents. Or, text processing such as removing numbers, converting emojis to text, image capturing, handling urls, etc.  
    - Keywords and representative posts matter more than overall fit when evaluating the topic model. In order to perform well in search, the topic embedding must represent the centroid of the topic, not the overall word distribution. The more informative the keywords are, the more precise the search results. Generally, you want to use the smallest set of keywords that represent the topic well. This may mean writing a custom algorithm to select representative features and generating an embedding.
    - When handling the represenatation step in the pipeline consider pruning more common words and infrequent words so that these do not skew the labeling of the topic. If it would improve the topic model, you may want to do this at an early stage in the pipeline. However, if you do it earlier, remember those words (which may become more frequent over time) will no longer be incorporated in the embedding for the document for search purposes. 

# Teardown

In [20]:
from opensearchpy import OpenSearch 
from src.source import delete_all_documents
from src.index import delete_index

# Initialize OpenSearch client
opensearch_client = OpenSearch(
    hosts=[{"host": "localhost", "port": 9200}], http_compress=True
)

# Delete any existing documents 
delete_all_documents(opensearch_client)

#Delete index 
delete_index(opensearch_client)

INFO:opensearch:POST http://localhost:9200/post_docs/_delete_by_query [status:200 request:0.040s]
INFO:opensearch:HEAD http://localhost:9200/post_docs [status:200 request:0.003s]
INFO:opensearch:DELETE http://localhost:9200/post_docs [status:200 request:0.051s]


Deleted 145 documents from index 'post_docs'.
Index 'post_docs' deleted successfully.
