# Topic Modeling for Metadata in RAG

**Authors**:
- Novan Parmonangan Simanjuntak (novan.p.simanjuntak@gdplabs.id)
- Surya Mahadi (made.r.s.mahadi@gdplabs.id)

**Reviewers**:



## References
[1] [GLAIR - Evaluate Relevance Between Retrieved Contexts and Ground Truth Contexts](https://docs.glair.ai/generative-internal/modules/evaluator/cookbook/retrieval-evaluator/retrieval-evaluation-methods/evaluate-relevance-between-retrieved-contexts-and-ground-truth-contexts)

[2] [LlamaIndex - Reciprocal Rerank Fusion Retriever](https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion.html)

[3] [BERTopic](https://maartengr.github.io/BERTopic/index.html)

# Prepare Environment

Before we start, ensure you have a GitHub account with access to the GDP Labs GenAI SDK GitHub repository. Then, follow these steps to create a personal access token:
1. Log in to your [GitHub](https://github.com/) account.
2. Navigate to the [Personal Access Tokens](https://github.com/settings/tokens) page.
3. Select the `Generate new token` option. You can use the classic version instead of the beta version.
4. Fill in the required information, ensuring that you've checked the `repo` option to grant access to private repositories.
5. Save the newly generated token.

In [None]:
import getpass
import subprocess
import sys

def install_sdk_library() -> None:
    """Installs the `gdplabs_gen_ai` library from a private GitHub repository using a Personal Access Token.

    This function prompts the user to input their Personal Access Token for GitHub authentication. It then constructs
    the repository URL with the provided token and executes a subprocess to install the library via pip from the
    specified repository.

    Raises:
        subprocess.CalledProcessError: If the installation process returns a non-zero exit code.

    Note:
        The function utilizes `getpass.getpass()` to securely receive the Personal Access Token without echoing it.
    """
    token = getpass.getpass("Input Your Personal Access Token: ")
    repo_url_with_token = f"https://{token}@github.com/GDP-ADMIN/gen-ai-internal.git"
    cmd = ["pip", "install", f"gdplabs_gen_ai[eval] @ git+{repo_url_with_token}"]

    try:
        with subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                              text=True, bufsize=1, universal_newlines=True) as process:
            for line in process.stdout:
                sys.stdout.write(line)

            process.wait()  # Wait for the process to complete.
            if process.returncode != 0:
                raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
    except Exception as e:
        print(f"An error occurred: {e}.")

install_sdk_library()

Input Your Personal Access Token: ··········
Collecting gdplabs_gen_ai[eval]@ git+https://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@github.com/GDP-ADMIN/gen-ai-internal.git
  Cloning https://****@github.com/GDP-ADMIN/gen-ai-internal.git to /tmp/pip-install-c480lx3n/gdplabs-gen-ai_5c37e11798b64be186082f4b8d3d9c76
  Running command git clone --filter=blob:none --quiet 'https://****@github.com/GDP-ADMIN/gen-ai-internal.git' /tmp/pip-install-c480lx3n/gdplabs-gen-ai_5c37e11798b64be186082f4b8d3d9c76
  Resolved https://****@github.com/GDP-ADMIN/gen-ai-internal.git to commit 42b35c68bfbea218313d7234ad164138bc167dea
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting chromadb==0.4.3 (from gdplabs_gen_ai[eval]@ g

Install the other required dependencies

In [None]:
!pip install bertopic rank-bm25 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for hdbscan (pyproject.toml) ... [?25l[?25hdone
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone


# Configuration
Next we will choose embedding models that we will use both for topic modeling and semantic search. Fortunately, in this notebook we already define embedding models for both english and indonesia task. You can modify this config based on your preferences or you can check [MTEB Leaderboards](https://huggingface.co/spaces/mteb/leaderboard) to see the current state-of-the-art embedding models.

In [None]:
CONFIGURATION = {
    "english": {
        "TOPIC_MODELING_EMBEDDING_MODEL": "BAAI/bge-small-en-v1.5",
        "VECTOR_DATABASE_EMBEDDING_MODEL": "BAAI/bge-small-en-v1.5"
    },
    "indonesia": {
        "TOPIC_MODELING_EMBEDDING_MODEL": "firqaaa/indo-sentence-bert-base",
        "VECTOR_DATABASE_EMBEDDING_MODEL": "firqaaa/indo-sentence-bert-base"
    }
}

LANGUAGE = "english"

# Prepare Datasets
Next we need to prepare our datasets. In this notebook we will use ...

In [None]:
from beir.datasets.data_loader import GenericDataLoader
from beir import util

import logging
import pathlib, os

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "/content/datasets"
data_path = util.download_and_unzip(url, out_dir)

corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

  from tqdm.autonotebook import tqdm


/content/datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

  0%|          | 0/5183 [00:00<?, ?it/s]

In [None]:
def show(index):
  query_id = list(queries.keys())[index]
  query = queries[query_id]
  print("Query:", query)
  qrs = qrels[query_id]

  print("Relevant Corpus:")
  for corpus_id in qrs.keys():
    print("Title:", corpus[corpus_id]["title"])
    print("Text:", corpus[corpus_id]["text"])
    print()

In [None]:
show(3)

Query: 5% of perinatal mortality is due to low birth weight.
Relevant Corpus:
Title: Estimates of global prevalence of childhood underweight in 1990 and 2015.
Text: CONTEXT One key target of the United Nations Millennium Development goals is to reduce the prevalence of underweight among children younger than 5 years by half between 1990 and 2015. OBJECTIVE To estimate trends in childhood underweight by geographic regions of the world. DESIGN, SETTING, AND PARTICIPANTS Time series study of prevalence of underweight, defined as weight 2 SDs below the mean weight for age of the National Center for Health Statistics and World Health Organization (WHO) reference population. National prevalence rates derived from the WHO Global Database on Child Growth and Malnutrition, which includes data on approximately 31 million children younger than 5 years who participated in 419 national nutritional surveys in 139 countries from 1965 through 2002. MAIN OUTCOME MEASURES Linear mixed-effects modeling w

In [None]:
datasets = []
for corpus_id in corpus.keys():
  datasets.append(corpus[corpus_id]["text"])

## Define Evaluation function
in this section, we will define evaluation function using BEIR based on our [documentations](https://docs.glair.ai/generative-internal/modules/evaluator/cookbook/retrieval-evaluator/retrieval-evaluation-methods/evaluate-relevance-between-retrieved-contexts-and-ground-truth-contexts).

In [None]:
from typing import List

from beir.retrieval.evaluation import EvaluateRetrieval
from llama_index.retrievers import BaseRetriever


def evaluate(retriever: BaseRetriever, k_values: List[int] = [1, 3, 5, 10]):
  results = {}
  for query_id in queries.keys():
    query = queries[query_id]

    answers = retriever.retrieve(query)
    result = {}
    for answer in answers:
      corpus_id = answer.metadata["id"]
      result[corpus_id] = answer.score

    results[query_id] = result

  ndcg, map_score, recall, precision = EvaluateRetrieval.evaluate(
    qrels, results, k_values
  )

  print(f"NDCG: {ndcg}")
  print(f"MAP: {map_score}")
  print(f"Recall: {recall}")
  print(f"Precision: {precision}")


# Extract Metadata

## Extract Topics
Next we will perform topics modeling to extract topics from our datasets. This topic will be used as metadata for document retrieval in the next step.

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer(CONFIGURATION[LANGUAGE]["TOPIC_MODELING_EMBEDDING_MODEL"])
embeddings = embedding_model.encode(datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Here we will define our hyperparameter for our topics modeling

In [None]:
from hdbscan import HDBSCAN
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
keybert_model = KeyBERTInspired()

In [None]:
from bertopic import BERTopic

representation_model = {
    "KeyBERT": keybert_model
}

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=False
)

run the topics modeling

In [None]:
topics, scores = topic_model.fit_transform(datasets, embeddings)

Let's see, what kind of topic is our topic modeling successfully extract.

**Note: Topic `-1` means outlier, which we can ignore**

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,Representative_Docs
0,-1,1811,-1_patients_cancer_cells_treatment,"[patients, cancer, cells, treatment, results, ...","[patients, mortality, treatment, clinical, res...",[Background CD19‐specific chimeric antigen rec...
1,0,2105,0_cells_cell_dna_expression,"[cells, cell, dna, expression, protein, gene, ...","[chromatin, stem cells, histone, embryonic, ge...",[DNA damage encountered by DNA replication for...
2,1,658,1_risk_patients_health_95,"[risk, patients, health, 95, years, study, wom...","[diabetes, risk factors, coronary, hypertensio...",[IMPORTANCE Understanding the major health pro...
3,2,440,2_cells_cell_il_immune,"[cells, cell, il, immune, mice, cd4, antigen, ...","[interleukin, cytokine, cytokines, inflammasom...",[Inflammasome-mediated IL-1beta production is ...
4,3,169,3_mice_adipose_adipose tissue_insulin,"[mice, adipose, adipose tissue, insulin, tissu...","[adipocytes, adipose tissue, adipose, insulin ...",[Adipose tissue macrophages (ATMs) infiltrate ...


extract topics from document

In [None]:
document_index = 5
print("Document:")
print(datasets[document_index])
print("Topics:")
topic_model.get_topic(topics[document_index], full=True)['KeyBERT']

Document:
Glioblastomas are deadly cancers that display a functional cellular hierarchy maintained by self-renewing glioblastoma stem cells (GSCs). GSCs are regulated by molecular pathways distinct from the bulk tumor that may be useful therapeutic targets. We determined that A20 (TNFAIP3), a regulator of cell survival and the NF-kappaB pathway, is overexpressed in GSCs relative to non-stem glioblastoma cells at both the mRNA and protein levels. To determine the functional significance of A20 in GSCs, we targeted A20 expression with lentiviral-mediated delivery of short hairpin RNA (shRNA). Inhibiting A20 expression decreased GSC growth and survival through mechanisms associated with decreased cell-cycle progression and decreased phosphorylation of p65/RelA. Elevated levels of A20 in GSCs contributed to apoptotic resistance: GSCs were less susceptible to TNFalpha-induced cell death than matched non-stem glioma cells, but A20 knockdown sensitized GSCs to TNFalpha-mediated apoptosis. The

[('chromatin', 0.75457567),
 ('stem cells', 0.7359276),
 ('histone', 0.7301235),
 ('embryonic', 0.7173904),
 ('genome', 0.7101828),
 ('gene expression', 0.7081264),
 ('methylation', 0.6999913),
 ('genes', 0.6978488),
 ('cells', 0.69671434),
 ('kinase', 0.69142395)]

# Create Vector Database
In this notebook, we will use LlamaIndex [SimpleVectorStore](https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemo.html) as vector database. You can change it based on your preferences. Please check [LlamaIndex Vector Store Documentation](https://docs.llamaindex.ai/en/stable/module_guides/storing/customization.html)

In [None]:
from llama_index.schema import Document


documents_with_metadata = []
documents = []
documents_keywords = []
for text, topic_id, corpus_id in zip(datasets, topics, corpus.keys()):
  topic = topic_model.get_topic(topic_id, full=True)['KeyBERT']
  topic_names = [t[0] for t in topic]
  document_with_metadata = Document(text=text, metadata={"Topics": ", ".join(topic_names), "id": corpus_id, "topic_search": topic_names}, excluded_embed_metadata_keys=["id", "topic_search"])
  document = Document(text=text, metadata={"id": corpus_id}, excluded_embed_metadata_keys=["id"])
  document_keyword = Document(text="", metadata={"Topics": ", ".join(topic_names), "id": corpus_id, "topic_search": topic_names}, excluded_embed_metadata_keys=["id", "topic_search"])

  documents_with_metadata.append(document_with_metadata)
  documents.append(document)
  documents_keywords.append(document_keyword)

In [None]:
print("Document without metadata")
documents[0]

Document without metadata


Document(id_='a896251f-4f47-4bf5-a47c-e7c356d2c411', embedding=None, metadata={'id': '4983'}, excluded_embed_metadata_keys=['id'], excluded_llm_metadata_keys=[], relationships={}, hash='9cf022f40bd0dff56c819c077f2bb020e0453175f396f188332075370fd03cb6', text='Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and d

In [None]:
print("Document with metadata topics")
documents_with_metadata[0]

Document with metadata topics


Document(id_='cadcd445-b2c7-49ba-8a17-1be80708e576', embedding=None, metadata={'Topics': 'patients, mortality, treatment, clinical, results, hiv, lung, care, findings, risk', 'id': '4983', 'topic_search': ['patients', 'mortality', 'treatment', 'clinical', 'results', 'hiv', 'lung', 'care', 'findings', 'risk']}, excluded_embed_metadata_keys=['id', 'topic_search'], excluded_llm_metadata_keys=[], relationships={}, hash='06c433bc651b04bb3ffd1ee84161bccb7794137f923ffaa24049d57edbb70f2b', text='Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of p

In [None]:
print("Document with metadata topics only")
documents_keywords[0]

Document with metadata topics only


Document(id_='704bb1ca-af05-4bb9-b1f4-9ed774839eae', embedding=None, metadata={'Topics': 'patients, mortality, treatment, clinical, results, hiv, lung, care, findings, risk', 'id': '4983', 'topic_search': ['patients', 'mortality', 'treatment', 'clinical', 'results', 'hiv', 'lung', 'care', 'findings', 'risk']}, excluded_embed_metadata_keys=['id', 'topic_search'], excluded_llm_metadata_keys=[], relationships={}, hash='e7cfae06c56a250e8d7b4beea2b101bc4093eab60494c30af71cc9f97fc3dfb3', text='', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

We also can see what will be the document looks like when we pass it to embedding models

In [None]:
from llama_index.schema import MetadataMode


print("Document without metadata:")
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

print("="*30)

print("Document with metadata:")
print(documents_with_metadata[0].get_content(metadata_mode=MetadataMode.EMBED))

print("="*30)

print("Document with metadata only:")
print(documents_keywords[0].get_content(metadata_mode=MetadataMode.EMBED))

Document without metadata:
Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 versus 1.1 microm2/ms). Relative anisotropy was higher the closer birth 

next we will create two vector store database
1. without metadata (for comparison)
2. with metadata

In [None]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.embeddings import HuggingFaceEmbedding


# set our embedding model to global context, so we don't need to set it everytime
embed_model = HuggingFaceEmbedding(model_name=CONFIGURATION[LANGUAGE]["VECTOR_DATABASE_EMBEDDING_MODEL"])
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)
set_global_service_context(service_context)

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

LLM is explicitly disabled. Using MockLLM.


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from llama_index import VectorStoreIndex


vector_index = VectorStoreIndex.from_documents(documents, show_progress=True)
vector_index_with_metadata = VectorStoreIndex.from_documents(documents_with_metadata, show_progress=True)

Parsing nodes:   0%|          | 0/5183 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5195 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/5183 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5200 [00:00<?, ?it/s]

# Search Strategy - Full Text Search (BM25)
in this section, we will explore how to perform full-text search and see how it's performances.

## Baseline
in this section, we will search without metadata as baseline.

In [None]:
from llama_index.retrievers import BM25Retriever


bm25_retriever = BM25Retriever.from_defaults(
    docstore=vector_index.docstore, similarity_top_k=10
)

next we can evaluate our baseline full-text search strategy

In [None]:
evaluate(bm25_retriever)

NDCG: {'NDCG@1': 0.5, 'NDCG@3': 0.56765, 'NDCG@5': 0.58577, 'NDCG@10': 0.61585}
MAP: {'MAP@1': 0.47833, 'MAP@3': 0.54343, 'MAP@5': 0.55416, 'MAP@10': 0.56799}
Recall: {'Recall@1': 0.47833, 'Recall@3': 0.62056, 'Recall@5': 0.66278, 'Recall@10': 0.74944}
Precision: {'P@1': 0.5, 'P@3': 0.21778, 'P@5': 0.142, 'P@10': 0.08167}


## Full-text search with metadata

In [None]:
bm25_retriever_with_metadata = BM25Retriever.from_defaults(
    docstore=vector_index_with_metadata.docstore, similarity_top_k=10
)

In [None]:
evaluate(bm25_retriever_with_metadata)

NDCG: {'NDCG@1': 0.5, 'NDCG@3': 0.56765, 'NDCG@5': 0.58577, 'NDCG@10': 0.61686}
MAP: {'MAP@1': 0.47833, 'MAP@3': 0.54343, 'MAP@5': 0.55416, 'MAP@10': 0.56836}
Recall: {'Recall@1': 0.47833, 'Recall@3': 0.62056, 'Recall@5': 0.66278, 'Recall@10': 0.75278}
Precision: {'P@1': 0.5, 'P@3': 0.21778, 'P@5': 0.142, 'P@10': 0.082}


# Search Strategy - Semantic Search

## Baseline

In [None]:
semantic_retriever = vector_index.as_retriever(similarity_top_k=10)

In [None]:
evaluate(semantic_retriever)

NDCG: {'NDCG@1': 0.56333, 'NDCG@3': 0.64971, 'NDCG@5': 0.67211, 'NDCG@10': 0.69436}
MAP: {'MAP@1': 0.54289, 'MAP@3': 0.62199, 'MAP@5': 0.63894, 'MAP@10': 0.64937}
Recall: {'Recall@1': 0.54289, 'Recall@3': 0.70167, 'Recall@5': 0.75828, 'Recall@10': 0.82189}
Precision: {'P@1': 0.56333, 'P@3': 0.25778, 'P@5': 0.17, 'P@10': 0.093}


## Semantic search with metadata

In [None]:
semantic_retriever_with_metadata = vector_index_with_metadata.as_retriever(similarity_top_k=10)

In [None]:
evaluate(semantic_retriever_with_metadata)

NDCG: {'NDCG@1': 0.57333, 'NDCG@3': 0.65291, 'NDCG@5': 0.66958, 'NDCG@10': 0.69696}
MAP: {'MAP@1': 0.55039, 'MAP@3': 0.62657, 'MAP@5': 0.63861, 'MAP@10': 0.65148}
Recall: {'Recall@1': 0.55039, 'Recall@3': 0.70183, 'Recall@5': 0.74667, 'Recall@10': 0.82689}
Precision: {'P@1': 0.57333, 'P@3': 0.25778, 'P@5': 0.166, 'P@10': 0.09367}


## Semantic search on metadata

In [None]:
vector_index_keywords = VectorStoreIndex.from_documents(documents_keywords, show_progress=True)

Parsing nodes:   0%|          | 0/5183 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5183 [00:00<?, ?it/s]

In [None]:
semantic_retriever_only_metadata = vector_index_keywords.as_retriever(similarity_top_k=10)

In [None]:
evaluate(semantic_retriever_only_metadata)

NDCG: {'NDCG@1': 0.0, 'NDCG@3': 0.0, 'NDCG@5': 0.0, 'NDCG@10': 0.00197}
MAP: {'MAP@1': 0.0, 'MAP@3': 0.0, 'MAP@5': 0.0, 'MAP@10': 0.0007}
Recall: {'Recall@1': 0.0, 'Recall@3': 0.0, 'Recall@5': 0.0, 'Recall@10': 0.00667}
Precision: {'P@1': 0.0, 'P@3': 0.0, 'P@5': 0.0, 'P@10': 0.00067}


# Search Strategy - Hybrid Search

## Metadata Filtering + Full-text search
before we continue, currently LlamaIndex `SimpleVectorStore` doesn't support complex metadata filtering, so we need to create our own `VectorStore`. In this section, we will create our own custom `VectorStore` and add support for complex metadata filtering such as `OR` and `IN` operator.

In [None]:
#@title Custom Vector Store
from llama_index.vector_stores import SimpleVectorStore
from typing import Any, Callable, Dict, List, Mapping, Optional, cast
from llama_index.indices.query.embedding_utils import (
    get_top_k_embeddings,
    get_top_k_embeddings_learner,
    get_top_k_mmr_embeddings,
)
from llama_index.vector_stores.simple import LEARNER_MODES, MMR_MODE
from llama_index.vector_stores.types import (
    DEFAULT_PERSIST_DIR,
    DEFAULT_PERSIST_FNAME,
    MetadataFilters,
    MetadataFilter,
    VectorStore,
    VectorStoreQuery,
    VectorStoreQueryMode,
    VectorStoreQueryResult,
    FilterOperator,
    FilterCondition
)
import numpy as np


def _build_metadata_filter_fn(
    metadata_lookup_fn: Callable[[str], Mapping[str, Any]],
    metadata_filters: Optional[MetadataFilters] = None,
) -> Callable[[str], bool]:
    """Build metadata filter function."""
    if not metadata_filters.filters:
        return lambda _: True

    def filter_fn(node_id: str) -> bool:
        metadata = metadata_lookup_fn(node_id)
        results = []
        for filter_ in metadata_filters.filters:
            metadata_value = metadata.get(filter_.key, None)
            if metadata_value is None:
                return False
            elif isinstance(metadata_value, list):
              results.append(filter_.value in metadata_value)
            elif isinstance(metadata_value, (int, float, str, bool)):
              results.append(metadata_value == filter_.value)

        if metadata_filters.condition == FilterCondition.OR:
          return np.any(results)
        else:
          return np.all(results)

    return filter_fn

class MySimpleVectorStore(SimpleVectorStore):
    def query(
        self,
        query,
        **kwargs,
    ):
        """Get nodes for response."""
        # Prevent metadata filtering on stores that were persisted without metadata.
        if (
            query.filters is not None
            and self._data.embedding_dict
            and not self._data.metadata_dict
        ):
            raise ValueError(
                "Cannot filter stores that were persisted without metadata. "
                "Please rebuild the store with metadata to enable filtering."
            )
        # Prefilter nodes based on the query filter and node ID restrictions.
        query_filter_fn = _build_metadata_filter_fn(
            lambda node_id: self._data.metadata_dict[node_id], query.filters
        )

        if query.node_ids is not None:
            available_ids = set(query.node_ids)

            def node_filter_fn(node_id: str) -> bool:
                return node_id in available_ids

        else:

            def node_filter_fn(node_id: str) -> bool:
                return True

        node_ids = []
        embeddings = []
        # TODO: consolidate with get_query_text_embedding_similarities
        for node_id, embedding in self._data.embedding_dict.items():
            if node_filter_fn(node_id) and query_filter_fn(node_id):
                node_ids.append(node_id)
                embeddings.append(embedding)

        query_embedding = cast(List[float], query.query_embedding)

        if query.mode in LEARNER_MODES:
            top_similarities, top_ids = get_top_k_embeddings_learner(
                query_embedding,
                embeddings,
                similarity_top_k=query.similarity_top_k,
                embedding_ids=node_ids,
            )
        elif query.mode == MMR_MODE:
            mmr_threshold = kwargs.get("mmr_threshold", None)
            top_similarities, top_ids = get_top_k_mmr_embeddings(
                query_embedding,
                embeddings,
                similarity_top_k=query.similarity_top_k,
                embedding_ids=node_ids,
                mmr_threshold=mmr_threshold,
            )
        elif query.mode == VectorStoreQueryMode.DEFAULT:
            top_similarities, top_ids = get_top_k_embeddings(
                query_embedding,
                embeddings,
                similarity_top_k=query.similarity_top_k,
                embedding_ids=node_ids,
            )
        else:
            raise ValueError(f"Invalid query mode: {query.mode}")

        return VectorStoreQueryResult(similarities=top_similarities, ids=top_ids)

In [None]:
#@title Custom Retriever
from llama_index import QueryBundle
from llama_index.schema import NodeWithScore
from llama_index.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    KeywordTableSimpleRetriever,
)


class TopicRetriever(VectorIndexRetriever):
    def __init__(
        self,
        topic_model,
        **kwargs: Any,
    ) -> None:
        self.topic_model = topic_model
        self.cache = {}
        super().__init__(**kwargs)

    def _create_filters(self, topics):
        return [MetadataFilter(key="topic_search", value=topic, operator=FilterOperator.EQ) for topic in topics]

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # update the filter based on the query
        query = query_bundle.query_str

        # cache the result
        if query in self.cache:
          topics = self.cache[query]
        else:
          topic_id = self.topic_model.transform(query)[0][0]
          topics = [t[0] for t in self.topic_model.get_topic(0, True)['KeyBERT']]
          self.cache[query] = topics

        filters = self._create_filters(topics)
        self._filters = MetadataFilters(filters=filters, condition=FilterCondition.OR)


        return super()._retrieve(query_bundle)

Next we can create our vector store and retriever

In [None]:
from llama_index.storage.storage_context import StorageContext


my_vector_store = MySimpleVectorStore()
my_storage_context = StorageContext.from_defaults(vector_store=my_vector_store)

my_vector_index = VectorStoreIndex(documents_with_metadata, storage_context=my_storage_context)

In [None]:
topic_retriever = TopicRetriever(topic_model, index=my_vector_index, similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=vector_index_with_metadata.docstore, similarity_top_k=10
)

In [None]:
from llama_index.retrievers import QueryFusionRetriever


metadata_bm25_retriever = QueryFusionRetriever(
    [topic_retriever, bm25_retriever],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
    llm=None
)

LLM is explicitly disabled. Using MockLLM.


In [None]:
evaluate(metadata_bm25_retriever)

NDCG: {'NDCG@1': 0.27667, 'NDCG@3': 0.45326, 'NDCG@5': 0.49266, 'NDCG@10': 0.51025}
MAP: {'MAP@1': 0.26528, 'MAP@3': 0.40653, 'MAP@5': 0.42883, 'MAP@10': 0.43702}
Recall: {'Recall@1': 0.26528, 'Recall@3': 0.57444, 'Recall@5': 0.66844, 'Recall@10': 0.71767}
Precision: {'P@1': 0.27667, 'P@3': 0.20222, 'P@5': 0.14267, 'P@10': 0.07867}


## Metadata Filtering + Semantic Search

In [None]:
topic_retriever = TopicRetriever(topic_model, index=my_vector_index, similarity_top_k=10)
semantic_retriever_with_metadata = vector_index_with_metadata.as_retriever(similarity_top_k=10)

In [None]:
metadata_semantic_retriever = QueryFusionRetriever(
    [topic_retriever, semantic_retriever_with_metadata],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
    llm=None
)

LLM is explicitly disabled. Using MockLLM.


In [None]:
evaluate(metadata_semantic_retriever)

NDCG: {'NDCG@1': 0.29667, 'NDCG@3': 0.46573, 'NDCG@5': 0.50375, 'NDCG@10': 0.54412}
MAP: {'MAP@1': 0.28317, 'MAP@3': 0.42043, 'MAP@5': 0.44401, 'MAP@10': 0.46262}
Recall: {'Recall@1': 0.28317, 'Recall@3': 0.57033, 'Recall@5': 0.66406, 'Recall@10': 0.78222}
Precision: {'P@1': 0.29667, 'P@3': 0.20778, 'P@5': 0.14667, 'P@10': 0.08733}


## Full-text search + Semantic search

In [None]:
bm25_retriever_with_metadata = BM25Retriever.from_defaults(
    docstore=vector_index_with_metadata.docstore, similarity_top_k=10
)
semantic_retriever_with_metadata = vector_index_with_metadata.as_retriever(similarity_top_k=10)

In [None]:
bm25_semantic_retriever = QueryFusionRetriever(
    [bm25_retriever_with_metadata, semantic_retriever_with_metadata],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
    llm=None
)

LLM is explicitly disabled. Using MockLLM.


In [None]:
evaluate(bm25_semantic_retriever)

NDCG: {'NDCG@1': 0.57333, 'NDCG@3': 0.65539, 'NDCG@5': 0.6817, 'NDCG@10': 0.70352}
MAP: {'MAP@1': 0.54917, 'MAP@3': 0.62889, 'MAP@5': 0.64486, 'MAP@10': 0.65592}
Recall: {'Recall@1': 0.54917, 'Recall@3': 0.71556, 'Recall@5': 0.77744, 'Recall@10': 0.83722}
Precision: {'P@1': 0.57333, 'P@3': 0.25556, 'P@5': 0.16933, 'P@10': 0.09367}


## Metadata filtering + Full-text search + Semantic search

In [None]:
topic_retriever = TopicRetriever(topic_model, index=my_vector_index, similarity_top_k=10)
bm25_retriever_with_metadata = BM25Retriever.from_defaults(
    docstore=vector_index_with_metadata.docstore, similarity_top_k=10
)
semantic_retriever_with_metadata = vector_index_with_metadata.as_retriever(similarity_top_k=10)

In [None]:
metadata_bm25_semantic_retriever = QueryFusionRetriever(
    [topic_retriever, bm25_retriever_with_metadata, semantic_retriever_with_metadata],
    similarity_top_k=10,
    num_queries=1,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
    llm=None
)

LLM is explicitly disabled. Using MockLLM.


In [None]:
evaluate(metadata_bm25_semantic_retriever)

NDCG: {'NDCG@1': 0.52667, 'NDCG@3': 0.61325, 'NDCG@5': 0.64952, 'NDCG@10': 0.67019}
MAP: {'MAP@1': 0.50472, 'MAP@3': 0.58541, 'MAP@5': 0.60725, 'MAP@10': 0.61696}
Recall: {'Recall@1': 0.50472, 'Recall@3': 0.67289, 'Recall@5': 0.76061, 'Recall@10': 0.82056}
Precision: {'P@1': 0.52667, 'P@3': 0.24333, 'P@5': 0.16667, 'P@10': 0.091}
