<a href="https://colab.research.google.com/github/prakul/MongoDB-AI-Resources/blob/main/%5Bnew_version%5D_llamaIndex%2BmongoDB_MetadataFiltering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip3 install python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [None]:
!pip install llama-index
!pip install llama-index-vector-stores-mongodb
!pip install llama-index-embeddings-openai
!pip install pymongo
!pip install datasets
!pip install pandas
!pip3 install python-dotenv


In [3]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
MONGO_URI = os.environ["MONGO_URI"]


In [4]:
from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=256)
llm = OpenAI()

Settings.llm = llm
Settings.embed_model = embed_model

In [None]:
# OPTIONAL - try this if llamaindex throws an OpenAI Authentication error
#import openai
#openai.api_key = OPENAI_API_KEY


In [5]:
import pymongo

from llama_index.core import SimpleDirectoryReader
#index = VectorStoreIndex.from_vector_store(vector_store)


In [None]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'


In [None]:

# Load the PDF into LlamaIndex docs format
uber_docs = SimpleDirectoryReader(input_files=["./data/10k/uber_2021.pdf"]).load_data()


In [None]:
print((uber_docs[0].__dict__))

{'id_': '32a146f4-1bf7-42d7-9a54-ed583993a996', 'embedding': None, 'metadata': {'page_label': '1', 'file_name': 'uber_2021.pdf', 'file_path': 'uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 4557617, 'creation_date': '2024-02-16', 'last_modified_date': '2024-02-16', 'last_accessed_date': '2024-02-16'}, 'excluded_embed_metadata_keys': ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 'excluded_llm_metadata_keys': ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 'relationships': {}, 'hash': '13c1b36110971e4dfd25578b6fe1576129cd93785e79613c1f602a96c0c68f08', 'text': 'Uber \n2021 \nAnnual Report', 'start_char_idx': None, 'end_char_idx': None, 'text_template': '{metadata_str}\n\n{content}', 'metadata_template': '{key}: {value}', 'metadata_seperator': '\n'}


In [11]:
# Setup the configurations for MongoDBAtlasVectorSearch
# This class is defined at https://github.com/jerryjliu/llama_index/blob/main/llama_index/vector_stores/mongodb.py

MONGODB_DB_NAME = "LlamaIndex_sample" # default is "default_db"
MONGODB_COLLECTION_NAME = "uber_sample_3" # default is "default_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "LlamaIndex_vector_index_2" # default is "default"

In [10]:
mongodb_client = pymongo.MongoClient(MONGO_URI)
#mongodb_client.list_database_names()

In [None]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.core import VectorStoreIndex, StorageContext

store = MongoDBAtlasVectorSearch(mongodb_client, db_name=MONGODB_DB_NAME, collection_name=MONGODB_COLLECTION_NAME, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME)
storage_context = StorageContext.from_defaults(vector_store=store)


# Process the PDF and store in Atlas Vector Search

In [None]:
#index = VectorStoreIndex.from_documents(uber_docs, storage_context=storage_context, show_progress=True)

index = VectorStoreIndex.from_documents(
    uber_docs, storage_context=storage_context, show_progress=True
)

#index = VectorStoreIndex.from_documents(uber_docs, storage_context=storage_context, show_progress=True)


Parsing nodes:   0%|          | 0/307 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/410 [00:00<?, ?it/s]

The above steps will result in the LlamaIndex docs stored in MongoDB

# ![picture](https://drive.google.com/uc?id=19k9gv-uoGfV7NngYPi4gk2a64l-GNCRu)

# Create an Index in Atlas Vector Search


Now go to the Search tab on Atlas to create a vector search index on your cluster.  Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings-for-vector-search), [sample blog](https://www.mongodb.com/developer/products/atlas/building-generative-ai-applications-vector-search-open-source-models/#semantic-search-for-movie-recommendations) to get more details on how to define an Atlas Vector Search index.

Name the index `LlamaIndex_vector_index` (as defined in ATLAS_VECTOR_SEARCH_INDEX_NAME), and create the index on the namespace `LlamaIndex_sample.uber_sample`. Finally, write the following definition in the JSON editor on MongoDB Atlas:

```
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 256,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "metadata.page_label"
    }
  ]
}

```
In the example, `embedding` is the name of the field that contains the embedding vector.


##  Initialize Index (Optional)

For data already in MongoDB you can initialize the index like this

In [14]:
from llama_index.core import VectorStoreIndex, StorageContext
vector_store = MongoDBAtlasVectorSearch(mongodb_client, db_name=MONGODB_DB_NAME, collection_name=MONGODB_COLLECTION_NAME, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME)
index = VectorStoreIndex.from_vector_store(vector_store)


## Performing Semantic Search

[Reference](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/root.html)

In [17]:
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("What was Uber's revenue?")


In [19]:
for node in nodes:
    print(node)

Node ID: 681d2a66-97ef-486d-820c-25c1ec0efe8d
Text: Financial and Operational HighlightsYear Ended December 31,
Constant Currency (In millions, except percentages) 2020 2021 2020 to
2021 %Change 2020 to 2021 % Change Monthly Active Platform Consumers
(“MAPCs”) 93 118 27 %Trips  5,025 6,368 27 %Gross Bookings  $ 57,897 $
90,415 56 %53 % Revenue $ 11,139 $ 17,455 57 %54 % Net loss
attributable to ...
Score:  0.855

Node ID: 2b04e088-7d95-4693-95c6-9976fc387c56
Text: UBER TECHNOLOGIES, INC.CONSOLIDATED STATEMENTS OF  OPERATIONS(In
millions, except share amounts which are ref lected in thousands, and
per share amounts)Year Ended December 31, 2019 2020 2021 Revenue $
13,000 $ 11,139 $ 17,455 Costs and expenses Cost of revenue, exclusive
of dep reciation and amortization shown separately below6,061 5,154
9,351 ...
Score:  0.851

Node ID: 824a3448-ebfa-4b9c-9437-d18684414670
Text: Through our partnership with Arizona State University, over4,000
Drivers and their fami ly members have enrolled 

## Performing Semantic Search with Metadata filtering

In [20]:
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    ExactMatchFilter,
    FilterOperator,
)
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="metadata.page_label", value="131")]
)
retriever = index.as_retriever(similarity_top_k=3, filters=filters)
nodes = retriever.retrieve("What was Uber's revenue?")

In [21]:
for node in nodes:
    print(node)

Node ID: e9ae9536-2a57-4a05-a840-48fe1601a18c
Text: Other Driver Classification MattersAdditionally, we  have
received other lawsuits and governmental inquiries in other
jurisdictions, and anticipate future claims, lawsuits, arbitration
proceedings,administrative  actions,  and  government  investigations
and  audits  challenging  our  classification  of  Drivers  as
independent  contractors  a...
Score:  0.770

Node ID: 0630edba-db9c-4b6f-8be1-751024049a5d
Text: Uber filed a proof of claim in the bankruptcy court, and Levando
wski additionally asserted a claim against Uber alleging that Uber
failed toperform its obligations under an agreement with Otto Trucking
, LLC. For these claims, Uber and Levandowski reached a confidential
settlement in principle that isscheduled  for an approval hearing with
the ...
Score:  0.759



## Performing RAG

In [None]:
response = index.as_query_engine().query("What was Uber's revenue?", )
print(response)


Uber's revenue for the year ended December 31, 2021, was $17.5 billion.


In [None]:
print(response.source_nodes)

[NodeWithScore(node=TextNode(id_='681d2a66-97ef-486d-820c-25c1ec0efe8d', embedding=None, metadata={'page_label': '53', 'file_name': 'uber_2021.pdf', 'file_path': 'data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-02-16', 'last_modified_date': '2024-02-16', 'last_accessed_date': '2024-02-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f1ae6bfd-ad27-4da7-940f-fbedf7273e28', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '53', 'file_name': 'uber_2021.pdf', 'file_path': 'data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-02-16', 'last_modified_date': '2024-02-16', 'last_accessed_date': '20

## Performing RAG with Metadata Filtering

Each LlamaIndex document contains a metadata field.
The default metadata field contains following fields

```
metadata={'page_label': '129', 'file_name': 'uber_2021.pdf', 'file_path': 'data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-02-16', 'last_modified_date': '2024-02-16', 'last_accessed_date': '2024-02-16'}
```


The metadata can be customized using instructions at
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html


Whatever metadata field is desired to be searchable, needs to be included in the Atlas Vector Search index definition. In the example above we have included the field `metadata.page_label` in the Vector Search index definition via the `type: filter`
```
{
      "type": "filter",
      "path": "metadata.page_label"
    }
```

In the above response, we can see that the answer was contained in the page_label = 129.

If we exclude that particular page_label we wont be able to find the corresponding answer

In [None]:
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    ExactMatchFilter,
    FilterOperator,
)
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="metadata.page_label", value="131")]
)
response = index.as_query_engine(filters=filters).query("What was Uber's revenue?", )
print(response)

I'm sorry, but the given context does not provide any information about Uber's revenue.
