[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/building_RAG_with_LlamaIndex_and_MongoDB_Vector_Database.ipynb) 


In [None]:
!pip install llama-index
!pip install llama-index-vector-stores-mongodb
!pip install llama-index-embeddings-openai
!pip install pymongo
!pip install datasets
!pip install pandas

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk..."

In [None]:
import pandas as pd
from datasets import load_dataset

# Make sure you have an Hugging Face token(HF_TOKEN) in your development environemnt before running the code below
# How to get a token: https://huggingface.co/docs/hub/en/security-tokens

# https://huggingface.co/datasets/MongoDB/airbnb_embeddings
dataset = load_dataset("MongoDB/airbnb_embeddings")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

In [None]:
dataset_df = dataset_df.drop(columns=["text_embeddings"])

In [None]:
from llama_index.core.settings import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=256)
llm = OpenAI()

Settings.llm = llm
Settings.embed_model = embed_model

In [None]:
import json

from llama_index.core import Document
from llama_index.core.schema import MetadataMode

# Convert the DataFrame to a JSON string representation
documents_json = dataset_df.to_json(orient="records")

# Load the JSON string into a Python list of dictionaries
documents_list = json.loads(documents_json)

llama_documents = []

for document in documents_list:
    # Value for metadata must be one of (str, int, float, None)
    document["amenities"] = json.dumps(document["amenities"])
    document["images"] = json.dumps(document["images"])
    document["host"] = json.dumps(document["host"])
    document["address"] = json.dumps(document["address"])
    document["availability"] = json.dumps(document["availability"])
    document["review_scores"] = json.dumps(document["review_scores"])
    document["reviews"] = json.dumps(document["reviews"])
    document["image_embeddings"] = json.dumps(document["image_embeddings"])

    # Create a Document object with the text and excluded metadata for llm and embedding models
    llama_document = Document(
        text=document["description"],
        metadata=document,
        excluded_llm_metadata_keys=[
            "_id",
            "transit",
            "minimum_nights",
            "maximum_nights",
            "cancellation_policy",
            "last_scraped",
            "calendar_last_scraped",
            "first_review",
            "last_review",
            "security_deposit",
            "cleaning_fee",
            "guests_included",
            "host",
            "availability",
            "reviews",
            "image_embeddings",
        ],
        excluded_embed_metadata_keys=[
            "_id",
            "transit",
            "minimum_nights",
            "maximum_nights",
            "cancellation_policy",
            "last_scraped",
            "calendar_last_scraped",
            "first_review",
            "last_review",
            "security_deposit",
            "cleaning_fee",
            "guests_included",
            "host",
            "availability",
            "reviews",
            "image_embeddings",
        ],
        metadata_template="{key}=>{value}",
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )

    llama_documents.append(llama_document)

# Observing an example of what the LLM and Embedding model receive as input
print(
    "\nThe LLM sees this: \n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.LLM),
)
print(
    "\nThe Embedding model sees this: \n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.EMBED),
)


The LLM sees this: 
 Metadata: listing_url=>https://www.airbnb.com/rooms/10006546
name=>Ribeira Charming Duplex
summary=>Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.
space=>Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets with his outspok

In [None]:
llama_documents[0]

In [None]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

parser = SentenceSplitter(chunk_size=5000)
nodes = parser.get_nodes_from_documents(llama_documents)

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode=MetadataMode.EMBED)
    )
    node.embedding = node_embedding

### MONGODB VECTOR DATABASE CONNECTION AND SETUP

MongoDB acts as both an operational and a vector database for the RAG system. 
MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

1. First, register for a [MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register). For existing users, sign into MongoDB Atlas.
2. [Follow the instructions](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/). Select Atlas UI as the procedure to deploy your first cluster. 
3. Create the database: `airbnb`.
4. Within the database` airbnb`, create the collection ‘listings_reviews’. 
5. Create a [vector search index](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure/) named vector_index for the ‘listings_reviews’ collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index. 

Follow MongoDB’s [steps to get the connection](https://www.mongodb.com/docs/manual/reference/connection-string/) string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.

This guide uses Google Colab, which offers a feature for securely storing environment secrets. These secrets can then be accessed within the development environment. Specifically, the line mongo_uri = userdata.get('MONGO_URI') retrieves the URI from the secure storage.

In [None]:
import pymongo
from google.colab import userdata


def get_mongo_client(mongo_uri):
    """Establish connection to the MongoDB."""
    try:
        client = pymongo.MongoClient(
            mongo_uri, appname="devrel.showcase.rag_llamaindex_mongodb"
        )
        print("Connection to MongoDB successful")
        return client
    except pymongo.errors.ConnectionFailure as e:
        print(f"Connection failed: {e}")
        return None


mongo_uri = userdata.get("MONGO_URI")
if not mongo_uri:
    print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME = "airbnb"
COLLECTION_NAME = "listings_reviews"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

Connection to MongoDB successful


In [None]:
# To ensure we are working with a fresh collection
# delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff0000000000000014'), 'opTime': {'ts': Timestamp(1711370832, 1), 't': 20}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1711370832, 1), 'signature': {'hash': b'\xa3\xe8\xd6\xb8\xe2=\xae\xcc\x8f\xeap\xc6\xca\x14\x97KR\xde\xc0\x18', 'keyId': 7320226449804230661}}, 'operationTime': Timestamp(1711370832, 1)}, acknowledged=True)

In [None]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

vector_store = MongoDBAtlasVectorSearch(
    mongo_client,
    db_name=DB_NAME,
    collection_name=COLLECTION_NAME,
    index_name="vector_index",
)
vector_store.add(nodes)

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
import pprint

from llama_index.core.response.notebook_utils import display_response

query_engine = index.as_query_engine(similarity_top_k=3)

query = "I want to stay in a place that's warm and friendly, and not too far from resturants, can you recommend a place? Include a reason as to why you've chosen your selection"

response = query_engine.query(query)
display_response(response)
pprint.pprint(response.source_nodes)

**`Final Response:`** I recommend the first option, "Good Home! Near Downtown Montreal, Unlimited Wifi," as it offers a warm and friendly atmosphere with the host treating guests like family members. Additionally, it is close to restaurants, pubs, malls, and other amenities, providing a convenient and welcoming stay for guests looking for a cozy and friendly environment near dining options.

[NodeWithScore(node=TextNode(id_='7bee48d7-9474-40de-9cc1-f7f018a380fa', embedding=None, metadata={'_id': 30104792, 'listing_url': 'https://www.airbnb.com/rooms/30104792', 'name': 'Good Home! Near Downtown Montreal, Unlimited Wifi', 'summary': 'You need to feel at home? Come to our home!! Cozy Room With Wifi In A Family Resident. Possibility of breakfast at $6 per day, per guest if desired. Home near restaurants, pubs, malls etc.', 'space': "Home away from home! Close to bus stops, Metro Angrignon, walking distance to nice restaurants e.g buffet, grocery stores etc. Close to Malls, downtown, movie theatres, recreation parks, Airport. Decent, peaceful, clean area in Lasalle. Parking available. 2 bedrooms in same house for hosting guests. I'm a down to earth peace-loving person with lovely family who will make your stay fun! Pick up from Trudeau Airport/bus stations is available for $50 fees. Please message host to discuss, thanks.", 'description': "You need to feel at home? Come to our 