# Sistema RAG com LlamaIndex, OpenAI e banco de dados vetorial MongoDB

Este notebook vai implementar um sistema RAG de ponta a ponta usando o "POLM" (Python, OpenAI, LlamaIndex, MongoDB) AI Stack. O stack de IA, ou GenAI stack, refere-se à composição de modelos, bancos de dados, bibliotecas e frameworks usados para construir e desenvolver aplicações modernas com capacidades de IA generativa.

Os componentes do AI stack incluem: modelos, orquestradores ou integradores, e bancos de dados operacionais e vetoriais. Neste projeto usamos o modelo `text-embedding-3-small` do `OpenAI`, `LlamaIndex` como orquestrador, e o `MongoDB` atuará tanto como banco de dados operacional quanto vetorial.



O projeto tem a seguinte estrutura:

- Carregados o dataset do Hugging Face  
- Criar embeddings utilizando o modelo de embeddings da OpenAI  
- Configurado um banco de dados vetorial no MongoDB para armazenar embeddings vetoriais  
- Estabelecer uma conexão com este banco de dados  
- Criar um índice de busca vetorial para consultas 

 
As bibliotecas necessárias:
- `LlamaIndex`: framework de dados que fornece funcionalidades para conectar fontes de dados (arquivos, PDFs, sites) tanto a LLM fechados (OpenAI, Cohere) quanto de código aberto (Llama)
- `LlamaIndex` para `MongoDB`: biblioteca de extensão do LlamaIndex que importa todos os métodos necessários para conectar e trabalhar com o MongoDB Atlas.
- `LlamaIndex` para `OpenAI`: biblioteca de extensão do LlamaIndex que importa todos os métodos necessários para acessar os modelos de embedding da OpenAI.
- `PyMongo`: biblioteca Python para interagir com o MongoDB, conectar a um cluster e consultar dados armazenados em coleções e documentos.
- `Hugging Face datasets`: biblioteca Hugging Face que contém varios datasets.
- `Pandas`: para processamento e análise eficientes de dados usando Python.

In [None]:
%pip install llama-index
%pip install llama-index-vector-stores-mongodb
%pip install llama-index-embeddings-openai
%pip install pymongo
%pip install datasets
%pip install pandas


In [1]:
import getpass
import os

In [2]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OPEN API KEY:")

In [3]:
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies

ds = load_dataset("MongoDB/embedded_movies")


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Convert the dataset to a pandas dataframe

dataset_df=pd.DataFrame(ds['train'])

dataset_df.head()

Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic,plot_embedding
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[0.0007293965299999999, -0.026834568000000003,..."
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.022837115, -0.022941574000000003, 0.014937..."
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[0.00023330492999999998, -0.028511643000000003..."
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.005927917, -0.033394486, 0.0015323418, -0...."
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.0059373598, -0.026604708, -0.0070914757000..."


In [5]:
dataset_df.shape

(1500, 18)

In [6]:
dataset_df.columns

Index(['plot', 'runtime', 'genres', 'fullplot', 'directors', 'writers',
       'countries', 'poster', 'languages', 'cast', 'title',
       'num_mflix_comments', 'rated', 'imdb', 'awards', 'type', 'metacritic',
       'plot_embedding'],
      dtype='object')

In [7]:
# Remove data point where plot&fullplot column is missing

dataset_df=dataset_df.dropna(subset=['plot','fullplot'])

print("\nNumber of missing values in each column after removal:")

print(dataset_df.isnull().sum())

# Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with the new OpenAI embedding Model "text-embedding-3-small"

dataset_df=dataset_df.drop(columns=['plot_embedding'])

dataset_df.head()


Number of missing values in each column after removal:
plot                    0
runtime                14
genres                  0
fullplot                0
directors              12
writers                13
countries               0
poster                 78
languages               1
cast                    1
title                   0
num_mflix_comments      0
rated                 279
imdb                    0
awards                  0
type                    0
metacritic            893
plot_embedding          1
dtype: int64


Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,


In [8]:
dataset_df.shape

(1452, 17)

In [9]:
df = dataset_df.head(100)

In [10]:
from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model=OpenAIEmbedding(model="text-embedding-3-small",dimensions=256)

llm=OpenAI()

Settings.llm=llm

Settings.embed_model=embed_model

import json
from llama_index.core import Document
from llama_index.core.schema import MetadataMode

# Convert the DataFrame to a JSON string representation
documents_json = df.to_json(orient='records')
# Load the JSON string into a Python list of dictionaries
documents_list = json.loads(documents_json)

In [11]:
llama_documents = []

for document in documents_list:

  # Value for metadata must be one of (str, int, float, None)
  document["writers"] = json.dumps(document["writers"])
  document["languages"] = json.dumps(document["languages"])
  document["genres"] = json.dumps(document["genres"])
  document["cast"] = json.dumps(document["cast"])
  document["directors"] = json.dumps(document["directors"])
  document["countries"] = json.dumps(document["countries"])
  document["imdb"] = json.dumps(document["imdb"])
  document["awards"] = json.dumps(document["awards"])


  # Create a Document object with the text and excluded metadata for llm and embedding models
  llama_document = Document(
      text=document["fullplot"],
      metadata=document,
      excluded_llm_metadata_keys=["fullplot", "metacritic"],
      excluded_embed_metadata_keys=["fullplot", "metacritic", "poster", "num_mflix_comments", "runtime", "rated"],
      metadata_template="{key}=>{value}",
      text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
      )

  llama_documents.append(llama_document)

# Observing an example of what the LLM and Embedding model receive as input
print(
    "\nThe LLM sees this: \n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.LLM),
)
print(
    "\nThe Embedding model sees this: \n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.EMBED),
)


The LLM sees this: 
 Metadata: plot=>Young Pauline is left a lot of money when her wealthy uncle dies. However, her uncle's secretary has been named as her guardian until she marries, at which time she will officially take ...
runtime=>199.0
genres=>["Action"]
directors=>["Louis J. Gasnier", "Donald MacKenzie"]
writers=>["Charles W. Goddard (screenplay)", "Basil Dickey (screenplay)", "Charles W. Goddard (novel)", "George B. Seitz", "Bertram Millhauser"]
countries=>["USA"]
poster=>https://m.media-amazon.com/images/M/MV5BMzgxODk1Mzk2Ml5BMl5BanBnXkFtZTgwMDg0NzkwMjE@._V1_SY1000_SX677_AL_.jpg
languages=>["English"]
cast=>["Pearl White", "Crane Wilbur", "Paul Panzer", "Edward Jos\u00e8"]
title=>The Perils of Pauline
num_mflix_comments=>0
rated=>None
imdb=>{"id": 4465, "rating": 7.6, "votes": 744}
awards=>{"nominations": 0, "text": "1 win.", "wins": 1}
type=>movie
-----
Content: Young Pauline is left a lot of money when her wealthy uncle dies. However, her uncle's secretary has been named as

In [12]:
llama_documents[25]

Document(id_='a23efdd4-d62f-4c68-ad78-0961b46d69ad', embedding=None, metadata={'plot': 'A night club owner becomes infatuated with a torch singer and frames his best friend/manager for embezzlement when the chanteuse falls in love with him.', 'runtime': 95.0, 'genres': '["Action", "Drama", "Film-Noir"]', 'fullplot': 'Jefty, owner of a roadhouse in a backwoods town, hires sultry, tough-talking torch singer Lily Stevens against the advice of his manager Pete Morgan. Jefty is smitten with Lily, who in turn exerts her charms on the more resistant Pete. When Pete finally falls for her and she turns down Jefty\'s marriage proposal, they must face Jefty\'s murderous jealousy and his twisted plots to "punish" the two.', 'directors': '["Jean Negulesco"]', 'writers': '["Edward Chodorov (screen play)", "Margaret Gruen (story)", "Oscar Saul (story)"]', 'countries': '["USA"]', 'poster': 'https://m.media-amazon.com/images/M/MV5BMjc1ZTNkM2UtYzY3Yi00ZWZmLTljYmEtNjYxZDNmYzk2ZjkzXkEyXkFqcGdeQXVyMjUxODE0

In [13]:
from llama_index.core.node_parser import SentenceSplitter
import time
import openai  
from openai import OpenAI

parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(llama_documents)

access_count = 0
max_accesses = 3  # Limite de 3 acessos por minuto
sleep_time = 20   # Tempo inicial de espera 

for node in nodes:
    while True:
        try:
            if access_count >= max_accesses:
                time.sleep(sleep_time)  # Espera o tempo determinado
                access_count = 0  # Reseta contagem de acessos
            
            # Chamada para gerar a embedding
            node_embedding = embed_model.get_text_embedding(
                node.get_content(metadata_mode="all")
            )
            node.embedding = node_embedding
            access_count += 1
            break  # Sai do loop while se a chamada for bem-sucedida

        except openai.APIError as e:
            # Handle API error, por exemplo, limite de taxa ou outros problemas
            print(f"OpenAI API retornou um erro: {e}. Aguardando {sleep_time} segundos...")
            time.sleep(sleep_time)  # Aguarda antes de tentar novamente
            access_count = 0  # Reseta contagem de acessos para começar novo ciclo
        except Exception as e:
            # Handle generic errors
            print(f"Erro inesperado: {e}")
            break



In [14]:
os.environ["URI"] = getpass.getpass("URI:")

In [1]:
uri = getpass.getpass("URI: ")  


NameError: name 'getpass' is not defined

In [15]:
from pymongo.mongo_client import MongoClient
from pymongo.errors import ConnectionFailure

# Establishing connection
try:
    uri = os.environ["URI"]
    connect = MongoClient(uri)
    print("MongoDB cluster is reachable")
    print(connect)
except ConnectionFailure as e:
    print("Could not connect to MongoDB")
    print(e)

MongoDB cluster is reachable
MongoClient(host=['ac-fx09sp5-shard-00-01.pnon21i.mongodb.net:27017', 'ac-fx09sp5-shard-00-00.pnon21i.mongodb.net:27017', 'ac-fx09sp5-shard-00-02.pnon21i.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', appname='myAtlasClusterEDU', authsource='admin', replicaset='atlas-mrk941-shard-0', tls=True)


In [16]:
from pymongo.mongo_client import MongoClient
from pymongo.errors import ConnectionFailure
from pymongo.server_api import ServerApi

# Create a new client and connect to the server
client = MongoClient(uri)
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [17]:
mongo_client = MongoClient(uri)

DB_NAME="movies"
COLLECTION_NAME="movies_records"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

In [19]:
# To ensure we are working with a fresh collection 
# delete any existing records in the collection

collection.delete_many({})


DeleteResult({'n': 10, 'electionId': ObjectId('7fffffff0000000000000098'), 'opTime': {'ts': Timestamp(1725036858, 2), 't': 152}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1725036858, 11), 'signature': {'hash': b'^c\x18\xf8\xe25\xd0\xeb\xe6\xb8\xe0\xec\x07H\xda/N\xc9\xa9\xcb', 'keyId': 7342970547904446472}}, 'operationTime': Timestamp(1725036858, 2)}, acknowledged=True)

In [20]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

vector_store = MongoDBAtlasVectorSearch(mongo_client, db_name=DB_NAME, collection_name=COLLECTION_NAME, index_name="vector_index")
vector_store.add(nodes)

index_name is deprecated. Please use vector_index_name
vector_index_name and index_name both specified. Will use vector_index_name


['921eaac9-ee7f-4dc8-a68e-863c40cde387',
 '5f018535-1991-43d6-88e8-8b0d1923ff97',
 '87edeb48-75f2-4c17-889b-92bbe699b8ef',
 'e1830968-81a3-444e-9e0d-515056c2a888',
 '277a35ab-83f2-4777-9fc8-60d90d13dc94',
 '479a2e66-129c-4b55-9962-bd938c4d6b50',
 '819b64d8-dfa3-49c6-9a79-e64a3e5810c6',
 '670dda43-87e5-4498-b9c1-74edc69eedd7',
 '7bf21f28-1744-402d-971b-9619c8436e30',
 '8920b2ae-5f19-4b52-acbf-a3f66dcf0f41',
 '249592e4-7fee-4889-8e6f-5bdfb8d9e3fa',
 '38d9e71c-c6a9-4387-84d3-2cb781616723',
 '272edd0d-fd84-4310-8b5d-a048273b5ed2',
 'df6585d5-e4d0-46e9-9cb9-1a3f4c2fef42',
 'd6bf47dd-f92f-4512-932c-f497c4c667ff',
 '1ead5494-e96d-458b-804c-0ac9005c8022',
 '48e85dcf-c333-43f8-a1f3-13db30fd45fa',
 '1020b6f9-3593-481d-b7e6-1b38fecb7ed0',
 'd349d272-5ab5-4c0a-b81a-7ee946907a36',
 'cf08004c-37fe-4f39-8dfe-eb534f7b87de',
 '22410e59-4ae6-4e90-9370-ed205626eb6f',
 '09f637c2-1edd-4c65-a383-93999b94870a',
 '01044a48-7bb3-4fc7-8003-b4bcd4472429',
 'a19b2fe4-9382-4805-ac92-8c4f1fb879bb',
 '0b069e88-2818-

In [24]:
import pprint
from llama_index.core.response.notebook_utils import display_response
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

query_engine = index.as_query_engine(similarity_top_k=3)
query = "Recomende um filme de fantasia para assistir com crianças e justifique sua escolha."
response = query_engine.query(query)
display_response(response)
pprint.pprint(response.source_nodes)


**`Final Response:`** "Jungle Book" seria uma ótimo filme de fantasia para assistir com crianças, pois conta a história de Mowgli, um menino criado por lobos que tenta se adaptar à vida na aldeia humana. A narrativa envolve aventuras emocionantes na selva, amizades incomuns com animais e lições valiosas sobre aceitação e compreensão. Além disso, o filme tem uma classificação aprovada, o que o torna adequado para crianças de todas as idades.

[NodeWithScore(node=TextNode(id_='40b06a90-855f-47e6-b4ba-c2473122bfbb', embedding=None, metadata={'plot': 'Period piece about a Brazil that is no more. This movie is the sequel to "God and the Devil in the Land of the Sun" (Deus e o diabo na terra do sol), and takes place 29 years after Antonio ...', 'runtime': 100.0, 'genres': '["Action", "Crime", "Drama"]', 'fullplot': 'Period piece about a Brazil that is no more. This movie is the sequel to "God and the Devil in the Land of the Sun" (Deus e o diabo na terra do sol), and takes place 29 years after Antonio das Mortes killed Corisco (the "Blond Devil"), last of the Cangaceiros. In "the old days", Antonio\'s function in life was exterminate these bandits, on account of his personal grudges against them. His life had been meaningless for the last 29 years, but now, a new challenge awaits him. When a Cangaceiro appears in Jardim Das Piranhas, the local Land Baron (Jofre Soares), an old man, does what seems obvious to him: he calls Antoni