# RAG (Retrieval-Augmented Generation) com LangChain e OpenAI

Este notebook demonstra como construir um pipeline de RAG utilizando LangChain, OpenAI e ChromaDB para responder perguntas baseadas em reviews reais de TVs. O fluxo inclui:

- Inicialização do modelo de linguagem (LLM) da OpenAI
- Carregamento e preparação dos dados de reviews
- Split dos textos em chunks menores para melhor indexação
- Criação de embeddings semânticos com OpenAI
- Indexação dos chunks em um banco vetorial (ChromaDB)
- Busca semântica dos chunks mais relevantes para uma query
- Geração de respostas usando o LLM, com contexto recuperado dos reviews

O objetivo é ilustrar como combinar busca semântica e geração de texto para construir sistemas de QA robustos e baseados em evidências.

In [None]:
# Instale as dependências necessárias para embeddings e banco vetorial
#!pip install --quiet chromadb==0.4.12 tiktoken==0.4.0

In [None]:
# Importe as bibliotecas principais do LangChain para RAG
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain import LLMChain
from langchain.chains.question_answering import load_qa_chain

In [None]:
# Defina a chave de API da OpenAI e configure o ambiente
import getpass 

# Solicita a chave de API do usuário
api_key = getpass.getpass("Enter your API Key: ").strip()

import warnings
warnings.filterwarnings("ignore")

import os
os.environ["OPENAI_API_KEY"] = api_key
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

First, initialize your LLM

In [None]:
# Inicialize o modelo de linguagem da OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=1000)

Then, load reviews from tv-reviews.csv

In [None]:
# Carregue os documentos de reviews a partir do arquivo CSV
loader = CSVLoader(file_path="data/tv-reviews.csv", source_column="Review Text")
docs = loader.load()
print(docs)

[Document(page_content="TV Name: Imagix Pro\nReview Title: Amazing Picture Quality\nReview Rating: 9\nReview Text: I recently purchased the Imagix Pro and I am blown away by its picture quality. The colors are vibrant and the images are crystal clear. It feels like I'm watching movies in a theater! The sound is also impressive, creating a truly immersive experience. Highly recommended!", metadata={'source': "I recently purchased the Imagix Pro and I am blown away by its picture quality. The colors are vibrant and the images are crystal clear. It feels like I'm watching movies in a theater! The sound is also impressive, creating a truly immersive experience. Highly recommended!", 'row': 0}), Document(page_content="TV Name: Imagix Pro\nReview Title: Impressive Features\nReview Rating: 8\nReview Text: The Imagix Pro is packed with impressive features that enhance my viewing experience. The smart functionality allows me to easily stream my favorite shows and movies. The remote control is u

Split the documents you loaded into smaller chunks

In [None]:
# Divida os documentos em chunks menores para melhor indexação
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Now, initialize your embeddings

In [None]:
# Inicialize o modelo de embeddings da OpenAI
embeddings = OpenAIEmbeddings()

Initialize your vector db with your embeddings model and populate with your text chunks

In [None]:
# Popule o banco vetorial ChromaDB com os chunks e embeddings
db = Chroma.from_documents(texts, embeddings)

Query your vector database for 5 most semantically similar chunks

In [None]:
# Defina a query e busque os 5 chunks mais semanticamente similares
query = """
    Based on the reviews in the context, tell me what people liked about the picture quality.
    Make sure you do not paraphrase the reviews, and only use the information provided in the reviews.
    """
# Encontre os 5 documentos mais similares semanticamente à query
db_results = db.similarity_search(query, k=5)

Combined, they should provide enough information to answer our question about picture quality

In [None]:
# Gere a resposta final usando o LLM e os chunks mais relevantes
use_chain_helper = False
if use_chain_helper:
    # Opção True: usar o RetrievalQA pronto do LangChain
    rag = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=db.as_retriever())
    print(rag.run(query))
else:
    # Opção False: criar um prompt customizado e usar o LLM diretamente
    prompt = PromptTemplate(
        template="{query}\nContext: {context}",
        input_variables=["query", "context"],
    )
    chain = load_qa_chain(llm, prompt=prompt, chain_type="stuff")
    print(chain.run(input_documents=db_results, query=query))

People liked the vibrant colors and crystal clear images of the Imagix Pro, with some mentioning that it feels like watching movies in a theater. The clarity and sharpness of the details were also praised, with one reviewer mentioning that it enhances their movie-watching experience. The colors were described as vibrant and realistic, making everything look stunning.
