In this exercise, we'll attempt to re-implement the functionality of VectorstoreIndexCreator by using lower-level LangChain abstraction. Our goal is to find out what people liked about the picture quality for our fictional TVs.

We will:
- Start by using a Loader to load our fictional reviews.
- Split the reviews using TextSplitter into smaller chunks.
- Use Embeddings to turn reviews into vectors.
- Store the vectors in Chroma DB.
- Perform a semantic similarity search to find vectors that are semantically similar to our query.
- Send the text associated with the vectors to LLM for summarization.

By splitting our reviews into smaller parts and later retrieving only those parts that relate to the picture quality, we're able to only feed the LLM with the relevant portions of text that are needed for answering our question, and overcome the issue with limited context window.

In [1]:
import os

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

from langchain.chains.combine_documents.base import DEFAULT_DOCUMENT_PROMPT, DEFAULT_DOCUMENT_SEPARATOR, DOCUMENTS_KEY
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import format_document



base_url = "https://openai.vocareum.com/v1"
api_key = os.environ.get("OPENAI_API_KEY")

First, initialize your LLM

In [3]:
# TODO: initialize your LLM
model_name = "gpt-4o"
temperature = 0.0
max_tokens = 2000

llm = ChatOpenAI(base_url=base_url, api_key=api_key, model=model_name, temperature=temperature, max_tokens=max_tokens)

Then, load reviews from tv-reviews.csv
- CSVLoader: [Documentation](https://python.langchain.com/docs/integrations/document_loaders/csv/),
[API Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html)

In [4]:
# TODO: load your documents
file_path = "./tv-reviews.csv"
loader = CSVLoader(file_path=file_path)
docs = loader.load()


for i in range(len(docs)):
    print(f"\n========== ROW {i} ==========")
    row = docs[i]
    print("===== METADATA =====")
    print(f"source: {row.metadata["source"]}")
    print(f"row: {row.metadata["row"]}")
    print("===== CONTENT =====")
    print(f"Type: {type(row.page_content)}")
    print(row.page_content)

[Document(metadata={'source': './tv-reviews.csv', 'row': 0}, page_content="TV Name: Imagix Pro\nReview Title: Amazing Picture Quality\nReview Rating: 9\nReview Text: I recently purchased the Imagix Pro and I am blown away by its picture quality. The colors are vibrant and the images are crystal clear. It feels like I'm watching movies in a theater! The sound is also impressive, creating a truly immersive experience. Highly recommended!"), Document(metadata={'source': './tv-reviews.csv', 'row': 1}, page_content="TV Name: Imagix Pro\nReview Title: Impressive Features\nReview Rating: 8\nReview Text: The Imagix Pro is packed with impressive features that enhance my viewing experience. The smart functionality allows me to easily stream my favorite shows and movies. The remote control is user-friendly and has convenient shortcuts. The slim design is sleek and fits perfectly in my living room. The only downside is that the sound could be better, but overall, I'm satisfied."), Document(metadat

Split the documents you loaded into smaller chunks
- CharacterTextSplitter: [Documentation](https://python.langchain.com/docs/how_to/character_text_splitter/),
[API Documentation](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html)

In [6]:
# TODO: use a Text Splitter to split the documents into chunks
chunk_size = 1000
chunk_overlap = 0

splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
split_docs = splitter.split_documents(docs)


for i, doc in enumerate(split_docs):
    print(f"\n========== DOC {i} ==========")
    print("===== METADATA =====")
    print(f"source: {doc.metadata["source"]}")
    print(f"row: {doc.metadata["row"]}")
    print("===== CONTENT =====")
    print(f"Type: {type(doc.page_content)}")
    print(doc.page_content)


===== METADATA =====
source: ./tv-reviews.csv
row: 0
===== CONTENT =====
Type: <class 'str'>
TV Name: Imagix Pro
Review Title: Amazing Picture Quality
Review Rating: 9
Review Text: I recently purchased the Imagix Pro and I am blown away by its picture quality. The colors are vibrant and the images are crystal clear. It feels like I'm watching movies in a theater! The sound is also impressive, creating a truly immersive experience. Highly recommended!

===== METADATA =====
source: ./tv-reviews.csv
row: 1
===== CONTENT =====
Type: <class 'str'>
TV Name: Imagix Pro
Review Title: Impressive Features
Review Rating: 8
Review Text: The Imagix Pro is packed with impressive features that enhance my viewing experience. The smart functionality allows me to easily stream my favorite shows and movies. The remote control is user-friendly and has convenient shortcuts. The slim design is sleek and fits perfectly in my living room. The only downside is that the sound could be better, but overall, I'm 

Now, initialize your embeddings and then initialize your vector db with your embeddings model and populate with your text chunks
- OpenAIEmbeddings: [Documentation](https://python.langchain.com/docs/integrations/text_embedding/openai/),
[API Documentation](https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html)
- Chroma: [Documentation](https://python.langchain.com/docs/integrations/vectorstores/chroma/),
[API Documentation](https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Chroma.html)

In [8]:
# TODO: initialize your embeddings model
embeddings = OpenAIEmbeddings()

# TODO: populate your vector database with the chunks
db = Chroma.from_documents(split_docs, embeddings)

Query your vector database for 5 most semantically similar chunks

In [None]:
system_prompt = (
    "Use the following pieces of context to answer the user's question.\n" 
    "If you don't know the answer, just say that you don't know, don't try to make up an answer.\n"
    "----------------\n"
    "{context}"
)

query = """
    Based on the reviews in the context, tell me what people liked about the picture quality.
    Make sure you do not paraphrase the reviews, and only use the information provided in the reviews.
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

- create_stuff_documents_chain: [Documentation](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html)
- create_retrieval_chain: [Documentation](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval.create_retrieval_chain.html)

In [9]:
def format_docs(inputs: dict) -> str:
    docs = list()
    for doc in inputs[DOCUMENTS_KEY]:
        docs.append(format_document(doc, DEFAULT_DOCUMENT_PROMPT))
    return DEFAULT_DOCUMENT_SEPARATOR.join(docs)

In [None]:
use_chain_helper = False
retriever = db.as_retriever()

if use_chain_helper:
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever=retriever, combine_docs_chain=question_answer_chain)
else:
    # equivalent to "create_stuff_documents_chain"
    runnable = RunnablePassthrough.assign(**{DOCUMENTS_KEY: format_docs}).with_config(run_name="format_inputs")
    question_answer_chain = (runnable | prompt | llm | StrOutputParser()).with_config(run_name="stuff_documents_chain")

    # equivalent to "create_retrieval_chain"
    retrieval_docs = (lambda x: x["input"]) | retriever
    context = retrieval_docs.with_config(run_name="retrieve_documents")
    retrieve_documents = RunnablePassthrough.assign(**{DOCUMENTS_KEY: context})
    rag_chain = retrieve_documents.assign(answer=question_answer_chain).with_config(run_name="retrieval_chain")

output = rag_chain.invoke({"input": query})

In [35]:
print("********** RETRIEVED CONTEXT **********")
for i, doc in enumerate(output['context']):
    doc_num = doc.metadata['row']
    doc_content = doc.page_content
    print(f"===== ROW {doc_num} =====")
    print(doc_content)

print("\n********** ANSWER **********")
print(output['answer'])

********** RETRIEVED CONTEXT **********
===== ROW 0 =====
TV Name: Imagix Pro
Review Title: Amazing Picture Quality
Review Rating: 9
Review Text: I recently purchased the Imagix Pro and I am blown away by its picture quality. The colors are vibrant and the images are crystal clear. It feels like I'm watching movies in a theater! The sound is also impressive, creating a truly immersive experience. Highly recommended!
===== ROW 2 =====
TV Name: Imagix Pro
Review Title: Unmatched Clarity
Review Rating: 10
Review Text: I cannot express enough how impressed I am with the clarity of the Imagix Pro. Every detail is so sharp and lifelike, it's like I can reach out and touch the images on the screen. The colors are vibrant and realistic, making everything look stunning. It truly enhances my movie-watching experience!
===== ROW 14 =====
TV Name: VisionMax Ultra
Review Title: Excellent Picture Clarity
Review Rating: 9
Review Text: The picture clarity of the VisionMax Ultra is simply outstanding. 

Combined, they should provide enough information to answer our question about picture quality

In [9]:
# query your LLM with the query and the top 5 documents