
# 🎥 Build a RAG question answering system for movie recommendations

Our goal for this project is to build a movie question answering system using RAG.

We will use a dataset from the Internet Movie Database (IMDb).

Users will be interact with the dataset by asking questions about movies, and the chatbot will retrieve relevant information from our dataset to answer those questions.


# 💻 Install `datasets` library to access IMDb dataset from Hugging Face Hub

[Hugging Face Hub](https://huggingface.co/docs/hub/en/index) is to machine learning what GitHub is to software development—a centralized platform that promotes open sharing, testing, and collaboration.

The models and datasets on the Hub are hosted as Git repositories, allowing for versioning and reproducibility.

The Hub provides a simple way for developers to discover, download, and use these pre-trained models and datasets through the `huggingface-hub` Python library, which you will be installing below.

In addition to models, the Hub also hosts a variety of machine learning applications and demos created by the community, called "Spaces".

[Here is documentation](https://huggingface.co/docs/datasets/en/index) on `datasets` if you want to read more about its capabilities.

In [None]:
%pip install -q -U datasets

# 💻 Generate a Hugging Face token

Generate a new token on Hugging Face with "Write" permissions using the instructions below. We will need it to download datasets via `huggingface-hub`.

[How to generate a new Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens)


# 💻 Add your Hugging Face token to your Google Colab Secrets

Use the sidebar on the left to add your Hugging Face token `HF_TOKEN` to your Google Colab secrets.

NOTE: The secrets are persisted for all future Colab sessions.

![Screenshot](https://drive.google.com/uc?export=view&id=10U0nesFSgXdCR4ywPk18mHERG47T_rRt)

# 💻 Download an IMDB datset from Hugging Face Hub

Use the `datasets` documentation to load the [ShubhamChoksi/IMDB_Movies](https://huggingface.co/datasets/ShubhamChoksi/IMDB_Movies) dataset.

[Datasets documentation](https://pypi.org/project/datasets/)

In [1]:
from datasets import load_dataset_builder, load_dataset # TODO

# dataset_builder = load_dataset_builder("ShubhamChoksi/IMDB_Movies")
# dataset_builder.info.features

# TODO: How do we luse `datasets` to load the ShubhamChoksi/IMDB_Movies dataset?
dataset = load_dataset("ShubhamChoksi/IMDB_Movies", split="train")

  from .autonotebook import tqdm as notebook_tqdm


# 💻 Store the IMDb dataset locally as CSV file

We will be using [LangChain](https://www.langchain.com/) to build our RAG question answering system.

[LangChain](https://www.langchain.com/) is an open-source framework designed to simplify the development of applications powered by LLMs. It provides a set of tools, components, and interfaces that make it easier to build LLM-centric applications, allowing you to focus on the core functionality rather than the complexities of integrating language models.

To build our RAG systen, we will first store the IMDb dataset into a local CSV file to ensure it is in format we can pass to LangChain.


In [2]:
dataset_dict = dataset
dataset_dict.to_csv("../data/imdb.csv") # TODO - what method do we have to use to store imdb.csv from ShubhamChoksi/IMDB_Movies?

Creating CSV from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 58.62ba/s]


10753100

# [Optional] 💻 Write the IMDd CSV data into a Pandas DataFrame


Pandas is a powerful open-source Python library for data manipulation and analysis.

It provides two main data structures:
* DataFrame
  * A two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)
* Series
  * A one-dimensional labeled array capable of holding data of any data type.

The DataFrame is the primary and most widely used data structure in Pandas. It is similar to a spreadsheet or a SQL table, with rows and columns.

Each column in a DataFrame can have a different data type, making it a flexible and powerful tool for working with diverse data.

We don't **need** to use a DataFrame for this example because we can convert the `dataset` to a CSV using `to_csv`. However, you will likely encounter a DataFrame while processing data for LLM applications, so we want to briefly introduce it here.

[Pandas DataFrame Documentation](https://pandas.pydata.org/docs/reference/io.html)

In [3]:
import pandas as pd # the Colab runtime will already have this library installed - no need to `pip install`

movies_dataframe = pd.read_csv("../data/imdb.csv") # TODO: How do we read imdb.csv as a dataframe?

movies_dataframe.head() # TODO: What method can we call to ensure we have actually loaded data?

Unnamed: 0,Name,rating,No_of_ratings,user,critics,Movie_Info
0,First Knight,6.0,77K,226,54.0,Mel Gibson was attached to this project at one...
1,First Man,7.3,198K,1.4K,496.0,Mark Armstrong and Rick Armstrong said that th...
2,First Man into Space,5.4,1.7K,40,31.0,The pilot in the stock footage sequences is Ch...
3,First of May,6.8,454,13,3.0,Charles Nelson Reilly survived the worst circu...
4,The First of May,6.8,454,13,3.0,Charles Nelson Reilly survived the worst circu...


# 💻 Install LangChain

In [4]:
%pip install -q -U langchain

Note: you may need to restart the kernel to use updated packages.


[LangChain Document Loaders documentation](https://python.langchain.com/docs/integrations/document_loaders/)

In [4]:
from langchain.document_loaders import CSVLoader # TODO: What do we have to import to load our imdb.csv file?

loader = CSVLoader(file_path="../data/imdb.csv") # TODO
data = loader.load() # TODO

len(data) # ensure we have actually loaded data into a format LangChain can recognize

6591

# 💻 Chunk the loaded data to improve retrieval performance

In a RAG system, the model needs to be able to quickly and accurately retrieve relevant information from a knowledge base or other data sources to assist in generating high-quality responses. However, working with large, unstructured datasets can be computationally expensive and time-consuming, especially during the retrieval process.

By splitting the data into these smaller, overlapping chunks, the RAG system can more efficiently search and retrieve the most relevant information to include in the generated response. This can lead to improved performance, as the model doesn't have to process the entire dataset at once, and can focus on the most relevant parts of the data.

[LangChain `RecursiveCharacterTextSplitter` documentation](https://sj-langchain.readthedocs.io/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)


In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # TODO: How do we create a text splitter with 1000 character chunks and 100 character overlap?
    chunk_size = 1000,
    chunk_overlap = 100
)

chunked_documents = text_splitter.split_documents(data) # TODO: How do we chunk the data?

len(chunked_documents) # ensure we have actually split the data into chunks

19077

# 💻 Build a way for your RAG system to understand the relationship between words and their meanings

You can think of embeddings as secret codes that capture the essence of words, allowing your system to understand their true meanings and relationships. This semantic understanding is crucial for the system to provide precise and contextually relevant answers.

Embeddings are represented as dense, continuous vectors in a high-dimensional space and serve as a semantic map that guides your RAG system to the most relevant answers.

A good analogy is a compass: embeddings help your system navigates throgugh a vast sea of information, delivering accurate and contextual responses to user queries.

# 💻 Use OpenAI embeddings to create a vector store

The first step in creating a vector store is to create embeddings from the data that you want the RAG system to be able to retrieve.

This is done using an embedding model, which transforms text data into a high-dimensional vector representation. Each piece of text (such as a document, paragraph, or sentence) is converted into a vector that captures its semantic meaning.

For this exercise, we will use OpenAI's embedding model.

In [None]:
%pip install -q -U langchain-openai

[LangChain `OpenAIEmbeddings` documentation](https://python.langchain.com/docs/integrations/text_embedding/openai/)

In [6]:
import os
# from google.colab import userdata
from langchain_openai import OpenAIEmbeddings

from dotenv import load_dotenv
load_dotenv('../.env')

# openai_api_key = userdata.get("OPENAI_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY", "")
embedding_model = OpenAIEmbeddings(model='text-embedding-3-large', openai_api_key=openai_api_key) # TODO: How do we create our embedding model?

# 💻 Create embedder

We will create our embedder using the `CacheBackedEmbeddings` class.

This class is designed to optimize the process of generating embeddings by caching the results of expensive embedding computations.

This caching mechanism prevents the need to recompute embeddings for the same text multiple times, which can be computationally expensive and time-consuming.

[LangChain walkthrough of caching embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings/)

[`CacheBackedEmbeddings` documentation](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html)

In [7]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/") # TODO: How do we create a local file store to for our cached embeddings?
embedder = CacheBackedEmbeddings.from_bytes_store(
    embedding_model, store, namespace=embedding_model.model
)  # TODO: How do we create our embedder?

# 💻 Create vector store using Facebook AI Similarity Search (FAISS)

[FAISS](https://ai.meta.com/tools/faiss/) is specifically designed for efficient similarity search in large datasets of high-dimensional vectors.

By using vector embeddings and storing them in a FAISS index, you can significantly reduce the computational cost associated with real-time embedding generation and similarity calculations.

Retrieval from a FAISS index is much faster than linear search across high-dimensional vectors, speeding up the response time of the system.

You may have also noticed that we save the vector store using [`LocalFileStore`](https://python.langchain.com/docs/integrations/stores/file_system/).

Saving the vector store locally ensures that the embeddings are persistent across sessions, reducing the need to recompute embeddings and rebuild the index each time the system is used.

[LangChain FAISS documentation](https://python.langchain.com/docs/integrations/vectorstores/faiss/)

In [12]:
%pip install -q faiss-cpu tiktoken

Note: you may need to restart the kernel to use updated packages.


In [11]:
from langchain_community.vectorstores import FAISS

# vector_store = FAISS.from_documents(chunked_documents, embedder) # TODO: How do we create our vector store using FAISS?
# vector_store.save_local('./store/faiss_index') # TODO: How do we save our vector store locally?

# Load our vector store from disk
# ValueError: The de-serialization relies loading a pickle file.
#Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.
# You will need to set `allow_dangerous_deserialization` to `True` to enable deserialization.
# If you do this, make sure that you trust the source of the data.
# For example, if you are loading a file that you created, and no that no one else has modified the file,
# then this is safe to do. Do not set this to `True` if you are loading a file from an untrusted source (e.g., some random site on the internet.).
vector_store = FAISS.load_local('./store/faiss_index', embedder, allow_dangerous_deserialization=True)

# 💻 Ask your RAG system a question!

Now that we have the embeddings for our chunked IMDb data saved locally in our vector store, we are ready to ask it a question.

To accomplish this task, we first transform a question like "What are some good sci-fi movies from the 1980s?" into a vector representation using our embedding model.

After that, we perform a similarity search to grab the relevant documents from our vector store.

[LangChain Text Embedding Models documentation](https://python.langchain.com/docs/modules/data_connection/text_embedding/)

[LangChain Vector Store documentation](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

In [12]:
query = "What are some good sci-fi movies from the 1980s?"

embedded_query = embedder.embed_query(query) # TODO: How do we embed our query?
# print(embedded_query)
# TODO: How do we do a similarity search to find documents similar to our query?
similar_documents_with_score = vector_store.similarity_search_with_score_by_vector(
    embedded_query)

for page, score in similar_documents_with_score:
  # TODO: Print the similar documents that the similarity search returns?
  print(f'{score}: {page.page_content}')

1.0613569021224976: Movie_Info: The set dresser on this film was Bill Paxton, before he took to acting. He was employed by the film's Art Director, and future frequent collaborator, James Cameron.,When the captain engages the hyper-drive, she stands at the control panel. A montage of the crew members strapped into their seats follows, including the captain. When hyper-drive disengages a few seconds later, she is still standing at the control panel in exactly the same spot.,Quuhod: I live and die by the crystals.,End credits roll over the pyramid which glows blue as the wind can be heard blowing.,The 1986 UK Warner video release was cut by 14 secs by the BBFC to reduce the rape of Dameia by a giant maggot. The cuts were fully waived for the 2002 ILC DVD.,Edited into Not of This Earth (1988),A spacecraft travels to a distant planet to rescue the crew of another space ship that has crashed. The craft has damages in the landing and needs to be repaired. Baelon (Zalman King) commands the re

# 💻 Combine the retrieved data with the output of the LLM using [`Runnable`](https://python.langchain.com/docs/expression_language/interface/) interface

To understand the Runnable interface in [LangChain]((https://python.langchain.com/docs/expression_language/interface/)), let's use the analogy of a kitchen staff in a restaurant.

In a kitchen, you have different chefs who specialize in various tasks—there's a pastry chef, a grill chef, a sauce chef, and so on.

Each chef is responsible for preparing a specific part of the meal, and they must work in a certain order to ensure the dish comes out correctly.

The head chef oversees the process, ensuring that each part is ready at the right time and that everything comes together in the end.

In this analogy:
* Each chef represents a component in LangChain that implements the Runnable interface.
* The dish being prepared is the final output from the LangChain system, such as an answer to a user's question.
* The head chef is like the LangChain framework, which coordinates the execution of each Runnable component.

## Preparing a Multi-Course Meal

We are using LangChain to build a system that answers questions about movies.

We will have a sequence of Runnables that:

1. Retrieves documents related to the query (like finding the right ingredients)
2. Parses the documents to extract relevant information (like prepping the ingredients)
3. Generates a response based on the information (like cooking the ingredients to create a dish)


# 💻 Create the components (chefs)

In [None]:
%pip install -q langchain_openai

[LangChain ChatPromptTemplate quick reference](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/)

[LangChain `VectorStoreRetriever` documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/)

[LangChain `ChatOpenAI` documentation](https://python.langchain.com/docs/integrations/chat/openai/)

[LangChain `StrOutputParser` documentation](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html)

In [26]:
from langchain_core.runnables.base import RunnableSequence
from langchain_core.runnables.passthrough import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Create the components (chefs)
# TODO: How do we create a prompt template to send to our LLM that will incorporate the documents from our retriever with the question we ask the chat model?
prompt_template = ChatPromptTemplate.from_messages([
    ("system", f"You are a movie recommendation chatbot."),
    ("user", "Given the {context}, answer the following question: {question}"),
])

# TODO: How to we create a retriever for our documents?
# as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5})
# as_retriever(search_type="mmr")
retriever = vector_store.as_retriever()

chat_model = ChatOpenAI(name='Movie Recommendation Bot',
                        api_key=openai_api_key,
                        # cache=True,
                        temperature=0.7,
                        verbose=True)  # TODO: How do we create a chat model / LLM?

parser = StrOutputParser() # TODO: How do we create a parser to parse the output of our LLM?

# 💻 Create the sequence (recipe)

[LangChain Runnable interface documentation](https://python.langchain.com/docs/expression_language/interface/)

In [27]:
runnable_chain = RunnableSequence(
    # TODO: How do we chain the output of our retriever, prompt, model and model output parser so that we can get a good answer to our query?
    {"context": retriever, "question": RunnablePassthrough(
    )} | prompt_template | chat_model | parser
)

print(runnable_chain.input_schema.schema())  # Show inferred input schema
print(runnable_chain.output_schema.schema())  # Show inferred output schema

{'title': 'RunnableParallel<context,question>Input', 'type': 'string'}
{'title': 'StrOutputParserOutput', 'type': 'string'}


# 💻 Execute the sequence (prepare the meal)

In [28]:
# Synchronous execution
# TODO: How do we execute our chain?
output_chunks = runnable_chain.invoke(
    'What are some good sci-fi movies from the 1980s?')
print(''.join(output_chunks))

Some good sci-fi movies from the 1980s that you might enjoy are:

1. "Aliens" (1986) - Directed by James Cameron, this film continues the story of Ellen Ripley as she returns to the planet where she first encountered the deadly alien creatures.
   
2. "Blade Runner" (1982) - Directed by Ridley Scott, this film is a neo-noir sci-fi classic that explores themes of artificial intelligence and humanity.

3. "The Terminator" (1984) - Directed by James Cameron, this film features a time-traveling cyborg sent to the past to assassinate a key figure in a future war.

4. "Back to the Future" (1985) - Directed by Robert Zemeckis, this film follows the adventures of Marty McFly as he travels through time in a DeLorean with the help of eccentric scientist Doc Brown.

5. "Star Wars: Episode V - The Empire Strikes Back" (1980) - Directed by Irvin Kershner, this film is the second installment in the original Star Wars trilogy and is considered one of the best in the series.

These are just a few exam

In [37]:
output_chunks_2 = runnable_chain.stream(
    'What are some batman movies that were released in 2020?'
)

print(''.join(output_chunks_2))

Some Batman movies that were released in 2020 include "The Batman."


In [30]:
output_chunks_3 = runnable_chain.invoke(
    'Are you sure that is the correct answer?'
)

print(''.join(output_chunks_3))

Based on the information provided, it seems like you are asking if "The Answer Man" is the correct answer. Yes, "The Answer Man" is a movie that matches the details you provided. It has a rating of 6.4, with a significant number of ratings from both users and critics. The movie's screenplay was featured in the 2007 Blacklist, and it explores themes of self-discovery and personal growth. If you are looking for a movie recommendation based on this information, "The Answer Man" could be a good choice for you.


In [34]:
# Asynchronous execution (e.g., for a better a chatbot user experience)
import asyncio

output_stream = runnable_chain.astream('What are some batman movies that were released in 2020?') # TODO: How do we execute our chain asynchronously?

async for chunk in output_stream:
  print(chunk, sep='', flush=True)


Some
 Batman
 movies
 that
 were
 released
 in
 
202
0
 include
 "
The
 Batman
."



# 🎉 Congratulations on Completing Your Project!

You have successfully built a Retrieval Augmented Generation (RAG) question answering system for movie recommendations using the IMDb dataset.

This system allows users to interactively ask questions about movies, and it retrieves relevant information to provide insightful answers.

Great job on reaching this milestone!

From here, we can move on to our [Week 2: Introduction to Retrieval-Augmented Generation (RAG) Experiment Notebook](https://colab.research.google.com/drive/1pyzWvYDCKEmORN_yIsKwki6BQ5I9kXWK#scrollTo=Grd0XoS-tLgs).

# 📝 Submission

Submit your experiment notebook for Week 2 using the form [here](https://docs.google.com/forms/d/1l935d2L3YN3Kj3ovNf3CKWB2EyxvDMkYY_sYte-NYWI/edit).

Please make sure sharing permissions are turned on for everyone with the link.
