
# 🎥 Build a RAG question answering system for movie recommendations

Our goal for this project is to build a movie question answering system using RAG.

We will use a dataset from the Internet Movie Database (IMDb).

Users will be interact with the dataset by asking questions about movies, and the chatbot will retrieve relevant information from our dataset to answer those questions.


# 💻 Install `datasets` library to access IMDb dataset from Hugging Face Hub

[Hugging Face Hub](https://huggingface.co/docs/hub/en/index) is to machine learning what GitHub is to software development—a centralized platform that promotes open sharing, testing, and collaboration.

The models and datasets on the Hub are hosted as Git repositories, allowing for versioning and reproducibility.

The Hub provides a simple way for developers to discover, download, and use these pre-trained models and datasets through the `huggingface-hub` Python library, which you will be installing below.

In addition to models, the Hub also hosts a variety of machine learning applications and demos created by the community, called "Spaces".

[Here is documentation](https://huggingface.co/docs/datasets/en/index) on `datasets` if you want to read more about its capabilities.

In [None]:
!pip install -q -U datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# 💻 Generate a Hugging Face token

Generate a new token on Hugging Face with "Write" permissions using the instructions below. We will need it to download datasets via `huggingface-hub`.

[How to generate a new Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens)


# 💻 Add your Hugging Face token to your Google Colab Secrets

Use the sidebar on the left to add your Hugging Face token `HF_TOKEN` to your Google Colab secrets.

NOTE: The secrets are persisted for all future Colab sessions.

![Screenshot](https://drive.google.com/uc?export=view&id=10U0nesFSgXdCR4ywPk18mHERG47T_rRt)

# 💻 Download an IMDB datset from Hugging Face Hub

Use the `datasets` documentation to load the [ShubhamChoksi/IMDB_Movies](https://huggingface.co/datasets/ShubhamChoksi/IMDB_Movies) dataset.

[Datasets documentation](https://pypi.org/project/datasets/)

In [None]:
from datasets import # TODO

dataset = # TODO: How do we luse `datasets` to load the ShubhamChoksi/IMDB_Movies dataset?

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6591 [00:00<?, ? examples/s]

# 💻 Store the IMDb dataset locally as CSV file

We will be using [LangChain](https://www.langchain.com/) to build our RAG question answering system.

[LangChain](https://www.langchain.com/) is an open-source framework designed to simplify the development of applications powered by LLMs. It provides a set of tools, components, and interfaces that make it easier to build LLM-centric applications, allowing you to focus on the core functionality rather than the complexities of integrating language models.

To build our RAG systen, we will first store the IMDb dataset into a local CSV file to ensure it is in format we can pass to LangChain.


In [None]:
dataset_dict = dataset
dataset_dict["train"] # TODO - what method do we have to use to store imdb.csv from ShubhamChoksi/IMDB_Movies?

Creating CSV from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

10753100

# [Optional] 💻 Write the IMDd CSV data into a Pandas DataFrame


Pandas is a powerful open-source Python library for data manipulation and analysis.

It provides two main data structures:
* DataFrame
  * A two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)
* Series
  * A one-dimensional labeled array capable of holding data of any data type.

The DataFrame is the primary and most widely used data structure in Pandas. It is similar to a spreadsheet or a SQL table, with rows and columns.

Each column in a DataFrame can have a different data type, making it a flexible and powerful tool for working with diverse data.

We don't **need** to use a DataFrame for this example because we can convert the `dataset` to a CSV using `to_csv`. However, you will likely encounter a DataFrame while processing data for LLM applications, so we want to briefly introduce it here.

[Pandas DataFrame Documentation](https://pandas.pydata.org/docs/reference/io.html)

In [None]:
import pandas as pd # the Colab runtime will already have this library installed - no need to `pip install`

movies_dataframe = # TODO: How do we read imdb.csv as a dataframe?

movies_dataframe # TODO: What method can we call to ensure we have actually loaded data?

Unnamed: 0,Name,rating,No_of_ratings,user,critics,Movie_Info
0,First Knight,6.0,77K,226,54.0,Mel Gibson was attached to this project at one...
1,First Man,7.3,198K,1.4K,496.0,Mark Armstrong and Rick Armstrong said that th...
2,First Man into Space,5.4,1.7K,40,31.0,The pilot in the stock footage sequences is Ch...
3,First of May,6.8,454,13,3.0,Charles Nelson Reilly survived the worst circu...
4,The First of May,6.8,454,13,3.0,Charles Nelson Reilly survived the worst circu...


# 💻 Install LangChain

In [None]:
!pip install -q -U langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.7/116.7 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h

[LangChain Document Loaders documentation](https://python.langchain.com/docs/integrations/document_loaders/)

In [None]:
from langchain.document_loaders # TODO: What do we have to import to load our imdb.csv file?

loader = # TODO
data = # TODO

len(data) # ensure we have actually loaded data into a format LangChain can recognize

6591

# 💻 Chunk the loaded data to improve retrieval performance

In a RAG system, the model needs to be able to quickly and accurately retrieve relevant information from a knowledge base or other data sources to assist in generating high-quality responses. However, working with large, unstructured datasets can be computationally expensive and time-consuming, especially during the retrieval process.

By splitting the data into these smaller, overlapping chunks, the RAG system can more efficiently search and retrieve the most relevant information to include in the generated response. This can lead to improved performance, as the model doesn't have to process the entire dataset at once, and can focus on the most relevant parts of the data.

[LangChain `RecursiveCharacterTextSplitter` documentation](https://sj-langchain.readthedocs.io/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # TODO: How do we create a text splitter with 1000 character chunks and 100 character overlap?
)

chunked_documents = # TODO: How do we chunk the data?

len(chunked_documents) # ensure we have actually split the data into chunks

19077

# 💻 Build a way for your RAG system to understand the relationship between words and their meanings

You can think of embeddings as secret codes that capture the essence of words, allowing your system to understand their true meanings and relationships. This semantic understanding is crucial for the system to provide precise and contextually relevant answers.

Embeddings are represented as dense, continuous vectors in a high-dimensional space and serve as a semantic map that guides your RAG system to the most relevant answers.

A good analogy is a compass: embeddings help your system navigates throgugh a vast sea of information, delivering accurate and contextual responses to user queries.

# 💻 Use OpenAI embeddings to create a vector store

The first step in creating a vector store is to create embeddings from the data that you want the RAG system to be able to retrieve.

This is done using an embedding model, which transforms text data into a high-dimensional vector representation. Each piece of text (such as a document, paragraph, or sentence) is converted into a vector that captures its semantic meaning.

For this exercise, we will use OpenAI's embedding model.

In [None]:
!pip install -q -U langchain-openai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/312.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.9/312.9 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h

[LangChain `OpenAIEmbeddings` documentation](https://python.langchain.com/docs/integrations/text_embedding/openai/)

In [None]:
import os
from google.colab import userdata
from langchain_openai import OpenAIEmbeddings

openai_api_key = userdata.get("OPENAI_API_KEY")
embedding_model = # TODO: How do we create our embedding model?

# 💻 Create embedder

We will create our embedder using the `CacheBackedEmbeddings` class.

This class is designed to optimize the process of generating embeddings by caching the results of expensive embedding computations.

This caching mechanism prevents the need to recompute embeddings for the same text multiple times, which can be computationally expensive and time-consuming.

[LangChain walkthrough of caching embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings/)

[`CacheBackedEmbeddings` documentation](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html)

In [None]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

store = # TODO: How do we create a local file store to for our cached embeddings?
embedder = # TODO: How do we create our embedder?

# 💻 Create vector store using Facebook AI Similarity Search (FAISS)

[FAISS](https://ai.meta.com/tools/faiss/) is specifically designed for efficient similarity search in large datasets of high-dimensional vectors.

By using vector embeddings and storing them in a FAISS index, you can significantly reduce the computational cost associated with real-time embedding generation and similarity calculations.

Retrieval from a FAISS index is much faster than linear search across high-dimensional vectors, speeding up the response time of the system.

You may have also noticed that we save the vector store using [`LocalFileStore`](https://python.langchain.com/docs/integrations/stores/file_system/).

Saving the vector store locally ensures that the embeddings are persistent across sessions, reducing the need to recompute embeddings and rebuild the index each time the system is used.

[LangChain FAISS documentation](https://python.langchain.com/docs/integrations/vectorstores/faiss/)

In [None]:
!pip install -q faiss-cpu tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = # TODO: How do we create our vector store using FAISS?

# TODO: How do we save our vector store locally?

# 💻 Ask your RAG system a question!

Now that we have the embeddings for our chunked IMDb data saved locally in our vector store, we are ready to ask it a question.

To accomplish this task, we first transform a question like "What are some good sci-fi movies from the 1980s?" into a vector representation using our embedding model.

After that, we perform a similarity search to grab the relevant documents from our vector store.

[LangChain Text Embedding Models documentation](https://python.langchain.com/docs/modules/data_connection/text_embedding/)

[LangChain Vector Store documentation](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

In [None]:
query = "What are some good sci-fi movies from the 1980s?"

embedded_query = # TODO: How do we embed our query?
similar_documents = # TODO: How do we do a similarity search to find documents similar to our query?

for page in similar_documents:
  # TODO: Print the similar documents that the similarity search returns?

'realistic' images from 1950s' speculative magazines (fictional and 'factual') but are neither as novel nor as effective the contemporaneous Czech sci-fi film 'Ikarie XB-1' (1963) (with which 'Andromeda Nebula' shares a number of tropes). Watching as I did, most of the back-story (which occurs in a futuristic 'socialist' world in which Earth is part of an enlightened galactic alliance) was lost on me but again the images were interesting (a mix of Soviet-style monumental architecture and neoclassical 'future-tropes'). The film is based on a novel by Ivan Yefremov and was originally intended to the first in a film series. Worth watching for the imagery but unless you speak Russian, I'd suggest investing in a DVD, waiting for a subtitled version to show up on-line, or skipping to the 'special effects' sequences.,
Compare this movie to the 1956 movie Forbidden Planet, and think about which one gives you a better 'futuristic' portrayal of how mankind has advanced in 'the future'. Even allo

# 💻 Combine the retrieved data with the output of the LLM using [`Runnable`](https://python.langchain.com/docs/expression_language/interface/) interface

To understand the Runnable interface in [LangChain]((https://python.langchain.com/docs/expression_language/interface/)), let's use the analogy of a kitchen staff in a restaurant.

In a kitchen, you have different chefs who specialize in various tasks—there's a pastry chef, a grill chef, a sauce chef, and so on.

Each chef is responsible for preparing a specific part of the meal, and they must work in a certain order to ensure the dish comes out correctly.

The head chef oversees the process, ensuring that each part is ready at the right time and that everything comes together in the end.

In this analogy:
* Each chef represents a component in LangChain that implements the Runnable interface.
* The dish being prepared is the final output from the LangChain system, such as an answer to a user's question.
* The head chef is like the LangChain framework, which coordinates the execution of each Runnable component.

## Preparing a Multi-Course Meal

We are using LangChain to build a system that answers questions about movies.

We will have a sequence of Runnables that:

1. Retrieves documents related to the query (like finding the right ingredients)
2. Parses the documents to extract relevant information (like prepping the ingredients)
3. Generates a response based on the information (like cooking the ingredients to create a dish)


# 💻 Create the components (chefs)

In [None]:
!pip install -q langchain_openai

[LangChain ChatPromptTemplate quick reference](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/)

[LangChain `VectorStoreRetriever` documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/)

[LangChain `ChatOpenAI` documentation](https://python.langchain.com/docs/integrations/chat/openai/)

[LangChain `StrOutputParser` documentation](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html)

In [None]:
from langchain_core.runnables.base import RunnableSequence
from langchain_core.runnables.passthrough import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Create the components (chefs)
prompt_template = # TODO: How do we create a prompt template to send to our LLM that will incorporate the documents from our retriever with the question we ask the chat model?

retriever = # TODO: How to we create a retriever for our documents?

chat_model = # TODO: How do we create a chat model / LLM?

parser = # TODO: How do we create a parser to parse the output of our LLM?

# 💻 Create the sequence (recipe)

[LangChain Runnable interface documentation](https://python.langchain.com/docs/expression_language/interface/)

In [None]:
runnable_chain = (
    # TODO: How do we chain the output of our retriever, prompt, model and model output parser so that we can get a good answer to our query?
)

# 💻 Execute the sequence (prepare the meal)

In [None]:
# Synchronous execution
output_chunks = # TODO: How do we execute our chain?
print(''.join(output_chunks))

In the realm of science fiction movies from the 1980s, there are several standout films that have left a lasting impact on the genre. Here are a few recommendations that have stood the test of time:

1. "Blade Runner" (1982) - Directed by Ridley Scott, this neo-noir sci-fi film starring Harrison Ford is a visually stunning masterpiece that explores themes of humanity, identity, and artificial intelligence.

2. "The Terminator" (1984) - Directed by James Cameron, this action-packed sci-fi thriller introduced the world to the iconic character of the Terminator, played by Arnold Schwarzenegger. The film blends thrilling action sequences with a gripping storyline about a cyborg sent from the future to alter the course of history.

3. "E.T. the Extra-Terrestrial" (1982) - Directed by Steven Spielberg, this heartwarming sci-fi film follows the friendship between a young boy named Elliott and a gentle alien stranded on Earth. Filled with emotional depth and magical moments, "E.T." remains a b

In [None]:
# Asynchronous execution (e.g., for a better a chatbot user experience)
import asyncio

output_stream = # TODO: How do we execute our chain asynchronously?

async for chunk in output_stream:
  print(chunk, sep='', flush=True)


As
 a
 sn
oot
y
 movie
 critic
,
 I
 must
 say
 that
 selecting
 the
 top
 three
 movies
 of
 the
 last
 
20
 years
 is
 a
 daunting
 task
,
 but
 I
 shall
 endeavor
 to
 do
 so
 with
 the
 utmost
 discern
ment
 and
 elo
quence
.


1
.
 "
Before
 Midnight
":
 Richard
 Link
later
's
 masterpiece
 in
 the
 "
Before
"
 trilogy
 stands
 out
 as
 one
 of
 the
 finest
 examples
 of
 indie
 filmm
aking
 in
 recent
 memory
.
 The
 evolution
 of
 the
 characters
 played
 by
 Julie
 D
elp
y
 and
 Ethan
 Haw
ke
 is
 a
 joy
 to
 behold
,
 as
 they
 navigate
 the
 complexities
 of
 love
 and
 relationships
 with
 raw
 honesty
 and
 authenticity
.
 The
 film
's
 intimate
 portrayal
 of
 human
 emotions
,
 coupled
 with
 its
 sharp
 writing
 and
 impeccable
 performances
,
 make
 it
 a
 modern
 classic
 that
 reson
ates
 deeply
 with
 audiences
.


2
.
 "
The
 General
":
 This
 film
 deserves
 a
 place
 at
 the
 top
 of
 any
 list
 of
 great
 films
 for
 its
 timeless
 brilliance
 and
 innovative
 s

# 🎉 Congratulations on Completing Your Project!

You have successfully built a Retrieval Augmented Generation (RAG) question answering system for movie recommendations using the IMDb dataset.

This system allows users to interactively ask questions about movies, and it retrieves relevant information to provide insightful answers.

Great job on reaching this milestone!

From here, we can move on to our [Week 2: Introduction to Retrieval-Augmented Generation (RAG) Experiment Notebook](https://colab.research.google.com/drive/1pyzWvYDCKEmORN_yIsKwki6BQ5I9kXWK#scrollTo=Grd0XoS-tLgs).

# 📝 Submission

Submit your experiment notebook for Week 2 using the form [here](https://docs.google.com/forms/d/1l935d2L3YN3Kj3ovNf3CKWB2EyxvDMkYY_sYte-NYWI/edit).

Please make sure sharing permissions are turned on for everyone with the link.
