# RAG for movies recomendation

In this tutorial we will download a dataset of movie reviews from IMDB and make a RAG LLM for answering questions about movies. An LLM will receive user queries and retrieved information from a movies vector database will feed the LLM with the proper material for an informed response.

The process involves the titles and the overviews of multiple movies. The overview texts are split into chunks that hopefully capture semantic information relevant to the movie. Then a vector database is constructed using LLM embeddings for those chunks. When a user query about movies is coming in, the query is transformed in an embedding and the chunks with the largest embedding similarity are retrieved.

The retrieved chunks and the movie titles are returned to the LLM for forming an informed response.

# Download data

First, download the dataset.

In [None]:
import pandas as pd
import os

In [None]:
os.makedirs("data", exist_ok=True)
#this for unzip and read the file
try:
    # Try to load the datset, if it is already downloadedLoad the dataset
    df = pd.read_csv("data/movies_metadata.csv")
    df = df.iloc[:5000,:]
except:
    !wget https://www.kaggle.com/api/v1/datasets/download/rounakbanik/the-movies-dataset
    !mv the-movies-dataset data
    !unzip /data/the-movies-dataset data/
    # Load the dataset
    df = pd.read_csv("data/movies_metadata.csv")
    df = df.iloc[:5000,:]

# Display the first few rows of the dataframe
print(df.head())

   adult                              belongs_to_collection    budget  \
0  False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1  False                                                NaN  65000000   
2  False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3  False                                                NaN  16000000   
4  False  {'id': 96871, 'name': 'Father of the Bride Col...         0   

                                              genres  \
0  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   
1  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   
2  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   
3  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
4                     [{'id': 35, 'name': 'Comedy'}]   

                               homepage     id    imdb_id original_language  \
0  http://toystory.disney.com/toy-story    862  tt0114709                en   
1                                   NaN   8844  tt0113497         

  df = pd.read_csv("data/movies_metadata.csv")


### Keep relevant data

We only need the movie titles and teh overview texts. We can keep more data, depending on the use case.

In [3]:
# keep only the title and overview columns
df = df[['original_title', 'overview']]
print(df.head())
print(df['overview'].iloc[0])

                original_title  \
0                    Toy Story   
1                      Jumanji   
2             Grumpier Old Men   
3            Waiting to Exhale   
4  Father of the Bride Part II   

                                            overview  
0  Led by Woody, Andy's toys live happily in his ...  
1  When siblings Judy and Peter discover an encha...  
2  A family wedding reignites the ancient feud be...  
3  Cheated on, mistreated and stepped on, the wom...  
4  Just when George Banks has recovered from his ...  
Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.


## Get text embeddings

We need to initialize a model that produces embeddings from text. These models are different from chat models that we have seen so far. Embedding models do not generate text (i.e., word-to-word prediction); they produce a vector that encodes semantic information of an input textual sentence. Ollama models that produce embeddings can be found here:

https://ollama.com/search?c=embedding

In [4]:
from ollama import embed, generate
from ollama import EmbedResponse

In [5]:
model_name = 'qwen3-embedding:0.6b'

In [6]:
# try embedding a sample text
response: EmbedResponse = embed(
    model=model_name,
    input='The sky is blue and the sun is shining.'
)
print(len(response.embeddings[0]))
print(response.embeddings[0])

1024
[-0.039944813, -0.039232314, -0.0017944131, 0.004507105, 0.038481183, -0.004437826, 0.008935359, -0.0011897972, -0.120160565, -0.041606613, 0.05052589, 0.07220101, -0.018513046, -0.002214915, -0.039557476, 0.08419371, -0.01714179, 0.085821226, 0.12917759, -0.053607576, -0.020850254, -0.0053504165, -0.0065490543, 0.07958131, -0.00094233203, 0.05159557, -0.0455688, 0.06217082, 0.073429264, 0.021033196, 0.045358114, 0.005637581, -0.0050079036, 0.027454425, 0.008957715, -0.006355414, 0.009720106, 0.046611644, 0.005960911, 0.014761418, 0.0031825167, -0.019159254, 0.011211738, 0.05141365, -0.0038887062, -0.068402745, -0.038023733, -0.029510127, -0.0744873, 0.021962335, 0.012063481, -0.030853705, 0.01894714, -0.00085537793, 0.012379256, 0.025397616, -0.0011478905, 0.0060302, 0.0028706945, -0.016214086, -0.056633197, 0.1063643, -0.0158135, 0.008174522, -0.03789282, 0.065569885, 0.005950511, 0.024345174, -0.07159921, -0.0002666929, -0.0013902251, 0.0015268439, -0.016283281, 0.022359703, -0

## Database

We need to create a vector database that will host the embeddings of the text chunks that we extract from the movie overviews. Chunking the overview text involves spliting the texts into meaningful chunks (depending on the task at hand), possibly with some overlap.

### Splitting text

For chunking the texts, we could use one of many available splitters in LangChain, depeding on the structure and the file type (e.g., HTML, JSON, code, etc).

https://docs.langchain.com/oss/python/integrations/splitters

For the overviews, we just need to split natural language texts. 

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)
texts = text_splitter.create_documents([df['overview'].iloc[0]])
print(len(texts))
print(texts)

4
[Document(metadata={}, page_content="Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto"), Document(metadata={}, page_content="Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against"), Document(metadata={}, page_content='Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo'), Document(metadata={}, page_content='owner, the duo eventually learns to put aside their differences.')]


In [9]:
# make a "chunks" column in the dataframe with the splitted texts from the overview column
def split_overview(overview):
    if pd.isna(overview):
        return []
    return text_splitter.create_documents([str(overview)])

df['chunks'] = df['overview'].apply(split_overview)

In [10]:
# Flatten the dataframe for easier processing.
# After that, each chunk gets its own row and the title and overview 
# get dublicate rows for accommodate for chunk explosion.
chunked_df = df.explode('chunks').reset_index(drop=True)
chunked_df.head()

Unnamed: 0,original_title,overview,chunks
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","page_content='Led by Woody, Andy's toys live h..."
1,Toy Story,"Led by Woody, Andy's toys live happily in his ...",page_content='Buzz Lightyear onto the scene. A...
2,Toy Story,"Led by Woody, Andy's toys live happily in his ...",page_content='Woody plots against Buzz. But wh...
3,Toy Story,"Led by Woody, Andy's toys live happily in his ...","page_content='owner, the duo eventually learns..."
4,Jumanji,When siblings Judy and Peter discover an encha...,page_content='When siblings Judy and Peter dis...


In [11]:
import os

In [12]:
# make an "embeddings" column in the dataframe with the embeddings of the chunks
def encode_chunk(chunk):
    if not isinstance(chunk, str) or chunk.strip() == "":
        print("None embedding for chunk: ", chunk)
        return None
    response: EmbedResponse = embed(
        model=model_name,
        input=chunk
    )
    return response.embeddings[0]

# Check if embeddings have already been computed and saved - takes a long time otherwise
if os.path.isfile('data/chunked_df_embeddings.pkl'):
    chunked_df = pd.read_pickle('data/chunked_df_embeddings.pkl')
else:
    chunked_df['embeddings'] = chunked_df['chunks'].apply(encode_chunk)
    # Drop rows where 'embeddings' is None
    chunked_df.dropna(subset=['embeddings'], inplace=True)
    chunked_df.to_pickle('data/chunked_df_embeddings.pkl')

In [13]:
chunked_df.head()

Unnamed: 0,original_title,overview,chunks,embeddings
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","[0.04387291, -0.03370365, -0.008714972, 0.0193..."
1,Toy Story,"Led by Woody, Andy's toys live happily in his ...",But when circumstances separate Buzz and Woody...,"[0.051439185, -0.07076929, -0.010691837, 0.044..."
2,Jumanji,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,"[0.032043517, 0.016517127, -0.0054009426, -0.0..."
3,Jumanji,When siblings Judy and Peter discover an encha...,their living room. Alan's only hope for freedo...,"[0.023679784, -0.008297648, -0.0073013008, 0.0..."
4,Grumpier Old Men,A family wedding reignites the ancient feud be...,A family wedding reignites the ancient feud be...,"[-0.025049845, -0.009733561, -0.008092537, 0.0..."


### Vector database

We need to create a vector database from the dataframe. We can do it using ChromaDB.

In [14]:
import chromadb

In [15]:
# initialize the ChromaDB client and a vector collection
client = chromadb.Client()
collection = client.create_collection(name="movies")

In [16]:
# Add the embeddings to the collection
for idx, row in chunked_df.iterrows():
    collection.add(
        documents=[row['chunks']],
        metadatas=[{'title': row['original_title'], 'overview': row['overview']}],
        ids=[str(idx)],
        embeddings=[row['embeddings']]
    )

In [17]:
# Check an example entry from the collection
result = collection.get(ids=[str(0)])

In [18]:
print(result)

{'ids': ['0'], 'embeddings': None, 'documents': ["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their"], 'uris': None, 'included': ['metadatas', 'documents'], 'data': None, 'metadatas': [{'title': 'Toy Story', 'overview': "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."}]}


### Retrieve

Now we need to take a query, make embeddings of the query and look up the top, say, 5 results in the database.

In [19]:
# get a query
query = "A movie about space exploration."

# make embedding for the query
query_embedding_response: EmbedResponse = embed(
    model=model_name,
    input=query
)
query_embedding = query_embedding_response.embeddings[0]

# perform a similarity search in the collection
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
)

In [20]:
print(results)

{'ids': [['1771', '9140', '7440', '2372', '6887']], 'embeddings': None, 'documents': [["Humanity finds a mysterious object buried beneath the lunar surface and sets off to find its origins with the help of HAL 9000, the world's most advanced super computer.", 'Travel alongside the astronauts as they deploy and repair the Hubble Space Telescope, soar above Venus and Mars, and find proof of new planets and the possibility of other life forming around distant stars.', 'Astronauts search for solutions to save a dying Earth by searching on Mars, only to have the mission go terribly awry.', 'An alien and a robot land on earth after World War II and tell mankind to be peaceful or face destruction. A classic science fiction film from Robert Wise with an exceptional message.', "to life. A science fiction thriller from the early 1990's with a star studded cast."]], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'overview': "Humanity finds a myste

### Generate response

Based on the query and the retrieved results, we need to use a chat model to generate the response.

In [21]:
# Select a chat model.
chat_model_name = 'qwen3-vl:4b'

output = generate(
  model=chat_model_name,
  prompt=f"""
  Instruction: You're an expert in movie suggestions. Your task is to analyze carefully the context and come up with an exhaustive answer to the following question:
  {query}
    You can use this context to help you:
    {results}
  """
)

In [22]:
print(output.response)

### Comprehensive Analysis of Space Exploration Movies Based on Your Query  

Your query specifically asks for **movies about space exploration**, and I‚Äôve analyzed the provided context (including metadata, overviews, and similarity scores) to deliver an exhaustive, contextually grounded recommendation. Below, I‚Äôve prioritized the **most relevant movies** from your dataset, ranked by similarity to the query (lowest distance = highest relevance), while clarifying why each qualifies as a space exploration film.  

---

#### ‚úÖ **Top Recommendations (Ranked by Relevance)**  

##### **1. 2001: A Space Odyssey (Distance: 0.68)**  
- **Why it fits**: This is the **most iconic depiction of space exploration** in cinematic history. The film follows humanity‚Äôs journey to uncover the origins of a mysterious monolith buried beneath the Moon, featuring groundbreaking visuals of space travel (including the first-ever depiction of Earth from space), the AI HAL 9000, and profound philosophical