## Part 1: Setup and Data Loading

### 1.1 Install Required Packages

In [4]:
# Simplified for Colab
%pip install -q llama-index-core llama-index-embeddings-huggingface llama-index-llms-huggingface bitsandbytes accelerate sentence-transformers faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m97.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.4/566.4 kB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.9/97.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.0/142.0 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de

### 1.2 Import Libraries and Check GPU Availability

This block imports the libraties for the RAG pipeline and configures the environment to use GPU acceleration

In [6]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from llama_index.core import VectorStoreIndex, Document, Settings, StorageContext, load_index_from_storage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
import os
import warnings
warnings.filterwarnings('ignore')

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


Using device: cuda
GPU: Tesla T4
CUDA Version: 12.8
GPU Memory: 15.64 GB


### 1.3 Load MistralAI Model with Quantization

The Mistral model and tokenizer is loaded here and uses 4-bit quantization

In [7]:
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
if torch.cuda.is_available():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16,
    )
else:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float32,
        trust_remote_code=True,
    )

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

### 1.4 Configure LlamaIndex LLM Wrapper

The mistral model is wrapped and loaded into LlamaIndex LLM interface with context window and generation parameters

In [8]:
# Configure LlamaIndex LLM wrapper
llm = HuggingFaceLLM(
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    context_window=4096,
    model_kwargs={"temperature": 0.1, "do_sample": True},
    generate_kwargs={"temperature": 0.1, "do_sample": True},
    device_map="auto",
)

print("LLM wrapper configured successfully!")

LLM wrapper configured successfully!


### 1.5 Load IMDB Dataset

The dataset is loaded here with a snapshot of it displayed below

In [9]:
# Load the IMDB dataset
df = pd.read_csv("IMDB_top_10000_07132023.csv")

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (9999, 12)

Columns: ['Title', 'Year', 'Genres', 'Certificate', 'Runtime', 'Rating', 'Metascore', 'Votes', 'Gross(Million)', 'Director', 'Stars', 'Summary']

First few rows:


Unnamed: 0,Title,Year,Genres,Certificate,Runtime,Rating,Metascore,Votes,Gross(Million),Director,Stars,Summary
0,The Shawshank Redemption,1994,Drama,R,142.0,9.3,82.0,2764512,28.34,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...","Over the course of several years, two convicts..."
1,The Dark Knight,2008,"Action, Crime, Drama",PG-13,152.0,9.0,84.0,2737769,534.86,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker wreaks havo...
2,Inception,2010,"Action, Adventure, Sci-Fi",PG-13,148.0,8.8,74.0,2429452,292.58,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",A thief who steals corporate secrets through t...
3,Fight Club,1999,Drama,R,139.0,8.8,67.0,2201448,37.03,David Fincher,"Brad Pitt, Edward Norton, Meat Loaf, Zach Grenier",An insomniac office worker and a devil-may-car...
4,Forrest Gump,1994,"Drama, Romance",PG-13,142.0,8.8,82.0,2150299,330.25,Robert Zemeckis,"Tom Hanks, Robin Wright, Gary Sinise, Sally Field",The history of the United States from the 1950...


### 1.6 Data Exploration

Here, we perform Exploratory Data Analysis to calculate dataset statistics, identify missing values and genre distribution.

In [10]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

# Basic statistics
print("\n" + "="*50)
print("Dataset Statistics:")
print("="*50)
print(f"Total movies: {len(df)}")
print(f"Year range: {df['Year'].min()} - {df['Year'].max()}")
print(f"Rating range: {df['Rating'].min():.2f} - {df['Rating'].max():.2f}")
print(f"Average rating: {df['Rating'].mean():.2f}")
print(f"Median rating: {df['Rating'].median():.2f}")

# Genre analysis
all_genres = []
for genres in df['Genres'].dropna():
    all_genres.extend([g.strip() for g in str(genres).split(',')])
genre_counts = pd.Series(all_genres).value_counts()
print(f"\nTop 10 genres:")
print(genre_counts.head(10))

Missing values per column:
Title                0
Year                 0
Genres               0
Certificate        374
Runtime              2
Rating               0
Metascore         1949
Votes                0
Gross(Million)    2741
Director             0
Stars                3
Summary              0
dtype: int64

Total missing values: 5069

Dataset Statistics:
Total movies: 9999
Year range: 1950 - 2022
Rating range: 1.00 - 9.30
Average rating: 6.57
Median rating: 6.70

Top 10 genres:
Drama        5719
Comedy       3682
Action       2560
Crime        2079
Adventure    1707
Thriller     1704
Romance      1694
Horror       1283
Mystery      1077
Fantasy       736
Name: count, dtype: int64


Missing values are approriately handled here

In [11]:
df = df.dropna(subset=["Title", "Year"]).copy()

# 2. Fill numeric columns
numeric_cols = ["Rating", "Runtime", "Metascore", "Gross(Million)", "Votes"]
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")  # ensure numeric
        df[col] = df[col].fillna(df[col].median())

# 3. Fill text columns with "Unknown"
text_cols = ["Genres", "Certificate", "Director", "Stars", "Summary"]
for col in text_cols:
    if col in df.columns:
        df[col] = df[col].fillna("Unknown")

print("Missing values after cleaning:")
print(df.isnull().sum())


Missing values after cleaning:
Title             0
Year              0
Genres            0
Certificate       0
Runtime           0
Rating            0
Metascore         0
Votes             0
Gross(Million)    0
Director          0
Stars             0
Summary           0
dtype: int64


### 1.7 Create Rich Text Representations

Here, A unified text representation is created by combining metadata and summary for semantic embedding generations

In [12]:
def create_movie_text(row):
    title = str(row['Title']) if pd.notna(row['Title']) else "Unknown"
    year = int(row['Year']) if pd.notna(row['Year']) else "Unknown"
    genres = str(row['Genres']) if pd.notna(row['Genres']) else "Unknown"
    director = str(row['Director']) if pd.notna(row['Director']) else "Unknown"
    stars = str(row['Stars']) if pd.notna(row['Stars']) else "Unknown"
    summary = str(row['Summary']) if pd.notna(row['Summary']) else "No summary available"

    # Create rich text representation
    text = f"{title} ({year}) - {genres} | Director: {director} | Stars: {stars} | Summary: {summary}"

    # Add additional metadata if available
    if pd.notna(row['Rating']):
        text += f" | Rating: {row['Rating']}"
    if pd.notna(row['Runtime']):
        text += f" | Runtime: {row['Runtime']} minutes"
    if pd.notna(row['Certificate']):
        text += f" | Certificate: {row['Certificate']}"

    return text

# Create text representations for all movies
df['movie_text'] = df.apply(create_movie_text, axis=1)

# Display sample
print("Sample movie text representation:")
print("="*80)
print(df['movie_text'].iloc[0])
print("="*80)
print(f"\nTotal movies processed: {len(df)}")

Sample movie text representation:
The Shawshank Redemption (1994) - Drama | Director: Frank Darabont | Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler | Summary: Over the course of several years, two convicts form a friendship, seeking consolation and, eventually, redemption through basic compassion. | Rating: 9.3 | Runtime: 142.0 minutes | Certificate: R

Total movies processed: 9999


## Part 2: Vector Index Construction

### 2.1 Initialize Embedding Model
Here, the sentence transformers model is initialized and configures LlamaIndex settings for embeddings and chunking

In [13]:
# Initialize sentence-transformers embedding model
# Using all-MiniLM-L6-v2 for good balance of speed and quality
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

print(f"Loading embedding model: {EMBED_MODEL_NAME}...")
embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
print("Embedding model loaded successfully!")

# Configure LlamaIndex Settings
Settings.embed_model = embed_model
Settings.llm = llm
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("LlamaIndex Settings configured!")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded successfully!
LlamaIndex Settings configured!


### 2.2 Create Document Objects
Here, we convert the processed movie text and metadata into LlamaIndex Document objects to prepare for vector indexing

In [14]:
# Create Document objects from processed movie texts
documents = []
for idx, row in df.iterrows():
    doc = Document(
        text=row['movie_text'],
        metadata={
            "title": row['Title'],
            "year": int(row['Year']) if pd.notna(row['Year']) else None,
            "genres": row['Genres'],
            "director": row['Director'],
            "rating": float(row['Rating']) if pd.notna(row['Rating']) else None,
            "index": idx
        }
    )
    documents.append(doc)

print(f"Created {len(documents)} document objects")
print(f"\nSample document metadata:")
print(documents[0].metadata)

Created 9999 document objects

Sample document metadata:
{'title': 'The Shawshank Redemption', 'year': 1994, 'genres': 'Drama', 'director': 'Frank Darabont', 'rating': 9.3, 'index': 0}


### 2.3 Build and Persist Vector Index
The vector store index is built from documents and persisted to disk

In [15]:
# Check if index already exists
INDEX_DIR = "./movie_index"

if os.path.exists(INDEX_DIR):
    print(f"Loading existing index from {INDEX_DIR}...")
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
    index = load_index_from_storage(storage_context)
    print("Index loaded successfully!")
else:
    print("Building new vector index...")
    print("This may take several minutes...")

    # Build vector index
    index = VectorStoreIndex.from_documents(
        documents,
        show_progress=True
    )

    # Persist index to disk
    print(f"\nPersisting index to {INDEX_DIR}...")
    index.storage_context.persist(persist_dir=INDEX_DIR)
    print("Index built and persisted successfully!")

print(f"\nIndex type: {type(index)}")
print(f"Index stats: {len(documents)} documents indexed")

Building new vector index...
This may take several minutes...


Applying transformations:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1807 [00:00<?, ?it/s]


Persisting index to ./movie_index...
Index built and persisted successfully!

Index type: <class 'llama_index.core.indices.vector_store.base.VectorStoreIndex'>
Index stats: 9999 documents indexed


### 2.4 Test Similarity Search
Checks if index is working by retrieving top 5 most similar movies for a test query

In [16]:
# Test similarity search
retriever = index.as_retriever(similarity_top_k=5)

test_query = "movies about space and aliens"
print(f"Test query: '{test_query}'")
print("="*80)

results = retriever.retrieve(test_query)
for i, node in enumerate(results, 1):
    print(f"\nResult {i} (Score: {node.score:.4f}):")
    print(f"Title: {node.metadata.get('title', 'Unknown')}")
    print(f"Year: {node.metadata.get('year', 'Unknown')}")
    print(f"Text preview: {node.text[:200]}...")

Test query: 'movies about space and aliens'

Result 1 (Score: 0.5455):
Title: Aliens
Year: 1986
Text preview: Aliens (1986) - Action, Adventure, Sci-Fi | Director: James Cameron | Stars: Sigourney Weaver, Michael Biehn, Carrie Henn, Paul Reiser | Summary: Decades after surviving the Nostromo incident, Ellen R...

Result 2 (Score: 0.5367):
Title: Pixels
Year: 2015
Text preview: Pixels (2015) - Action, Comedy, Fantasy | Director: Chris Columbus | Stars: Adam Sandler, Kevin James, Michelle Monaghan, Peter Dinklage | Summary: When aliens misinterpret video feeds of classic arca...

Result 3 (Score: 0.5360):
Title: Galaxy Quest
Year: 1999
Text preview: Galaxy Quest (1999) - Adventure, Comedy, Sci-Fi | Director: Dean Parisot | Stars: Tim Allen, Sigourney Weaver, Alan Rickman, Tony Shalhoub | Summary: The alumni cast of a space opera television series...

Result 4 (Score: 0.5349):
Title: Arrival
Year: 2016
Text preview: Arrival (2016) - Drama, Mystery, Sci-Fi | Director: Denis Villeneuve | St

## Part 3: Semantic Query Implementation

### 3.1 Configure Query Engine for RAG
The LlamaIndex query engine is configured with the LLM and retrieval parameters

In [17]:
semantic_query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",
    llm=llm,
)

print("Semantic query engine configured successfully!")

Semantic query engine configured successfully!


### 3.2 Implement RAG Pipeline with Source Attribution
Here the semantic query function is defined and it retrieves cntext, generates an answer and returns source movie metadata

In [18]:
def semantic_query(query: str, top_k: int = 5):
    try:
        retriever = index.as_retriever(similarity_top_k=top_k)
        retrieved_nodes = retriever.retrieve(query)

        # Generate answer using query engine
        response = semantic_query_engine.query(query)

        # Extract source movies
        sources = []
        for node in retrieved_nodes:
            source_info = {
                "title": node.metadata.get('title', 'Unknown'),
                "year": node.metadata.get('year', 'Unknown'),
                "rating": node.metadata.get('rating', 'Unknown'),
                "similarity_score": float(node.score) if hasattr(node, 'score') else None
            }
            sources.append(source_info)

        return {
            "query": query,
            "query_type": "semantic",
            "answer": str(response),
            "sources": sources,
            "num_sources": len(sources)
        }
    except Exception as e:
        return {
            "query": query,
            "query_type": "semantic",
            "answer": f"Error processing query: {str(e)}",
            "sources": [],
            "num_sources": 0,
            "error": str(e)
        }

print("Semantic query function defined!")

Semantic query function defined!


In [19]:
queries = [
    "Movies about dystopian futures and oppressive governments",
    "Films that explore complex father–son relationships",
    "Movies centered around survival in extreme environments",
    "Films that blend comedy with crime or heist elements",
    "Movies that revolve around unraveling mysteries or solving puzzles"
]

for q in queries:
    print("="*100)
    print("Query:", q)
    print(semantic_query(q))


Query: Movies about dystopian futures and oppressive governments


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'query': 'Movies about dystopian futures and oppressive governments', 'query_type': 'semantic', 'answer': '1. The Humanity Bureau (2017) - Sci-Fi, Thriller\n2. Tenet (2020) - Action, Sci-Fi, Thriller (Although not strictly a dystopian future, the film does depict a world on the brink of destruction)\n\nThese movies fit the query as they both involve dystopian futures and oppressive governments or systems. The Humanity Bureau is set in a future with economic recession and environmental problems, while the government in Tenet is trying to prevent the end of the world.', 'sources': [{'title': 'Tenet', 'year': 2020, 'rating': 7.3, 'similarity_score': 0.4675318167337076}, {'title': 'The Humanity Bureau', 'year': 2017, 'rating': 4.5, 'similarity_score': 0.4617433820656208}, {'title': 'Daylight', 'year': 1996, 'rating': 5.9, 'similarity_score': 0.46137373298685885}, {'title': 'In Time', 'year': 2011, 'rating': 6.7, 'similarity_score': 0.460579192933707}, {'title': 'In a Valley of Violence', 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'query': 'Films that explore complex father–son relationships', 'query_type': 'semantic', 'answer': '1. Mom and Dad (2017) - A teenage girl and her younger brother must survive a wild 24 hours during which a mass hysteria of unknown origin causes parents to turn violently on their own kids. This film explores the complex relationship between a father and his children.\n2. To Father with Love (2016) - An intelligent son takes revenge on a powerful and rich businessman who is as smart as he is. This film explores the complex relationship between a father and his son.\n3. My Father the Hero (1994) - A teenage girl on vacation in the Bahamas with her divorced father tries to impress a potential boyfriend by saying that her father is actually her lover. This film explores the complex relationship between a father and his daughter.\n4. The Father (2020) - A man refuses all assistance from his daughter as he ages and begins to doubt his loved ones, his own mind and even the fabric of his rea

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'query': 'Movies centered around survival in extreme environments', 'query_type': 'semantic', 'answer': '1. The Survivalist (2015) - A survivalist lives off a small plot of land hidden deep in forest and his existence is threatened when two women seeking food and shelter discover his farm.\n2. Wilderness (2006) - Juvenile delinquents are sent to a small British island and have to fight for survival.\n3. Cadaver (2020) - A family of three attends a charitable event at a hotel, but when people start to disappear, they have to survive in the starving aftermath of a nuclear disaster.\n4. Wild Tales (2014) - Six short stories that explore the extremities of human behavior involving people in distress.\n5. Survive the Night (2020) - A disgraced doctor and his family are held hostage at their home by criminals on the run and they have to survive when a robbery-gone-awry requires them to seek immediate medical attention.', 'sources': [{'title': 'The Survivalist', 'year': 2015, 'rating': 6.4, 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'query': 'Films that blend comedy with crime or heist elements', 'query_type': 'semantic', 'answer': "1. The Maiden Heist (2009) - A comedy centered on three museum security guards who devise a plan to steal back the artworks to which they have become attached after they are transferred to another museum.\n2. Tower Heist (2011) - A comedy about a group of hard-working guys who conspire to rob their wealthy employer's high-rise residence after finding out they've fallen victim to his Ponzi scheme.\n3. Fun with Dick and Jane (2005) - A comedy about an affluent couple who turn to a life of crime to make ends meet after losing all their money following a series of blunders.\n4. Setup (2011) - An action comedy about a group of friends who become involved in a potentially deadly diamond heist.\n5. Heist (2015) - A drama comedy about a father who, as a last resort, partners with a greedy co-worker to rob a casino.\n\nNote: The films are listed in the order they appear in the context informat

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'query': 'Movies that revolve around unraveling mysteries or solving puzzles', 'query_type': 'semantic', 'answer': '1. Reminiscence (2021) - Mystery, Romance, Sci-Fi\n2. Spiral (2021) - Crime, Horror, Mystery\n3. Enigma (2001) - Drama, Mystery, Romance\n4. The Thing (2011) - Horror, Mystery, Sci-Fi\n5. Divergent (2014) - Action, Adventure, Mystery\n\nThese movies involve mysteries or puzzles that the main characters aim to unravel or solve.', 'sources': [{'title': 'Reminiscence', 'year': 2021, 'rating': 5.9, 'similarity_score': 0.46238517315197436}, {'title': 'Spiral', 'year': 2021, 'rating': 5.2, 'similarity_score': 0.45754138723728516}, {'title': 'Enigma', 'year': 2001, 'rating': 6.4, 'similarity_score': 0.4529738654552854}, {'title': 'The Thing', 'year': 2011, 'rating': 6.2, 'similarity_score': 0.4511178903592493}, {'title': 'Divergent', 'year': 2014, 'rating': 6.6, 'similarity_score': 0.4489437467634806}], 'num_sources': 5}


## Part 4: Factual Query Implementation

### 4.1 Design Code Generation Prompt Template

The prompt template is constructed along with the dataset schema and constraints to guide the LLM in generating valid Pandas code

In [20]:
# Get dataset schema for code generation
def get_dataset_schema():
    """Get dataset schema information for code generation prompts."""
    schema_info = {
        "columns": list(df.columns),
        "dtypes": {col: str(dtype) for col, dtype in df.dtypes.items()},
        "sample_values": {}
    }

    # Add sample values for key columns
    for col in ['Title', 'Year', 'Genres', 'Rating', 'Director', 'Gross(Million)', 'Runtime']:
        if col in df.columns:
            sample_vals = df[col].dropna().head(3).tolist()
            schema_info["sample_values"][col] = sample_vals

    return schema_info

schema = get_dataset_schema()

def create_code_generation_prompt(query: str):
    """Create a prompt for generating pandas code to answer factual queries."""
    prompt = f"""You are a data analyst. Given a pandas DataFrame named 'df' with the following schema:

Columns: {schema['columns']}

Data types:
{chr(10).join([f"  - {col}: {dtype}" for col, dtype in schema['dtypes'].items()])}

Sample values:
{chr(10).join([f"  - {col}: {vals}" for col, vals in schema['sample_values'].items()])}

IMPORTANT RULES:
1. The DataFrame is already loaded as 'df' - do NOT load it again
2. Use ONLY pandas and numpy - no other imports
3. Do NOT use file operations, system calls, or dangerous operations
4. Return ONLY valid Python code that can be executed directly
5. The code should answer the question: "{query}"
6. CRITICAL: You MUST assign the final result to a variable named 'result'
7. If the result is a single value, use .iloc[0] or .values[0] to extract it

PANDAS SYNTAX EXAMPLES:
- Filtering: df[df['Column'] > value] or df[df['Column'].str.contains('text', case=False, na=False)]
- Groupby: df.groupby('Column')['OtherColumn'].agg_function() NOT df['Column'].groupby('OtherColumn')
- Multiple conditions: df[(df['Col1'] > val1) & (df['Col2'] == val2)] - use parentheses around each condition
- String contains: df['Genres'].str.contains('Sci-Fi', case=False, na=False) - always use na=False
- Count rows: df[condition].shape[0] or len(df[condition])
- Max/Min: df.groupby('Col')['OtherCol'].max().idxmax() for finding the key with max value

COMMON MISTAKES TO AVOID:
- WRONG: df['Col'].groupby('OtherCol') - should be df.groupby('OtherCol')['Col']
- WRONG: (df['A'] > 1 & df['B'] > 2) - should be (df['A'] > 1) & (df['B'] > 2)
- WRONG: df[condition].all(1) on Series - use .all() without axis or proper boolean indexing
- WRONG: Missing na=False in str.contains() - always add na=False

Question: {query}

Generate pandas code to answer this question. Return ONLY the code, no explanations:"""

    return prompt



Here, we've created a filter using regex matching which cleans up any available errors in the code before it is executed to provide accurate results

In [21]:
def fix_common_pandas_errors(code: str):
    fixes = [
        (r"df\['([^']+)'\]\.groupby\('([^']+)'\)", r"df.groupby('\2')['\1']"),
        (r"df\[([^&|]+)\s*&\s*([^&|]+)\]", r"df[(\1) & (\2)]"),
        (r"df\[([^&|]+)\s*\|\s*([^&|]+)\]", r"df[(\1) | (\2)]"),
        (r"\.all\(1\)", r".all()"),
        (r"\.str\.contains\(([^)]+)\)(?!.*na=False)", r".str.contains(\1, na=False)"),
    ]

    fixed_code = code
    for pattern, replacement in fixes:
        fixed_code = re.sub(pattern, replacement, fixed_code)

    return fixed_code

def execute_code_safely_with_retry(code: str, df: pd.DataFrame, max_retries: int = 1):
    result, error = execute_code_safely(code, df)

    # If error occurred and we have retries left, try fixing common errors
    if error and max_retries > 0:
        fixed_code = fix_common_pandas_errors(code)
        if fixed_code != code:
            print(f"Attempting to fix code error...")
            result, error = execute_code_safely(fixed_code, df)
            if not error:
                print(f"Code fixed successfully!")

    return result, error

print("Code fixing function added!")

Code fixing function added!


### 4.2 Implement Execution Framework
A execution environment is created that validates generated code

In [22]:
import re
import ast

def extract_code_from_response(response_text: str):
    """Extract Python code from LLM response (handles markdown code blocks)."""
    # Try to extract code from markdown code blocks
    code_block_pattern = r'```(?:python)?\n?(.*?)```'
    matches = re.findall(code_block_pattern, response_text, re.DOTALL)

    if matches:
        return matches[0].strip()

    # If no code blocks, try to find code-like content
    lines = response_text.strip().split('\n')
    code_lines = []
    in_code = False

    for line in lines:
        # Skip markdown headers and explanations
        if line.strip().startswith('#') or 'import' in line or 'df.' in line or 'pd.' in line or 'np.' in line:
            in_code = True
        if in_code and (line.strip() == '' or line.strip().startswith('#') or
                       any(keyword in line for keyword in ['import', 'df', 'pd', 'np', 'print', '='])):
            code_lines.append(line)

    if code_lines:
        return '\n'.join(code_lines).strip()

    return response_text.strip()

def validate_code_safety(code: str):
    """Validate that code doesn't contain dangerous operations."""
    dangerous_patterns = [
        r'__import__',
        r'eval\(',
        r'exec\(',
        r'open\(',
        r'file\(',
        r'input\(',
        r'raw_input\(',
        r'subprocess',
        r'os\.system',
        r'shell\s*=\s*True',
        r'import\s+os',
        r'import\s+subprocess',
        r'import\s+sys',
    ]

    for pattern in dangerous_patterns:
        if re.search(pattern, code, re.IGNORECASE):
            return False, f"Dangerous pattern detected: {pattern}"

    # Check for allowed imports only
    allowed_imports = ['pandas', 'pd', 'numpy', 'np']
    import_lines = re.findall(r'import\s+(\w+)', code)
    for imp in import_lines:
        if imp not in allowed_imports:
            return False, f"Disallowed import: {imp}"

    return True, "Code is safe"

def execute_code_safely(code: str, df: pd.DataFrame):
    """Execute pandas code safely and return results."""
    try:
        # Validate code safety
        is_safe, message = validate_code_safety(code)
        if not is_safe:
            return None, f"Code validation failed: {message}"

        # Create execution environment
        exec_globals = {
            'df': df,
            'pd': pd,
            'np': np,
            '__builtins__': __builtins__
        }

        # Execute code
        exec(code, exec_globals)

        # Try to capture result (if code assigns to a variable or returns something)
        # Check common result variable names
        result = None
        for var_name in ['result', 'answer', 'output', 'res']:
            if var_name in exec_globals:
                result = exec_globals[var_name]
                break

        # If no result variable, try to get last expression result
        # This is a simplified approach - in practice, you might need AST parsing
        if result is None:
            # Try to evaluate last line if it's an expression
            lines = code.strip().split('\n')
            last_line = lines[-1].strip()
            if not last_line.startswith('#') and '=' not in last_line:
                try:
                    result = eval(last_line, exec_globals)
                except:
                    pass

        return result, None

    except Exception as e:
        return None, f"Execution error: {str(e)}"

print("Safe code execution framework implemented!")

Safe code execution framework implemented!


### 4.3 Implement Factual Query Pipeline with Result Formatting

The code generation and execution process is orchestrated here, and then this is formatted into natural language

In [23]:
def format_result_with_llm(query: str, result):
    """Format numerical/statistical result into natural language using LLM."""
    try:
        # Convert result to string representation
        if isinstance(result, (pd.DataFrame, pd.Series)):
            result_str = result.to_string()
        elif isinstance(result, np.ndarray):
            result_str = str(result.tolist())
        else:
            result_str = str(result)

        # Create formatting prompt
        format_prompt = f"""Convert the following data analysis result into a clear, natural language answer.

Question: {query}

Result:
{result_str}

Provide a concise, natural language answer (2-3 sentences maximum):"""

        # Use LLM to format the answer
        formatted_response = llm.complete(format_prompt)
        return str(formatted_response).strip()
    except Exception as e:
        # Fallback to simple formatting
        if isinstance(result, (pd.DataFrame, pd.Series)):
            return f"The result is:\n{result.to_string()}"
        else:
            return f"The answer is: {result}"

def factual_query(query: str):
    """
    Process factual/statistical queries using code generation and execution.

    Args:
        query: Natural language question requiring numerical computation

    Returns:
        dict: Answer, generated code, and metadata
    """
    try:
        # Generate code
        code_prompt = create_code_generation_prompt(query)
        code_response = llm.complete(code_prompt)
        generated_code = extract_code_from_response(str(code_response))

        # Execute code
        result, error = execute_code_safely_with_retry(generated_code, df)

        if error:
            return {
                "query": query,
                "query_type": "factual",
                "answer": f"Error executing code: {error}",
                "generated_code": generated_code,
                "error": error
            }

        if result is None:
            return {
                "query": query,
                "query_type": "factual",
                "answer": "Code executed but no result was captured. Please check the generated code.",
                "generated_code": generated_code,
                "raw_result": None
            }

        # Format result into natural language
        formatted_answer = format_result_with_llm(query, result)

        return {
            "query": query,
            "query_type": "factual",
            "answer": formatted_answer,
            "generated_code": generated_code,
            "raw_result": result
        }

    except Exception as e:
        return {
            "query": query,
            "query_type": "factual",
            "answer": f"Error processing query: {str(e)}",
            "generated_code": None,
            "error": str(e)
        }

print("Factual query function defined!")

Factual query function defined!


## Part 5: Query Classification and Integration

### 5.1 Implement Query Classification

The LLM is used to classify incoming user questions as either semantic or factual to route them correctly

In [24]:
def classify_query(query: str):
    """
    Classify a query as either 'semantic' or 'factual'.

    Semantic queries: Require conceptual understanding and reasoning
    Factual queries: Require numerical computations, statistics, aggregations, or filtering
    """
    classification_prompt = f"""Classify the following movie-related question into one of two categories:

1. SEMANTIC: Questions about movie themes, content, concepts, recommendations, or understanding
   Examples:
   - "What are some alien-related movies?"
   - "Which films explore time travel themes?"
   - "Recommend movies about redemption"
   - "What movies feature strong female protagonists?"

2. FACTUAL: Questions requiring numerical computations, statistics, aggregations, or filtering
   Examples:
   - "What's the average rating of James Bond movies?"
   - "Which director has the highest-grossing film?"
   - "How many sci-fi movies were released after 2010?"
   - "What is the median runtime of R-rated movies?"

Question: "{query}"

Respond with ONLY one word: either "semantic" or "factual":"""

    try:
        response = llm.complete(classification_prompt)
        classification = str(response).strip().lower()

        # Validate classification
        if 'semantic' in classification:
            return 'semantic'
        elif 'factual' in classification:
            return 'factual'
        else:
            # Default based on keywords if classification is unclear
            factual_keywords = ['average', 'mean', 'median', 'sum', 'count', 'how many',
                              'highest', 'lowest', 'maximum', 'minimum', 'total', 'statistics']
            if any(keyword in query.lower() for keyword in factual_keywords):
                return 'factual'
            return 'semantic'
    except Exception as e:
        print(f"Classification error: {e}")
        # Fallback classification
        factual_keywords = ['average', 'mean', 'median', 'sum', 'count', 'how many',
                          'highest', 'lowest', 'maximum', 'minimum', 'total']
        if any(keyword in query.lower() for keyword in factual_keywords):
            return 'factual'
        return 'semantic'

# Test classification
test_queries = [
    "What are some alien-related movies?",
    "What's the average rating of all movies?",
    "Recommend psychological thrillers",
    "How many movies were released in 2020?"
]

print("Testing Query Classification")
print("="*80)
for query in test_queries:
    qtype = classify_query(query)
    print(f"Query: '{query}'")
    print(f"Classification: {qtype.upper()}")
    print()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Testing Query Classification


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: 'What are some alien-related movies?'
Classification: SEMANTIC



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: 'What's the average rating of all movies?'
Classification: FACTUAL



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Query: 'Recommend psychological thrillers'
Classification: SEMANTIC

Query: 'How many movies were released in 2020?'
Classification: FACTUAL



### 5.2 Create Unified Question-Answering Interface

The classificaton and routing logic is wrapped into a single function that handles any type of question (semantic or factual)

In [25]:
def answer_question(query: str, verbose: bool = True):
    """
    Unified interface to answer any movie-related question.
    Automatically classifies the query and routes to appropriate pipeline.

    Args:
        query: Natural language question about movies
        verbose: Whether to print detailed information

    Returns:
        dict: Complete answer with metadata
    """
    import time

    start_time = time.time()

    # Classify query
    query_type = classify_query(query)

    if verbose:
        print(f"Query classified as: {query_type.upper()}")
        print(f"Processing...")

    # Route to appropriate pipeline
    if query_type == 'semantic':
        result = semantic_query(query)
    else:
        result = factual_query(query)

    # Add processing metadata
    processing_time = time.time() - start_time
    result['processing_time'] = processing_time
    result['classification'] = query_type

    return result

print("Unified question-answering interface created!")

Unified question-answering interface created!


### 5.3 Display Function for Unified Results

The results are formatted and printed including metadata, execution time, sources used etc

In [26]:
def display_result(result):
    """Display unified query result in a user-friendly format."""
    print("="*80)
    print(f"QUERY: {result['query']}")
    print(f"CLASSIFICATION: {result['classification'].upper()}")
    print(f"PROCESSING TIME: {result.get('processing_time', 0):.2f} seconds")
    print("="*80)
    print(f"\nANSWER:\n{result['answer']}")

    # Display sources for semantic queries
    if result['classification'] == 'semantic' and 'sources' in result:
        print(f"\nSOURCES ({result.get('num_sources', 0)} movies):")
        for i, source in enumerate(result['sources'], 1):
            print(f"  {i}. {source['title']} ({source['year']}) - Rating: {source['rating']}")

    # Display code for factual queries
    if result['classification'] == 'factual' and 'generated_code' in result and result['generated_code']:
        print(f"\nGENERATED CODE:")
        print("-"*80)
        print(result['generated_code'])
        print("-"*80)

    # Display errors if any
    if 'error' in result:
        print(f"\nERROR: {result['error']}")

    print("="*80)
    print()

print("Display function created!")

Display function created!


### 5.4 Comprehensive Testing with Diverse Queries

The system is tested on 10 different queries (5 factual, 5 semantic)

In [27]:
# Comprehensive test queries covering both types
comprehensive_queries = [
    # Semantic queries
    "What are some movies about artificial intelligence?",
    "Recommend movies with complex plot twists",
    "What films explore themes of justice and revenge?",
    "Which movies feature time loops or time manipulation?",
    "Tell me about The Dark Knight",

    # Factual queries
    "What's the average rating of Christopher Nolan movies?",
    "How many action movies are in the dataset?",
    "Which year had the highest average movie rating?",
    "What is the runtime of the movie Avatar?",
    "What are the top 5 highest rated movies?"
]

print("Comprehensive Testing of Unified System")
print("="*80)
print()

for i, query in enumerate(comprehensive_queries, 1):
    print(f"\nTest {i}/{len(comprehensive_queries)}")
    result = answer_question(query, verbose=False)
    display_result(result)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Comprehensive Testing of Unified System


Test 1/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What are some movies about artificial intelligence?
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 27.20 seconds

ANSWER:
1. A.I. Artificial Intelligence (2001) - A highly advanced robotic boy longs to become "real" so that he can regain the love of his human mother.
2. Ex Machina (2014) - A young programmer is selected to participate in a ground-breaking experiment in synthetic intelligence by evaluating the human qualities of a highly advanced humanoid A.I.
3. Transcendence (2014) - A scientist's drive for artificial intelligence takes on dangerous implications when his own consciousness is uploaded into one such program.
4. Robot & Frank (2012) - An ex-jewel thief receives a gift from his son: a robot butler programmed to look after him. But soon the two companions try their luck as a heist team.
5. Robots (2005) - In a robot world, a young idealistic inventor travels to the big city to join his inspiration's company, only to find himself opposing its sinister new management.

SOU

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: Recommend movies with complex plot twists
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 29.47 seconds

ANSWER:
1. Circle (2015) - This movie involves fifty strangers who are held captive and forced to choose the one person among them who deserves to live. The plot twist comes when the audience realizes that the situation is not as it seems.
2. Twisted (2004) - This movie follows a police officer investigating a murder, only to find herself in the center of her own investigation as her former lovers start being murdered. The plot twist comes when the identity of the killer is revealed.
3. Unlocked (2017) - This movie follows a CIA interrogator who is lured into a ruse that puts London at risk of a biological attack. The plot twist comes when the true intentions of the antagonist are revealed.
4. Creepshow 2 (1987) - This movie contains three macabre tales from a boy's favorite comic book. The plot twists come in each of the three stories, adding to the complexity of the movie as a wh

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What films explore themes of justice and revenge?
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 27.81 seconds

ANSWER:
1. Do Revenge (2022) - A comedy about two girls who seek revenge against their bullies.
2. Equilibrium (2002) - An action, drama, and sci-fi film where a man overthrows a system that suppresses emotions and feelings.
3. Vengeance (2022) - A comedy, music, and mystery film about a writer investigating the murder of a girl he hooked up with.
4. Poetic Justice (1993) - A drama and romance film about a grieving hairdresser on a road trip and dealing with loss and love.
5. Revenge (1990) - An action, crime, and drama film about a retired US Navy pilot who falls in love with a powerful businessman's wife and faces the consequences.

All of these films explore themes of justice and revenge in various ways. Some focus on seeking revenge against bullies or those who have wronged them, while others explore the concept of justice on a larger scale, such as overthrowing a syste

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: Which movies feature time loops or time manipulation?
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 23.53 seconds

ANSWER:
1. Synchronicity (2015) - A physicist invents a time machine and travels back to the past to uncover the truth.
2. Timecrimes (2007) - A man accidentally travels back in time nearly an hour and faces unforeseen consequences.
3. Boss Level (2020) - A former special forces agent is trapped in a time loop that constantly repeats the day of his murder.
4. Timecop (1994) - A security agency regulates time travel, and a shady politician tries to change the past to control the future.
5. Time Lapse (2014) - Three friends discover a mysterious machine that takes pictures twenty-four hours into the future, and they conspire to use it for personal gain, but disturbing and dangerous images begin to develop.

These movies feature time loops or time manipulation as significant elements in their plots.

SOURCES (5 movies):
  1. Synchronicity (2015) - Rating: 5.5
  2. Timecrim

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: Tell me about The Dark Knight
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 13.87 seconds

ANSWER:

The Dark Knight is a 2008 action, crime, and drama film directed by Christopher Nolan. The movie features Christian Bale, Heath Ledger, Aaron Eckhart, and Michael Caine in the lead roles. The story revolves around Batman, who faces one of the greatest psychological and physical tests of his ability to fight injustice when the menace known as the Joker wreaks havoc and chaos on the people of Gotham. The film has a rating of 9.0 and a runtime of 152 minutes.

SOURCES (5 movies):
  1. The Dark Knight (2008) - Rating: 9.0
  2. The Dark Knight Rises (2012) - Rating: 8.4
  3. A Knight's Tale (2001) - Rating: 6.9
  4. Black Knight (2001) - Rating: 4.9
  5. Last Knights (2015) - Rating: 6.2


Test 6/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What's the average rating of Christopher Nolan movies?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 9.84 seconds

ANSWER:
The average rating of Christopher Nolan's movies is approximately 8.2, with a slight variation due to the rounding of decimal places.

GENERATED CODE:
--------------------------------------------------------------------------------
result = df[df['Director'].str.contains('Christopher Nolan', case=False, na=False)].Rating.mean()
--------------------------------------------------------------------------------


Test 7/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: How many action movies are in the dataset?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 9.72 seconds

ANSWER:
There are a total of 2560 action movies in the dataset. This large number underscores the popularity and enduring appeal of the action genre in cinema.

GENERATED CODE:
--------------------------------------------------------------------------------
result = len(df[df['Genres'].str.contains('Action', na=False)])
--------------------------------------------------------------------------------


Test 8/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: Which year had the highest average movie rating?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 10.64 seconds

ANSWER:
The year with the highest average movie rating was 1952. This finding may be surprising, as one might expect more recent years to have higher ratings due to advancements in film technology and increased competition. However, the data suggests that movies from 1952 received the highest average rating from viewers.

GENERATED CODE:
--------------------------------------------------------------------------------
result = df.groupby('Year')['Rating'].mean().idxmax()
--------------------------------------------------------------------------------


Test 9/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What is the runtime of the movie Avatar?
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 4.20 seconds

ANSWER:
162.0 minutes or 2 hours and 42 minutes.

SOURCES (5 movies):
  1. Avatar (2009) - Rating: 7.9
  2. 9 (2009) - Rating: 7.0
  3. Evangelion: 3.0+1.01 Thrice Upon a Time (2021) - Rating: 8.0
  4. The Girl Who Leapt Through Time (2006) - Rating: 7.7
  5. Turning Red (2022) - Rating: 7.0


Test 10/10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What are the top 5 highest rated movies?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 25.23 seconds

ANSWER:
The top 5 highest rated movies according to the data are "The Shawshank Redemption" (1994) with a rating of 9.3, "The Chaos Class" (1975) with a rating of 9.2, "Ramayana: The Legend of Prince Rama" (1993) with a rating of 9.2, "The Godfather" (1972) with a rating of 9.2, and "Daman" (2022) with a rating of 9.1. These films span various genres and years, but all have received critical acclaim and high viewer ratings.

GENERATED CODE:
--------------------------------------------------------------------------------
result = df.sort_values(by=['Rating'], ascending=False).head(5)
--------------------------------------------------------------------------------



### 5.5 Error Handling and Edge Cases

The system is tested for edge cases here with

In [28]:
edge_case_queries = [
    "What movies were released in the year 3000?",
    "How many movies have a rating greater than 10?",
    "The one where the ship sinks",
]

print("Testing Error Handling and Edge Cases")
print("="*80)
print()

for query in edge_case_queries:
    print(f"\nEdge Case Query: '{query}'")
    try:
        result = answer_question(query, verbose=False)
        display_result(result)
    except Exception as e:
        print(f"Exception caught: {str(e)}")
        print("="*80)
        print()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Testing Error Handling and Edge Cases


Edge Case Query: 'What movies were released in the year 3000?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: What movies were released in the year 3000?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 9.75 seconds

ANSWER:
There are no movies released in the year 3000 based on the available data.

GENERATED CODE:
--------------------------------------------------------------------------------
result = df[df['Year'] == 3000]
--------------------------------------------------------------------------------


Edge Case Query: 'How many movies have a rating greater than 10?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: How many movies have a rating greater than 10?
CLASSIFICATION: FACTUAL
PROCESSING TIME: 9.50 seconds

ANSWER:
Based on the data analyzed, no movies in the dataset have a rating higher than 10. The highest rating recorded was a 10, indicating that these movies received perfect scores.

GENERATED CODE:
--------------------------------------------------------------------------------
result = len(df[df['Rating'] > 10])
--------------------------------------------------------------------------------


Edge Case Query: 'The one where the ship sinks'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


QUERY: The one where the ship sinks
CLASSIFICATION: SEMANTIC
PROCESSING TIME: 33.76 seconds

ANSWER:
1. Pitch Black (2000) - A transport ship crashes and leaves its crew stranded on a desert planet inhabited by bloodthirsty creatures that come out during an eclipse.
2. USS Indianapolis: Men of Courage (2016) - During World War II, an American navy ship is sunk by a Japanese submarine leaving 890 crewmen stranded in shark infested waters.
3. Ghost Ship (2002) - A salvage crew discovers a long-lost 1962 passenger ship floating lifeless in a remote region of the Bering Sea, and soon notices that its long-dead inhabitants may still be on board.

Of the given options, Pitch Black, USS Indianapolis: Men of Courage, and Ghost Ship all involve a ship sinking or crashing. Pitch Black is a sci-fi horror film where a transport ship crashes on a desert planet. USS Indianapolis: Men of Courage is a historical drama about a navy ship sunk by a Japanese submarine during World War II. Ghost Ship is a 