# Problem Statement

**Building a Movie Recommendation System with Chroma Vector DB**

**Scenario:** Develop a Python script that utilizes Chroma Vector DB to create a movie recommendation system based on user preferences via query and movie descriptions.

**Tasks:**

1. For each movie, extract the description text.
2. Use any embedding_model to generate an embedding for the description.
3. Insert the movie description, its embedding, and the metadata (genre and director) into the ChromaDB collection
4. Based on the user query form a proper filter expression.
5. Retrieve the top 5 results based on the user query.

**Dataset:**

[**Movies Dataset**](https://docs.google.com/spreadsheets/d/17zn3h5pDWTz1CEiouAc0c5KWpobk7OFg8D0wmO6QJoE/edit?usp=sharing)





# Step 1: Install Dependencies

In [None]:
!pip install -q chromadb langchain sentence-transformers pandas


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.9/19.9 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m98.0 MB/s[0m eta [36m0:00:

# Step 2: Mount Google Drive

In [None]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
try:
    drive.mount('/content/drive')
    print(" Google Drive mounted successfully.")
except Exception as e:
    print(" Error mounting Drive:", e)


Mounted at /content/drive
 Google Drive mounted successfully.


# Step 3: Load Datasets

In [None]:
# Load dataset
try:
    file_path = "/content/drive/MyDrive/Kata-6/Movies-Dataset.csv"
    df = pd.read_csv(file_path)
    print(" Dataset loaded successfully!\n")

    # Normalize column names (important for later filtering)
    df.columns = df.columns.str.strip().str.lower()

    # Show first 5 rows in table form
    from IPython.display import display
    display(df.head())

except FileNotFoundError:
    print(" File not found. Please check your path.")
except pd.errors.EmptyDataError:
    print(" The file is empty.")
except Exception as e:
    print("Error loading dataset:", e)

 Dataset loaded successfully!



Unnamed: 0,title,description,genre,director
0,Inception,Inception is a 2010 science fiction action fil...,Science Fiction,Christopher Nolan
1,The Matrix,The Matrix is a 1999 science fiction action fi...,Science Fiction,The Wachowskis
2,Fight Club,Fight Club is a 1999 film directed by David Fi...,Drama,David Fincher
3,Interstellar,Interstellar is a 2014 epic science fiction fi...,Science Fiction,Christopher Nolan
4,The Silence of the Lambs,The Silence of the Lambs is a 1991 psychologic...,Horror,Jonathan Demme


# Step 4: Prepare Documents for Embedding

In [None]:
from langchain_core.documents import Document

data = []

try:
    for _, row in df.iterrows():
        title = row.get("title", "")
        description = row.get("description", "No description available")
        genre = row.get("genre", "")
        director = row.get("director", "")

        metadata = {"title": title, "genre": genre, "director": director}
        doc = Document(page_content=str(description), metadata=metadata)
        data.append(doc)

    print(f"Successfully created {len(data)} documents for ChromaDB.")
except Exception as e:
    print(" Error preparing documents:", e)


Successfully created 32 documents for ChromaDB.


# Step 5: Generate Embeddings for Each Description

In [None]:
from sentence_transformers import SentenceTransformer

try:
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    print(" Embedding model loaded successfully.")
except Exception as e:
    print(" Error loading embedding model:", e)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

 Embedding model loaded successfully.


# Now, generate the embeddings:

In [None]:
try:
    texts = [doc.page_content for doc in data]
    embeddings = embedding_model.encode(texts, show_progress_bar=True)
    print(f" Generated embeddings for {len(embeddings)} documents.")
except Exception as e:
    print(" Error generating embeddings:", e)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 Generated embeddings for 32 documents.


# Step 6: Store Data in Chroma Vector Database

In [None]:
import chromadb

try:
    client = chromadb.Client()
    collection = client.get_or_create_collection("movie_recommendations")
    print(" ChromaDB collection created or loaded successfully.")
except Exception as e:
    print(" Error creating ChromaDB collection:", e)


 ChromaDB collection created or loaded successfully.


# Now insert the movie data and embeddings into the collection:

In [None]:
try:
    collection.add(
        ids=[f"id_{i}" for i in range(len(data))],
        embeddings=embeddings.tolist(),
        documents=[doc.page_content for doc in data],
        metadatas=[doc.metadata for doc in data]
    )
    print(" All movie data inserted into ChromaDB collection successfully.")
except Exception as e:
    print(" Error inserting data into ChromaDB:", e)


 All movie data inserted into ChromaDB collection successfully.


# Step 7: Query the ChromaDB for Recommendations

In [None]:
user_query = input(" Enter your movie preference query: ")

try:
    # Ask if user wants to filter by genre or director
    filter_choice = input("Would you like to filter by 'genre' or 'director'? (leave blank to skip): ").strip().lower()
    filter_value = ""

    if filter_choice in ["genre", "director"]:
        filter_value = input(f"Enter the {filter_choice} you want to search for: ").strip()

    # Prepare query embedding
    query_embedding = embedding_model.encode([user_query])

    # Query ChromaDB
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=5
    )

    print("\n Top 5 Recommended Movies:\n")

    count = 0
    for i in range(len(results['documents'][0])):
        meta = results['metadatas'][0][i]
        title = meta['title']
        genre = meta['genre']
        director = meta['director']

        # Apply genre/director filter if provided
        if filter_value:
            if filter_choice == "genre" and filter_value.lower() not in genre.lower():
                continue
            if filter_choice == "director" and filter_value.lower() not in director.lower():
                continue

        count += 1
        print(f" Title: {title}")
        print(f" Genre: {genre}")
        print(f" Director: {director}")
        print(f" Description: {results['documents'][0][i][:200]}...\n")

    if count == 0:
        print(" No matching movies found for your filter.")

except Exception as e:
    print(" Error during query or retrieval:", e)


 Enter your movie preference query: space adventure with robots
Would you like to filter by 'genre' or 'director'? (leave blank to skip): drama

 Top 5 Recommended Movies:

 Title: The Matrix
 Genre: Science Fiction
 Director: The Wachowskis
 Description: The Matrix is a 1999 science fiction action film written and directed by the Wachowskis. It depicts a dystopian future in which humanity is unknowingly trapped inside a simulated reality, the Matrix, ...

 Title: Interstellar
 Genre: Science Fiction
 Director: Christopher Nolan
 Description: Interstellar is a 2014 epic science fiction film directed by Christopher Nolan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, and Michael Caine. The film fo...

 Title: The Prestige
 Genre: Psychological Thriller
 Director: Christopher Nolan
 Description: The Prestige is a 2006 psychological thriller film directed by Christopher Nolan and written by Nolan and his brother Jonathan, based on Christopher Pr