# DeepSeek and MongoDB For Movie Recommendation System


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/deepseek_r1_rag_pipeline_with_mongodb.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)]()

## Install Libaries and Set Environment Variables

In [None]:
!pip install --quiet -U pymongo sentence-transformers datasets accelerate

In [2]:
import getpass
import os


# Function to securely get and set environment variables
def set_env_securely(var_name, prompt):
    value = getpass.getpass(prompt)
    os.environ[var_name] = value

## Step 1: Data Loading

In [3]:
# Load Dataset
import pandas as pd
from datasets import load_dataset

# https://huggingface.co/datasets/MongoDB/embedded_movies
dataset = load_dataset("MongoDB/embedded_movies")

# Convert the dataset to a pandas DataFrame
dataset_df = pd.DataFrame(dataset["train"])

sample_mflix.embedded_movies.json:   0%|          | 0.00/42.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [4]:
# Remove data point where plot column is missing
dataset_df = dataset_df.dropna(subset=["fullplot"])
print("\nNumber of missing values in each column after removal:")
print(dataset_df.isnull().sum())

# Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open-source embedding model from Hugging Face: gte-large
dataset_df = dataset_df.drop(columns=["plot_embedding"])


Number of missing values in each column after removal:
plot                    0
runtime                14
genres                  0
fullplot                0
directors              12
writers                13
countries               0
poster                 78
languages               1
cast                    1
title                   0
num_mflix_comments      0
rated                 279
imdb                    0
awards                  0
type                    0
metacritic            893
plot_embedding          1
dtype: int64


In [24]:
dataset_df.head()

Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic,embedding
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.06366512924432755, 0.05893123149871826, -0..."
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.04760091006755829, -0.008872468955814838, ..."
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[0.022996366024017334, 0.10801853239536285, -0..."
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.07819894701242447, 0.11125769466161728, -0..."
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.014855025336146355, 0.09593196213245392, -..."


## Step 2: Generating Embeddings

In [55]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


# Function to generate embeddings
def generate_embedding(text):
    return embedding_model.encode([text])[0].tolist()

In [6]:
dataset_df["embedding"] = dataset_df["fullplot"].apply(generate_embedding)

## Step 3: MongoDB (Operational and Vector Database)

MongoDB acts as both an operational and a vector database for the RAG system.
MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

1. First, register for a [MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register). For existing users, sign into MongoDB Atlas.
2. [Follow the instructions](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/). Select Atlas UI as the procedure to deploy your first cluster.

Follow MongoDB’s [steps to get the connection](https://www.mongodb.com/docs/manual/reference/connection-string/) string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.


In [7]:
# Set MongoDB URI
set_env_securely("MONGO_URI", "Enter your MONGO URI: ")

Enter your MONGO URI: ··········


In [8]:
import pymongo


def get_mongo_client(mongo_uri):
    """Establish and validate connection to the MongoDB."""

    client = pymongo.MongoClient(
        mongo_uri, appname="devrel.showcase.rag.deepseek_rag_movies.python"
    )

    # Validate the connection
    ping_result = client.admin.command("ping")
    if ping_result.get("ok") == 1.0:
        # Connection successful
        print("Connection to MongoDB successful")
        return client
    else:
        print("Connection to MongoDB failed")
    return None


MONGO_URI = os.environ["MONGO_URI"]
if not MONGO_URI:
    print("MONGO_URI not set in environment variables")

In [9]:
mongo_client = get_mongo_client(MONGO_URI)

DB_NAME = "movies_database"
COLLECTION_NAME = "movies_collection"

# Create or get the database
db = mongo_client[DB_NAME]

# Create or get the collections
collection = db[COLLECTION_NAME]

Connection to MongoDB successful


In [10]:
collection.delete_many({})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff000000000000003c'), 'opTime': {'ts': Timestamp(1738352202, 1), 't': 60}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1738352202, 1), 'signature': {'hash': b'\xe4\xa5\xe1\x04\xcd\xc6\xcf\x8aI\xe2\xbd:\xc5\xf6\xa1\xa1Jk\xf6\xea', 'keyId': 7421923411288391683}}, 'operationTime': Timestamp(1738352202, 1)}, acknowledged=True)

## Step 4: Data Ingestion

In [11]:
documents = dataset_df.to_dict("records")
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Step 5: Vector Index Creation

In [12]:
# The field containing the text embeddings on each document within the shipping_data collection
embedding_field_name = "embedding"
# MongoDB Atlas Vector Search index name
vector_search_index_name = "vector_index"

In [13]:
import time

from pymongo.operations import SearchIndexModel


def setup_vector_search_index(collection, index_definition, index_name="vector_index"):
    """
    Setup a vector search index for a MongoDB collection and wait for 30 seconds.

    Args:
    collection: MongoDB collection object
    index_definition: Dictionary containing the index definition
    index_name: Name of the index (default: "vector_index")
    """
    new_vector_search_index_model = SearchIndexModel(
        definition=index_definition, name=index_name, type="vectorSearch"
    )

    # Create the new index
    try:
        result = collection.create_search_index(model=new_vector_search_index_model)
        print(f"Creating index '{index_name}'...")

        # Sleep for 30 seconds
        print(f"Waiting for 30 seconds to allow index '{index_name}' to be created...")
        time.sleep(30)

        print(f"30-second wait completed for index '{index_name}'.")
        return result

    except Exception as e:
        print(f"Error creating new vector search index '{index_name}': {e!s}")
        return None

In [14]:
def create_vector_index_definition(dimensions):
    return {
        "fields": [
            {
                "type": "vector",
                "path": embedding_field_name,
                "numDimensions": dimensions,
                "similarity": "cosine",
            }
        ]
    }

In [15]:
DIMENSIONS = 384
vector_index_definition = create_vector_index_definition(dimensions=DIMENSIONS)

In [16]:
setup_vector_search_index(collection, vector_index_definition, "vector_index")

Creating index 'vector_index'...
Waiting for 30 seconds to allow index 'vector_index' to be created...
30-second wait completed for index 'vector_index'.


'vector_index'

## Step 6: Vector Search Function

In [48]:
def vector_search(user_query, top_k=150):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = generate_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    vector_search_stage = {
        "$vectorSearch": {
            "index": "vector_index",
            "queryVector": query_embedding,
            "path": "embedding",
            "numCandidates": top_k,  # Number of candidate matches to consider
            "limit": 5,  # Return top 4 matches
        }
    }

    project_stage = {
        "$project": {
            "_id": 0,  # Exclude the _id field
            "fullplot": 1,  # Include the plot field
            "title": 1,  # Include the title field
            "genres": 1,  # Include the genres field
            "score": {"$meta": "vectorSearchScore"},  # Include the search score
        }
    }

    pipeline = [vector_search_stage, project_stage]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

## Step 7: Semantic Search

In [44]:
query = "What are the some interesting action movies to watch that include business?"

get_knowledge = vector_search(query)

pd.DataFrame(get_knowledge).head()

print(f"\nTop 5 results for query '{query}':")

for result in get_knowledge:
    print(f"Title: {result['title']}, Score: {result['score']:.4f}")


Top 5 results for query 'What are the some interesting action movies to watch that include business?':
Title: Shanghai Express, Score: 0.7532
Title: Grindhouse, Score: 0.7137
Title: Crime Story, Score: 0.7058
Title: The Accidental Spy, Score: 0.6996
Title: Hand Gun, Score: 0.6962


## Step 8: Retrieval Augmented Generation(RA)

Load DeepSeek model from Hugging Face

In [None]:
# Load model directly
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device_map="cuda"
)
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
)

model.to("cuda")

In [66]:
def rag_query(query):
    query = (
        "What are the some interesting action movies to watch that include business?"
    )

    get_knowledge = vector_search(query)

    combined_information = f"Query: {query}\nContinue to answer the query by using the Search Results:\n{get_knowledge}."

    # Moving tensors to GPU
    input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
    response = model.generate(**input_ids, max_new_tokens=1000)

    return tokenizer.decode(response[0], skip_special_tokens=False)

In [67]:
print(
    rag_query(
        "What's a romantic movie that I can watch with my wife? Make your response concise"
    )
)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


<｜begin▁of▁sentence｜>Query: What are the some interesting action movies to watch that include business?
Continue to answer the query by using the Search Results:
[{'genres': ['Action', 'Comedy', 'Western'], 'fullplot': "Multi-genre flick (western, martial arts, comedy, adventure, etc.) with an all-star cast about a man who returns to his home town, buys everything in sight, and tries to improve its municipal (and his personal) profits by sabotaging a train so the passengers all have to stop in his town and spend lots o' money! Throw in various subplots involving some Japanese swordsmen, some bungling bankrobbers (one of whom is the head of security), and a gang of no-goods who try to mess up the town.", 'title': 'Shanghai Express', 'score': 0.7531864643096924}, {'genres': ['Action', 'Horror', 'Thriller'], 'fullplot': 'A double-bill of thrillers that recall both filmmakers\' favorite exploitation films. "Grindhouse" (a downtown movie theater in disrepair since its glory days as a movie 