# Movie Semantic Search Assignment

This notebook implements a semantic search engine for movie plots using SentenceTransformers.

**Student Name:** Amritanshu Darbari  
**Assignment:** Semantic Search on Movie Plots  
**Due Date:** August 26, 2025


## Section 1: Install and Import Libraries

First, we'll install and import all the necessary libraries for our semantic search system.

In [1]:
# Install required packages (run this cell if packages are not installed)
# !pip install sentence-transformers pandas scikit-learn numpy

# Import necessary libraries
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Section 2: Load Movie Dataset

We'll load the movie dataset from the CSV file into a pandas DataFrame.

In [2]:
# Load the movies dataset
movies_df = pd.read_csv('movies.csv')

# Display basic information about the dataset
print(f"Dataset shape: {movies_df.shape}")
print(f"Columns: {list(movies_df.columns)}")
print("\nFirst 5 movies:")
print(movies_df.head())

Dataset shape: (20, 2)
Columns: ['title', 'plot']

First 5 movies:
                  title                                               plot
0   The Bourne Identity  A man with amnesia wakes up on a fishing boat ...
1   Mission: Impossible  Ethan Hunt, an agent of the Impossible Mission...
2         Casino Royale  James Bond's first mission as 007 leads him to...
3  The Spy Who Loved Me  British and Russian submarines carrying nuclea...
4    North by Northwest  An advertising executive is mistaken for a gov...


## Section 3: Create Embeddings using SentenceTransformers

We'll use the 'all-MiniLM-L6-v2' model to create embeddings for all movie plots.

In [3]:
# Initialize the SentenceTransformer model
# all-MiniLM-L6-v2 is a lightweight model that provides good performance for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"Model loaded: {model}")
print(f"Model max sequence length: {model.max_seq_length}")

Model loaded: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
Model max sequence length: 256


In [4]:
# Create embeddings for all movie plots
print("Creating embeddings for movie plots...")

# Extract plot descriptions
movie_plots = movies_df['plot'].tolist()

# Generate embeddings using the SentenceTransformer model
movie_embeddings = model.encode(movie_plots, show_progress_bar=True)

print(f"Created embeddings with shape: {movie_embeddings.shape}")
print(f"Each movie plot is represented by a {movie_embeddings.shape[1]}-dimensional vector")

Creating embeddings for movie plots...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Created embeddings with shape: (20, 384)
Each movie plot is represented by a 384-dimensional vector


## Section 4: Implement Search Function

Now we'll implement the `search_movies` function that takes a query and returns the most similar movies.

In [5]:
def search_movies(query, top_n=5):
    """
    Search for movies based on a text query using semantic similarity.
    
    Args:
        query (str): The search query describing the type of movie desired
        top_n (int): Number of top results to return (default: 5)
    
    Returns:
        pandas.DataFrame: DataFrame containing top_n movies with columns:
                         - title: Movie title
                         - plot: Movie plot description  
                         - similarity_score: Cosine similarity score (0-1)
    """
    
    # Step 1: Encode the search query using the same model
    query_embedding = model.encode([query])
    
    # Step 2: Calculate cosine similarity between query and all movie plots
    # cosine_similarity returns a 2D array, we take the first row
    similarities = cosine_similarity(query_embedding, movie_embeddings)[0]
    
    # Step 3: Get indices of top_n most similar movies
    # argsort() returns indices that would sort the array
    # [::-1] reverses to get descending order (highest similarity first)
    # [:top_n] takes only the top_n results
    top_indices = np.argsort(similarities)[::-1][:top_n]
    
    # Step 4: Create result DataFrame with the top movies
    result_df = movies_df.iloc[top_indices].copy()
    
    # Step 5: Add similarity scores to the result
    result_df['similarity_score'] = similarities[top_indices]
    
    # Step 6: Reset index and return only required columns
    result_df = result_df.reset_index(drop=True)
    
    return result_df[['title', 'plot', 'similarity_score']]

print("search_movies function defined successfully!")

search_movies function defined successfully!


## Section 5: Test the Search Function

Let's test our search function with the query 'spy thriller in Paris' and see the results.

In [6]:
# Test the search function with the specified query
test_query = 'spy thriller in Paris'
results = search_movies(test_query, top_n=5)

print(f"Search Results for: '{test_query}'")
print("=" * 50)

# Display results in a nice format
for idx, row in results.iterrows():
    print(f"\n{idx + 1}. {row['title']}")
    print(f"   Similarity Score: {row['similarity_score']:.4f}")
    print(f"   Plot: {row['plot'][:100]}...")

print("\n" + "=" * 50)
print("\nFull Results DataFrame:")
print(results)

Search Results for: 'spy thriller in Paris'

1. The French Connection
   Similarity Score: 0.4318
   Plot: New York City detectives Jimmy Doyle and Buddy Russo hope to break a narcotics smuggling ring and ul...

2. Mission: Impossible
   Similarity Score: 0.4281
   Plot: Ethan Hunt, an agent of the Impossible Missions Force, is framed for the murders of his entire team....

3. The Conversation
   Similarity Score: 0.4129
   Plot: A paranoid surveillance expert has a crisis of conscience when he suspects that the couple he is spy...

4. The Departed
   Similarity Score: 0.3687
   Plot: An undercover cop and a police informant play a cat and mouse game with each other as they attempt t...

5. Three Days of the Condor
   Similarity Score: 0.3343
   Plot: A bookish CIA researcher finds all his co-workers dead and must outwit those responsible until he fi...


Full Results DataFrame:
                      title  \
0     The French Connection   
1       Mission: Impossible   
2          The 

## Additional Testing

Let's test with a few more queries to see how well our semantic search works.

In [7]:
# Test with different queries
test_queries = [
    'government conspiracy',
    'undercover agent',
    'assassination plot',
    'cold war espionage'
]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 30)
    results = search_movies(query, top_n=3)
    for idx, row in results.iterrows():
        print(f"{idx + 1}. {row['title']} (Score: {row['similarity_score']:.3f})")


Query: 'government conspiracy'
------------------------------


1. The Pelican Brief (Score: 0.416)


2. Marathon Man (Score: 0.414)
3. All the President's Men (Score: 0.406)

Query: 'undercover agent'
------------------------------
1. North by Northwest (Score: 0.581)
2. Mission: Impossible (Score: 0.509)
3. The Departed (Score: 0.485)

Query: 'assassination plot'
------------------------------
1. Marathon Man (Score: 0.533)
2. The Parallax View (Score: 0.521)
3. The Pelican Brief (Score: 0.501)

Query: 'cold war espionage'
------------------------------
1. Tinker Tailor Soldier Spy (Score: 0.670)
2. Three Days of the Condor (Score: 0.479)
3. The Spy Who Loved Me (Score: 0.438)


## Summary

We have successfully implemented a semantic search engine for movie plots using:

1. **SentenceTransformers**: Used the 'all-MiniLM-L6-v2' model to create embeddings
2. **Cosine Similarity**: Calculated similarity between query and movie plot embeddings
3. **Pandas**: Managed the movie dataset and results
4. **NumPy**: Handled array operations for similarity calculations

The search function returns movies ranked by semantic similarity, allowing users to find relevant movies even when the exact keywords don't match.