# Semantic Search for Movie Plots

This notebook implements a semantic search engine for movie plots. We will use `sentence-transformers` to create embeddings for movie plots and then use cosine similarity to find movies that are semantically similar to a given query.

### 1. Install and Import Libraries

In [None]:
# Install the required libraries
!pip install sentence-transformers pandas scikit-learn

# Import necessary libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### 2. Load the Dataset

In [None]:
# Load the movies.csv file into a pandas DataFrame
df = pd.read_csv('movies.csv')
print("Dataset loaded successfully. Here are the first 5 rows:")
df.head()

### 3. Create Embeddings for Movie Plots

In [None]:
# Load the pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for the movie plots
# This may take a moment to run
print("Creating embeddings for movie plots...")
embeddings = model.encode(df['plot'].tolist(), convert_to_tensor=False)
print("Embeddings created successfully.")

### 4. Implement the Search Function

In [None]:
def search_movies(query, top_n=5):
    """
    Searches for movies based on a query using semantic similarity.

    Args:
        query (str): The search query.
        top_n (int): The number of top results to return.

    Returns:
        pandas.DataFrame: A DataFrame with the top N movies, including their 
                          titles, plots, and similarity scores.
    """
    # Encode the query to get its embedding
    query_embedding = model.encode([query], convert_to_tensor=False)
    
    # Calculate cosine similarity between the query and all movie plots
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    
    # Get the indices of the top N most similar movies
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    
    # Create a result DataFrame
    result_df = df.iloc[top_indices].copy()
    result_df['similarity'] = similarities[top_indices]
    
    return result_df

### 5. Test the Search Function

In [None]:
# Test the search function with the query 'spy thriller in Paris'
test_query = 'spy thriller in Paris'
results = search_movies(test_query, top_n=3)

print(f"Top 3 results for the query: '{test_query}'")
results