# Movie Recommendation System using TF-IDF and Word2Vec

## Setup and Imports
First, let's import all necessary libraries and download required NLTK data

In [26]:
# Setup and Imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
import string
import re

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haima\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\haima\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\haima\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Data Loading and Preprocessing
Load the IMDB movies dataset and create preprocessing functions

In [34]:
# Cell 2: Data Loading and Preprocessing
# Load the dataset
df = pd.read_csv('imdb_movies.csv')

# Initialize preprocessing tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Preprocess text following requirements:
    - Lowercase
    - Remove punctuation
    - Remove stopwords
    - Lemmatize
    """
    if isinstance(text, str):
        # Lowercase
        text = text.lower()
        # Remove punctuation
        text = re.sub(f'[{string.punctuation}]', ' ', text)
        # Tokenize
        tokens = text.split()
        # Remove stopwords and lemmatize
        tokens = [lemmatizer.lemmatize(token) for token in tokens 
                 if token not in stop_words and token.isalnum()]
        return ' '.join(tokens)
    return ''

# Preprocess all movie plots
print("Preprocessing movie plots...")
df['processed_plot'] = df['overview'].fillna('').apply(preprocess_text)

# Display preprocessing results
print("\nPreprocessing Results Example:")
print(df[['title', 'overview', 'processed_plot']].head())

Preprocessing movie plots...

Preprocessing Results Example:
                      title  \
0  The Shawshank Redemption   
1             The Godfather   
2     The Godfather Part II   
3          Schindler's List   
4              12 Angry Men   

                                            overview  \
0  Imprisoned in the 1940s for the double murder ...   
1  Spanning the years 1945 to 1955, a chronicle o...   
2  In the continuing saga of the Corleone crime f...   
3  The true story of how businessman Oskar Schind...   
4  The defense and the prosecution have rested an...   

                                      processed_plot  
0  imprisoned 1940s double murder wife lover upst...  
1  spanning year 1945 1955 chronicle fictional it...  
2  continuing saga corleone crime family young vi...  
3  true story businessman oskar schindler saved t...  
4  defense prosecution rested jury filing jury ro...  


### TF-IDF Vectorization and Similarity Computation

In [55]:
# TF-IDF Implementation
# Create and fit TF-IDF vectorizer
print("Training TF-IDF vectorizer...")
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_plot'])

def get_tfidf_recommendations(input_text, top_n=5):
    """Get movie recommendations using TF-IDF and cosine similarity."""
    # Preprocess input text
    processed_input = preprocess_text(input_text)
    
    # Transform input text using TF-IDF vectorizer
    input_vector = tfidf_vectorizer.transform([processed_input])
    
    # Calculate cosine similarity
    similarities = cosine_similarity(input_vector, tfidf_matrix).flatten()
    
    # Get top N similar movies
    movie_similarities = list(enumerate(similarities))
    movie_similarities = sorted(movie_similarities, key=lambda x: x[1], reverse=True)
    
    # Get recommendations
    recommendations = []
    seen_titles = set()
    
    for idx, score in movie_similarities:
        title = df.iloc[idx]['title']
        if title not in seen_titles:
            recommendations.append({
                'Title': title,
                'Year': df.iloc[idx]['release_date'][:4],
                'Similarity Score': score
            })
            seen_titles.add(title)
            
            if len(recommendations) == top_n:
                break
    
    return recommendations

# Test with multiple queries
test_queries = [
    "dreams and memory invasion",
    "space exploration and aliens",
    "superhero action adventure"
]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 50)
    recommendations = get_tfidf_recommendations(query)
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec['Title']} ({rec['Year']}) - Score: {rec['Similarity Score']:.4f}")

Training TF-IDF vectorizer...

Query: 'dreams and memory invasion'
--------------------------------------------------
1. The Trust (2016) - Score: 0.2600
2. Eternal Sunshine of the Spotless Mind (2004) - Score: 0.2536
3. Total Recall (1990) - Score: 0.2474
4. Infinite (2021) - Score: 0.2339
5. Ben 10: Race Against Time (2008) - Score: 0.2235

Query: 'space exploration and aliens'
--------------------------------------------------
1. Space Chimps (2008) - Score: 0.3654
2. Alien: Romulus (2024) - Score: 0.2789
3. Lifeforce (1985) - Score: 0.2709
4. Space Pirate Captain Harlock (2013) - Score: 0.2672
5. Like Crazy (2016) - Score: 0.2583

Query: 'superhero action adventure'
--------------------------------------------------
1. Shazam! (2019) - Score: 0.2743
2. Max Steel (2016) - Score: 0.2727
3. Minnal Murali (2021) - Score: 0.2576
4. The Legend of Zorro (2005) - Score: 0.2263
5. X-Men (2000) - Score: 0.2182


## Part 2: Word2Vec Based Recommendation System

In [64]:
# Word2Vec Implementation
print("Training Word2Vec model...")

# Prepare sentences for Word2Vec
sentences = [text.split() for text in df['processed_plot']]

# Train Word2Vec model with final optimized parameters
word2vec_model = Word2Vec(
    sentences, 
    vector_size=100,     # Reduced to focus on core semantics
    window=5,            # Standard window size
    min_count=2,         # Remove very rare words
    workers=4,
    sg=1,               # Skip-gram model
    epochs=20           # More training epochs
)

# Calculate average vectors for each movie
movie_vectors = []
for text in df['processed_plot']:
    words = text.split()
    # Get vectors for meaningful words
    word_vectors = [word2vec_model.wv[word] for word in words 
                   if word in word2vec_model.wv and len(word) > 2]
    
    if word_vectors:
        # Use weighted average based on word importance
        movie_vector = np.mean(word_vectors, axis=0)
    else:
        movie_vector = np.zeros(100)
    movie_vectors.append(movie_vector)
movie_vectors = np.array(movie_vectors)

def get_word2vec_recommendations(input_text, top_n=5):
    """Get movie recommendations using Word2Vec and cosine similarity."""
    # Preprocess input text
    processed_input = preprocess_text(input_text)
    
    # Calculate input vector
    input_words = processed_input.split()
    input_vectors = [word2vec_model.wv[word] for word in input_words 
                    if word in word2vec_model.wv and len(word) > 2]
    
    if not input_vectors:
        return []
    
    input_vector = np.mean(input_vectors, axis=0).reshape(1, -1)
    
    # Calculate cosine similarity
    similarities = cosine_similarity(input_vector, movie_vectors).flatten()
    
    # Get top N similar movies
    movie_similarities = list(enumerate(similarities))
    movie_similarities = sorted(movie_similarities, key=lambda x: x[1], reverse=True)
    
    # Get recommendations
    recommendations = []
    seen_titles = set()
    
    for idx, score in movie_similarities:
        title = df.iloc[idx]['title']
        if title not in seen_titles and score > 0.3:  # Lower threshold for more diversity
            recommendations.append({
                'Title': title,
                'Year': df.iloc[idx]['release_date'][:4],
                'Similarity Score': score
            })
            seen_titles.add(title)
            
            if len(recommendations) == top_n:
                break
    
    return recommendations

# Test Word2Vec recommendations
print("\nTesting Word2Vec Recommendations:")
for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 50)
    recommendations = get_word2vec_recommendations(query)
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec['Title']} ({rec['Year']}) - Score: {rec['Similarity Score']:.4f}")

Training Word2Vec model...

Testing Word2Vec Recommendations:

Query: 'dreams and memory invasion'
--------------------------------------------------
1. Thank You for Your Service (2017) - Score: 0.7043
2. The Best Years (2020) - Score: 0.7035
3. Universal Soldier (1992) - Score: 0.7021
4. Total Recall (1990) - Score: 0.7001
5. Paris or Perish (2013) - Score: 0.6979

Query: 'space exploration and aliens'
--------------------------------------------------
1. Space Chimps (2008) - Score: 0.8185
2. Meet Dave (2008) - Score: 0.8034
3. Dead Space: Downfall (2008) - Score: 0.7874
4. Lost in Space (1998) - Score: 0.7779
5. Lifeforce (1985) - Score: 0.7771

Query: 'superhero action adventure'
--------------------------------------------------
1. Max Steel (2016) - Score: 0.6936
2. My Little Pony: The Movie (2017) - Score: 0.6676
3. Aladdin (2019) - Score: 0.6607
4. The Scorpion King 3: Battle for Redemption (2012) - Score: 0.6567
5. PAW Patrol: Mighty Pups (2018) - Score: 0.6520


## Part 3: Comparison and Analysis of Both Methods

In [67]:
# Direct Comparison Function
def compare_methods(query):
    """Compare recommendations from both TF-IDF and Word2Vec methods."""
    print(f"\nQuery: '{query}'")
    
    print("\nTF-IDF Recommendations:")
    print("-" * 50)
    tfidf_recs = get_tfidf_recommendations(query)
    for i, rec in enumerate(tfidf_recs, 1):
        print(f"{i}. {rec['Title']} ({rec['Year']}) - Score: {rec['Similarity Score']:.4f}")
    
    print("\nWord2Vec Recommendations:")
    print("-" * 50)
    w2v_recs = get_word2vec_recommendations(query)
    for i, rec in enumerate(w2v_recs, 1):
        print(f"{i}. {rec['Title']} ({rec['Year']}) - Score: {rec['Similarity Score']:.4f}")

# Test both methods with a sample query
test_query = "dreams and memory invasion"
compare_methods(test_query)


Query: 'dreams and memory invasion'

TF-IDF Recommendations:
--------------------------------------------------
1. The Trust (2016) - Score: 0.2600
2. Eternal Sunshine of the Spotless Mind (2004) - Score: 0.2536
3. Total Recall (1990) - Score: 0.2474
4. Infinite (2021) - Score: 0.2339
5. Ben 10: Race Against Time (2008) - Score: 0.2235

Word2Vec Recommendations:
--------------------------------------------------
1. Thank You for Your Service (2017) - Score: 0.7043
2. The Best Years (2020) - Score: 0.7035
3. Universal Soldier (1992) - Score: 0.7021
4. Total Recall (1990) - Score: 0.7001
5. Paris or Perish (2013) - Score: 0.6979


## Analysis of Results

### Comparison of TF-IDF vs Word2Vec Results

#### TF-IDF Performance
- Shows more consistent thematic matching
- Better at handling specific plot elements
- Example: Found "Eternal Sunshine" and "Total Recall" for memory-related query
- Similarity scores range from ~0.20 to 0.35
- More reliable for direct keyword matching

#### Word2Vec Performance
- Shows broader semantic relationships
- Captures some thematic similarities
- Example: Found "Universal Soldier" and "Total Recall" for memory query
- Similarity scores tend to be higher (0.65-0.82)
- Better at finding related concepts even with different vocabulary

### Key Differences Observed

#### 1. Similarity Scores
- TF-IDF: More conservative scores (0.20-0.35)
- Word2Vec: Higher scores overall (0.65-0.82)
- TF-IDF scores seem more reliable for ranking

#### 2. Types of Matches
- TF-IDF: More literal matches based on shared words
- Word2Vec: More conceptual matches based on word relationships
- Example: Space query
  * TF-IDF found direct matches like "Space Chimps"
  * Word2Vec found related concepts like "Dead Space" and "Lost in Space"

#### 3. Consistency
- TF-IDF showed more consistent genre-appropriate matches
- Word2Vec sometimes included thematically distant movies
- TF-IDF better maintained genre boundaries

### Why the Differences?

#### TF-IDF Strengths
- Works well with specific vocabulary
- Good for plot-based similarity
- More predictable results
- Better for direct content matching

#### Word2Vec Strengths
- Captures word relationships
- Can find similar concepts
- More flexible with vocabulary
- Better for thematic similarity

### Conclusion

For this movie recommendation task:
1. TF-IDF appears more reliable for plot-based recommendations
2. Word2Vec shows promise but needs refinement
3. TF-IDF's more conservative approach yields more relevant results
4. Both methods successfully avoid duplicates
5. A hybrid approach might be worth exploring in future iterations

The choice between methods depends on the use case:
- Use TF-IDF for specific plot-based searches
- Use Word2Vec for broader thematic searches
- Consider combining both for a more robust system