# Movies Recommendation Engine:
---
## Project Overview:
 - This end-to-end Movie Recommendation System delivers personalized film suggestions by combining content-based filtering with a smart NLP-powered search using fuzzy string matching for enhanced user input handling.
 - It leverages metadata from the TMDB API, processes over 9,000 movies, and extracts key features such as genres, descriptions, and popularity to compute similarity scores using TF-IDF vectorization and cosine similarity.
 - The system is deployed with an intuitive Streamlit interface, enabling users to:
   - Enter movie names (even approximate or partial)
   - Receive relevant recommendations with posters, ratings, and direct streaming links

The project demonstrates hands-on expertise in data preprocessing, natural language processing, similarity algorithms, and full-stack deployment, making it a strong addition to any Data Science portfolio.

## 🎬 This notebook will build the Movie Recommendation Engine by:
🧹 Textual Data Cleaning & Preprocessing:
  - Remove leading/trailing spaces, special characters, and lowercase the text for uniformity.
  - Construct a custom textual feature called “soup” by combining metadata like title, description, genre, etc.
  - Apply lemmatization to reduce words to their base form, improving matching performance.

🔍 Smart Search with Fuzzy Matching:
  - Use fuzzywuzzy’s process.extractOne() to allow approximate string matching for user input.
  - Supports partial titles, misspellings, or non-standard casing, making the search experience seamless.
  - The closest matched title is passed to the hybrid recommendation engine for fetching results.

📊 TF-IDF Vectorization for Content-Based Filtering:
  - Vectorize the “soup” using TF-IDF to capture semantic relevance among movies.
  - Cosine similarity is computed on these vectors to identify textually similar movies.

🎯 Recommendation Output:
  - For each search query, return the Top 10 recommended movies based on computed score.

---

In [131]:
# Import library:
import pandas as pd

# Load data:
movies = pd.read_csv("tmdb_movies.csv")

# Preview the data:
movies.head(10)       # First 10 rows

Unnamed: 0,id,title,language,description,genre_id,release_date,rating,poster_path,genres,top_cast,cast_profile_path,keywords,watch_link,languages
0,1197306,A Working Man,en,Levon Cade left behind a decorated military ca...,"28, 80, 53",26 Mar 2025,6.489,https://image.tmdb.org/t/p/original/6FRFIogh3z...,"Action, Crime, Thriller","Jason Statham, Jason Flemyng, Merab Ninidze, M...",https://image.tmdb.org/t/p/w200/whNwkEQYWLFJA8...,"based on novel or book, kidnapping, vigilante,...",https://www.themoviedb.org/,English
1,1471014,Van Gogh by Vincent,en,"In a career that lasted only ten years, Vincen...",99,26 Mar 2025,6.375,https://image.tmdb.org/t/p/original/z73X4WKZgh...,Documentary,"Zahra Ahmadi, Jack Etchells, Adam Woolley, Fra...",https://image.tmdb.org/t/p/w200/WlF6GWX98ilKoa...,other,https://www.themoviedb.org/,English
2,668489,Havoc,en,When a drug heist swerves lethally out of cont...,"28, 80, 53",24 Apr 2025,6.608,https://image.tmdb.org/t/p/original/r46leE6PSz...,"Action, Crime, Thriller","Tom Hardy, Jessie Mei Li, Timothy Olyphant, Fo...",https://image.tmdb.org/t/p/w200/d81K0RH8UX7tZj...,"winter, detective, rescue mission, shootout, d...",https://www.themoviedb.org/movie/668489-havoc/...,English
3,950387,A Minecraft Movie,en,Four misfits find themselves struggling with o...,"10751, 35, 12, 14",31 Mar 2025,6.189,https://image.tmdb.org/t/p/original/iPPTGh2OXu...,"Family, Comedy, Adventure, Fantasy","Jason Momoa, Jack Black, Sebastian Eugene Hans...",https://image.tmdb.org/t/p/w200/3troAR6QbSb6nU...,"friendship, surrealism, exploration, portal, m...",https://www.themoviedb.org/,English
4,986056,Thunderbolts*,en,After finding themselves ensnared in a death t...,"28, 12, 878",30 Apr 2025,7.615,https://image.tmdb.org/t/p/original/vnfgoohSwK...,"Action, Adventure, Science Fiction","Florence Pugh, Sebastian Stan, Julia Louis-Dre...",https://image.tmdb.org/t/p/w200/6Sjz9teWjrMY9l...,"villain, based on comic, aftercreditsstinger, ...",https://www.themoviedb.org/,English
5,1225915,Jewel Thief: The Heist Begins,hi,"In this high-octane battle of wits and wills, ...","28, 53",25 Apr 2025,6.793,https://image.tmdb.org/t/p/original/eujLbO0kf1...,"Action, Thriller","Saif Ali Khan, Jaideep Ahlawat, Nikita Dutta, ...",https://image.tmdb.org/t/p/w200/kzOy1DoCeLoKJ0...,other,https://www.themoviedb.org/movie/1225915/watch...,Hindi
6,324544,In the Lost Lands,en,A queen sends the powerful and feared sorceres...,"28, 14, 12",27 Feb 2025,6.323,https://image.tmdb.org/t/p/original/dDlfjR7gll...,"Action, Fantasy, Adventure","Milla Jovovich, Dave Bautista, Arly Jover, Ama...",https://image.tmdb.org/t/p/w200/usWnHCzbADijUL...,"witch, dystopia, sorcery, betrayal, based on s...",https://www.themoviedb.org/,English
7,822119,Captain America: Brave New World,en,After meeting with newly elected U.S. Presiden...,"28, 53, 878",12 Feb 2025,6.16,https://image.tmdb.org/t/p/original/pzIddUEMWh...,"Action, Thriller, Science Fiction","Anthony Mackie, Harrison Ford, Danny Ramirez, ...",https://image.tmdb.org/t/p/w200/eZSIDrtTzhvaby...,"hero, superhero, revenge, aftercreditsstinger,...",https://www.themoviedb.org/,English
8,1153714,Death of a Unicorn,en,A father and daughter accidentally hit and kil...,"27, 14, 35, 12",27 Mar 2025,6.2,https://image.tmdb.org/t/p/original/lXR32JepFw...,"Horror, Fantasy, Comedy, Adventure","Jenna Ortega, Paul Rudd, Will Poulter, Richard...",https://image.tmdb.org/t/p/w200/7oUAtVgZU0uLdU...,"dark comedy, unicorn, dark fantasy, horror com...",https://www.themoviedb.org/,English
9,1180906,Desert Dawn,en,A newly appointed small-town sheriff and his b...,"28, 80, 9648, 53",15 May 2025,0.0,https://image.tmdb.org/t/p/original/S21BfLrJSD...,"Action, Crime, Mystery, Thriller","Kellan Lutz, Cam Gigandet, Chad Michael Collins",https://image.tmdb.org/t/p/w200/pLzdFABlU6oS2B...,other,https://www.themoviedb.org/,English


In [132]:
# Data information:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8245 entries, 0 to 8244
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 8245 non-null   int64  
 1   title              8245 non-null   object 
 2   language           8245 non-null   object 
 3   description        8245 non-null   object 
 4   genre_id           8245 non-null   object 
 5   release_date       8245 non-null   object 
 6   rating             8245 non-null   float64
 7   poster_path        8245 non-null   object 
 8   genres             8245 non-null   object 
 9   top_cast           8245 non-null   object 
 10  cast_profile_path  8245 non-null   object 
 11  keywords           8245 non-null   object 
 12  watch_link         8245 non-null   object 
 13  languages          8245 non-null   object 
dtypes: float64(1), int64(1), object(12)
memory usage: 901.9+ KB


## Cleaning Textual Data

In [133]:
# Starting with cleaning Title data into a Cleaned_Title:
# This Cleaned Title column will be used to match the user input's movie to provide recommendation

import unicodedata
import re

def clean_title(title):
    title = unicodedata.normalize('NFKD', title).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    title = title.lower().strip()
    title = re.sub(r'[^a-z0-9\s]','',title)
    title = re.sub(r'\s+',' ', title)
    return title

# Apply the function to a newly created column of cleaned title:
movies['title_clean'] = movies['title'].apply(clean_title)

# Preview:
movies.head(5)

Unnamed: 0,id,title,language,description,genre_id,release_date,rating,poster_path,genres,top_cast,cast_profile_path,keywords,watch_link,languages,title_clean
0,1197306,A Working Man,en,Levon Cade left behind a decorated military ca...,"28, 80, 53",26 Mar 2025,6.489,https://image.tmdb.org/t/p/original/6FRFIogh3z...,"Action, Crime, Thriller","Jason Statham, Jason Flemyng, Merab Ninidze, M...",https://image.tmdb.org/t/p/w200/whNwkEQYWLFJA8...,"based on novel or book, kidnapping, vigilante,...",https://www.themoviedb.org/,English,a working man
1,1471014,Van Gogh by Vincent,en,"In a career that lasted only ten years, Vincen...",99,26 Mar 2025,6.375,https://image.tmdb.org/t/p/original/z73X4WKZgh...,Documentary,"Zahra Ahmadi, Jack Etchells, Adam Woolley, Fra...",https://image.tmdb.org/t/p/w200/WlF6GWX98ilKoa...,other,https://www.themoviedb.org/,English,van gogh by vincent
2,668489,Havoc,en,When a drug heist swerves lethally out of cont...,"28, 80, 53",24 Apr 2025,6.608,https://image.tmdb.org/t/p/original/r46leE6PSz...,"Action, Crime, Thriller","Tom Hardy, Jessie Mei Li, Timothy Olyphant, Fo...",https://image.tmdb.org/t/p/w200/d81K0RH8UX7tZj...,"winter, detective, rescue mission, shootout, d...",https://www.themoviedb.org/movie/668489-havoc/...,English,havoc
3,950387,A Minecraft Movie,en,Four misfits find themselves struggling with o...,"10751, 35, 12, 14",31 Mar 2025,6.189,https://image.tmdb.org/t/p/original/iPPTGh2OXu...,"Family, Comedy, Adventure, Fantasy","Jason Momoa, Jack Black, Sebastian Eugene Hans...",https://image.tmdb.org/t/p/w200/3troAR6QbSb6nU...,"friendship, surrealism, exploration, portal, m...",https://www.themoviedb.org/,English,a minecraft movie
4,986056,Thunderbolts*,en,After finding themselves ensnared in a death t...,"28, 12, 878",30 Apr 2025,7.615,https://image.tmdb.org/t/p/original/vnfgoohSwK...,"Action, Adventure, Science Fiction","Florence Pugh, Sebastian Stan, Julia Louis-Dre...",https://image.tmdb.org/t/p/w200/6Sjz9teWjrMY9l...,"villain, based on comic, aftercreditsstinger, ...",https://www.themoviedb.org/,English,thunderbolts


In [134]:
# Create soup using Genres, Description and Language:
movies['soup'] = movies['genres'] + ' ' + movies['description'] + ' ' + movies['language'] + ' ' + movies['top_cast'] + ' ' + movies['keywords']
print(f'Generated Soup:\n{movies['soup'][0]}')

Generated Soup:
Action, Crime, Thriller Levon Cade left behind a decorated military career in the black ops to live a simple life working construction. But when his boss's daughter, who is like family to him, is taken by human traffickers, his search to bring her home uncovers a world of corruption far greater than he ever could have imagined. en Jason Statham, Jason Flemyng, Merab Ninidze, Maximilian Osinski, Cokey Falkow based on novel or book, kidnapping, vigilante, missing person, black ops, construction worker, criminal conspiracy, absurd


In [135]:
# Check:
movies['top_cast'][0]

'Jason Statham, Jason Flemyng, Merab Ninidze, Maximilian Osinski, Cokey Falkow'

In [136]:
# Clean and Lemmatize soup:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [137]:
# Initialize StopWords and lemmatizer:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define the function to clean the soup:
def clean_text(text,is_cast=False):
    # If its cast data, just convert to lowercase and strip
    if is_cast:
        return ', '.join([actor.lower().strip() for actor in text.split(',')])
    
    # Otherwise, clean the text as usual
    tokens = nltk.word_tokenize(re.sub(r'\W',' ', text.lower()))
    filtered = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(filtered)

# Apply functon to soup
movies['cleaned_soup'] = movies['soup'].apply(lambda x: clean_text(x))
# Apply function to cast
movies['cleaned_top_cast'] = movies['top_cast'].apply(lambda x: clean_text(x, is_cast=True))

In [138]:
# Clean properly:
movies['cleaned_cast'] = movies['top_cast'].str.lower().str.strip().str.replace(', ', ' ')

# Convert into final soup:
movies['final_soup'] = movies['cleaned_soup'] + ' ' + movies['cleaned_cast']

In [139]:
# Preview finalized soup:
movies['final_soup'][2]

'action crime thriller drug heist swerve lethally control jaded cop fight way corrupt city criminal underworld save politician son en tom hardy jessie mei li timothy olyphant forest whitaker justin cornwell winter detective rescue mission shootout dirty cop criminal underworld crooked politician estranged son aggressive insecure christmas grim drug deal night club brutal violence tom hardy jessie mei li timothy olyphant forest whitaker justin cornwell'

## Preprocessesing Data:

In [81]:
# Apply TF-IDF Vectorization on Soup:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the Vectorizer:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['final_soup'])

# Print output:
print(f"Successfully vectorized soup:\n{tfidf_matrix.shape}")

Successfully vectorized soup:
(8245, 33228)


## Computing Cosine Similarity using TFIDF Vectorized Data

In [82]:
# Import required library:
from sklearn.metrics.pairwise import cosine_similarity

# Compute:
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)

# Print output:
print(f"Computation Complete: {cosine_sim.shape}")

Computation Complete: (8245, 8245)


In [98]:
# Save the cosine sim using NumPy:
import numpy as np
np.save('cosine_sim.npy',cosine_sim)
print("Saved!")

Saved!


In [140]:
# Save the data ready for recommendation engine:
movies.to_csv("movies_recommend.csv", index=False)
print("Saved!")

Saved!


# Content Based Recommendation Engine with smart search

In [193]:
# Saving movies_recommend.csv data as pkl file:
import pandas as pd
import pickle

# Load the CSV file:
movies_df = pd.read_csv("movies_recommend.csv")

# Save into pkl:
with open("movies_recommend.pkl", 'wb') as f:
    pickle.dump(movies_df, f)

print("Pickel file saved successfully.")

Pickel file saved successfully.


In [195]:
# Saving cosine similarity matrix as pkl file:

with open("cosine_sim.pkl", "wb") as f:
    pickle.dump(cosine_sim, f)

print("Precomputed Cosine sim is saved.")

Precomputed Cosine sim is saved.


In [1]:
import re
import pandas as pd
import numpy as np
# Load data:
movies = pd.read_csv("../Data/movies_recommended.csv")
# Load precomputed cosine sim:
cosine_sim = np.load("../Recommendation Engine/cosine_sim.npy")

In [3]:
# Save to pkl file:
import pickle
with open('movies_recommended.pkl', 'wb') as f:
    pickle.dump(movies, f)

print("Pickle file saved successfully")

Pickle file saved successfully


In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8245 entries, 0 to 8244
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 8245 non-null   int64  
 1   title              8245 non-null   object 
 2   language           8245 non-null   object 
 3   description        8245 non-null   object 
 4   genre_id           8245 non-null   object 
 5   release_date       8245 non-null   object 
 6   rating             8245 non-null   float64
 7   poster_path        8245 non-null   object 
 8   genres             8245 non-null   object 
 9   top_cast           8245 non-null   object 
 10  cast_profile_path  8245 non-null   object 
 11  keywords           8245 non-null   object 
 12  watch_link         8245 non-null   object 
 13  languages          8245 non-null   object 
 14  title_clean        8245 non-null   object 
 15  soup               8245 non-null   object 
 16  cleaned_soup       8245 

# === Content-based recommendation function ===
---

In [10]:
movie_aliases = {
    "znmd": "Zindagi Na Milegi Dobara",
    "dch": "Dil Chahta Hai",
    "3idiots": "3 Idiots",
    "k3g": "Kabhi Khushi Kabhie Gham",
    "lagaan": "Lagaan",
    "tzp": "Taare Zameen Par",
    "bb": "Bajrangi Bhaijaan",
    "dangal": "Dangal",
    "aaa": "Andaz Apna Apna",
    "barfi": "Barfi!",
    "mi": "Mission Impossible",
    "tdk": "The Dark Knight",
    "inception": "Inception",
    "lotr": "The Lord of the Rings",
    "matrix": "The Matrix",
    "endgame": "Avengers: Endgame",
    "forrest": "Forrest Gump",
    "interstellar": "Interstellar",
    "jp": "Jurassic Park",
    "potc": "Pirates of the Caribbean"
}

In [11]:
from rapidfuzz import process, fuzz

def recommend_movies(user_input, top_n=10):
    try:
        # Validate user input
        if not isinstance(user_input, str) or not user_input.lower().strip():
            raise ValueError("User input must not be empty. Please add a movie to get recommendations.")
        
        # Clean user input
        user_input_clean = re.sub(r'[^a-zA-Z0-9\s]', '', user_input.lower().strip())

        # Use alias if available
        if user_input_clean in movie_aliases:
            user_input_clean = movie_aliases[user_input_clean]

        # Handle case where no fuzzy match is found
        match_result = process.extractOne(user_input_clean, movies['title_clean'].to_list(), scorer=fuzz.ratio)
        if match_result is None:
            raise ValueError(f"Movie {user_input} is not updated in the data. It will be added in future update of application.")
        
        best_match = match_result[0]
        print(f"Best Match is: {best_match}")

        # Get index of the best match
        idx = movies[movies['title_clean'] == best_match].index[0]

        # Ensure that idx is valid
        if idx < 0 or idx >= len(movies):
            raise IndexError(f"Index {idx} is out of range.")
        
        # Calculate similarity scores directly from the precomputed cosine_sim matrix
        sim_scores = list(enumerate(cosine_sim[idx]))      # Use the precomputed cosine similarity matrix

        # Sort the similarity scores (excluding the movie itself)
        similar_movies_idx = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]

        # Prepare results
        results = []
        for i, _ in similar_movies_idx:
            movie_data = movies.loc[i]

            # Check if all necessary fields exist
            if any(field not in movie_data for field in ['title', 'top_cast','cast_profile_path', 'description', 'genres', 'languages', 'rating', 'poster_path', 'release_date', 'watch_link']):
                continue

            # Get trailer info if available:
            video_key = movie_data.get('video_key')
            trailer_url = f"https://www.youtube.com/watch?v={video_key}" if pd.notna(video_key) else None

            results.append({
                'Title': movies.loc[i,'title'],
                'Top Cast': movies.loc[i, 'top_cast'],
                'Cast Picture': movies.loc[i, 'cast_profile_path'],
                'Description': movies.loc[i,'description'],
                'Genre': movies.loc[i,'genres'],
                'Language': movies.loc[i,'languages'],
                'Release Date': movies.loc[i, 'release_date'],
                'Rating': movies.loc[i,'rating'],
                'Poster': movies.loc[i,'poster_path'],
                'Stream': movies.loc[i, 'watch_link'],
                'Trailer': trailer_url
            })


        return results
    
    except ValueError as ve:
        return {'Error': str(ve)}
    
    except IndexError as ie:
        return {'Error': f'Index error: {str(ie)}'}
    
    except Exception as e:
        return {'Error': f'An unexpected error occurred: {str(e)}'}

In [13]:
recommend_movies('dil dhadakne do',10)

Best Match is: dil dhadakne do


[{'Title': "Isn't It Romantic",
  'Top Cast': 'Rebel Wilson, Liam Hemsworth, Adam Devine, Priyanka Chopra Jonas, Betty Gilpin',
  'Cast Picture': 'https://image.tmdb.org/t/p/w200/yuyRg1WaY616Uux3vP9ONsUjQTS.jpg, https://image.tmdb.org/t/p/w200/7UIm9RoBnlqS1uLlbElAY8urdWD.jpg, https://image.tmdb.org/t/p/w200/8zU8zMs7cpjVzkBXis6I3wO3YeQ.jpg, https://image.tmdb.org/t/p/w200/stEZxIVAWFlrifbWkeULsD4LHnf.jpg, https://image.tmdb.org/t/p/w200/hBOviIHCVqbWyyPUoIxZohDl5SL.jpg',
  'Description': 'For a long time, Natalie, an Australian architect living in New York City, had always believed that what she had seen in rom-coms is all fantasy. But after thwarting a mugger at a subway station only to be knocked out while fleeing, Natalie wakes up and discovers that her life has suddenly become her worst nightmare—a romantic comedy—and she is the leading lady.',
  'Genre': 'Comedy, Fantasy, Romance',
  'Language': 'English',
  'Release Date': '13 Feb 2019',
  'Rating': 6.211,
  'Poster': 'https://image