# Hybrid Movie Recommendation System

This notebook implements a hybrid recommendation system that combines:
- **Content-based filtering**: Uses movie features (cast, crew, genres, keywords)
- **Collaborative filtering**: Uses user rating patterns

## Import Libraries

In [1]:
import pandas as pd
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')

## Data Loading and Preprocessing

Load and merge movie datasets:
- TMDB 5000 Dataset (movies and credits)
- The Movie Dataset (ratings and links)

In [2]:
# Load datasets
tmdb_movies_df = pd.read_csv('../data/raw/tmdb_5000_movie_dataset/tmdb_5000_movies.csv')
tmdb_credits_df = pd.read_csv('../data/raw/tmdb_5000_movie_dataset/tmdb_5000_credits.csv')
ratings_df = pd.read_csv('../data/raw/the_movie_dataset/ratings.csv')
links_df = pd.read_csv('../data/raw/the_movie_dataset/links.csv')

# Rename columns for consistency
tmdb_credits_df.columns = ['id','title','cast','crew']

# Merge movie data with credits
tmdb_movies_df = tmdb_movies_df.merge(tmdb_credits_df, on='id')
tmdb_movies_df = tmdb_movies_df.rename(columns={'id': 'tmdbId'})

# Link ratings with TMDB IDs
ratings_with_tmdb_id_df = ratings_df.merge(links_df, on='movieId', how='inner')
ratings_with_tmdb_id_df = ratings_with_tmdb_id_df.dropna(subset=['tmdbId']).astype({'tmdbId': 'int'})

print(f"Final dataset size: {tmdb_movies_df.shape}")

Final dataset size: (4803, 23)


## Data Filtering
    
This filtering helps improve recommendation quality by focusing on users and movies with sufficient data.

In [3]:
# Filter users: keep only users who have rated at least 50 movies
min_user_ratings = 50
filtered_users = ratings_with_tmdb_id_df['userId'].value_counts()
filtered_users = filtered_users[filtered_users >= min_user_ratings].index
final_dataset = ratings_with_tmdb_id_df[ratings_with_tmdb_id_df['userId'].isin(filtered_users)]

print(f"Users after filtering (min {min_user_ratings} ratings): {len(filtered_users)}")
print(f"Dataset shape after user filtering: {final_dataset.shape}")

# Filter movies: keep only movies with at least 100 ratings
min_movie_ratings = 100
filtered_movies = final_dataset['tmdbId'].value_counts()
filtered_movies = filtered_movies[filtered_movies >= min_movie_ratings].index
final_dataset = final_dataset[final_dataset['tmdbId'].isin(filtered_movies)]

print(f"Movies after filtering (min {min_movie_ratings} ratings): {len(filtered_movies)}")
print(f"Final dataset shape after movie filtering: {final_dataset.shape}")

Users after filtering (min 50 ratings): 103787
Dataset shape after user filtering: (22865938, 6)
Movies after filtering (min 100 ratings): 9828
Final dataset shape after movie filtering: (22421221, 6)


## Content-Based Filtering Setup

Process movie features and create feature vectors for similarity calculation.

In [4]:
# Parse JSON strings in feature columns
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    tmdb_movies_df[feature] = tmdb_movies_df[feature].apply(literal_eval)

def get_director(crew_list):
    """Extract director name from crew list"""
    for person in crew_list:
        if person['job'] == 'Director':
            return person['name']
    return None

def get_list(feature_list):
    """Extract top 3 names from feature list"""
    if isinstance(feature_list, list):
        names = [item['name'] for item in feature_list]
        return names[:3] if len(names) > 3 else names
    return []

def clean_data(data):
    """Clean and normalize text data"""
    if isinstance(data, list):
        return [str.lower(item.replace(' ', '')) for item in data]
    elif isinstance(data, str):
        return str.lower(data.replace(' ', ''))
    else:
        return ''

# Extract director from crew
tmdb_movies_df['director'] = tmdb_movies_df['crew'].apply(get_director)

# Process features: get top 3 items and clean text
features_to_process = ['cast', 'keywords', 'genres', 'director']
for feature in ['cast', 'keywords', 'genres']:
    tmdb_movies_df[feature] = tmdb_movies_df[feature].apply(get_list)
    
for feature in features_to_process:
    tmdb_movies_df[feature] = tmdb_movies_df[feature].apply(clean_data)
    
def create_soup(row):
    """Combine all features into single text string"""
    return ' '.join(row['keywords']) + ' ' + ' '.join(row['cast']) + ' ' + \
           row['director'] + ' ' + ' '.join(row['genres'])

# Create feature soup for vectorization
tmdb_movies_df['soup'] = tmdb_movies_df.apply(create_soup, axis=1)

# Vectorize features and calculate similarity matrix
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(tmdb_movies_df['soup'])
content_cosine_sim = cosine_similarity(count_matrix, count_matrix)

# Create title-to-index mapping for quick lookup
tmdb_movies_df = tmdb_movies_df.reset_index()
titles = tmdb_movies_df['title_x']
indices = pd.Series(tmdb_movies_df.index, index=tmdb_movies_df['title_x'])

## Content-Based Recommendation Function

In [5]:
def get_content_recommendations(title, cosine_sim=content_cosine_sim):
    """Get content-based recommendations using cosine similarity"""
    # Get movie index
    idx = indices[title]
    
    # Calculate similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get top 10 similar movies (excluding self)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    
    return titles.iloc[movie_indices]

## Collaborative Filtering Setup

Create user-item matrix and train KNN model for collaborative filtering.

In [6]:
# Create user-item rating matrix
movie_to_user_df = final_dataset.pivot_table(index='tmdbId', columns='userId', values='rating').fillna(0)

# Convert to sparse matrix for memory efficiency
movie_to_user_sparse_matrix = csr_matrix(movie_to_user_df.values)

# Train KNN model for collaborative filtering
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(movie_to_user_sparse_matrix)

0,1,2
,n_neighbors,5
,radius,1.0
,algorithm,'brute'
,leaf_size,30
,metric,'cosine'
,p,2
,metric_params,
,n_jobs,


## Collaborative Filtering Function

In [7]:
def get_collaborative_recommendations(movie_id):
    """Get collaborative recommendations using KNN on user ratings"""
    # Find similar movies based on user rating patterns
    distances, indices = model_knn.kneighbors(
        movie_to_user_df.loc[movie_id].values.reshape(1, -1), 
        n_neighbors=11
    )
    
    # Extract similar movie IDs (exclude self)
    similar_movies = []
    for i in range(1, len(distances.flatten())):
        similar_movies.append(movie_to_user_df.index[indices.flatten()[i]])
    
    # Return movie titles
    return tmdb_movies_df[tmdb_movies_df['tmdbId'].isin(similar_movies)][['title_x']]

## Hybrid Recommendation System

Combines content-based and collaborative filtering with weighted scores.

In [8]:
def hybrid_recommender(movie_title, w_content=0.5, w_collaborative=0.5):
    """Hybrid recommender combining content and collaborative filtering
    
    Args:
        movie_title: Target movie title
        w_content: Weight for content-based recommendations (0-1)
        w_collaborative: Weight for collaborative recommendations (0-1)
    
    Returns:
        List of recommended movie titles sorted by hybrid score
    """
    # Get content-based recommendations
    content_recs = get_content_recommendations(movie_title)

    # Get collaborative recommendations
    movie_id = tmdb_movies_df[tmdb_movies_df['title_x'] == movie_title]['tmdbId'].values[0]
    collaborative_recs = get_collaborative_recommendations(movie_id)

    # Calculate hybrid scores
    hybrid_scores = {}

    # Add content-based scores
    for title in content_recs:
        if title not in hybrid_scores:
            hybrid_scores[title] = w_content * 1.0

    # Add collaborative scores (boost if already in content recommendations)
    for title in collaborative_recs['title_x']:
        if title in hybrid_scores:
            hybrid_scores[title] += w_collaborative * 1.0
        else:
            hybrid_scores[title] = w_collaborative * 1.0

    # Sort by hybrid score
    sorted_recs = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)

    return [title for title, score in sorted_recs]

## Example Usage

Test the hybrid recommender with 'Interstellar' using custom weights.

In [9]:
# Get hybrid recommendations with content bias (50% content, 50% collaborative)
hybrid_recommender('The Avengers', w_content=0.5, w_collaborative=0.5)

['Iron Man 2',
 'Captain America: The First Avenger',
 'Captain America: The Winter Soldier',
 'Iron Man 3',
 'Iron Man',
 'Guardians of the Galaxy',
 'Avengers: Age of Ultron',
 'Captain America: Civil War',
 'The Incredible Hulk',
 'X-Men: The Last Stand',
 'The Dark Knight Rises',
 'X-Men: Days of Future Past',
 'X-Men: First Class',
 'Thor']