# Introduction
This notebook implements a **Content-Based Recommendation System** using the MovieLens dataset. 

**The Goal:** Recommend movies to a user based on the similarity of movie attributes (tags and genres). 

**The Approach:**
1. **Data Cleaning:** Aggregate user tags and format genres.
2. **Feature Engineering:** Combine textual data into a single "soup" of metadata.
3. **Vectorization:** Use **TF-IDF (Term Frequency-Inverse Document Frequency)** to convert text into numerical vectors.
4. **Similarity Calculation:** Use **Cosine Similarity** to find the closest vectors (movies) in multidimensional space.

In [23]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the datasets
# Note: Adjust the file paths if your Kaggle input directory structure is different
links = pd.read_csv("./data/links.csv")
movies = pd.read_csv("./data/movies.csv")
tags = pd.read_csv("./data/tags.csv")

print("Movies shape:", movies.shape)
print("Tags shape:", tags.shape)
movies.head()

Movies shape: (9742, 3)
Tags shape: (3683, 4)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Data Preprocessing

We need to prepare the text data for vectorization.
1. **Tags:** Currently, tags are one row per user/movie. We need to aggregate them so each movie has a single string of all tags associated with it.
2. **Genres:** The genres are pipe-separated (e.g., "Action|Adventure"). We need to remove the pipes so the vectorizer treats them as individual words.

In [24]:
# 1. Aggregate Tags
# Group tags by movieId and join them with spaces
# This replaces the slow loop with a vectorized pandas operation
movie_tags = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x).lower()).reset_index()
movie_tags.rename(columns={'tag': 'tags'}, inplace=True)

# 2. Clean Genres
# Remove '|' and convert to lowercase
movies['genres'] = movies['genres'].apply(lambda x: " ".join(x.split("|")).lower())

# 3. Merge Data
# Merge movies with tags. We use a left merge to keep movies that might not have user tags.
df = movies.merge(movie_tags, on="movieId", how="left")
df = df.merge(links, on="movieId", how="left")

# Fill NaN values in 'tags' with an empty string to avoid errors later
df['tags'] = df['tags'].fillna('')

# 4. Create the "Metadata Soup"
# Combine genres and tags into a single column for vectorization
# IMPORTANT: We add a space " " between tags and genres to prevent word merging
df['metadata'] = df['tags'] + " " + df['genres']

# Display the final dataframe structure
df[['movieId', 'title', 'metadata']].head()

Unnamed: 0,movieId,title,metadata
0,1,Toy Story (1995),pixar pixar fun adventure animation children c...
1,2,Jumanji (1995),fantasy magic board game robin williams game a...
2,3,Grumpier Old Men (1995),moldy old comedy romance
3,4,Waiting to Exhale (1995),comedy drama romance
4,5,Father of the Bride Part II (1995),pregnancy remake comedy


## TF-IDF Vectorization & Cosine Similarity

We use **TF-IDF** to transform the `metadata` text into vectors. 
* **TF (Term Frequency):** How often a word appears in a specific movie's metadata.
* **IDF (Inverse Document Frequency):** Downweights words that appear frequently across *all* movies (like "comedy" or "film"), giving more importance to unique tags.

We limit `max_features` to 5000 to keep the computation efficient.

In [25]:
# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit and transform the metadata
# This creates a matrix where rows are movies and columns are words (features)
vector_matrix = tfidf.fit_transform(df['metadata'])

print(f"Matrix Shape: {vector_matrix.shape}")

# Calculate Cosine Similarity
# This computes the similarity score (0 to 1) between every movie and every other movie
similarity = cosine_similarity(vector_matrix)

Matrix Shape: (9742, 1677)


## Building the Recommendation Function

Now we define a function that:
1. Takes a movie title as input.
2. Finds the index of that movie in our DataFrame.
3. Retrieves the similarity scores for that movie.
4. Sorts the scores to find the top 5 matches.

In [26]:
def recommend(movie_title):
    # Check if movie exists in the database
    if movie_title not in df['title'].values:
        return f"Movie '{movie_title}' not found in the database."
    
    # Get the index of the movie
    movie_index = df[df["title"] == movie_title].index[0]
    
    # Get similarity scores for this movie
    distances = similarity[movie_index]
    
    # Sort the movies based on similarity scores (descending order)
    # We skip the first one (index 0) because it is the movie itself
    movie_list = sorted(list(enumerate(distances)), key=lambda x: x[1], reverse=True)[1:6]
    
    print(f"Recommendations for '{movie_title}':")
    print("-" * 30)
    
    # Retrieve movie titles
    recommended_movies = []
    for i in movie_list:
        # Using .iloc properly to access the title column
        title = df.iloc[i[0]]['title']
        recommended_movies.append(title)
        print(title)
        
    return recommended_movies

# Test the system
recommend("Toy Story (1995)")
print("\n")
recommend("Jumanji (1995)")

Recommendations for 'Toy Story (1995)':
------------------------------
Bug's Life, A (1998)
Toy Story 2 (1999)
Guardians of the Galaxy 2 (2017)
Antz (1998)
Adventures of Rocky and Bullwinkle, The (2000)


Recommendations for 'Jumanji (1995)':
------------------------------
Tomb Raider (2018)
Night at the Museum (2006)
Indian in the Cupboard, The (1995)
NeverEnding Story III, The (1994)
Escape to Witch Mountain (1975)


['Tomb Raider (2018)',
 'Night at the Museum (2006)',
 'Indian in the Cupboard, The (1995)',
 'NeverEnding Story III, The (1994)',
 'Escape to Witch Mountain (1975)']