# Content Based Filtering

In this project, I build a content-based recommender system that suggests movies to users based on their individual genre preferences.
The goal is to develop a model that can recommend new movies by analyzing the genres of movies a user has already rated highly.

Unlike collaborative filtering methods, which rely on user-user or item-item interactions, content-based filtering focuses solely on the attributes of the items themselves — in this case, the movie genres. This approach is especially useful for:

- Personalized recommendations even with a small user base

- New users or items that don't have a lot of historical data (solving part of the cold-start problem)

- Clear, explainable suggestions based on a user’s known interests

In [None]:
import pandas as pd

### Load Data: Ratings and Movies Datasets

In this step, we load two datasets into memory using pandas.

- **ratings.csv** contains user ratings for movies (e.g., user IDs, movie IDs, and the rating given).

 **movies.csv** contains movie metadata (e.g., movie IDs and movie titles).

These datasets are essential for building a content-based recommendation system, as they provide both the users' preferences and the details about the movies to recommend.

In [None]:
ratings = pd.read_csv("data/raw/ratings.csv")
movies = pd.read_csv("data/raw/movies.csv")

In [None]:
movies.head(6)

In [None]:
ratings.head(6)

Now we merge both sets to make predictions about the movies based on their ratings. Also this will simplify the search of missing values

In [None]:
df = ratings.merge(movies, on="movieId")
df.head()

### Check for missing values

This step performs missing data analysis on the DataFrame df:

- **df.isnull().sum()** counts the number of missing (null) values in each column and prints the result. This helps identify which features may require cleaning or imputation.

- **df[df.isnull().any(axis=1)]** filters the rows where any column contains a missing value.
The resulting df_missing DataFrame is printed to inspect which rows are incomplete and might need special handling.

In [None]:
# Check for missing values in each column
print(df.isnull().sum())

# Create and display a DataFrame containing only the rows with missing values
df_missing = df[df.isnull().any(axis=1)]
print(df_missing)

In [None]:
df.info

In [None]:
df.dtypes

### Clean and Transform the 'genres' Column

Here it's prepared the data for content-based recommendation by cleaning and transforming the genres field:

- Convert title and genres to strings: Ensures consistency in case some entries were read incorrectly as other types (e.g., floats because of missing values).

- Split genres: The genres field contains multiple genres separated by | (pipe) characters. `.str.split("|")` transforms each entry into a list of genres.

- Explode the DataFrame: `df.explode("genres")` transforms the lists in genres into multiple rows — one row per (movie, genre) pair.

- This structure is much more useful for machine learning, allowing the recommender to treat each genre individually.

In [None]:
print(df["title"].dtype)
print(df["title"].nunique())
print(df["title"].unique()[:10])

print(df["genres"].dtype)
print(df["genres"].nunique())
print(df["genres"].unique()[:10]) 


In [None]:
# Convert 'title' and 'genres' columns to string type
df["title"] = df["title"].astype(str)
df["genres"] = df["genres"].astype(str)

# Convert 'title' and 'genres' columns to string type
df["genres"] = df["genres"].str.split("|")

# Expand the DataFrame so each genre gets its own row
df = df.explode("genres")
df.head(6)

### Data Cleaning

In this stage, we clean the dataset by dropping irrelevant features, removing duplicates, handling missing genre information, and verifying the integrity of the movie IDs — all essential to ensure the recommender system is trained on reliable, structured data.

1. Dropping the timestamp column:

- The timestamp field from ratings is not relevant for a content-based recommender.

- Dropping it reduces unnecessary noise and memory usage.

2. Removing duplicates and resetting the index:

- `drop_duplicate()` ensures there are no repeated rows (e.g., identical movie-genre records).

 `reset_index(drop=True)` reindexes the DataFrame after removing duplicates to maintain a clean, continuous index.

3. Inspecting the unique genres:

- nunique() and unique() help verify how many distinct genres exist and what they are.

- This inspection helps detect invalid or placeholder entries.

4. Filtering out "(no genres listed)" entries:

- Some movies may not have assigned genres (marked as "(no genres listed)").

- These rows are identified and printed (filtered_df), along with their count (filtered_df.shape).

- Knowing the number of movies without genres informs later decisions — they may be excluded or handled separately since a content-based recommender relies on genre data.

5. Checking the number of unique movieIds:

- Verifying how many distinct movies exist after cleaning.

- Ensures the dataset hasn't lost too much data during the cleaning process.

In [None]:
# Drop the 'timestamp' column
df.drop("timestamp", axis=1, inplace=True)


# Remove duplicate rows and reset index
df = df.drop_duplicates()
df = df.reset_index(drop=True)
print(df.head())

In [None]:
print(df["genres"].nunique())
print(df["genres"].unique())

In [None]:
# Filter and display movies with no genres listed
filtered_df = df[df["genres"] == "(no genres listed)"]

print(filtered_df)

In [None]:
print(filtered_df.shape)
print("\n")
print(df["movieId"].nunique())

This step removes incomplete records that could negatively impact the content-based recommender.

Filter out all rows where the genres field is equal to "(no genres listed)". Since the recommender relies on genre information to compute similarity between movies, movies without any genres are not useful.

In [None]:
# Remove movies that have no genres listed
df = df[df["genres"] != "(no genres listed)"]

# Reset the index after removing rows
df.reset_index(drop=True, inplace=True)


print(df["movieId"].nunique())
print(df.shape)

### Feature engineering

This step transforms categorical genre information into a machine-readable format:

- `pd.get_dummies()` is used to perform one-hot encoding on the genres column:
    It creates a new binary column for each unique genre.
    If a movie belongs to a genre, the corresponding column will have a 1; otherwise, it will have a 0.

In [None]:
df_genres = pd.get_dummies(df, columns=["genres"], dtype=int)
print(df_genres.head())

In [None]:
print(f"Number of ratings: {df_genres.shape[0]}")
print(f"\nUnique users: {df_genres['userId'].nunique()}")
print(f"\nUnique movies: {df_genres['movieId'].nunique()}")

In [None]:
print(df_genres.tail(1))
df_genres.tail(5)

### Build User Profiles Based on Genre Preferences

This block extracts genre-related columns and creates user profiles based on their past ratings:

1. Identify Genre Feature Columns:

    - `genre_cols` gathers all column names that start with "genres_" — these are the one-hot encoded genre columns created earlier.

    - A quick printout confirms the genre features and the dataset dimensions.

2. Define the User Profile Function:

    - get_user_profile(user_df) creates a personalized genre preference vector for each user:

        - Multiplies each genre column by the user's rating for that movie (giving more weight to movies the user rated highly).

        - Sums the weighted genres across all movies rated by the user.

        - Normalizes by the sum of the user's ratings to account for rating scale differences between users.

    - This results in a profile vector where each value represents the user's relative preference for a specific genre.

3. Apply to All Users:

    - `df_genres.groupby("userId").apply(get_user_profile)` applies this function across all users to build user profiles.

    - `user_profiles` now contains a weighted genre preference vector for each user — a crucial step for making personalized recommendations.

In [None]:
genre_cols = [col for col in df_genres.columns if col.startswith("genres_")]

# Quick view of genre columns and dataset sizeprint("Genre Columns:", genre_cols)
print("Dataset size:", df_genres.shape)

In [None]:
# Define a function to build a user's genre preference profile
def get_user_profile(user_df):
    # Weight genres by the user's movie ratings and normalize
    rated_genres = user_df[genre_cols].multiply(user_df["rating"], axis=0)
    return rated_genres.sum() / user_df["rating"].sum()

# Generate user profiles by applying the function to each user
user_profiles = df_genres.groupby("userId").apply(get_user_profile)

user_profiles.head(5)

### Movie Recommendations: Generate Similarity Scores Using Cosine Similarity

This cell generates movie recommendations for a given user by calculating similarity between their genre preferences and available movies. It uses cosine similarity to measure how similar the user’s preferences are to each movie’s genre profile:

1. Prepare the Movie Feature Set:

    - `movie_feature` contains the movies genre vectors and the movie titles. It drops duplicates based on movieId to ensure each movie is represented once.

2. Define the recommend_movies Function:

    - User Vector: For a given user, their genre profile (a vector of genre preferences) is extracted from `user_profiles` and reshaped to match the expected input for the cosine similarity function.

    - Movie Vectors: The genre columns from `movie_features` are used to build a set of vectors representing each movie's genre.

     Cosine Similarity: The `cosine_similarity` function computes how similar each movie is to the user’s genre profile.

     Exclusion of Already Rated Movies: The function filters out movies the user has already rated by checking their movie IDs (`already_rated`).

    - Sorting and Ranking: The movies are sorted by their similarity score, and the top n most similar movies are returned as recommendations.

3. Generate Recommendations:

    - The `recommend_movies` function is called with user_id=1 and top_n=5 to recommend the top 5 movies for the user with ID 1.

4. Output:

    - The final recommendations are displayed, showing the movie title and their respective similarity score.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Prepare the movie feature set with movieId, title, and genre columns
movie_features = df_genres.drop_duplicates("movieId")[["movieId", "title"] + genre_cols].set_index("movieId")

# Define a function to recommend movies based on user profile similarity
def recommend_movies(user_id, top_n=5):
    # Get the user's genre preference vector
    user_vector = user_profiles.loc[user_id].values.reshape(1, -1)
    
    # Get the movie genre vectors
    movie_vectors = movie_features[genre_cols].values
    
    # Calculate the cosine similarity between the user and all movies
    similarity_scores = cosine_similarity(user_vector, movie_vectors).flatten()
    
    # Add the similarity scores to the movie feature DataFrame
    movie_features["similarity"] = similarity_scores
    
    # Filter out movies the user has already rated   
    already_rated = df_genres[df_genres["userId"] == user_id]["movieId"]
    
    # Get the top N recommendations based on similarity scores   
    recommendations = movie_features[~movie_features.index.isin(already_rated)]
    
    # Sort by similarity and return the top N movies    
    return recommendations.sort_values("similarity", ascending=False).head(top_n)

recommendations = recommend_movies(user_id=1, top_n=5)
print(recommendations[['title', 'similarity']])

### Predict Rating: Estimate a User's Rating for a Movie

This block defines a function to predict a user's rating for a movie based on the similarity between the user's preferences and the movie's genre profile:

1. Define the predict_rating Function:

    - User Vector: Extract the genre preferences vector for the user from `user_profiles` and reshape it into a 2D array suitable for the cosine similarity function.

    - Movie Vector: Similarly, extract the genre features for the specific movie from `movie_features` based on `movie_id`.

    - Cosine Similarity: Compute the cosine similarity score between the user's genre profile and the movie's genre profile. This score indicates how closely aligned the movie is with the user’s tastes.

    - Predicted Rating: Multiply the similarity score by the sum of the user’s ratings across all genres. This gives a predicted rating — the expected rating that the user would give to the movie based on their genre preferences.

2. Predict Rating for Specific User and Movie:

    - The function is called for user ID 600 and movie ID 1, predicting the rating that user would give that movie based on their genre preferences.

3. Output:

    - The predicted rating is printed for the specified user and movie.

In [None]:
# Define a function to predict a user's rating for a specific movie
def predict_rating(user_id, movie_id):
    # Get the user's genre preference vector
    user_vector = user_profiles.loc[user_id].values.reshape(1, -1)
    
    # Get the movie's genre vector
    movie_vector = movie_features.loc[movie_id, genre_cols].values.reshape(1, -1)
    
    # Calculate the cosine similarity between the user and the movie
    similarity_score = cosine_similarity(user_vector, movie_vector).flatten()[0]
    
    # Predict the rating as the similarity score multiplied by the user's total preference
    predicted_rating = similarity_score * user_vector.sum()
    
    return predicted_rating

predicted_rating = predict_rating(user_id=600, movie_id=1)
print(f"Predicted Rating: {predicted_rating}")
