# Project: Anime Recommendation Engines

## Delivering Personalized Recommendations Using Collaborative Filtering

**Motivation & Interest:**  
As a data science enthusiast and anime fan, I aimed to explore how recommendation engines drive user engagement on platforms like MyAnimeList. This project bridges technical machine learning concepts with practical, real-world applications.

**Why This Project?**  
- Business Impact: Recommendation systems influence approximately 75% of content consumption on streaming platforms such as Netflix ([Source: Netflix Recommendations: Beyond the 5 stars (Part 1)](https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429)) 
- Technical Expertise: Hands-on experience with core algorithms including similarity metrics and matrix factorization  
- Innovation: Combining user-based and item-based filtering with genre data to enhance recommendation quality  

**Key Questions & Approach:**  
1. How consistent are user ratings, and what behavioral patterns can be identified?  
   - Exploratory data analysis with user rating distributions and entropy measures  
2. How does item-based collaborative filtering perform for anime recommendations?  
   - Implementing item-based filtering for generating personalized recommendations
3. Can integrating genre similarity improve recommendation performance?  
   - Developing a hybrid model blending collaborative filtering with genre-based similarity, visualized via network graphs  


### Loading Packages and Initial Settings

Importing the necessary Python libraries for data handling, visualization, statistical analysis, and building the recommendation models. We also set a random seed to ensure reproducibility.

In [None]:
import os
from kaggle.api.kaggle_api_extended import KaggleApi
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.formula.api import logit
from scipy.stats import entropy
from scipy.sparse.linalg import svds
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

sns.set_style("darkgrid")
SEED = 9

### Data Downloading and Importing

In this step, the MyAnimeList datasets are downloaded directly from Kaggle using the Kaggle API. The data is saved in a local directory and automatically unzipped for further use.  
Note: Authentication via the Kaggle API is required beforehand.

In [None]:
project_root = cwd.parent

target_dir = project_root / "data"

api = KaggleApi()
api.authenticate()

dataset = "andrewgatchalian/myanimelist-user-ratings"
api.dataset_download_files(dataset, path=target_dir, unzip=True)

print(f"Dataset downloaded to: {target_dir}")

data_path = os.path.join(target_dir, "anime_data", "anime_user_ratings.csv")

df_user_ratings = pd.read_csv(data_path)

### Understanding User Rating Consistency: Patterns and Biases in Anime Preferences

This analysis explores how consistently users rate anime and examines the potential impact of rating behavior on recommendation quality. Certain users might display extreme rating patterns—such as assigning only maximum (10) or minimum (1) scores—which can distort similarity calculations in collaborative filtering models. By visualizing the distribution of ratings per user and calculating statistical measures like rating entropy, we aim to identify these outlier behaviors. Consistency in ratings is expected to improve recommendation accuracy, while inconsistent rating patterns may introduce bias into both user-based and item-based filtering methods. The results of this step will inform whether preprocessing—such as down-weighting or filtering inconsistent raters—can enhance the overall performance of our recommendation engine.

#### EDA: Distribution of User Rating Scores by User Status

We begin the exploratory analysis by visualizing how anime rating scores are distributed across different user statuses (e.g., completed, dropped, watching).  
Histograms and density plots reveal general scoring tendencies, while bar plots highlight average ratings by user status.  
This step helps identify potential biases, such as whether certain statuses are associated with systematically higher or lower ratings.

Key Takeaways:
- **Score Distribution**  
   - Most user ratings are in the high range (7–9).  
   - Very low ratings (<4) are rare.  
- **Score Distribution by User Status**  
   - **Completed** entries dominate across nearly all rating values, especially at round scores (6, 8, 10).  
   - **Watching**, **Dropped**, and **On Hold** appear far less frequently and have flatter distributions.  
- **Average Score by Status**  
   - **Watching** status has the highest average score.  
   - **Dropped** status has the lowest average.  
   - **Completed** and **On Hold** are close to each other.  

**Conclusion:** Users generally give positive ratings, especially when they finish or are currently watching a series. Shows that are dropped receive significantly lower ratings, as expected.

In [None]:
sns.histplot(data = df_user_ratings, x = "user_score", binwidth = 1)
plt.show()

sns.kdeplot(data = df_user_ratings, x = "user_score", hue = "user_status")
plt.show()

sns.barplot(data = df_user_ratings, x = "user_status", y = "user_score")
plt.show()

#### EDA: Rating Entropy

To measure how consistent each user is in their ratings, we calculate the **entropy** of their score distribution:  
- **High entropy** → diverse rating behavior (user gives a wide range of scores)  
- **Low entropy** → consistent rating behavior (user gives mostly the same score)  

We also store each user's total number of ratings and average score for context.  
Finally, we visualize entropy vs. total ratings to identify patterns, such as whether prolific users tend to be more consistent or varied in their ratings.

- Key Results:
    - **Mean entropy**: ~2.73 bits → most users display moderately diverse rating patterns.  
    - **Standard deviation**: ~0.20 bits → rating variety is fairly similar across the user base.  
    - **Distribution**: Most users fall between 2.63 and 2.84 bits, indicating a balanced mix of consistent and varied raters.

**Conclusion**: While most users rate across a range of scores, a small subset either rates very uniformly or sticks to a single score. Both extremes can influence collaborative filtering similarity calculations.


In [None]:
user_rating_counts = df_user_ratings.groupby(['user_id', 'user_score']).size().unstack(fill_value=0)

user_entropy = user_rating_counts.apply(entropy, axis=1, base=2)

user_stats = pd.DataFrame({
    'entropy': user_entropy,
    'total_ratings': df_user_ratings['user_id'].value_counts(),
    'avg_score': df_user_ratings.groupby('user_id')['user_score'].mean()
})

print(user_stats.sort_values(by = "entropy", ascending = False))

user_stats.plot.scatter(x='total_ratings', y='entropy', s=100, alpha=0.7)
plt.title('User Rating Behavior')
plt.xlabel('Number of Ratings')
plt.ylabel('Entropy')
plt.show()

#### EDA: Correlation Between User Scores and Episodes Watched

To examine whether higher ratings are associated with watching more episodes, we:  
1. Converted user scores into categorical values for clearer grouping in plots.  
2. Created a bar plot showing the average number of episodes watched for each rating score.  
3. Calculated the Pearson correlation coefficient between `user_score` and `user_eps_watched`.

This analysis helps determine if engagement (measured by episodes watched) aligns with rating behavior, which could influence recommendation quality and model assumptions.

**Key Results:** The bar plot below shows the average number of episodes watched for each rating score.  
While there is a general upward trend—higher ratings tend to be associated with more episodes watched—the overall **Pearson correlation coefficient is only `0.03`**, indicating a very weak linear relationship.

**Interpretation:**
- Users who give the highest scores (`9` and `10`) have, on average, watched significantly more episodes.
- However, this trend is not strong enough to imply that episode count is a reliable predictor of a user's rating.
- Ratings may be influenced by other factors such as story quality, personal preferences, or genre fit, rather than just viewing length.

In [None]:
df_user_ratings['user_score_category'] = df_user_ratings['user_score'].astype('category')

sns.barplot(data = df_user_ratings, x = "user_score_category", y = "user_eps_watched")
plt.xlabel('User Rating')
plt.ylabel('Amount of episodes watched')
plt.show()

print(df_user_ratings[["user_score", "user_eps_watched"]].corr())

#### EDA: Most Popular Animes by Number of Ratings

To better understand the dataset, we aggregated ratings per anime title, computing both the **mean rating** and the **number of ratings** each title received.  
To focus on titles with enough user feedback for reliable analysis, we filtered for anime with **more than 100 ratings**.

We then explored the relationship between the **average rating** and the **number of ratings** using a joint plot.

- Results:
    - Popular titles often belong to well-known or highly publicized franchises.
    - The joint plot shows no strong correlation between **average rating** and **popularity** (number of ratings).
    - Highly rated shows are not necessarily the most popular, suggesting a need for recommendation algorithms to balance **quality** and **popularity**.


In [None]:
agg_ratings = df_user_ratings.groupby('title').agg(mean_rating = ('user_score', 'mean'),
                                                number_of_ratings = ('user_score', 'count')).reset_index()

agg_ratings_TOP100 = agg_ratings[agg_ratings['number_of_ratings'] > 100]
agg_ratings_TOP100.info()

print(agg_ratings_TOP100.sort_values(by = 'number_of_ratings', ascending = False).head())

sns.jointplot(x = 'mean_rating', y = 'number_of_ratings', data = agg_ratings_TOP100)

### Question 2: How does item-based collaborative filtering perform for anime recommendations?

The first analysis revealed that user rating behavior varies widely, and that analyzing entropy may help in categorizing users from consistent superfans to sophisticated critics. While these insights help understand data quality, they don’t answer the core practical question: How can we use the ratings by users for generating personalized anime recommendations? Building on the insights from Question 1, this section focuses on item-based collaborative filtering (IBCF) to recommend animes based on similarities between titles. Compared to user-based methods, IBCF provides greater stability and scalability by relying on item similarity rather than user similarity. This approach is common in industry and well-suited to our data.

#### Integrating Personal User Data for Tailored Recommendation Evaluation

To accurately compare user-based and item-based collaborative filtering methods, we incorporate personal rating data from my own MyAnimeList account into the existing dataset. This allows us to generate and evaluate recommendations specifically tailored to my viewing history and preferences. By merging my ratings with the broader user data, we can simulate how each filtering approach performs in delivering relevant anime suggestions personalized to my profile.

In [None]:
data_path = os.path.join(target_dir, "anime_data", "personal_list.xlsx")

df_personal_account = pd.read_excel(data_path)

df_full = pd.concat([df_user_ratings, df_personal_account], ignore_index= True)

#### Data Preparation: Filtering, Deduplication, and User-Item Matrix Construction

To ensure reliable recommendations, we first filter the dataset to include only popular anime titles with a minimum of 100 user ratings. This step reduces noise and sparsity in the data. Next, we check for and identify any duplicate user-anime rating pairs to maintain data integrity. Finally, we transform the dataset into a user-item matrix with users as rows, anime titles as columns, and their corresponding ratings as values. This matrix format is essential for implementing collaborative filtering algorithms. We also calculate the sparsity of the matrix to understand the density of available rating data, which directly impacts recommendation quality.

In [None]:
min_ratings = 100

anime_stats = df_full.groupby('anime_id').agg(
    rating_count=('user_score', 'count'),
    avg_rating=('user_score', 'mean')
)
popular_anime = anime_stats[anime_stats['rating_count'] >= min_ratings].index

df_full = df_full[df_full['anime_id'].isin(popular_anime)]

duplicates = df_full.duplicated(subset=['user_id', 'title'], keep=False)

print(f"Found {duplicates.sum()} duplicate user-title pairs:")

print(df_full[duplicates].sort_values(['user_id', 'title']))

user_ratings_pivot = df_full.pivot_table(
    index='user_id', 
    columns='title', 
    values='user_score',
    aggfunc='last'
)

print(user_ratings_pivot.info())

number_of_empty = user_ratings_pivot.isnull().values.sum()
total_number = user_ratings_pivot.size
sparsity = number_of_empty/total_number

print(sparsity)

#### Normalizing User Ratings: Centering and Handling Missing Values

To prepare the data for similarity calculations, we first examine the shape of the user-item rating matrix. We then compute the average rating for each user to account for individual rating biases. By subtracting each user’s average from their ratings, we center the data, normalizing user preferences around zero. This normalization improves the accuracy of similarity measures in collaborative filtering. Finally, missing ratings are filled with zeros to facilitate matrix operations without introducing bias.

In [None]:
print(user_ratings_pivot.shape)

avg_ratings = user_ratings_pivot.mean(axis = 1)

print(avg_ratings)

user_ratings_centered = user_ratings_pivot.sub(avg_ratings, axis = 0)

user_ratings_centered.fillna(0, inplace=True)

print(user_ratings_centered)

#### Item-Based Collaborative Filtering: Calculating Anime Similarities

In this step, we implement item-based collaborative filtering by computing similarity scores between anime titles. We first transpose the user-item rating matrix to create an item-user matrix, which allows us to analyze how similarly different anime are rated across users. Using cosine similarity, we calculate pairwise similarity scores between all anime titles. The resulting similarity matrix enables us to identify titles most similar to a given anime, such as “Shingeki no Kyojin.” These similarity scores form the foundation for generating personalized recommendations based on items similar to those a user has already rated highly.

**Results:** The cosine similarity scores reveal the anime titles most closely related to “Shingeki no Kyojin” based on user rating patterns. As expected, its direct sequels (“Season 2”, “Season 3 Part 2”, etc.) have the highest similarity scores, reflecting consistent user preferences across the series. Other top matches include thematically or stylistically similar titles such as “Vinland Saga Season 2” and “Kaguya-sama wa Kokurasetai.” These similarity relationships form the basis for recommending anime that align closely with a user's past favorites.

In [None]:
anime_ratings_pivot = user_ratings_centered.T

print(anime_ratings_pivot)

similarities_item = cosine_similarity(anime_ratings_pivot)

cosine_similarity_df_item = pd.DataFrame(similarities_item,
                                   index = anime_ratings_pivot.index,
                                   columns = anime_ratings_pivot.index)

cosine_similarity_df_item.head()

cosine_similarity_series_item = cosine_similarity_df_item.loc['Shingeki no Kyojin']
ordered_similarities_item = cosine_similarity_series_item.sort_values(ascending = False)
print(ordered_similarities_item.head(10))

### Question 3: Can we improve recommendations by blending genres with collaborative filtering?

While collaborative filtering techniques like user-based and item-based filtering offer valuable insights, they face challenges such as sparse user overlap and unreliable similarity scores for niche anime. To overcome these limitations, hybrid recommendation systems combine collaborative filtering with content-based features, including genres, studios, and themes. This integration helps mitigate cold-start problems for new or less-rated titles, improves the semantic relevance of recommendations by suggesting thematically related anime, and strengthens overall recommendation robustness. In this section, we develop a hybrid model that blends Item-Based Collaborative Filtering with genre similarity to provide more diverse and contextually meaningful anime recommendations.

#### Loading and Preparing Anime Genre Data

To enable genre-based recommendations in our hybrid model, we import and clean the dataset containing anime genres. We ensure unique anime entries by removing duplicates and then merge genre information with the existing anime titles in our main dataset. This merged genre data will serve as the content-based feature layer, complementing the collaborative filtering approach in the following steps.

In [None]:
data_path = os.path.join(target_dir, "anime_data", "anime_genres.csv")

df_anime_genre = pd.read_csv(data_path)

df_anime_genre = df_anime_genre.drop_duplicates(subset = "anime_id", keep = "first")

df_anime_genre = df_anime_genre.merge(
    df_full[["anime_id", "title"]].drop_duplicates(),
    on="anime_id",
    how="left")

df_anime_genre = df_anime_genre.drop(columns=["anime_id"])

df_anime_genre = df_anime_genre.set_index("title")

print(df_anime_genre.info())

#### Building the Content-Based Recommendation Matrix

To capture the similarity between anime titles based on their genres, we calculate the Jaccard similarity using genre data. This metric measures the overlap between genre sets for each anime, providing a content-based perspective complementary to collaborative filtering. The resulting similarity matrix enables us to identify closely related titles, as demonstrated with recommendations for *Shingeki no Kyojin*.

In [None]:
jaccard_distances = pdist(df_anime_genre.values, metric='jaccard')

jaccard_similarity_array = 1 - squareform(jaccard_distances)

jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index = df_anime_genre_merged.index, columns = df_anime_genre_merged.index)

print(jaccard_similarity_df.head(5))

jaccard_similarity_series = jaccard_similarity_df.loc['Shingeki no Kyojin']

ordered_similarities = jaccard_similarity_series.sort_values(ascending = False)

print(ordered_similarities.head(20))

#### Aligning Genre-Based and Item-Based Similarity Matrices

To combine collaborative filtering with genre-based similarities, we first identify the common anime titles present in both similarity matrices. We then align these matrices by selecting only the overlapping titles, ensuring consistency for the hybrid recommendation model.

In [None]:
common_titles = list(set(jaccard_similarity_df.index) & set(cosine_similarity_df_item.index))

jaccard_sim_aligned = jaccard_similarity_df.loc[common_titles, common_titles]
cosine_sim_aligned = cosine_similarity_df_item.loc[common_titles, common_titles]

print(jaccard_sim_aligned)
print(cosine_sim_aligned)

#### Normalizing Jaccard and Cosine Similarity Scores

To ensure both genre-based (Jaccard) and item-based (Cosine) similarity scores contribute equally in the hybrid model, we apply Min-Max scaling to normalize their values between 0 and 1. This step allows for a balanced combination of the two similarity measures in subsequent analysis.

In [None]:
scaler = MinMaxScaler()

jaccard_sim_scaled = pd.DataFrame(
    scaler.fit_transform(jaccard_sim_aligned),
    index = jaccard_sim_aligned.index,
    columns = jaccard_sim_aligned.columns
)

cosine_sim_scaled = pd.DataFrame(
    scaler.fit_transform(cosine_sim_aligned),
    index = cosine_sim_aligned.index,
    columns = cosine_sim_aligned.columns
)

print(jaccard_sim_scaled)
print(cosine_sim_scaled)

#### Constructing a Weighted Hybrid Recommendation System

We create a hybrid similarity matrix by combining the normalized item-based (Cosine) and genre-based (Jaccard) similarity scores. Assigning a higher weight to item-based similarity (70%) reflects its stronger predictive power, while the genre-based component (30%) adds valuable content context. This weighted blend aims to deliver more accurate and diverse anime recommendations.

In [None]:
hybrid_similarity_df = 0.7 * cosine_sim_scaled + 0.3 * jaccard_sim_scaled

print(hybrid_similarity_df)

#### Generating Top-10 Anime Recommendations Using the Hybrid Model

Using the hybrid similarity matrix, we aggregate similarity scores for all animes the user *k3v1n* has watched. By summing these scores and excluding already rated titles, we generate a ranked list of top recommendations tailored to the user's preferences. This approach leverages both collaborative filtering and genre information to provide personalized and diverse anime suggestions.

**Conclusion:** Recommendations show strong alignment with user’s past preferences while introducing genre-diverse titles to encourage discovery. The hybrid model produced a top recommendation list dominated by highly rated and thematically relevant anime such as Hunter x Hunter, Bleach: Sennen Kessen-hen, and Jujutsu Kaisen.

In [None]:
watched_animes = user_ratings_pivot.columns[user_ratings_pivot.loc['k3v1n'].notna()].tolist()

recommendation_scores = pd.Series(dtype=float)

for anime in watched_animes:
    similar_animes = hybrid_similarity_df[anime]
    recommendation_scores = recommendation_scores.add(similar_animes, fill_value=0)

recommendation_scores = recommendation_scores.drop(labels=watched_animes, errors='ignore')

top_recommendations = recommendation_scores.sort_values(ascending=False).head(10)

print("🎯 Top Recommendations for User k3v1n:")
print(top_recommendations)