<div align="center" style="position: relative; padding: 20px;">

  <h1 style="font-family: 'Comic Sans MS', cursive; font-size: 36px; color: #ff4d6d; 
             text-shadow: 2px 2px 5px black;">
    ✨ Anime Recommender System using Unsupervised Learning ✨
  </h1>

  <img src="https://th.bing.com/th/id/OIP.rcZi51S5pAjcUw2X_B9sxAHaHa?w=174&h=180&c=7&r=0&o=5&pid=1.7" alt="Anime Recommender" width="400" 
       style="border-radius: 15px; box-shadow: 0 0 15px rgba(255, 77, 109, 0.7); margin-top: 10px;">
  
  <p style="font-size: 18px; font-style: italic; color: #333;">
    Discover your next favorite anime with machine learning magic! 🎌✨
  </p>

  <div id="popup" style="position: absolute; top: -40px; right: 10px; 
                         background: #ff4d6d; color: white; padding: 10px 20px; 
                         border-radius: 5px; font-weight: bold; display: none;">
    New Recommendation! 🎉
  </div>

</div>

<script>
  setTimeout(() => {
    document.getElementById('popup').style.display = 'block';
  }, 2000);
</script>



<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📜 Table of Contents 🎌
  </h2>

</div>

- 📌 [**Introduction**](#1-introduction)  
  - 🔹 [Problem Statement](#11-problem-statement)  
  - 🔹 [Objectives](#12-objectives)  
- 📌 [**Importing Packages**](#2-importing-packages)  
- 📌 [**Data Loading and Inspection**](#3-data-loading-and-inspection)  
- 📌 [**Data Cleaning**](#4-data-cleaning)  
- 📌 [**Exploratory Data Analysis (EDA)**](#5-exploratory-data-analysis-eda)  
- 📌 [**Data Preprocessing**](#6-data-preprocessing)  
- 📌 [**Model Development**](#7-model-development)  
- 📌 [**Model Evaluation**](#8-model-evaluation)  
- 📌 [**Model Deployment**](#9-model-deployment)  
- 📌 [**Conclusion and Recommendations**](#10-conclusion-and-recommendations)  
- 📌 [**References**](#11-references)  



<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    🌟 1. Introduction 🌟
  </h2>

</div>

<a id="1-introduction"></a>  

### **1.1 Problem Statement**  
<a id="11-problem-statement"></a>  

With the rapid growth of anime streaming platforms, users often struggle to find new anime that align with their preferences. A **recommender system** can help solve this problem by analyzing user-anime interactions and suggesting relevant content.  

In this project, we will develop an **unsupervised learning-based anime recommender system** using clustering techniques. The goal is to provide **personalized recommendations** based on user behavior and anime attributes.  

### **1.2 Objectives**  
<a id="12-objectives"></a>  

✅ Develop an **unsupervised learning** model to recommend anime based on user preferences.  
✅ Perform **data cleaning, preprocessing, and exploratory data analysis (EDA)** to understand patterns in the dataset.  
✅ Implement **clustering algorithms** (e.g., K-Means, DBSCAN) for grouping similar anime.  
✅ Evaluate the performance of the recommendation system.  
✅ Submit predictions to the **Kaggle Challenge Leaderboard** for evaluation.  
 



<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📦 2. Importing Packages 🚀
  </h2>

</div>

<a id="2-importing-packages"></a>  

In this section, we import the necessary packages to execute our data science workflow. These packages include essential tools for **data manipulation, visualization, preprocessing, and machine learning**.  




In [None]:


import numpy as np  
import pandas as pd  
from datetime import datetime  

# Visualization Libraries  
import matplotlib.pyplot as plt  
import seaborn as sns  
import plotly.express as px  
import plotly.graph_objects as go  
import plotly.figure_factory as ff  
from plotly.offline import init_notebook_mode, iplot  

# Text Processing & Feature Extraction  
# Display a decorative section header
from IPython.display import display, HTML
from sklearn.preprocessing import MinMaxScaler  
from scipy import stats  

# Machine Learning & Dimensionality Reduction  
from sklearn import (manifold, decomposition, ensemble, discriminant_analysis,  
                     random_projection, preprocessing, datasets)  

# Similarity & Error Metrics  
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import mean_squared_error  

# Timing Execution  
from time import time  

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
# from scipy.sparse import csr_matrix
from surprise import Dataset, Reader, SVD, BaselineOnly, CoClustering
from surprise.model_selection import train_test_split as surprise_train_test_split
from math import sqrt
# import implicit

# Ignore Warnings for Clean Output  
import warnings  
warnings.filterwarnings('ignore')  


<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📂 3. Data Loading and Inspection 🧐
  </h2>

</div>

<a id="3-data-loading-and-inspection"></a>  

In this section, we will load the datasets required to build and evaluate our recommender system. By inspecting the structure of the data, we can gain insights into its **quality** and **completeness**. This will guide our **data cleaning process** to ensure consistency and accuracy in our analysis.




###  Data Loading  
The datasets are sourced from a **Kaggle competition** and stored as CSV files. We will read them into **Pandas DataFrames** for easier manipulation.


In [None]:

 #Define the paths to your data files
anime_path = "Data/anime.csv"
ratings_path = "Data/train.csv"  
test_path = "Data/test.csv"
submission_path = "Data/submission.csv"  

# Load the datasets
anime_df = pd.read_csv(anime_path)
ratings_df = pd.read_csv(ratings_path)
test_df = pd.read_csv(test_path)
submission_df = pd.read_csv(submission_path)  

# Confirm data loading
print("\033[1;35mDatasets loaded successfully!\033[0m")

✅We will duplicate our anime dataset to avoid contaminating the original data set

In [None]:
# Load the "anime.csv" file into two separate DataFrames
# anime_df1 = pd.read_csv(anime_path)
# anime_df2 = pd.read_csv(anime_path)

###  Data Inspection 
Data inspection is the process of understanding and assessing the structure, completeness, 
    and quality of a dataset. This step ensures that the data is ready for further analysis 
    by identifying missing or inconsistent values, understanding data types, and spotting any 
    potential anomalies.

✅Preview the anime Dataset:

In [None]:

print(f"Dataset Dimensions: {anime_df.shape}\n")
display(anime_df.head(2).style.set_properties(**{"border": "1.5px solid black"}))

✅Preview the training dataset:

In [None]:

print(f"Dataset Dimensions: {ratings_df.shape}\n")
display(ratings_df.head(2).style.set_properties(**{"border": "1.5px solid black"}))

✅Preview the Submission Dataset:

In [None]:
print(f"Dataset Dimensions: {test_df.shape}\n")
display(test_df.head(2).style.set_properties(**{"border": "1.5px solid black"}))

✅Preview the test Dataset:

In [None]:
print(f"Dataset Dimensions: {submission_df.shape}\n")
display(submission_df.head(2).style.set_properties(**{"border": "1.5px solid black"}))

<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📂 4. Data Cleaning  🧹
  </h2>

</div>

<a id="4-data-cleaning"></a>  
Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing values in a dataset. It ensures that our data is accurate, consistent, and ready for analysis or modeling. Poor data quality can lead to misleading insights and inaccurate predictions, making this step crucial.  

In this section, we will:  
✅ Identify and handle **missing values**  
✅ Detect and remove **duplicate records**  
✅ Standardize **inconsistent formatting**  
✅ Address **outliers and anomalies**  
✅ Ensure data types are correctly assigned  

By the end of this process, we will have a clean and structured dataset that is ready for building our recommender system.  

--

###  Check and handle missing values

✅Check missing values:

In [None]:
def display_missing_values(df, name):
    """Displays missing values in a DataFrame in a clean, readable format."""
    missing = df.isnull().sum().to_frame(name="Missing Values")
    
    print(f"\n{name} Dataset:")
    display(missing.style.set_properties(**{
        "border": "1px solid black", 
        "text-align": "center"
    }).set_table_styles([{"selector": "th", 
"props": [("background-color", "#f4f4f4"), ("font-weight", "bold")]
    }]))
print("🔍 Missing Values in Each Dataset:")
datasets = {
    "Anime": anime_df,
    "Ratings": ratings_df,
    "Test": test_df
}

for name, df in datasets.items():
    display_missing_values(df, name)



✅Display Info of Datasets:

In [None]:

print("\nAnime Dataset Info:")
print(anime_df.info())
print("\nRatings Dataset Info:")
print(ratings_df.info())
print("\nTest Dataset Info:")
print(test_df.info())

✅Handling Missing Values:

In [None]:
anime_df['rating'] = anime_df['rating'].fillna(anime_df.groupby('type')['rating'].transform('mean'))
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['name'] = anime_df['name'].fillna('Unknown')


✅Cleaning Text Data (train_name & train_genre):

In [None]:
# anime_df1['train_genre'] = (
#     anime_df1['genre'] + " " 
# )
# anime_df2['train_name'] = (
#     anime_df2['name'] + " ")

In [None]:
# import re
# def clean_text(text):
#     text = str(text) 
#     text = text.lower() 
#     text = re.sub(r'\d+', '', text)
#     text = re.sub(r'\s+', ' ', text)
#     text = re.sub(r'[^\w\s]', '', text)
#     return text
# anime_df1['cleaned_genre'] = anime_df1['train_genre'].apply(clean_text)
# anime_df2['cleaned_name'] = anime_df2['train_name'].apply(clean_text)


## Tokenization to find modes

In [None]:
# tokeniser = TreebankWordTokenizer()
# anime_df1['genre_tokens'] = anime_df1['cleaned_genre'].apply(tokeniser.tokenize)
# anime_df2['name_tokens'] = anime_df2['cleaned_name'].apply(tokeniser.tokenize)

In [None]:
# from collections import Counter

# # Function to get top N words in a category
# def get_top_genre(type1, n=3):
#     type_tokens = anime_df1[anime_df1['type'] == type1]['genre_tokens'].explode().tolist()
#     word_counts = Counter(type_tokens)
#     return word_counts.most_common(n)

# # Plot top 10 words for each category
# types= anime_df1['type'].unique()
# num_types = len(types)

# plt.figure(figsize=(16, 4 * (num_types // 3 + 1)))  # Adjust height dynamically

# for i, type1 in enumerate(types, 1):
#     plt.subplot((num_types // 3) + 1, 3, i)
#     top_words = get_top_genre(type1)
    
#     if not top_words:
#         continue  # Skip empty categories
    
#     words, counts = zip(*top_words)
#     sns.barplot(x=list(counts), y=list(words), palette='coolwarm')
    
#     plt.title(f'Top 3 Genres in {type1.capitalize()}')
#     plt.xlabel('Frequency')
#     plt.ylabel('Words')

# plt.tight_layout()
# plt.show()


✅The above plot displays the top 3 genres per type. The highest genre per type will be the mode for that specific type

In [None]:
'''most_frequent_type = anime_df1['type'].value_counts().idxmax()
most_frequent_count = anime_df1['type'].value_counts().max()
print(f"The most frequent type is '{most_frequent_type}' with {most_frequent_count} occurrences.")'''

✅TV is the type with the highest frequency. TV is our mode in this case

##  Replacing Nulls

In [None]:
#Replace Null-genre entries
anime_df['genre'] = anime_df.apply(lambda row: 'Comedy' if row['type'] == 'Movie' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)
anime_df['genre'] = anime_df.apply(lambda row: 'Comedy' if row['type'] == 'TV' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)
anime_df['genre'] = anime_df.apply(lambda row: 'Hentai' if row['type'] == 'OVA' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)
anime_df['genre'] = anime_df.apply(lambda row: 'Comedy' if row['type'] == 'Special' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)
anime_df['genre'] = anime_df.apply(lambda row: 'Music' if row['type'] == 'Music' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)
anime_df['genre'] = anime_df.apply(lambda row: 'Comedy' if row['type'] == 'ONA' and (not row['genre'] or pd.isna(row['genre'])) else row['genre'], axis=1)

In [None]:
#Replace Null-type entries
anime_df['type'] = anime_df['type'].replace('', 'TV').fillna('TV')

In [None]:
#Replace Null-ratings entries
average_rating_per_type = anime_df.groupby('type')['rating'].transform('mean')
anime_df['rating'] = anime_df['rating'].fillna(average_rating_per_type)



In [None]:
anime_df.isnull().sum()

YAY!!!There are no missing values anymore

In [None]:
# anime_df = anime_df.drop_duplicates()


<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📊 5. Exploratory Data Analysis (EDA) 🔍
  </h2>

</div>

<a id="5-exploratory-data-analysis"></a>  
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing datasets to uncover patterns, trends, and relationships. It helps us understand the structure of our data, detect anomalies, and generate insights before applying machine learning models.  

In this section, we will:  
✅ Summarize and visualize **key statistics**  
✅ Identify **missing values** and their distribution  
✅ Explore **data distributions** and **correlations**  
✅ Detect and handle **outliers and anomalies**  
✅ Understand the **relationships between features**  

By the end of this process, we will have a deeper understanding of our dataset, enabling us to make informed decisions for building our recommender system.  


✅Summary statistics


In [None]:
# Summary statistics for anime dataset
print("Summary Statistics for Anime Dataset:")
print(anime_df.describe())
print("\n")

# Summary statistics for ratings dataset
print("Summary Statistics for User Ratings (Train) Dataset:")
print(ratings_df.describe())
print("\n")

# Summary statistics for test dataset
print("Summary Statistics for Test Dataset:")
print(test_df.describe())
print("\n")

# Data info for anime dataset
print("Data Info for Anime Dataset:")
print(anime_df.info())
print("\n")

# Data info for ratings dataset
print("Data Info for User Ratings (Train) Dataset:")
print(ratings_df.info())
print("\n")

# Data info for test dataset
print("Data Info for Test Dataset:")
print(test_df.info())
print("\n")

# Unique value counts for each dataset
print("Unique Value Counts in Anime Dataset:")
print(anime_df.nunique())
print("\n")

print("Unique Value Counts in User Ratings (Train) Dataset:")
print(ratings_df.nunique())
print("\n")

print("Unique Value Counts in Test Dataset:")
print(test_df.nunique())


##  User-related: User behavior analysis
This section will delve into how users interact with the anime data, providing insights for building an effective recommender system.

✅
* Distribution of ratings:
    - This will help us understand user preferences and identify potential biases (e.g., skewed towards high or low ratings).

In [None]:
# Visualize distribution of user ratings
plt.figure(figsize=(10, 6))
sns.histplot(ratings_df['rating'], bins=20, kde=False, color='skyblue')
plt.title('Distribution of User Ratings')
plt.xlabel('User Rating')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()


We can see that ratings are skewed towards high ratings.


✅

In [None]:
# Calculate the number of ratings per user
user_ratings_count = ratings_df.groupby('user_id')['rating'].count()

# Descriptive statistics for the distribution of user ratings
print("Descriptive Statistics for Number of Ratings per User:")
print(user_ratings_count.describe())

# Visualize the distribution of ratings per user
plt.figure(figsize=(10, 6))
sns.histplot(user_ratings_count, bins=50, kde=True, color='skyblue')
plt.title('Distribution of Ratings per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()



✅Number of unique users and anime:

In [None]:
num_users = ratings_df['user_id'].nunique()
num_anime = ratings_df['anime_id'].nunique()
print(f'Number of unique users: {num_users}')
print(f'Number of unique anime: {num_anime}')



✅Identify active and inactive users;

In [None]:
# Calculate the number of ratings per user
user_ratings_count = ratings_df.groupby('user_id')['rating'].count()
active_threshold = user_ratings_count.quantile(0.75)  # 75th percentile
inactive_threshold = user_ratings_count.quantile(0.25)  # 25th percentile
active_users = user_ratings_count[user_ratings_count > active_threshold].index
inactive_users = user_ratings_count[user_ratings_count < inactive_threshold].index
print(f"Number of active users: {len(active_users)}")
print(f"Number of inactive users: {len(inactive_users)}")


We have 69,481 total users, but only 34,614 (17,326 active + 17,288 inactive) are accounted for
in our user activity analysis.

### ✅Categorizing Users Based on Activity Levels and Visualizing the Distribution

In [None]:

user_ratings_count = ratings_df.groupby('user_id')['rating'].count()
very_active_threshold = user_ratings_count.quantile(0.9)  # Top 10%
active_threshold = user_ratings_count.quantile(0.75)      # Top 25%
inactive_threshold = user_ratings_count.quantile(0.25)    # Bottom 25%
very_inactive_threshold = user_ratings_count.quantile(0.1)  # Bottom 10%
user_activity = pd.DataFrame({'user_id': user_ratings_count.index})
user_activity['user_type'] = np.where(
    user_ratings_count > very_active_threshold, 'very_active',
    np.where(user_ratings_count > active_threshold, 'active',
             np.where(user_ratings_count < inactive_threshold, 'inactive', 'low_activity'))
)

print("User Type Distribution:")
print(user_activity['user_type'].value_counts())
user_type_order = ['low_activity', 'inactive', 'active', 'very_active']
plt.figure(figsize=(10, 6))
sns.countplot(x='user_type', data=user_activity, order=user_type_order, palette='pastel')
plt.title('Distribution of User Activity Levels')
plt.xlabel('User Type')
plt.ylabel('Count of Users')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


#### ✅Comparing Number of Ratings and User Type

In [None]:
# Create a new DataFrame with user_id, rating_count, and user_type columns
user_activity = pd.DataFrame({
    'user_id': user_ratings_count.index,
    'rating_count': user_ratings_count.values
})

# Assign user types based on thresholds
user_activity['user_type'] = np.where(
    user_ratings_count > very_active_threshold, 'very_active',
    np.where(user_ratings_count > active_threshold, 'active',
             np.where(user_ratings_count < inactive_threshold, 'inactive', 'low_activity')))
user_type_order = ['low_activity', 'inactive', 'active', 'very_active']
plt.figure(figsize=(10, 6))
sns.boxplot(
    x='user_type',
    y='rating_count',
    data=user_activity,
    order=user_type_order,
    showmeans=True,
    palette='pastel'
)
plt.title('Comparing Number of Ratings and User Type')
plt.xlabel('User Type')
plt.ylabel('Number of Ratings')
plt.grid(True)
plt.tight_layout()
plt.show()


Here we analyzed the average rating behavior of active (active & very active) vs. inactive (low_activitiy & inactive) users. It helps understand if these groups have systematic differences in their rating patterns, which can be valuable for recommendations.

## Analyzing Anime Content Behavior: Insights into Genres, Ratings, and Types

In [None]:
# Visualize distribution of average anime ratings
plt.figure(figsize=(10, 6))
sns.histplot(data=anime_df, x='rating', bins=20, kde=False, color='salmon')
plt.title('Distribution of Average Anime Ratings', fontsize=14, fontweight='bold')
plt.xlabel('Average Anime Rating', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()


✅Distribution of anime types

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(data=anime_df, x='type', order=anime_df['type'].value_counts().index, palette='pastel')
plt.title('Distribution of Anime Types', fontsize=14, fontweight='bold')
plt.xlabel('Anime Type', fontsize=12)
plt.ylabel('Count of Anime', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


 ✅Analysis of genres


In [None]:
genre_count = anime_df['genre'].str.get_dummies(sep=', ').sum()
plt.figure(figsize=(12, 6))
genre_count.sort_values(ascending=False).plot(kind='bar', color='lightblue')
plt.title('Distribution of Anime Genres', fontsize=14, fontweight='bold')
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Count of Anime', fontsize=12)
plt.tight_layout()
plt.show()


 ✅Top 10 genres

In [None]:
top_genres = anime_df['genre'].str.get_dummies(sep=', ').sum().sort_values(ascending=False).head(10)
plt.figure(figsize=(12, 6))
top_genres.plot(kind='bar', color='skyblue')
plt.title('Top 10 Anime Genres', fontsize=14, fontweight='bold')
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Count of Anime', fontsize=12)
plt.tight_layout()
plt.show()


 ✅Top 10 anime by number of user ratings

In [None]:

top_anime_by_ratings = ratings_df['anime_id'].value_counts().head(10)
top_anime_titles = anime_df.set_index('anime_id').loc[top_anime_by_ratings.index]['name']
plt.figure(figsize=(12, 6))
sns.barplot(y=top_anime_titles, x=top_anime_by_ratings.values, orient='h', palette='coolwarm')
plt.title('Top 10 Anime by Number of User Ratings', fontsize=14, fontweight='bold')
plt.xlabel('Number of Ratings', fontsize=12)
plt.ylabel('Anime Title', fontsize=12)
plt.tight_layout()
plt.show()


 ✅Top 10 anime by average rating

In [None]:
# Calculate the top 10 anime by average rating from anime_df
top_anime_by_avg_rating = anime_df[['anime_id', 'name', 'rating']].drop_duplicates('anime_id') \
    .sort_values(by='rating', ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(
    x='rating',
    y='name',
    data=top_anime_by_avg_rating,
    orient='h',
    palette='coolwarm'
)

# Titles and labels
plt.title('Top 10 Anime by Average Rating', fontsize=14, fontweight='bold')
plt.xlabel('Average Rating', fontsize=12)
plt.ylabel('Anime Title', fontsize=12)
plt.tight_layout()
plt.show()

# Display the top 10 list for reference
print("Top 10 Anime by Average Rating:")
print(top_anime_by_avg_rating[['name', 'rating']])

 ✅Top rated animes

In [None]:
# # Get the top 10 unique animes by user rating
top_rated_unique_animes = ratings_df.groupby('anime_id')['rating'].mean()
subset = pd.DataFrame(anime_df[['anime_id', 'name']])
subset['rating'] = subset['anime_id'].map(top_rated_unique_animes)
subset = subset.nlargest(10, 'rating')

# Visualize top 10 unique user rated animes
plt.figure(figsize=(12, 8))
plt.barh(
    subset['name'],
    subset['rating'],
    color='skyblue'
)
plt.title('Top 10 Unique Anime by User Rating')
plt.xlabel('User Rating')
plt.ylabel('Anime Title')
plt.show()



✅Get the top 10 unique genre by user rating

In [None]:
# unique_animes_exploded = unique_animes.assign(genre=unique_animes['genre'].str.split(', ')).explode('genre')
# top_genres = unique_animes_exploded.groupby('genre')['rating_x'].mean().reset_index()
top_genres = anime_df.nlargest(10, 'rating')
# top_genres = top_genres.rename(columns={'rating_x': 'average_anime_rating'})

# Visualize
plt.figure(figsize=(12, 8))
plt.barh(
    top_genres['genre'],
    top_genres['rating'],
    color='lightblue'
)
plt.title('Top 10 Unique Genres by Anime Rating')
plt.xlabel('Average Anime Rating')
plt.ylabel('Genre')
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

correlation Matricx

In [None]:
merged_df = pd.merge(ratings_df, anime_df[['anime_id', 'rating', 'episodes', 'members']], on='anime_id', how='left')
merged_df = merged_df.rename(columns={
    'rating_x': 'user_rating',  
    'rating_y': 'average_anime_rating' 
})
numeric_cols = ['user_rating', 'average_anime_rating', 'episodes', 'members']
merged_df['episodes'] = pd.to_numeric(merged_df['episodes'], errors='coerce')  
correlation_matrix = merged_df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("Correlation Matrix:")
print(correlation_matrix)

<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    📊 6. Data preprocessing
  </h2>

</div>
 
<b>Data preprocessing</b> is a crucial step in building any machine learning model. It involves transforming raw data into a clean, structured format that is suitable for analysis. This step includes various tasks such as encoding categorical variables, normalizing data, and feature extraction. Proper data preprocessing ensures that the data is of high quality, which in turn improves the performance of the machine learning models. This involves encoding of categorical
variables, ensuring uniformity and compatibility for subsequent analyses. 

In [None]:

df = merged_df.copy()  
numeric_cols = ['episodes', 'members']
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
scaler = MinMaxScaler(feature_range=(0, 1))
df['scaled_score'] = scaler.fit_transform(df[['average_anime_rating']])

# Function to preprocess features
def preprocess_features(df):
    # Combine relevant features for vectorization
    combined_features = df[["genre", "type", "episodes"]].fillna("").astype(str)
    combined_features = combined_features.apply(lambda x: ' '.join(x), axis=1)
    return combined_features


rec_data = anime_df.drop_duplicates(subset="name", keep="first").reset_index(drop=True)

# Assuming rec_data DataFrame is already defined and preprocessed as in the original code
# Example: rec_data = pd.read_csv("anime_data.csv")  # Load your dataset

combined_features = preprocess_features(rec_data)

# Prepare combined features
tfv = TfidfVectorizer(
    min_df=3,
    max_features=None,
    strip_accents="unicode",
    analyzer="word",
    token_pattern=r"\w{1,}",
    ngram_range=(1, 3),
    stop_words="english",
)


<div align="center" style="border: 2px solid #ff4d6d; padding: 15px; border-radius: 10px; 
                           box-shadow: 0px 0px 10px rgba(255, 77, 109, 0.7); width: 80%;">
  
  <h2 style="font-family: 'Comic Sans MS', cursive; color: #ff4d6d;">
    🤖 6. Model Development 🚀
  </h2>

</div>

<a id="Model Development"></a>

### **Model Development**
Model development is the process of designing, training, and optimizing machine learning models to achieve high predictive accuracy. This stage involves selecting the appropriate algorithms, tuning hyperparameters, and evaluating model performance to ensure optimal results.  

#### **In this section, we will:**
✅ **Select and implement machine learning models** suitable for our recommendation system  
✅ **Train the models** using preprocessed data to learn patterns and relationships  
✅ **Optimize hyperparameters** to enhance performance and prevent overfitting  
✅ **Evaluate model performance** using appropriate metrics such as Log Loss, RMSE, or accuracy  
✅ **Compare different models** to identify the best-performing approach  

By the end of this process, we will have a well-trained and validated recommendation model, ready for deployment and further fine-tuning.  


In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel


# Fit and transform the combined features
tfidf_matrix = tfv.fit_transform(combined_features)

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfidf_matrix, tfidf_matrix)

# Create a series with anime indices
indices = pd.Series(rec_data.index, index=rec_data["name"]).drop_duplicates()

# Recommendation Function
def give_rec(title, sig=sig):
    if title not in indices:
        return f"❌ '{title}' not found in the dataset!"
    
    idx = indices[title]
    sig_scores = list(enumerate(sig[idx]))
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)
    sig_scores = sig_scores[1:11]  # Top 10 recommendations (excluding itself)
    
    anime_indices = [i[0] for i in sig_scores]
    
    # Return the result as a formatted DataFrame
    return pd.DataFrame({
        "No": range(1, 11),
        "Anime Name": rec_data["name"].iloc[anime_indices].values,
        "Rating": rec_data["rating"].iloc[anime_indices].values
    }).set_index("No")

# Example usage
recommendations = give_rec("Naruto")
print(recommendations)

✅Data Splitting


Here, we split the dataset df_merged into training and validation sets. We use 80% of the data for training and 20% for validation to evaluate the model's performance.

In [None]:
avg_ratings_df = ratings_df.groupby('anime_id')['rating'].mean()
avg_ratings_df

In [None]:
base_model = pd.merge(test_df, avg_ratings_df, on='anime_id', how='left')
base_model

In [None]:
base_model.isnull().sum()

In [None]:
base_model['rating'].fillna(base_model['rating'].mean(), inplace=True)

In [None]:
base_model.isnull().sum()

In [None]:
base_model['ID'] = base_model['user_id'].astype(str) + '_' + base_model['anime_id'].astype(str)
base_model

In [None]:
base_model = base_model[['ID', 'rating']]
base_model


In [None]:
base_model.fillna(base_model['rating'].mean(), inplace=True)



In [None]:
base_model.to_csv('first_model.csv', index=False)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

train, val = train_test_split(ratings_df, test_size=0.2, random_state=42)
avg_ratings_train = train.groupby('anime_id')['rating'].mean()
val_preds = pd.merge(val, avg_ratings_train, on='anime_id', how='left')
val_preds['rating_y'] = val_preds['rating_y'].fillna(train['rating'].mean())
rmse = np.sqrt(mean_squared_error(val_preds['rating_x'], val_preds['rating_y']))

print(f"Baseline RMSE: {rmse}")