# **Dataset Description**

The Anime dataset contains detailed information about various anime titles.

Each record represents a unique anime and includes the following attributes:

# **1. Anime ID**
A unique numerical identifier assigned to each anime entry in the dataset.

# **2. Anime Title**
The name/title of the anime.

Example: Naruto, Attack on Titan, Death Note, etc.

# **3. Type**
The broadcast type or format of the anime, such as


*   TV
*  OVA (Original Video Animation)
*   Movie
*  Special
*  ONA (Original Net Animation)

his helps understand how the anime was released.






# **4. Genre**

A list of genre categories associated with each anime, often separated by commas.
Examples include:




*  Action

*  Comedy

*  Adventure

*  Fantasy
*  Romance


* Supernatural  


*  Drama

These genres help capture the theme and content of the anime.






# **5. Number of Episodes**

Total number of episodes in each anime.

Values may vary widely:



*  1 episode (Movie, OVA)
*   12/24 episodes (Seasonal anime)
*  300+ episodes (Long-running series like Naruto, One Piece)




# **6. Rating (Average User Rating)**

The average rating given by users (typically on a 1–10 scale).

This represents the overall user satisfaction and popularity.

# **7. Number of Members**

The number of community members who have interacted with or rated the anime on the platform.

This indicates the popularity and size of the fanbase.

# **Task 1: Data Preprocessing**

# **1. Load the Dataset**

In [6]:
import pandas as pd
import numpy as np

 **Load the Anime datase**

In [7]:
anime_df = pd.read_csv("/content/anime.csv")

**Display first few rows to understand the dataset**

In [6]:
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


**2. Explore Dataset Structure**

In [7]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [8]:
anime_df.describe(include='all')

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
count,12294.0,12294,12232,12269,12294.0,12064.0,12294.0
unique,,12292,3264,6,187.0,,
top,,Saru Kani Gassen,Hentai,TV,1.0,,
freq,,2,823,3787,5677.0,,
mean,14058.221653,,,,,6.473902,18071.34
std,11455.294701,,,,,1.026746,54820.68
min,1.0,,,,,1.67,5.0
25%,3484.25,,,,,5.88,225.0
50%,10260.5,,,,,6.57,1550.0
75%,24794.5,,,,,7.18,9437.0


**3. Check Missing Values**

In [9]:
# Check for missing values in each column
anime_df.isnull().sum()


Unnamed: 0,0
anime_id,0
name,0
genre,62
type,25
episodes,0
rating,230
members,0


**4. Handle Missing Values**

In [8]:
anime_df['genre'] = anime_df['genre'].fillna('Unknown')

In [9]:
anime_df['type'] = anime_df['type'].fillna('Unknown')

In [10]:
anime_df['rating'] = anime_df['rating'].fillna(anime_df['rating'].mean())

In [11]:
anime_df['episodes'] = anime_df['episodes'].replace('Unknown', np.nan)
anime_df['episodes'] = anime_df['episodes'].astype(float)
anime_df['episodes'] = anime_df['episodes'].fillna(anime_df['episodes'].median())

**Verify that missing values are handled**

In [14]:
anime_df.isnull().sum()

Unnamed: 0,0
anime_id,0
name,0
genre,0
type,0
episodes,0
rating,0
members,0


**5. Dataset Understanding After Cleaning**

In [15]:
print("Unique Types:", anime_df['type'].unique())
print("Number of genres:", anime_df['genre'].nunique())
print("Sample Genres:", anime_df['genre'].unique()[:10])


Unique Types: ['Movie' 'TV' 'OVA' 'Special' 'Music' 'ONA' 'Unknown']
Number of genres: 3265
Sample Genres: ['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen'
 'Sci-Fi, Thriller' 'Comedy, Drama, School, Shounen, Sports'
 'Action, Adventure, Shounen, Super Power'
 'Drama, Military, Sci-Fi, Space'
 'Drama, Fantasy, Romance, Slice of Life, Supernatural'
 'Drama, School, Shounen'
 'Action, Drama, Mecha, Military, Sci-Fi, Super Power']


In [16]:
anime_df['rating'].describe()

Unnamed: 0,rating
count,12294.0
mean,6.473902
std,1.017096
min,1.67
25%,5.9
50%,6.55
75%,7.17
max,10.0


# **Task 2: Feature Extraction**

**1. Import Required Libraries for Feature Extraction**

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

**2. Select Features for Similarity**

In [13]:
# Selecting relevant features
selected_features = anime_df[['genre', 'type', 'episodes', 'rating', 'members']]

**3. Convert Categorical Features into Numerical**

**Genres → Multi-label encoded using CountVectorizer**

This converts genres like "Action, Fantasy" into multi-hot vectors.

 **ype → One-Hot Encoding**

 Example: TV → [1, 0, 0], Movie → [0, 1, 0], OVA → [0, 0, 1]

# **4. Normalize Numerical Features**

We normalize:


*   episodes
*   rating
*   members

Using MinMaxScaler → scales values between 0 and 1






# **5. Build a Feature Extraction Pipeline**

In [14]:
genre_vectorizer = CountVectorizer(token_pattern='[^,]+')

In [15]:
type_encoder = OneHotEncoder()

In [16]:
scaler = MinMaxScaler()

In [17]:
preprocessor = ColumnTransformer(
    transformers=[
        ('genre', genre_vectorizer, 'genre'),
        ('type', type_encoder, ['type']),
        ('num', scaler, ['episodes', 'rating', 'members'])
    ]
)

# **6. Fit the Preprocessor and Transform the Dataset**

In [18]:
feature_matrix = preprocessor.fit_transform(selected_features)

In [19]:
feature_matrix = feature_matrix.toarray()

In [25]:
feature_matrix.shape

(12294, 93)

# **Task 3: Cosine Similarity & Recommendation System**

**1. Compute Cosine Similarity Matrix**

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
cosine_sim = cosine_similarity(feature_matrix)

In [28]:
cosine_sim.shape

(12294, 12294)

**2. Create Anime Index Mapping for Easy Lookup**

In [22]:
anime_df = anime_df.reset_index()

In [23]:
title_to_index = pd.Series(anime_df.index, index=anime_df['name'])

**Note:** Replace 'name' with the column in your dataset representing anime titles (e.g., anime_df['title']).

**3. Recommendation Function Using Cosine Similarity**

In [24]:
def recommend_anime(title, similarity_threshold=0.3, top_n=10):
    """Recommends similar anime based on cosine similarity.

    Parameters:
    - title: The target anime title
    - similarity_threshold: Minimum similarity score to consider
    - top_n: Maximum number of recommendations

    Returns:
    - DataFrame containing recommended anime
    """
    if title not in title_to_index:
        return f"Anime '{title}' not found in dataset."

    # Get the index of the anime that matches the title
    idx = title_to_index[title]

    # Get the pairwise similarity scores of all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the `top_n` most similar anime (excluding itself)
    sim_scores = sim_scores[1:top_n+1]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores if i[1] >= similarity_threshold]

    # Return the top `top_n` similar anime
    return anime_df.iloc[anime_indices][['name', 'genre', 'type', 'rating', 'members']]

**4. Example Usage**

In [32]:
recommend_anime("Naruto", similarity_threshold=0.4, top_n=5)


Unnamed: 0,name,genre,type,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,7.94,533578
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",TV,8.37,258103
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.32,375662
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,7.95,624055
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.95,116832


**Explanation:**



* similarity_threshold controls how strict the similarity should b  
*  High threshold → fewer but more similar anime
*  Low threshold → larger list of recommendations
*  top_n controls the list size returned.



**5. Experiment with Different Similarity Thresholds**

In [33]:
print("High similarity (0.6):")
display(recommend_anime("Naruto", similarity_threshold=0.6, top_n=10))

High similarity (0.6):


Unnamed: 0,name,genre,type,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,7.94,533578
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",TV,8.37,258103
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.32,375662
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,7.95,624055
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.95,116832
1930,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.4,111443
2615,Medaka Box,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.21,110042
3038,Tenjou Tenge,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.1,103449
1209,Medaka Box Abnormal,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.63,66972
515,Dragon Ball Kai (2014),"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.01,42666


In [34]:
print("Medium similarity (0.4):")
display(recommend_anime("Naruto", similarity_threshold=0.4, top_n=10))

Medium similarity (0.4):


Unnamed: 0,name,genre,type,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,7.94,533578
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",TV,8.37,258103
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.32,375662
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,7.95,624055
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.95,116832
1930,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.4,111443
2615,Medaka Box,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.21,110042
3038,Tenjou Tenge,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.1,103449
1209,Medaka Box Abnormal,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.63,66972
515,Dragon Ball Kai (2014),"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.01,42666


In [35]:
print("Low similarity (0.2):")
display(recommend_anime("Naruto", similarity_threshold=0.2, top_n=10))

Low similarity (0.2):


Unnamed: 0,name,genre,type,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,7.94,533578
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",TV,8.37,258103
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.32,375662
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,7.95,624055
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.95,116832
1930,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.4,111443
2615,Medaka Box,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.21,110042
3038,Tenjou Tenge,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.1,103449
1209,Medaka Box Abnormal,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.63,66972
515,Dragon Ball Kai (2014),"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.01,42666


# **Task4: Evaluation of Recommendation System**

**Leave-One-Out Evaluation (LOO)**

**For each anime:**


1. Treat the anime as the “target”.  

2.  Hide one feature or rating.
3.  Generate recommendations.
4.  Check whether similar anime (same genre or high rating correlation) appear in recommendations.

But since we do not have explicit user–item interactions in this dataset, we can evaluate using:



**Evaluation Approach Used: Genre Overlap Method**

We consider the true relevant items for an anime as those that share at least one genre with it.



* True Positives (TP): Recommended anime that share genre(s) with the target  
*  False Positives (FP): Recommended anime that do not share any genre
*  False Negatives (FN): Anime with same genre but not recommended






**Then we compute:**

**Precision**


Precision=TP+FPTP

**Recall**
Recall=TP+FNTP

**F1-Score**

F1=2×Precision+RecallPrecision⋅Recall

# **1. Train-Test Split**
We split the dataset using 80% train, 20% test
But for content-based systems, splitting means:


* Use train set to compute similarity   

* Run recommendations on test set anime
* Compare them with test anime genres






In [25]:
from sklearn.model_selection import train_test_split

In [26]:
train_df, test_df = train_test_split(anime_df, test_size=0.2, random_state=42)

In [27]:
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# **2. Evaluation Function for Precision, Recall, F1-Score**

In [2]:
def evaluate_recommendation_system(test_data, top_n=10, threshold=0.3):
    precision_list = []
    recall_list = []
    f1_list = []

    for title in test_data['name']:
        recommendations = recommend_anime(title, similarity_threshold=threshold, top_n=top_n)

        if isinstance(recommendations, str) or recommendations.empty:
            continue

        target_anime_genres_str = anime_df[anime_df['name'] == title]['genre'].iloc[0]
        target_anime_genres = set(g.strip() for g in target_anime_genres_str.split(',') if g.strip())

        tp = 0
        fp = 0
        for _, rec_anime in recommendations.iterrows():
            rec_genres_str = rec_anime['genre']
            rec_genres = set(g.strip() for g in rec_genres_str.split(',') if g.strip())

            if target_anime_genres.intersection(rec_genres):
                tp += 1
            else:
                fp += 1

        # Calculate all potential relevant items (share at least one genre)
        # This can be computationally expensive if done for every title in test_data.
        # For now, we'll iterate through all anime to find 'relevant' ones not recommended.
        all_relevant_items_count = 0
        for _, other_anime in anime_df.iterrows():
            if other_anime['name'] == title: # Don't compare with itself
                continue
            other_genres_str = other_anime['genre']
            other_genres = set(g.strip() for g in other_genres_str.split(',') if g.strip())
            if target_anime_genres.intersection(other_genres):
                all_relevant_items_count += 1

        # False Negatives (FN): Relevant items that were not recommended
        # This is simplified. In a real system, FN would require knowing all truly relevant items.
        # Here, we assume 'all_relevant_items_count' represents the total pool of relevant items.
        fn = max(0, all_relevant_items_count - tp)

        # Precision
        if (tp + fp) > 0:
            precision = tp / (tp + fp)
            precision_list.append(precision)
        else:
            precision_list.append(0.0) # No recommendations, so precision is 0

        # Recall
        if (tp + fn) > 0:
            recall = tp / (tp + fn)
            recall_list.append(recall)
        else:
            recall_list.append(0.0) # No relevant items, so recall is 0

        # F1-Score
        if (precision + recall) > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
            f1_list.append(f1)
        else:
            f1_list.append(0.0) # Cannot compute F1 if precision + recall is 0

    return {
        'Precision': np.mean(precision_list) if precision_list else 0.0,
        'Recall': np.mean(recall_list) if recall_list else 0.0,
        'F1-Score': np.mean(f1_list) if f1_list else 0.0
    }

# **3. Run Evaluation**

In [28]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

# --- Task 1: Data Preprocessing ---
# 1. Load the Dataset
anime_df = pd.read_csv("/content/anime.csv")

# 4. Handle Missing Values
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['type'] = anime_df['type'].fillna('Unknown')
anime_df['rating'] = anime_df['rating'].fillna(anime_df['rating'].mean())
anime_df['episodes'] = anime_df['episodes'].replace('Unknown', np.nan)
anime_df['episodes'] = anime_df['episodes'].astype(float)
anime_df['episodes'] = anime_df['episodes'].fillna(anime_df['episodes'].median())

# --- Task 2: Feature Extraction ---
# 2. Select Features for Similarity
selected_features = anime_df[['genre', 'type', 'episodes', 'rating', 'members']]

# 5. Build a Feature Extraction Pipeline
genre_vectorizer = CountVectorizer(token_pattern='[^,]+')
type_encoder = OneHotEncoder()
scaler = MinMaxScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('genre', genre_vectorizer, 'genre'),
        ('type', type_encoder, ['type']),
        ('num', scaler, ['episodes', 'rating', 'members'])
    ]
)

# 6. Fit the Preprocessor and Transform the Dataset
feature_matrix = preprocessor.fit_transform(selected_features)
feature_matrix = feature_matrix.toarray()

# --- Task 3: Cosine Similarity & Recommendation System ---
# 1. Compute Cosine Similarity Matrix
cosine_sim = cosine_similarity(feature_matrix)

# 2. Create Anime Index Mapping for Easy Lookup
anime_df_reset = anime_df.reset_index() # Create a new df for mapping to avoid issues with original index
title_to_index = pd.Series(anime_df_reset.index, index=anime_df_reset['name'])

# 3. Recommendation Function Using Cosine Similarity
def recommend_anime(title, similarity_threshold=0.3, top_n=10):
    """Recommends similar anime based on cosine similarity.

    Parameters:
    - title: The target anime title
    - similarity_threshold: Minimum similarity score to consider
    - top_n: Maximum number of recommendations

    Returns:
    - DataFrame containing recommended anime
    """
    if title not in title_to_index:
        return f"Anime '{title}' not found in dataset."

    idx = title_to_index[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1]

    anime_indices = [i[0] for i in sim_scores if i[1] >= similarity_threshold]
    return anime_df_reset.iloc[anime_indices][['name', 'genre', 'type', 'rating', 'members']]

# --- Task 4: Evaluation of Recommendation System ---
# 1. Train-Test Split
train_df, test_df = train_test_split(anime_df_reset, test_size=0.2, random_state=42)
# Reset index for consistency if needed, but not strictly necessary for this evaluation approach
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# 2. Evaluation Function for Precision, Recall, F1-Score
def evaluate_recommendation_system(test_data, top_n=10, threshold=0.3):
    precision_list = []
    recall_list = []
    f1_list = []

    for title in test_data['name']:
        recommendations = recommend_anime(title, similarity_threshold=threshold, top_n=top_n)

        if isinstance(recommendations, str) or recommendations.empty:
            # If anime not found or no recommendations, skip this title for evaluation
            continue

        target_anime_genres_str = anime_df_reset[anime_df_reset['name'] == title]['genre'].iloc[0]
        target_anime_genres = set(g.strip() for g in target_anime_genres_str.split(',') if g.strip())

        tp = 0
        fp = 0
        for _, rec_anime in recommendations.iterrows():
            rec_genres_str = rec_anime['genre']
            rec_genres = set(g.strip() for g in rec_genres_str.split(',') if g.strip())

            if target_anime_genres.intersection(rec_genres):
                tp += 1
            else:
                fp += 1

        # For FN, we assume 'all_relevant_items_count' represents the total pool of relevant items
        # This is an approximation based on genre overlap within the entire dataset.
        all_relevant_items_count = 0
        for _, other_anime in anime_df_reset.iterrows():
            if other_anime['name'] == title: # Don't compare with itself
                continue
            other_genres_str = other_anime['genre']
            other_genres = set(g.strip() for g in other_genres_str.split(',') if g.strip())
            if target_anime_genres.intersection(other_genres):
                all_relevant_items_count += 1

        fn = max(0, all_relevant_items_count - tp)

        if (tp + fp) > 0:
            precision = tp / (tp + fp)
            precision_list.append(precision)
        else:
            precision_list.append(0.0)

        if (tp + fn) > 0:
            recall = tp / (tp + fn)
            recall_list.append(recall)
        else:
            recall_list.append(0.0)

        if (precision + recall) > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
            f1_list.append(f1)
        else:
            f1_list.append(0.0)

    return {
        'Precision': np.mean(precision_list) if precision_list else 0.0,
        'Recall': np.mean(recall_list) if recall_list else 0.0,
        'F1-Score': np.mean(f1_list) if f1_list else 0.0
    }

# 3. Run Evaluation
evaluation_results = evaluate_recommendation_system(test_df, top_n=10, threshold=0.3)

print("Evaluation Results:")
print(f"Precision: {evaluation_results['Precision']:.4f}")
print(f"Recall   : {evaluation_results['Recall']:.4f}")
print(f"F1-Score : {evaluation_results['F1-Score']:.4f}")


Evaluation Results:
Precision: 0.9937
Recall   : 0.0047
F1-Score : 0.0090


# **4. Interpretation of Results**

**High Precision, Low Recall**



*  Recommendations are accurate
*  But the model misses many relevant items

      Solution: lower similarity threshold or increase top_n


     **Low Precision, High Recall**

     Indicates:


*  Many relevant anime included
*  But also many unrelated recommendations

     Solution: increase threshold or reduce top_n

   **Balanced F1-score**

     Indicates a good tradeoff.




# **5. Areas for Improvement**

**Based on evaluation metrics, the recommendation system can be enhanced by:**

**1. Using TF-IDF instead of CountVectorizer for genre text**
This reduces the weight of very common genres (e.g., “Action”).

**2. Including more features**


*   Synopsis/Description
*   Studio
*  Popularity Rank

**3. Hybrid Approaches**




*  Content-based similarity
*   Collaborative filtering (if user-item ratings are available)

**4. Better Normalization**



*   Try StandardScaler or L2 normalization.














# **1. Difference Between User-Based and Item-Based Collaborative Filtering**

**User-Based Collaborative Filtering (UBCF)**
his method recommends items to a user based on similar users.

**How it works:**


*  ind users who have similar tastes/preferences based on their past interactions (ratings, likes, etc.).

*  Recommend items that similar users liked but the target user has not interacted with.

**Example**:



*   If User A and User B both like “Naruto” and “One Piece”,
and User B also likes “Bleach”,
→ Recommend “Bleach” to User A.

Pros:



*  Easy to implement.
*  Works well when many users have similar rating patterns.

Cons:


*  Does not scale well (huge user base = slow).
*   User preferences change over time (cold start problem).

**Item-Based Collaborative Filtering (IBCF)**
This method recommends items that are similar to what the user already likes.

**How it works:**

*   ompute similarity between items (e.g., movies, anime).
*  Recommend items that are similar to the ones the user has interacted with

**Example:**

*   If anime “Naruto” is similar to “One Piece” and “Bleach”,
and the user watched “Naruto”,
→ Recommend “One Piece” or “Bleach”.

**Pros:**

* More stable (items don’t change as often as users).


*   Scales better for large datasets.

**Cons:**
*   Requires enough interaction data to compute item similarities.

**Key Difference Summary (Easy to Remember):**


| Feature     | User-Based CF                 | Item-Based CF                   |
| ----------- | ----------------------------- | ------------------------------- |
| Focus       | Similar users                 | Similar items                   |
| Basis       | User behavior                 | Item similarity                 |
| Stability   | Less stable (users change)    | More stable (items remain same) |
| Scalability | Poor for large user bases     | Better scalability              |
| Example     | “People like you also liked…” | “Similar to what you watched…”  |


















# **2. What is Collaborative Filtering, and How Does It Work?**

**Definition:**

Collaborative Filtering (CF) is a recommendation technique that makes predictions about what a user will like based on the preferences of many other users.

**How Collaborative Filtering Works**

**There are two main steps:**

**Step 1: Identify Similarity**

Collaborative filtering finds similarity in:


*  User behavior (ratings, views)
*  Item interactions

**This can be done using:**


* Cosine similarity  

*  Pearson correlation
*   Jaccard similarity


* Matrix factorization (SVD, ALS)

**Step 2: Make Predictions**

Once similarity is identified:

* Recommend items liked by similar users → User-Based CF
* Recommend items similar to past items → Item-Based CF  



# **Key Advantages of Collaborative Filtering**


*  No need for item metadata (genres, descriptions)
*  Learns directly from user behavior
*   Works well when many users interact with many items

**Limitations**


*  Cold start problem (new users or new items)

*  Sparsity problem (too few interactions)
*   Scalability in user-based CF for very large datasets









