Step 1: Import Libraries

In [30]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity


Step 2: Load and Explore the Dataset

In [33]:
# Load the anime dataset
anime_df = pd.read_csv("anime.csv")  # Replace with your actual file path

# Check the basic info
print(anime_df.info())

# Preview the data
print(anime_df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Advent

Step 3: Data Preprocessing

In [36]:
# Copy original to preserve
anime_df_cleaned = anime_df.copy()

In [38]:
# Fill missing genres with "Unknown"
anime_df_cleaned["genre"] = anime_df_cleaned["genre"].fillna("Unknown")

In [40]:
# Drop rows with missing ratings
anime_df_cleaned = anime_df_cleaned.dropna(subset=["rating"])

In [42]:
# Convert 'episodes' to numeric (errors="coerce" makes non-numeric to NaN)
anime_df_cleaned["episodes"] = pd.to_numeric(anime_df_cleaned["episodes"], errors='coerce')

In [44]:
# Fill missing episodes with the median
anime_df_cleaned["episodes"] = anime_df_cleaned["episodes"].fillna(anime_df_cleaned["episodes"].median())

Step 4: Select Top 2000 Anime by Popularity

In [47]:
# Sort by 'members' and pick top 2000
top_anime_df = anime_df_cleaned.sort_values(by="members", ascending=False).head(2000).reset_index(drop=True)


Step 5: Genre Feature Extraction (One-Hot Encoding)

In [50]:
# Convert genre string to a list
top_anime_df["genre_list"] = top_anime_df["genre"].apply(lambda x: [g.strip() for g in x.split(",")])

In [52]:
# One-hot encode the genre list
genre_dummies = top_anime_df["genre_list"].explode().str.get_dummies().groupby(level=0).sum()


Step 6: Normalize Ratings

In [55]:
# Normalize the ratings to 0–1 scale
scaler = MinMaxScaler()
normalized_ratings = scaler.fit_transform(top_anime_df[["rating"]])
normalized_ratings_df = pd.DataFrame(normalized_ratings, columns=["rating"])


Step 7: Combine Features

In [58]:
# Combine genre one-hot vectors with normalized rating
feature_matrix = pd.concat([genre_dummies, normalized_ratings_df], axis=1)


Step 8: Compute Cosine Similarity

In [63]:
# Compute cosine similarity matrix
cosine_sim = cosine_similarity(feature_matrix)


Step 9: Define Recommendation Function

In [70]:
# Anime title list for indexing
anime_titles = top_anime_df["name"]

def recommend_anime(title, top_n=5, threshold=0.5):
    if title not in anime_titles.values:
        return f"Anime titled '{title}' not found in the top 2000 list."
    
    idx = anime_titles[anime_titles == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Filter out self and apply threshold
    sim_scores = [(i, score) for i, score in sim_scores if i != idx and score >= threshold]
    
    # Sort by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[:top_n]
    
    # Return anime names
    recommendations = anime_titles.iloc[[i for i, _ in sim_scores]]
    return recommendations.tolist()


Step 10: Example Usage

In [73]:
# Get recommendations for a sample anime
results = recommend_anime("Steins;Gate", top_n=5, threshold=0.5)
print("Recommended Anime:", results)


Recommended Anime: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Under the Dog', 'Final Fantasy: The Spirits Within']


Conclusion

* Preprocessed the data by handling missing values and converting genres into machine-readable format.
* Extracted features using one-hot encoding for genres and normalized ratings.

* Computed similarity between anime using cosine similarity.

* Recommended similar anime based on a given title using genre and rating similarity.

* Limited the data to the top 2000 popular anime to improve performance.
* This system can recommend relevant anime even without user preferences, making it useful for new users or cold-start situations.




Interview Questions:

1. Can you explain the difference between user-based and item-based collaborative filtering?


1. User-Based Collaborative Filtering (UBCF)

f two users rate many items similarly, they are likely to have similar preferences.

Works:

* Find users who are similar to the target user (based on rating patterns).

* Recommend items that similar users liked, but the target user hasn’t seen yet.

* Example:

* If User A and User B both liked "Naruto" and "One Piece", and User B also liked "Bleach", then "Bleach" might be recommended to User A.

2. Item-Based Collaborative Filtering (IBCF)

If two items are rated similarly by many users, they are likely to be similar.

Works:

* Find items similar to what the user has liked/rated highly.

* Recommend similar items the user hasn’t interacted with yet.

* Example:

* If many users who liked "Attack on Titan" also liked "Death Note", then "Death Note" might be recommended to someone who liked "Attack on Titan".

2. What is collaborative filtering, and how does it work?

Collaborative Filtering (CF) is a popular recommendation technique that suggests items to users based on past interactions (like ratings, views, or purchases) from many users.



Types
* User-Based CF-Recommend items liked by similar users
* Item-Based CF-Recommend items similar to what the user liked
* Model-Based CF-Use machine learning (e.g., matrix factorization, deep learning)
* Memory-Based CF-Use raw historical data 

How Collaborative Filtering Works (Steps):

* Collect data: User-item interaction data (ratings, clicks, views, purchases).

* Build user-item matrix: Rows = users, Columns = items, Values = ratings or interactions.

* Compute similarity:

* Between users (for user-based CF).

* Between items (for item-based CF).

*Generate recommendations:

* For a given user, predict ratings or preference scores for unseen items.

* Recommend top-N items with highest predicted scores.