# **Task 1 : Data Preprocessing**

**Load the dataset into a suitable data structure (e.g., pandas DataFrame).**

In [65]:
import pandas as pd

In [66]:
df= pd.read_csv("/content/anime.csv")
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


**Handle missing values, if any.**

In [67]:
df.isnull().sum()

Unnamed: 0,0
anime_id,0
name,0
genre,62
type,25
episodes,0
rating,230
members,0


In [68]:
df['genre'].fillna(df['genre'].mode()[0], inplace=True)
df['type'].fillna(df['type'].mode()[0], inplace=True)
df['rating'].fillna(df['rating'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].fillna(df['genre'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['type'].fillna(df['type'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

In [69]:
df.isnull().sum()

Unnamed: 0,0
anime_id,0
name,0
genre,0
type,0
episodes,0
rating,0
members,0


**Explore the dataset to understand its structure and attributes.**

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [71]:
# df['episodes'].unique()

In [72]:
# Check for non-numeric values in the 'episodes' column
non_numeric_episodes = df[~df['episodes'].str.isdigit()]['episodes'].unique()
print("Non-numeric values in 'episodes' column:", non_numeric_episodes)

# Handle non-numeric values (e.g., replace with NaN and then convert to numeric)
df['episodes'] = pd.to_numeric(df['episodes'], errors='coerce')

# Fill NaN values that resulted from coercion (e.g., with the mode or a placeholder)
df['episodes'].fillna(df['episodes'].mode()[0], inplace=True)

# Convert the 'episodes' column to integer type
df['episodes'] = df['episodes'].astype(int)

# Verify the changes
display(df.info())

Non-numeric values in 'episodes' column: ['Unknown']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  int64  
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 672.5+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['episodes'].fillna(df['episodes'].mode()[0], inplace=True)


None

In [73]:
df.describe()

Unnamed: 0,anime_id,episodes,rating,members
count,12294.0,12294.0,12294.0,12294.0
mean,14058.221653,12.067757,6.473902,18071.34
std,11455.294701,46.25039,1.017096,54820.68
min,1.0,1.0,1.67,5.0
25%,3484.25,1.0,5.9,225.0
50%,10260.5,2.0,6.55,1550.0
75%,24794.5,12.0,7.17,9437.0
max,34527.0,1818.0,10.0,1013917.0


# **Task 2 : Feature Extraction**

**Decide on the features that will be used for computing similarity (e.g., genres, user ratings).**

For computing similarity between anime, we can consider the following features:

* Genre: Anime with similar genres are likely to be enjoyed by the same users.
* Rating: Higher rated anime might be similar in quality or appeal to a broader audience.
* Members: The number of members who have added the anime to their list can indicate popularity and potentially similarity in terms of mass appeal.
We will use these features to build a recommendation system.

In [74]:
selected_features = ['genre', 'rating', 'members']
display(df[selected_features].head())

Unnamed: 0,genre,rating,members
0,"Drama, Romance, School, Supernatural",9.37,200630
1,"Action, Adventure, Drama, Fantasy, Magic, Mili...",9.26,793665
2,"Action, Comedy, Historical, Parody, Samurai, S...",9.25,114262
3,"Sci-Fi, Thriller",9.17,673572
4,"Action, Comedy, Historical, Parody, Samurai, S...",9.16,151266


**Convert categorical features into numerical representations if necessary.**

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert 'genre' into numerical representation using TF-IDF
tfidf = TfidfVectorizer()
genre_matrix = tfidf.fit_transform(df['genre'])

# The shape of result matrix
print("Shape of genre matrix:", genre_matrix.shape)

Shape of genre matrix: (12294, 47)


**Normalize numerical features if required.**

In [76]:
from sklearn.preprocessing import MinMaxScaler

# Normalize numerical features ('rating' and 'members')
mm = MinMaxScaler()
df[['rating', 'members']] = mm.fit_transform(df[['rating', 'members']])

# Display the first few rows of the scaled data
display(df[['rating', 'members']].head())

Unnamed: 0,rating,members
0,0.92437,0.197872
1,0.911164,0.78277
2,0.909964,0.112689
3,0.90036,0.664325
4,0.89916,0.149186


# **Task 3 : Recommendation Syatem**

**Design a function to recommend anime based on cosine similarity.**

In [77]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Combine the genre matrix and scaled numerical features
# We need to convert the sparse genre_matrix to a dense array before concatenating
feature_matrix = np.hstack((genre_matrix.toarray(), df[['rating', 'members']].values))

# Calculate cosine similarity between anime
cosine_sim = cosine_similarity(feature_matrix)

def recommend_anime(anime_title, cosine_sim=cosine_sim, df=df, threshold=0.5):

  # Get the index of the anime that matches the title

    idx = df[df['name'] == anime_title].index[0]

     # Get the pairwise similarity scores for all anime with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the anime based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar anime (excluding the anime itself)
    sim_scores = sim_scores[1:11]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar anime
    return df[['name', 'genre', 'rating', 'members']].iloc[anime_indices]

**Given a target anime, recommend a list of similar anime based on cosine similarity scores.**

In [78]:
# Example
print(recommend_anime('Naruto'))

                                                   name  \
615                                  Naruto: Shippuuden   
206                                       Dragon Ball Z   
346                                         Dragon Ball   
1472        Naruto: Shippuuden Movie 4 - The Lost Tower   
1573  Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...   
486                            Boruto: Naruto the Movie   
1343                                        Naruto x UT   
2997  Naruto Soyokazeden Movie: Naruto to Mashin to ...   
1103  Boruto: Naruto the Movie - Naruto ga Hokage ni...   
588                                     Dragon Ball Kai   

                                                  genre    rating   members  
615   Action, Comedy, Martial Arts, Shounen, Super P...  0.752701  0.526252  
206   Action, Adventure, Comedy, Fantasy, Martial Ar...  0.798319  0.370503  
346   Adventure, Comedy, Fantasy, Martial Arts, Shou...  0.779112  0.311760  
1472  Action, Comedy, Martial Arts, Sh

**Experiment with different threshold values for similarity scores to adjust the recommendation list size.**

In [79]:
# Experiment with different threshold values
print("Recommendations with threshold = 0.7:")
display(recommend_anime('Naruto', threshold=0.7))

print("\nRecommendations with threshold = 0.11:")
display(recommend_anime('Naruto', threshold=0.5))

print("\nRecommendations with threshold = 0.9:")
display(recommend_anime('Naruto', threshold=0.9))

Recommendations with threshold = 0.7:


Unnamed: 0,name,genre,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",0.752701,0.526252
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.798319,0.370503
346,Dragon Ball,"Adventure, Comedy, Fantasy, Martial Arts, Shou...",0.779112,0.31176
1472,Naruto: Shippuuden Movie 4 - The Lost Tower,"Action, Comedy, Martial Arts, Shounen, Super P...",0.703481,0.083362
1573,Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.69988,0.082364
486,Boruto: Naruto the Movie,"Action, Comedy, Martial Arts, Shounen, Super P...",0.763505,0.07366
1343,Naruto x UT,"Action, Comedy, Martial Arts, Shounen, Super P...",0.709484,0.023138
2997,Naruto Soyokazeden Movie: Naruto to Mashin to ...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.653061,0.024824
1103,Boruto: Naruto the Movie - Naruto ga Hokage ni...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.721489,0.016632
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.753902,0.115224



Recommendations with threshold = 0.11:


Unnamed: 0,name,genre,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",0.752701,0.526252
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.798319,0.370503
346,Dragon Ball,"Adventure, Comedy, Fantasy, Martial Arts, Shou...",0.779112,0.31176
1472,Naruto: Shippuuden Movie 4 - The Lost Tower,"Action, Comedy, Martial Arts, Shounen, Super P...",0.703481,0.083362
1573,Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.69988,0.082364
486,Boruto: Naruto the Movie,"Action, Comedy, Martial Arts, Shounen, Super P...",0.763505,0.07366
1343,Naruto x UT,"Action, Comedy, Martial Arts, Shounen, Super P...",0.709484,0.023138
2997,Naruto Soyokazeden Movie: Naruto to Mashin to ...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.653061,0.024824
1103,Boruto: Naruto the Movie - Naruto ga Hokage ni...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.721489,0.016632
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.753902,0.115224



Recommendations with threshold = 0.9:


Unnamed: 0,name,genre,rating,members
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",0.752701,0.526252
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.798319,0.370503
346,Dragon Ball,"Adventure, Comedy, Fantasy, Martial Arts, Shou...",0.779112,0.31176
1472,Naruto: Shippuuden Movie 4 - The Lost Tower,"Action, Comedy, Martial Arts, Shounen, Super P...",0.703481,0.083362
1573,Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.69988,0.082364
486,Boruto: Naruto the Movie,"Action, Comedy, Martial Arts, Shounen, Super P...",0.763505,0.07366
1343,Naruto x UT,"Action, Comedy, Martial Arts, Shounen, Super P...",0.709484,0.023138
2997,Naruto Soyokazeden Movie: Naruto to Mashin to ...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.653061,0.024824
1103,Boruto: Naruto the Movie - Naruto ga Hokage ni...,"Action, Comedy, Martial Arts, Shounen, Super P...",0.721489,0.016632
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",0.753902,0.115224


# **Task 4 : Evaluation**

**Split the dataset into training and testing sets.**

In [80]:
from sklearn.model_selection import train_test_split

# Split in training and testing
train, test= train_test_split(df, test_size=0.2, random_state=42)

# Check recall, precision, f1
true_titles= test['name'].sample(20, random_state=42).values
y_true, y_pred= [], []

for t in true_titles:
  recs= recommend_anime(t, threshold=0.5)
  if isinstance(recs, str):
    continue
  rec_names= recs['name'].values

  for rec in rec_names:
    y_true.append(1 if rec in test["name"].values else 0)
    y_pred.append(1)

**Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.**

In [81]:
# Calculate Matrics
from sklearn.metrics import precision_score, recall_score, f1_score

print(f"Precision: {precision_score(y_true, y_pred, zero_division=0)}")
print(f"Recall: {recall_score(y_true, y_pred, zero_division=0)}")
print(f"F1-score: {f1_score(y_true, y_pred, zero_division=0)}")

Precision: 0.225
Recall: 1.0
F1-score: 0.3673469387755102


# **Interview Question**

**1. Can you explain the difference between user-based and item-based collaborative filtering?**

Ans:- User-Based vs. Item-Based Collaborative Filtering

**1) User-Based Collaborative Filtering**:
* Concept: This approach recommends items to a user based on the preferences of similar users. It finds users who have similar tastes to the active user and then recommends items that those similar users liked but the active user hasn't seen or rated yet.
* How it works:
1) Find users who are similar to the active user (e.g., users who have rated items similarly).
2) Identify items that those similar users have liked or interacted with.
3) Recommend these items to the active user, excluding items the active user has already seen or rated.
* Pros: Can recommend novel items that the active user might not have discovered otherwise.
* Cons: Can be computationally expensive for systems with a large number of users, as finding similar users can be time-consuming. The recommendations can also change frequently as user preferences evolve.

**2) Item-Based Collaborative Filtering**:

* Concept: This approach recommends items to a user based on the similarity between items. It finds items that are similar to the items the active user has liked or interacted with and recommends those similar items.
* How it works:
!) Find items that are similar to the items the active user has liked or interacted with (e.g., items that are often rated similarly by many users).
2) Recommend these similar items to the active user, excluding items the active user has already seen or rated.
* Pros: Generally more scalable than user-based filtering for systems with a large number of users, as item similarity is often more stable than user preferences. Recommendations tend to be more consistent.
* Cons: May not recommend as diverse a range of items as user-based filtering, as it focuses on items similar to those the user already likes.

In summary, user-based filtering focuses on finding similar users and recommending items they like, while item-based filtering focuses on finding similar items and recommending those. The choice between the two often depends on the specific dataset, the number of users and items, and the desired characteristics of the recommendations.

**2. What is collaborative filtering, and how does it work?**

Ans:- Collaborative Filtering is a technique used by recommendation systems to predict a user's interest in an item by collecting preferences from many users. The core idea is that if a user A has the same opinion as a user B on a set of items, user A is more likely to have the same opinion as B on another item than a randomly chosen user B.

There are generally two types of collaborative filtering:
1) **User-Based Collaborative Filtering**: (As explained previously) This method finds users with similar tastes and recommends items liked by those similar users.
2) **Item-Based Collaborative Filtering**: (As explained previously) This method finds items that are similar to the items a user has liked and recommends those similar items.

**How it works (General Principle)**:

Collaborative filtering works by building a model from a user's past behavior (items they have bought, rated, clicked, etc.) and similar decisions made by other users. This model is then used to predict items that the user might like.

The process typically involves:

1) Data Collection: Gathering data on user interactions with items (e.g., ratings, purchase history, browsing behavior).
2) Finding Similarities: Calculating the similarity between users (in user-based filtering) or between items (in item-based filtering) based on their interaction data. Similarity can be calculated using various metrics like cosine similarity, Pearson correlation, etc.
3) Generating Recommendations:
* User-Based: For an active user, find the top N similar users. Then, identify items that these similar users liked but the active user hasn't interacted with. Rank these items based on a weighted average of the similar users' ratings or interactions.
* Item-Based: For an active user, identify items they have liked or interacted with. Then, find items that are similar to these liked items. Rank these similar items based on the similarity scores and the user's interaction with the related liked item.
4) Filtering: Exclude items that the active user has already seen or interacted with.

Collaborative filtering is widely used in e-commerce (e.g., "customers who bought this also bought..."), streaming services (e.g., recommending movies or music), and social media platforms.