In [3]:

import pandas as pd
df = pd.read_csv("C:\\Users\\moham\\Desktop\\anime.csv")
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [5]:
# Handle missing values, if any.
# Explore the dataset to understand its structure and attributes.

# Check for missing values in the DataFrame
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Handle missing values (e.g., replace with mean, median, or drop rows/columns)
# Example: Replace missing values in 'rating' with the mean rating
df['rating'].fillna(df['rating'].mean(), inplace=True)

# Explore the dataset's structure and attributes
print("\nData types of each column:\n", df.dtypes)
print("\nDescriptive statistics of numerical columns:\n", df.describe())
print("\nUnique values in 'genre' column:\n", df['genre'].unique())
print("\nNumber of unique anime in the dataset:", len(df['name'].unique()))

Missing values in each column:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Data types of each column:
 anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

Descriptive statistics of numerical columns:
            anime_id        rating       members
count  12294.000000  12294.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.017096  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.900000  2.250000e+02
50%    10260.500000      6.550000  1.550000e+03
75%    24794.500000      7.170000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06

Unique values in 'genre' column:
 ['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samur

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['rating'].fillna(df['rating'].mean(), inplace=True)


In [7]:
# Feature Extraction:
# Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
# Convert categorical features into numerical representations if necessary.
# Normalize numerical features if required.
# Feature Extraction

# Select relevant features for similarity computation
features = ['genre', 'rating', 'episodes', 'type']

# Create a new DataFrame with selected features
anime_features = df[features].copy()

# Convert categorical features (genre) into numerical representation using one-hot encoding
anime_features = pd.get_dummies(anime_features, columns=['genre', 'type'], dummy_na=True)

# Convert 'episodes' column to numeric type, replacing invalid values with NaN
anime_features['episodes'] = pd.to_numeric(anime_features['episodes'], errors='coerce')

# Fill NaN values in 'episodes' with 0 
# This assumes that NaN in episodes means the anime is a movie or has an unknown number of episodes.
anime_features['episodes'].fillna(0, inplace=True)

# Normalize numerical features (rating, episodes) using Min-Max scaling
for column in ['rating', 'episodes']:
    anime_features[column] = (anime_features[column] - anime_features[column].min()) / (
        anime_features[column].max() - anime_features[column].min()
    )


print(anime_features.head())

     rating  episodes  genre_Action  genre_Action, Adventure  \
0  0.924370  0.000550         False                    False   
1  0.911164  0.035204         False                    False   
2  0.909964  0.028053         False                    False   
3  0.900360  0.013201         False                    False   
4  0.899160  0.028053         False                    False   

   genre_Action, Adventure, Cars, Comedy, Sci-Fi, Shounen  \
0                                              False        
1                                              False        
2                                              False        
3                                              False        
4                                              False        

   genre_Action, Adventure, Cars, Mecha, Sci-Fi, Shounen, Sports  \
0                                              False               
1                                              False               
2                                           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  anime_features['episodes'].fillna(0, inplace=True)


In [9]:
# Design a function to recommend anime based on cosine similarity.
# Given a target anime, recommend a list of similar anime based on cosine similarity scores.
# Experiment with different threshold values for similarity scores to adjust the recommendation list size.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_anime(anime_name, anime_features, df, threshold=0.7):
  """
  Recommends anime based on cosine similarity.

  Args:
    anime_name: The name of the anime to find similar recommendations for.
    anime_features: DataFrame containing features for anime.
    df: Original DataFrame containing anime data (for retrieving anime names).
    threshold: Minimum cosine similarity score for an anime to be considered similar.

  Returns:
    A list of recommended anime names.
  """

  if anime_name not in df['name'].values:
    return f"Anime '{anime_name}' not found in the dataset."

  target_anime_index = df[df['name'] == anime_name].index[0]
  target_anime_features = anime_features.iloc[[target_anime_index]]

  cosine_similarities = cosine_similarity(target_anime_features, anime_features)

  similar_anime_indices = np.where(cosine_similarities >= threshold)[1]
  recommended_anime_indices = [index for index in similar_anime_indices if index != target_anime_index]


  recommended_anime_names = [df['name'].iloc[index] for index in recommended_anime_indices]

  return recommended_anime_names


# Example usage:
anime_to_recommend = "Naruto"
recommendations = recommend_anime(anime_to_recommend, anime_features, df, threshold=0.7)

print(f"Recommendations for '{anime_to_recommend}':")
if isinstance(recommendations, str):
  print(recommendations)
else:
  for anime in recommendations:
    print(anime)


Recommendations for 'Naruto':
Naruto: Shippuuden


In [11]:
# Evaluation:
# Split the dataset into training and testing sets.
# Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
# Analyze the performance of the recommendation system and identify areas of improvement.

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Create features for the training and testing sets
train_anime_features = anime_features.loc[train_df.index]
test_anime_features = anime_features.loc[test_df.index]

# Generate recommendations for the test set
true_recommendations = []
predicted_recommendations = []

for anime_name in test_df['name'].unique():
  recommendations = recommend_anime(anime_name, train_anime_features, train_df) # Use training data to generate recommendations
  # Check if recommendations is a list before proceeding
  if isinstance(recommendations, list):  
      # Find anime in the test dataset that have similar genres, rating, etc. as the anime name 
      if recommendations: # Only add if there are recommendations
          true_recommendations.append(anime_name)
          predicted_recommendations.append(recommendations)

# Flatten the lists for evaluation
flat_predicted_recommendations = [item for sublist in predicted_recommendations for item in sublist]

# Calculate precision, recall, and F1-score (assuming a binary classification: recommended or not)

# Create a binary representation of the recommendations.
# Only create y_true if true_recommendations is not empty
if true_recommendations:
    y_true = [1] * len(true_recommendations)  # Assume all true recommendations are relevant
    y_pred = [1 if anime in flat_predicted_recommendations else 0 for anime in test_df['name'].unique()]

    precision = precision_score(y_true, y_pred, average='binary')
    recall = recall_score(y_true, y_pred, average='binary')
    f1 = f1_score(y_true, y_pred, average='binary')

    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")    
else:
    print("No recommendations found for the test set.")


# Analyze the performance and identify areas of improvement:
# * Low precision: Many irrelevant items are being recommended. Consider using a higher similarity threshold or improving the feature set.
# * Low recall: Many relevant items are not being recommended. Consider exploring alternative similarity measures or incorporating user preferences.
# * Low F1-score: A balanced approach is needed to improve both precision and recall.


# Further improvements:
# * Experiment with different similarity measures (e.g., Jaccard similarity, Euclidean distance).
# * Incorporate user ratings or preferences to personalize recommendations.
# * Explore collaborative filtering techniques for recommendation.
# * Build a more comprehensive feature set using text mining techniques.
# * Implement a ranking mechanism for the recommendations.


No recommendations found for the test set.


In [None]:
# Interview Questions:

#=======================================================================================

# 1. Can you explain the difference between user-based and item-based collaborative filtering?

#=======================================================================================

###Answer User-Based Collaborative Filtering:

# * Focuses on finding users similar to the target user.
# * Recommends items that similar users have liked or rated highly.
# * Works well when users have clear preferences and similar tastes.
# * Can be computationally expensive if the user base is large.


### Item-Based Collaborative Filtering:

# * Focuses on finding items similar to the items the target user has liked or rated highly.
# * Recommends items that are similar to those the user has already interacted with.
# * Works well when items have clear characteristics that can be used for similarity comparisons.
# * Generally more efficient than user-based collaborative filtering.


# In summary:
# User-based: Find users like you and recommend what they like.
# Item-based: Find items like what you like and recommend those.



In [None]:
#  2. What is collaborative filtering, and how does it work?

#==========================================================================================================


#Answer Collaborative filtering is a technique used in recommender systems to predict a user's preferences based on the preferences of other similar users or items. 
# It relies on the idea that if two users have similar tastes in the past, they are likely to have similar tastes in the future.

# There are two main types of collaborative filtering:

# 1. User-based Collaborative Filtering:
#    - Find users similar to the target user based on their past interactions (e.g., ratings, purchases, views).
#    - Recommend items that similar users have liked or rated highly.
#    - Works well when users have clear preferences and similar tastes.
#    - Can be computationally expensive for large user bases.

# 2. Item-based Collaborative Filtering:
#    - Find items similar to the items the target user has liked or rated highly.
#    - Recommend items that are similar to those the user has already interacted with.
#    - Works well when items have clear characteristics that can be used for similarity comparisons.
#    - Generally more efficient than user-based collaborative filtering.


# Example Scenario:

# If a user likes "Naruto" and "Bleach," user-based filtering would find other users who also like these anime. The system would then recommend other anime that those similar users also enjoyed.

# Item-based filtering would identify anime similar to "Naruto" and "Bleach" based on genre, themes, target audience, etc. It would then recommend those similar anime, regardless of who else likes them.

# In summary:

# User-based: Find users like you and recommend what they like.
# Item-based: Find items like what you like and recommend those.