Data Preprocessing

In [21]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv('anime.csv')

In [3]:
missing_values = df.isnull().sum()

In [4]:
if 'rating' in missing_values:
    df['rating'].fillna(df['rating'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['rating'].fillna(df['rating'].mean(), inplace=True)


In [5]:
# Display the first 5 rows of the DataFrame
print(df.head())

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  


In [6]:
# Display the last 5 rows of the DataFrame
print(df.tail())

       anime_id                                               name   genre  \
12289      9316       Toushindai My Lover: Minami tai Mecha-Minami  Hentai   
12290      5543                                        Under World  Hentai   
12291      5621                     Violence Gekiga David no Hoshi  Hentai   
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...  Hentai   
12293     26081                   Yasuji no Pornorama: Yacchimae!!  Hentai   

        type episodes  rating  members  
12289    OVA        1    4.15      211  
12290    OVA        1    4.28      183  
12291    OVA        4    4.88      219  
12292    OVA        1    4.98      175  
12293  Movie        1    5.46      142  


In [7]:
# Display the shape of the DataFrame (number of rows, number of columns)
print(df.shape)

(12294, 7)


In [8]:
# Display information about the DataFrame including the data types of each column and memory usage
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


In [9]:
# Display basic statistical details like percentile, mean, standard deviation etc. of the numerical columns
print(df.describe())

           anime_id        rating       members
count  12294.000000  12294.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.017096  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.900000  2.250000e+02
50%    10260.500000      6.550000  1.550000e+03
75%    24794.500000      7.170000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06


In [10]:
# Display the number of unique values in each column
print(df.nunique())

anime_id    12294
name        12292
genre        3264
type            6
episodes      187
rating        599
members      6706
dtype: int64


Feature Extraction

In [11]:
print(df['genre'].dtype)

object


In [12]:
df['genre'] = df['genre'].astype('str')

In [13]:
# Split the comma-separated genres into a list of genres
df['genre'] = df['genre'].str.split(', ')

# Convert the list of genres to a one-hot encoded DataFrame
genre_df = df['genre'].str.get_dummies(sep=', ')
genre_df 

# Concatenate the original DataFrame with the one-hot encoded DataFrame
df = pd.concat([df, genre_df], axis=1)
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,'Adventure','Adventure'],'Cars',...,['Slice of Life',['Slice of Life'],['Space'],['Sports'],['Super Power',['Supernatural'],['Thriller'],['Vampire'],['Yaoi'],['nan']
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,[Hentai],OVA,1,4.15,211,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,5543,Under World,[Hentai],OVA,1,4.28,183,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,5621,Violence Gekiga David no Hoshi,[Hentai],OVA,4,4.88,219,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,[Hentai],OVA,1,4.98,175,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit the scaler to the 'rating' column and transform it
df['rating'] = scaler.fit_transform(df[['rating']])
df['rating']

0        0.924370
1        0.911164
2        0.909964
3        0.900360
4        0.899160
           ...   
12289    0.297719
12290    0.313325
12291    0.385354
12292    0.397359
12293    0.454982
Name: rating, Length: 12294, dtype: float64

Recommendation System:

In [18]:
# Get the column names of the one-hot encoded DataFrame
genre_columns = genre_df.columns

# Include these columns in the features
features = list(genre_columns) + ['rating']

# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(df[features])

In [19]:
def recommend_anime(target_anime, similarity_matrix, threshold=0.5):
    # Get the index of the target anime in the DataFrame
    index = df[df['name'] == target_anime].index[0]

    # Get the corresponding row in the similarity matrix
    similarity_scores = similarity_matrix[index]

    # Get the indices of anime with similarity score above the threshold
    indices = np.where(similarity_scores > threshold)[0]

    # Get the names of the similar anime
    similar_anime = df.iloc[indices]['name']

    return similar_anime

In [22]:
# Recommend anime similar to 'Kimi no Na wa.'
print(recommend_anime('Kimi no Na wa.', similarity_matrix, threshold=0.7))

0                                   Kimi no Na wa.
10                            Clannad: After Story
16                         Shigatsu wa Kimi no Uso
60                              Hotarubi no Mori e
323                        ef: A Tale of Melodies.
337                                   Kanon (2006)
370      Clannad: Mou Hitotsu no Sekai, Tomoyo-hen
675                                Hana yori Dango
766                     Maria-sama ga Miteru: Haru
894                               Momo e no Tegami
897                                         Orange
937                    Mahouka Koukou no Rettousei
1111         Aura: Maryuuin Kouga Saigo no Tatakai
1201                Angel Beats!: Another Epilogue
1389                                 Orange: Mirai
1494                                      Harmonie
1616                                           Air
1659                              Futatsu no Spica
1771                              Strawberry Panic
1959                           

Evaluation:

In [23]:
from sklearn.model_selection import train_test_split

# Assuming 'df' is your DataFrame and 'features' is a list of feature column names
features = list(genre_columns) + ['rating']  # replace with your actual feature column names

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df[features], test_size=0.2, random_state=42)

 Evaluating a recommendation system typically involves computing metrics like precision@k, recall@k, and F1-score@k, where ‘k’ is the number of recommendations made. However, these metrics require true positive, false positive, and false negative values which are not directly available in a recommendation system. You might need to define these values based on your specific situation.

Analyze the performance of the recommendation system and identify areas of improvement:

This step involves looking at the evaluation metrics and identifying where the model is performing well and where it’s not. For example, if the precision is low, it means that many of the recommended items are not relevant. In this case, you might want to improve the model by using more relevant features or by tuning the model’s parameters.

 * Precision@k: This is the proportion of recommended items in the top-k set that are relevant. It’s calculated as (number of recommended items that are relevant) / (number of recommended items). High precision means that more of the recommended items are relevant, but it doesn’t take into account how many relevant items were not recommended.
 * Recall@k: This is the proportion of relevant items that are in the top-k recommendations. It’s calculated as (number of recommended items that are relevant) / (total number of relevant items). High recall means that most of the relevant items were recommended, but it doesn’t take into account how many irrelevant items were also recommended.
 * F1-score@k: This is the harmonic mean of precision and recall, and it tries to balance these two metrics. An F1-score is high if both precision and recall are high.