# ANIME RECOMMENDER SYSTEM - DATA CLEANING
## Anime Recommender System based on content and collaborative filtering

### One this notebook, we will process our dataset and export the data to new cleaned csv files to be used on the next step 

### Objective: 
- merge anime and anime_with_synopis
- clean and process data
- export cleaned dataframe to csv files

## TABLE OF CONTENT
- THE DATASET
- DATA CLEANING AND MERGING DATASET
    - merge anime and anime_with_synopsis
    - split comma seperated list columns
    - handling data types, invalid, and null data
- EXPORT CLEANED DATA

In [1]:
# basic library
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# THE DATASET

Dataset used sourced from myanimelist.net (MAL), which was one of the biggest anime and manga community site where user can list, rate, and open discussion to anime and manga. 

Dataset scrapped and uploaded on kaggle by Hernan Valdivieso: https://www.kaggle.com/hernan4444/anime-recommendation-database-2020

![img](img/2_mal.png)

The dataset consisted of 5 csv file. However, I decided to only use 3 csv files ('anime_with_synopsis.csv', 'anime.csv', 'rating_complete.csv') I'm not using 'animelist.csv' due to the size of the data and computation limitation, and I'm not using 'watching_status.csv' as this data used to encode watching status on 'animelist.csv'.

And to mention, 'rating_complete.csv' is a subset of 'animelist.csv', where user have completed the anime and gave rating to an Anime, meanwhile 'anime_list.csv' also contain user that not given a rate, or still watch/drop the anime. This makes 'rating_complete.csv' was good enough to be used for content and collaborative based filtering recommender systems.

In [2]:
# load data
anime_with_synopsis = pd.read_csv('dataset/anime_with_synopsis.csv')
anime_df = pd.read_csv('dataset/anime.csv')
rating_df = pd.read_csv('dataset/rating_complete.csv')

In [3]:
anime_with_synopsis.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


In [4]:
anime_with_synopsis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16214 entries, 0 to 16213
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   MAL_ID     16214 non-null  int64 
 1   Name       16214 non-null  object
 2   Score      16214 non-null  object
 3   Genres     16214 non-null  object
 4   sypnopsis  16206 non-null  object
dtypes: int64(1), object(4)
memory usage: 633.5+ KB


'anime_with_synopsis' is dataset representing anime and its properties. It contain 5 column/features:
- MAL_ID    : Unique anime identifier index given by MAL 
- Name      : Anime titles
- Score     : Anime rating counts from user given rating
- Genres    : Comma seperated list of Anime's genres
- Synopsis  : String with the synopsis of the anime

In [5]:
anime_df.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


In [6]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17562 entries, 0 to 17561
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MAL_ID         17562 non-null  int64 
 1   Name           17562 non-null  object
 2   Score          17562 non-null  object
 3   Genres         17562 non-null  object
 4   English name   17562 non-null  object
 5   Japanese name  17562 non-null  object
 6   Type           17562 non-null  object
 7   Episodes       17562 non-null  object
 8   Aired          17562 non-null  object
 9   Premiered      17562 non-null  object
 10  Producers      17562 non-null  object
 11  Licensors      17562 non-null  object
 12  Studios        17562 non-null  object
 13  Source         17562 non-null  object
 14  Duration       17562 non-null  object
 15  Rating         17562 non-null  object
 16  Ranked         17562 non-null  object
 17  Popularity     17562 non-null  int64 
 18  Members        17562 non-n

'anime_df' is dataset representing anime and its properties. It contain 35 column/features, summed up:
- MAL_ID    : Unique anime identifier index given by MAL 
- Name      : Anime titles
- Score     : Anime rating counts from user given rating
- Genres    : Comma seperated list of genre of anime
- English name  : Anime titles in english
- Japanese name : Anime titles in japanese
- Type      : Anime Type (TV, movie, OVA)
- Episodes  : Episodes counts
- Aired     : Broadcast date
- Premiered : Season which the anime broadcasted
- Producers : Comma seperated list of anime producers
- Licensors : Comma seperated list of Licensor/Broadcast owner of the anime
- Studios   : Comma seperated list of Studios that produce the anime
- Source    : Anime adaptation source (manga, light novel, etc)
- Duration  : Duration of anime per episode
- Rating    : Age rating (PG, G, etc)
- Ranked    : Rank Position based on Score
- Popularity: Popularity position based in the the number of users who have added the anime to their list.
- Members   : Number of community member of the anime
- Favorites : Number of user who add this anime to their favorites
- (anime watching status) number of users with status of: Watching, On-Hold, Dropped, Plan to watch
- (score) number of user who give the anime scores from 1 to 10

In [7]:
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


In [8]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57633278 entries, 0 to 57633277
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 1.3 GB


'rating_df' represent rating given by user to anime. It contain 3 column/features:
- user_id   : Unique identifier for each user given by MAL
- anime_id  : Anime's id which user given rated
- rating    : rating of anime given by user (scale of 1-10)

# DATA CLEANING
After we load and check the dataset, we can see some problem where some countable/numerical data have object data type instead of numerical. lets find out why and clean the data. There are also some comma seperated list column that we will be processed.

## JOINING anime_with synopsis and anime_df
both dataset represent anime data, lets combine them to make a full anime dataset. 

In [9]:
anime_full = pd.merge(anime_df, anime_with_synopsis, on='MAL_ID', how='outer')
anime_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17562 entries, 0 to 17561
Data columns (total 39 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MAL_ID         17562 non-null  int64 
 1   Name_x         17562 non-null  object
 2   Score_x        17562 non-null  object
 3   Genres_x       17562 non-null  object
 4   English name   17562 non-null  object
 5   Japanese name  17562 non-null  object
 6   Type           17562 non-null  object
 7   Episodes       17562 non-null  object
 8   Aired          17562 non-null  object
 9   Premiered      17562 non-null  object
 10  Producers      17562 non-null  object
 11  Licensors      17562 non-null  object
 12  Studios        17562 non-null  object
 13  Source         17562 non-null  object
 14  Duration       17562 non-null  object
 15  Rating         17562 non-null  object
 16  Ranked         17562 non-null  object
 17  Popularity     17562 non-null  int64 
 18  Members        17562 non-n

In [10]:
# drop doubles features (Name_y, Score_y, Genres_y)
anime_full = anime_full.drop(['Name_y', 'Score_y', 'Genres_y'], axis=1)

# rename columns (Name_x, Score_x, Genres_x)
anime_full = anime_full.rename(columns={"Name_x": "Name", "Score_x": "Score", "Genres_x": "Genres"})

## Split comma seperated list column
some comma seperated list column were: 'Genres', 'Producers', 'Licensors', 'Studios'

In [11]:
# import itertools
# import collections

# # function to split:
# def split(col):
#     anime_full[col].replace({'Unknown: np.nan'}, inplace=True)
#     anime_full[col] = anime_full[col].fillna('')

#     # split
#     anime_full[col] = anime_full[col].apply(lambda x: x.split(', '))

In [12]:
# col_list = ['Genres', 'Producers', 'Licensors', 'Studios']
# for i in col_list:
#     split(i)

In [13]:
anime_full.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0,"In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0,"other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0,"Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0,ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0,It is the dark century and the people are suff...


## CLEANING DATA TYPES AND INVALID VALUES
there're some wierd data types, for example 'Score' on anime_full should be using number(float) instead of String/Object since the data was numerical. I assume to find a string representing NULL, lets check.

In [14]:
print(sorted(pd.unique(anime_full['Score'])))

['1.85', '2.01', '2.18', '2.23', '2.26', '2.3', '2.35', '2.5', '2.61', '2.63', '2.66', '2.77', '2.78', '3.02', '3.09', '3.1', '3.15', '3.19', '3.23', '3.24', '3.26', '3.27', '3.28', '3.32', '3.35', '3.38', '3.39', '3.41', '3.42', '3.47', '3.51', '3.52', '3.55', '3.58', '3.6', '3.61', '3.67', '3.69', '3.72', '3.76', '3.79', '3.8', '3.81', '3.82', '3.84', '3.87', '3.89', '3.91', '3.92', '3.96', '4.0', '4.01', '4.04', '4.05', '4.06', '4.08', '4.12', '4.13', '4.16', '4.17', '4.18', '4.19', '4.2', '4.21', '4.22', '4.24', '4.25', '4.26', '4.28', '4.29', '4.31', '4.33', '4.34', '4.35', '4.36', '4.37', '4.38', '4.39', '4.4', '4.41', '4.42', '4.43', '4.44', '4.46', '4.47', '4.48', '4.49', '4.5', '4.51', '4.52', '4.53', '4.54', '4.56', '4.57', '4.58', '4.59', '4.6', '4.61', '4.62', '4.63', '4.64', '4.65', '4.66', '4.67', '4.68', '4.69', '4.7', '4.71', '4.72', '4.73', '4.74', '4.75', '4.76', '4.77', '4.78', '4.79', '4.8', '4.81', '4.82', '4.83', '4.84', '4.85', '4.86', '4.87', '4.88', '4.89', '4.

as expected, we found a string, 'Unknown'

In [15]:
def toNum(col):
    # replace 'Unknown' to nan
    anime_full[col].replace({'Unknown': np.nan}, inplace=True)

    # convert to float
    anime_full[col] = pd.to_numeric(anime_full[col], downcast='float')

In [16]:
col_list = ['Score', 'Episodes', 'Ranked']
for i in col_list:
    toNum(i)

there're still a lot data with string 'Unknown' lets convert them to nan

In [17]:
for column in anime_full:
    anime_full[column].replace({'Unknown': np.nan}, inplace=True)

'sypnopsis' contain similar problem of string representing null data, its 'No synopsis information has been added to this title. Help improve our database by adding a synopsis here .' lets convert it to nan as well.

In [18]:
anime_full['sypnopsis'].replace({'No synopsis information has been added to this title. Help improve our database by adding a synopsis here .': np.nan}, inplace=True)

In [19]:
# renaming MAL_ID to anime_id on anime_full
anime_full.rename(columns={'MAL_ID': 'anime_id'}, inplace=True)
# renaming Rating to age_rating to avoid confusion with rating (user rating)
anime_full.rename(columns={'Rating': 'age_rating'}, inplace=True)

In [20]:
# create new anime df with subset of most important columns (to decrese computation) 
feat = ['anime_id', 'Name', 'Score', 'Genres', 'Type', 'Studios', 'Source', 'age_rating', 'sypnopsis'] 

anime_subset = anime_full[feat].copy()
anime_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17562 entries, 0 to 17561
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   anime_id    17562 non-null  int64  
 1   Name        17562 non-null  object 
 2   Score       12421 non-null  float32
 3   Genres      17499 non-null  object 
 4   Type        17525 non-null  object 
 5   Studios     10483 non-null  object 
 6   Source      13995 non-null  object 
 7   age_rating  16874 non-null  object 
 8   sypnopsis   15497 non-null  object 
dtypes: float32(1), int64(1), object(7)
memory usage: 1.3+ MB


In [21]:
# new anime df with only titles and id (to decrese computation) 
feat = ['anime_id', 'Name']

anime_titles = anime_subset[feat].copy()
anime_titles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17562 entries, 0 to 17561
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   anime_id  17562 non-null  int64 
 1   Name      17562 non-null  object
dtypes: int64(1), object(1)
memory usage: 411.6+ KB


# EXPORT CLEANED DATA
Export cleaned data to new csv files

In [22]:
anime_full.to_csv('dataset/exported_dataset/anime_full.csv', index=False)

In [23]:
anime_subset.to_csv('dataset/exported_dataset/anime_subset.csv', index=False)

In [24]:
anime_titles.to_csv('dataset/exported_dataset/anime_titles.csv', index=False)

In [25]:
rating_df.to_csv('dataset/exported_dataset/rating_df.csv', index=False)