# Exploratory Data Analysis (EDA)

## Introduction

The main goal of this project is to build a web app of **Top 10 Animes Across Years & Recommendations**. This app can guide and help people who are interested in japanese animations and are trying to start watching.

In order to build the app, animes data are needed. In this part, we will perform **EDA** process on the collected data.


*Note: During throughout the project, Anime means Japanese Animation Cartoon.*

# Imports

### Import necessary libraries and packages

To start process EDA on the data, first, relevant Python libraries and modules are imported.

In [183]:
import pandas as pd
import numpy as np
import ast

# setting max columns display to None in order to display all columns in the data
pd.set_option('display.max_columns', None)

### Load the dataset

The data contains details about animes like title, score, genres, studio, year, season, etc.

The dataset `anime_data_2006_2022.csv` is loaded as `animes`.

In [184]:
animes = pd.read_csv('./anime_data_2006_2022.csv')

# Data Discovering

Let's view first five rows from `animes`

In [185]:
animes.head()

Unnamed: 0,mal_id,url,approved,titles,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,season,year,producers,licensors,studios,genres,explicit_genres,themes,demographics,images.jpg.image_url,images.jpg.small_image_url,images.jpg.large_image_url,images.webp.image_url,images.webp.small_image_url,images.webp.large_image_url,trailer.youtube_id,trailer.url,trailer.embed_url,trailer.images.image_url,trailer.images.small_image_url,trailer.images.medium_image_url,trailer.images.large_image_url,trailer.images.maximum_image_url,aired.from,aired.to,aired.prop.from.day,aired.prop.from.month,aired.prop.from.year,aired.prop.to.day,aired.prop.to.month,aired.prop.to.year,aired.string,broadcast.day,broadcast.time,broadcast.timezone,broadcast.string
0,853,https://myanimelist.net/anime/853/Ouran_Koukou...,True,"[{'type': 'Default', 'title': 'Ouran Koukou Ho...",Ouran Koukou Host Club,Ouran High School Host Club,桜蘭高校ホスト部,"['Ohran Koko Host Club', 'Ouran Koukou Hosutob...",TV,Manga,26.0,Finished Airing,False,23 min per ep,PG-13 - Teens 13 or older,8.16,672279.0,413.0,128,1110503,34094,Haruhi Fujioka is a studious girl who has rece...,In addition to this anime and the source manga...,spring,2006.0,"[{'mal_id': 29, 'type': 'anime', 'name': 'VAP'...","[{'mal_id': 102, 'type': 'anime', 'name': 'Fun...","[{'mal_id': 4, 'type': 'anime', 'name': 'Bones...","[{'mal_id': 4, 'type': 'anime', 'name': 'Comed...",[],"[{'mal_id': 81, 'type': 'anime', 'name': 'Cros...","[{'mal_id': 25, 'type': 'anime', 'name': 'Shou...",https://cdn.myanimelist.net/images/anime/2/719...,https://cdn.myanimelist.net/images/anime/2/719...,https://cdn.myanimelist.net/images/anime/2/719...,https://cdn.myanimelist.net/images/anime/2/719...,https://cdn.myanimelist.net/images/anime/2/719...,https://cdn.myanimelist.net/images/anime/2/719...,NcC5VCE2Its,https://www.youtube.com/watch?v=NcC5VCE2Its,https://www.youtube.com/embed/NcC5VCE2Its?enab...,https://img.youtube.com/vi/NcC5VCE2Its/default...,https://img.youtube.com/vi/NcC5VCE2Its/sddefau...,https://img.youtube.com/vi/NcC5VCE2Its/mqdefau...,https://img.youtube.com/vi/NcC5VCE2Its/hqdefau...,https://img.youtube.com/vi/NcC5VCE2Its/maxresd...,2006-04-05T00:00:00+00:00,2006-09-27T00:00:00+00:00,5,4,2006,27.0,9.0,2006.0,"Apr 5, 2006 to Sep 27, 2006",Wednesdays,00:50,Asia/Tokyo,Wednesdays at 00:50 (JST)
1,918,https://myanimelist.net/anime/918/Gintama,True,"[{'type': 'Default', 'title': 'Gintama'}, {'ty...",Gintama,Gintama,銀魂,"['Gin Tama', 'Silver Soul', 'Yorinuki Gintama-...",TV,Manga,201.0,Finished Airing,False,24 min per ep,PG-13 - Teens 13 or older,8.94,391827.0,15.0,139,1052162,58905,Edo is a city that was home to the vigor and a...,Several games based on Gintama have been relea...,spring,2006.0,"[{'mal_id': 16, 'type': 'anime', 'name': 'TV T...","[{'mal_id': 376, 'type': 'anime', 'name': 'Sen...","[{'mal_id': 14, 'type': 'anime', 'name': 'Sunr...","[{'mal_id': 1, 'type': 'anime', 'name': 'Actio...",[],"[{'mal_id': 57, 'type': 'anime', 'name': 'Gag ...","[{'mal_id': 27, 'type': 'anime', 'name': 'Shou...",https://cdn.myanimelist.net/images/anime/10/73...,https://cdn.myanimelist.net/images/anime/10/73...,https://cdn.myanimelist.net/images/anime/10/73...,https://cdn.myanimelist.net/images/anime/10/73...,https://cdn.myanimelist.net/images/anime/10/73...,https://cdn.myanimelist.net/images/anime/10/73...,,,,,,,,,2006-04-04T00:00:00+00:00,2010-03-25T00:00:00+00:00,4,4,2006,25.0,3.0,2010.0,"Apr 4, 2006 to Mar 25, 2010",Thursdays,18:00,Asia/Tokyo,Thursdays at 18:00 (JST)
2,889,https://myanimelist.net/anime/889/Black_Lagoon,True,"[{'type': 'Default', 'title': 'Black Lagoon'},...",Black Lagoon,Black Lagoon,BLACK LAGOON,[],TV,Manga,12.0,Finished Airing,False,23 min per ep,R - 17+ (violence & profanity),8.03,487522.0,573.0,155,993437,18516,Salaryman Rokurou Okajima spends his days tryi...,Black Lagoon was released on DVD by Geneon Ent...,spring,2006.0,"[{'mal_id': 31, 'type': 'anime', 'name': 'Gene...","[{'mal_id': 102, 'type': 'anime', 'name': 'Fun...","[{'mal_id': 11, 'type': 'anime', 'name': 'Madh...","[{'mal_id': 1, 'type': 'anime', 'name': 'Actio...",[],"[{'mal_id': 50, 'type': 'anime', 'name': 'Adul...","[{'mal_id': 42, 'type': 'anime', 'name': 'Sein...",https://cdn.myanimelist.net/images/anime/1906/...,https://cdn.myanimelist.net/images/anime/1906/...,https://cdn.myanimelist.net/images/anime/1906/...,https://cdn.myanimelist.net/images/anime/1906/...,https://cdn.myanimelist.net/images/anime/1906/...,https://cdn.myanimelist.net/images/anime/1906/...,d4EbGC7fKnQ,https://www.youtube.com/watch?v=d4EbGC7fKnQ,https://www.youtube.com/embed/d4EbGC7fKnQ?enab...,https://img.youtube.com/vi/d4EbGC7fKnQ/default...,https://img.youtube.com/vi/d4EbGC7fKnQ/sddefau...,https://img.youtube.com/vi/d4EbGC7fKnQ/mqdefau...,https://img.youtube.com/vi/d4EbGC7fKnQ/hqdefau...,https://img.youtube.com/vi/d4EbGC7fKnQ/maxresd...,2006-04-09T00:00:00+00:00,2006-06-25T00:00:00+00:00,9,4,2006,25.0,6.0,2006.0,"Apr 9, 2006 to Jun 25, 2006",Sundays,02:35,Asia/Tokyo,Sundays at 02:35 (JST)
3,849,https://myanimelist.net/anime/849/Suzumiya_Har...,True,"[{'type': 'Default', 'title': 'Suzumiya Haruhi...",Suzumiya Haruhi no Yuuutsu,The Melancholy of Haruhi Suzumiya,涼宮ハルヒの憂鬱,['Suzumiya Haruhi no Yuuutsu'],TV,Light novel,14.0,Finished Airing,False,23 min per ep,PG-13 - Teens 13 or older,7.83,475447.0,888.0,179,909771,15321,If a survey were conducted to see if people be...,Suzumiya Haruhi no Yuuutsu aired in a nonlinea...,spring,2006.0,"[{'mal_id': 104, 'type': 'anime', 'name': 'Lan...","[{'mal_id': 102, 'type': 'anime', 'name': 'Fun...","[{'mal_id': 2, 'type': 'anime', 'name': 'Kyoto...","[{'mal_id': 46, 'type': 'anime', 'name': 'Awar...",[],"[{'mal_id': 23, 'type': 'anime', 'name': 'Scho...",[],https://cdn.myanimelist.net/images/anime/1470/...,https://cdn.myanimelist.net/images/anime/1470/...,https://cdn.myanimelist.net/images/anime/1470/...,https://cdn.myanimelist.net/images/anime/1470/...,https://cdn.myanimelist.net/images/anime/1470/...,https://cdn.myanimelist.net/images/anime/1470/...,,,,,,,,,2006-04-03T00:00:00+00:00,2006-07-03T00:00:00+00:00,3,4,2006,3.0,7.0,2006.0,"Apr 3, 2006 to Jul 3, 2006",Mondays,00:00,Asia/Tokyo,Mondays at 00:00 (JST)
4,934,https://myanimelist.net/anime/934/Higurashi_no...,True,"[{'type': 'Default', 'title': 'Higurashi no Na...",Higurashi no Naku Koro ni,Higurashi: When They Cry,ひぐらしのなく頃に,"['When the Cicadas Cry', 'The Moment the Cicad...",TV,Visual novel,26.0,Finished Airing,False,24 min per ep,R - 17+ (violence & profanity),7.88,391899.0,809.0,229,797940,21247,Keiichi Maebara has just moved to the quiet li...,Geneon Entertainment USA initially licensed an...,spring,2006.0,"[{'mal_id': 31, 'type': 'anime', 'name': 'Gene...","[{'mal_id': 376, 'type': 'anime', 'name': 'Sen...","[{'mal_id': 37, 'type': 'anime', 'name': 'Stud...","[{'mal_id': 14, 'type': 'anime', 'name': 'Horr...",[],"[{'mal_id': 58, 'type': 'anime', 'name': 'Gore...",[],https://cdn.myanimelist.net/images/anime/12/19...,https://cdn.myanimelist.net/images/anime/12/19...,https://cdn.myanimelist.net/images/anime/12/19...,https://cdn.myanimelist.net/images/anime/12/19...,https://cdn.myanimelist.net/images/anime/12/19...,https://cdn.myanimelist.net/images/anime/12/19...,,,,,,,,,2006-04-05T00:00:00+00:00,2006-09-27T00:00:00+00:00,5,4,2006,27.0,9.0,2006.0,"Apr 5, 2006 to Sep 27, 2006",Wednesdays,01:30,Asia/Tokyo,Wednesdays at 01:30 (JST)


Check the shape of the data.

In [186]:
animes.shape

(16337, 59)

List the columns from the data.

In [187]:
animes.columns

Index(['mal_id', 'url', 'approved', 'titles', 'title', 'title_english',
       'title_japanese', 'title_synonyms', 'type', 'source', 'episodes',
       'status', 'airing', 'duration', 'rating', 'score', 'scored_by', 'rank',
       'popularity', 'members', 'favorites', 'synopsis', 'background',
       'season', 'year', 'producers', 'licensors', 'studios', 'genres',
       'explicit_genres', 'themes', 'demographics', 'images.jpg.image_url',
       'images.jpg.small_image_url', 'images.jpg.large_image_url',
       'images.webp.image_url', 'images.webp.small_image_url',
       'images.webp.large_image_url', 'trailer.youtube_id', 'trailer.url',
       'trailer.embed_url', 'trailer.images.image_url',
       'trailer.images.small_image_url', 'trailer.images.medium_image_url',
       'trailer.images.large_image_url', 'trailer.images.maximum_image_url',
       'aired.from', 'aired.to', 'aired.prop.from.day',
       'aired.prop.from.month', 'aired.prop.from.year', 'aired.prop.to.day',
       'ai

# Data Cleaning

### Dropping the unnecessary columns

Remove irrelevant columns from `animes`.

In [188]:
columns_to_drop = [
    'approved',
    'titles',
    'year',
    'explicit_genres',
    'images.jpg.small_image_url',
    'images.jpg.large_image_url',
    'images.webp.image_url',
    'images.webp.small_image_url',
    'images.webp.large_image_url',
    'trailer.youtube_id',
    'trailer.embed_url',
    'trailer.images.image_url',
    'trailer.images.small_image_url',
    'trailer.images.medium_image_url',
    'trailer.images.large_image_url',
    'trailer.images.maximum_image_url',
    'aired.from',
    'aired.to',
    'aired.prop.from.day',
    'aired.prop.from.month',
    'aired.prop.to.day',
    'aired.prop.to.month',
    'aired.prop.to.year',
    'broadcast.day', 
    'broadcast.time', 
    'broadcast.timezone'
]

animes.drop(columns=columns_to_drop, inplace=True)

### Renaming Columns

Let's rename some colums to get better understaing when we read.

In [189]:
col_mapping = {
    'images.jpg.image_url': 'image_url',
    'trailer.url': 'trailer_url',
    'aired.prop.from.year' : 'year',
    'aired.string': 'aired_date',
    'broadcast.string': 'broadcast_day_and_time'
}

animes.rename(columns=col_mapping, inplace=True)

### Finding numeric columns

Checking which columns are numeric.

In [190]:
animes.select_dtypes(include='float64').columns

Index(['episodes', 'score', 'scored_by', 'rank'], dtype='object')

In [191]:
# float_cols = ['episodes', 'scored_by', 'rank']

# for col in float_cols:
#     # animes[col] = animes[col].astype(int)

#     # second method won't work as there are NaN values in the columns
#     # animes[col] = pd.to_numeric(animes[col], errors='coerce').astype(int)

#     # animes[col] = animes[col].fillna(0).astype(int)

### Modifying values in some columns

Some columns have values in the form of a list string of multiple dictionaries.

We will convert it into the comma separated string to read and use easily.

Function: `custom_str_list_to_desrired_str(value, key):`
- The function requires two parameters: `value` and `key`.
    - `value` parameter takes the value from the associated column.
    - `key` parameter takes which value of the key to get from the `value` string.
- The function performs changing the list string of `value` to the actual list and return comma-separated string of joining from `key`'s value.

In [192]:
def custom_str_list_to_desrired_str(value, key):
    # converting list string into the list
    data_list = ast.literal_eval(value)
    string = ','.join(genre[key] for genre in data_list)
    return string

Call the `custom_str_list_to_desrired_str()` function on the columns needed to modify.

In [193]:
need_to_modify_columns = [
    'producers',
    'licensors',
    'studios',
    'genres',
    'themes',
    'demographics'
]

for col in need_to_modify_columns:
    animes[col] = animes[col].apply(custom_str_list_to_desrired_str, args=('name', ))

animes['title_synonyms'] = animes['title_synonyms'].apply(lambda titles: ','.join(title for title in ast.literal_eval(titles)))

### Chcecking total null values 

Let's check how many total `NaN` values there are in columns having null values.

In [194]:
columns_with_nulls = animes.columns[animes.isnull().any()]
col_sum = []
for col in columns_with_nulls:
    col_sum.append(animes[col].isnull().sum())

null_having_columns = list(zip(columns_with_nulls, col_sum))
null_having_columns

null_having_columns_df = pd.DataFrame(null_having_columns, columns=['column_name', 'total_null_values'])
null_having_columns_df.set_index('column_name', inplace=True)
null_having_columns_df.sort_values(by='total_null_values' ,ascending=False, inplace=True)
null_having_columns_df

Unnamed: 0_level_0,total_null_values
column_name,Unnamed: 1_level_1
background,14489
season,12676
trailer_url,12597
broadcast_day_and_time,11304
title_english,9202
score,5550
scored_by,5550
rank,3607
synopsis,3091
rating,147


### Filling on the empty string 

The values of some columns needs to be filled as those columns will be used as the category to filter the data.

Here, we will empty string with `Others` as a value.

In [195]:
cols_having_empty_string = ['producers', 'licensors', 'studios', 'genres', 'themes', 'demographics']
animes[cols_having_empty_string] = animes[cols_having_empty_string].replace('', 'Others')

### Handling missing values

Some values from the `season` column are missing. Therefore, re-calculating its season from `aired_date` column and update `season`. 

In [196]:
# extracting from aired_date column to get start_date
date_pattern = r'(\w{3} \d{1,2}, \d{4})'
animes['start_date'] = animes['aired_date'].str.extract(date_pattern)
animes['start_date'] = pd.to_datetime(animes['start_date'])

def get_season(month):
    if 4 <= month <= 6:
        return 'Spring'
    elif 7 <= month <= 9:
        return 'Summer'
    elif 10 <= month <= 12:
        return 'Fall'
    else:
        return 'Winter'

# Apply the function to update a 'season' column
animes['season'] = animes['start_date'].dt.month.apply(get_season)

### Changing title value

We want the `title` value to be in English if it has the value in `title_english`. If not, then leave `title` value as it is.

In [197]:
animes['title'] = np.where(
    animes['title_english'].notnull(),
    animes['title_english'],
    animes['title']
)

# Data Structuring

In order to improve readablity, `year` column is moved to before `season` column.

In [198]:
animes.insert(animes.columns.get_loc('season'), 'year', animes.pop('year'))

Some values from `year` includes the data from `2005`. Therefore, we will filter out those data as they are not included in our project range.

In [199]:
animes[animes['year'] == 2005]

Unnamed: 0,mal_id,url,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,year,season,producers,licensors,studios,genres,themes,demographics,image_url,trailer_url,aired_date,broadcast_day_and_time,start_date
377,558,https://myanimelist.net/anime/558/Major_S2,Major S2,,メジャー (第2シリーズ),,TV,Manga,26.0,Finished Airing,False,25 min per ep,PG-13 - Teens 13 or older,8.19,45031.0,373.0,2506,71263,315,Gorou Honda has finally returned to Mifune Eas...,,2005,Fall,"Shogakukan-Shueisha Productions,NHK",Others,Studio Hibari,Sports,Team Sports,Shounen,https://cdn.myanimelist.net/images/anime/6/739...,,"Dec 10, 2005 to Jun 10, 2006",Unknown,2005-12-10
422,2285,https://myanimelist.net/anime/2285/DICE,D.I.C.E.,,ディノブレイカー,"DICE,DNA Integrated Cybernetic Enterprises,DIN...",TV,Original,40.0,Finished Airing,False,25 min per ep,PG - Children,6.34,2389.0,7550.0,8361,4940,19,DICE (DNA Integrated Cybernetic Enterprises) i...,,2005,Fall,Others,Bandai Entertainment,Xebec,"Action,Adventure,Sci-Fi",Mecha,Kids,https://cdn.myanimelist.net/images/anime/1/252...,,"Dec 6, 2005 to Sep 19, 2006",Unknown,2005-12-06
473,27451,https://myanimelist.net/anime/27451/Porong_Por...,Pororo the Little Penguin 2,Pororo the Little Penguin 2,뽀롱뽀롱 뽀로로 2,,TV,Original,52.0,Finished Airing,False,5 min per ep,G - All Ages,5.71,215.0,10499.0,15262,552,0,A direct continuation of the previous season t...,,2005,Fall,Others,Others,Others,Others,Others,Kids,https://cdn.myanimelist.net/images/anime/12/66...,,"Dec 3, 2005 to May 28, 2006",Unknown,2005-12-03
498,44167,https://myanimelist.net/anime/44167/Meng_Li_Ren,The Dreaming Girl,The Dreaming Girl,夢裡人,,TV,Unknown,26.0,Finished Airing,False,22 min per ep,G - All Ages,,,16044.0,18665,266,1,,,2005,Fall,Others,Others,Others,Drama,Others,Others,https://cdn.myanimelist.net/images/anime/1775/...,,"Dec 12, 2005 to Jan 6, 2006",Unknown,2005-12-12
523,53910,https://myanimelist.net/anime/53910/Minami_no_...,Minami no Shima no Chiisana Hikouki Birdy,,南の島の小さな飛行機 バーディー,,TV,Original,105.0,Finished Airing,False,10 min per ep,PG - Children,,,16158.0,23150,74,0,"This is Bird Paradise Island, a small island i...",,2005,Fall,NHK,Others,Studio Deen,Adventure,Anthropomorphic,Kids,https://cdn.myanimelist.net/images/anime/1085/...,,"Dec 26, 2005 to Apr 2, 2007",Saturdays at 17:50 (JST),2005-12-26


In [200]:
animes = animes[~(animes['year'] == 2005)]

# validating to ensure year 2005 not being in the data anymore
animes['year'].unique()

array([2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
       2017, 2018, 2019, 2020, 2021, 2022])

Function: `is_subset(value, check_list):`
- The function requires two parameters: `value` and `check_list`.
    - `value` parameter takes the value from the associated column.
    - `check_list` parameter takes to check the list with the `value` string.
- The function performs changing the `value` string into list and return `True` if `any` of the list string includes in `check_list`, otherwise `False`

In [201]:
def is_subset(value, check_list):
    value_list = value.split(',')
    return any(value in check_list for value in value_list)

We will exclude R-18+ genre from the data which is not suitable.

In [202]:
animes = animes[~(animes['genres'].apply(is_subset, args=(['Hentai'],)))]

# Data Exploring

Let's explore the data again after cleaning, transforming and structuring.

In [203]:
animes.head()

Unnamed: 0,mal_id,url,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,year,season,producers,licensors,studios,genres,themes,demographics,image_url,trailer_url,aired_date,broadcast_day_and_time,start_date
0,853,https://myanimelist.net/anime/853/Ouran_Koukou...,Ouran High School Host Club,Ouran High School Host Club,桜蘭高校ホスト部,"Ohran Koko Host Club,Ouran Koukou Hosutobu,Our...",TV,Manga,26.0,Finished Airing,False,23 min per ep,PG-13 - Teens 13 or older,8.16,672279.0,413.0,128,1110503,34094,Haruhi Fujioka is a studious girl who has rece...,In addition to this anime and the source manga...,2006,Spring,"VAP,Hakusensha,Nippon Television Network",Funimation,Bones,"Comedy,Romance","Crossdressing,Reverse Harem,School",Shoujo,https://cdn.myanimelist.net/images/anime/2/719...,https://www.youtube.com/watch?v=NcC5VCE2Its,"Apr 5, 2006 to Sep 27, 2006",Wednesdays at 00:50 (JST),2006-04-05
1,918,https://myanimelist.net/anime/918/Gintama,Gintama,Gintama,銀魂,"Gin Tama,Silver Soul,Yorinuki Gintama-san",TV,Manga,201.0,Finished Airing,False,24 min per ep,PG-13 - Teens 13 or older,8.94,391827.0,15.0,139,1052162,58905,Edo is a city that was home to the vigor and a...,Several games based on Gintama have been relea...,2006,Spring,"TV Tokyo,Aniplex,Dentsu,Trinity Sound,Audio Hi...","Sentai Filmworks,Crunchyroll",Sunrise,"Action,Comedy,Sci-Fi","Gag Humor,Historical,Parody,Samurai",Shounen,https://cdn.myanimelist.net/images/anime/10/73...,,"Apr 4, 2006 to Mar 25, 2010",Thursdays at 18:00 (JST),2006-04-04
2,889,https://myanimelist.net/anime/889/Black_Lagoon,Black Lagoon,Black Lagoon,BLACK LAGOON,,TV,Manga,12.0,Finished Airing,False,23 min per ep,R - 17+ (violence & profanity),8.03,487522.0,573.0,155,993437,18516,Salaryman Rokurou Okajima spends his days tryi...,Black Lagoon was released on DVD by Geneon Ent...,2006,Spring,"Geneon Universal Entertainment,Shogakukan-Shue...","Funimation,Geneon Entertainment USA",Madhouse,Action,"Adult Cast,Organized Crime",Seinen,https://cdn.myanimelist.net/images/anime/1906/...,https://www.youtube.com/watch?v=d4EbGC7fKnQ,"Apr 9, 2006 to Jun 25, 2006",Sundays at 02:35 (JST),2006-04-09
3,849,https://myanimelist.net/anime/849/Suzumiya_Har...,The Melancholy of Haruhi Suzumiya,The Melancholy of Haruhi Suzumiya,涼宮ハルヒの憂鬱,Suzumiya Haruhi no Yuuutsu,TV,Light novel,14.0,Finished Airing,False,23 min per ep,PG-13 - Teens 13 or older,7.83,475447.0,888.0,179,909771,15321,If a survey were conducted to see if people be...,Suzumiya Haruhi no Yuuutsu aired in a nonlinea...,2006,Spring,"Lantis,Kadokawa Shoten,Kadokawa Pictures Japan...","Funimation,Bandai Entertainment,Kadokawa Pictu...",Kyoto Animation,"Award Winning,Comedy,Mystery,Sci-Fi",School,Others,https://cdn.myanimelist.net/images/anime/1470/...,,"Apr 3, 2006 to Jul 3, 2006",Mondays at 00:00 (JST),2006-04-03
4,934,https://myanimelist.net/anime/934/Higurashi_no...,Higurashi: When They Cry,Higurashi: When They Cry,ひぐらしのなく頃に,"When the Cicadas Cry,The Moment the Cicadas Cry",TV,Visual novel,26.0,Finished Airing,False,24 min per ep,R - 17+ (violence & profanity),7.88,391899.0,809.0,229,797940,21247,Keiichi Maebara has just moved to the quiet li...,Geneon Entertainment USA initially licensed an...,2006,Spring,"Geneon Universal Entertainment,Frontier Works,...","Sentai Filmworks,Geneon Entertainment USA",Studio Deen,"Horror,Mystery,Supernatural","Gore,Psychological",Others,https://cdn.myanimelist.net/images/anime/12/19...,,"Apr 5, 2006 to Sep 27, 2006",Wednesdays at 01:30 (JST),2006-04-05


In [204]:
animes.shape

(15450, 34)

### Checking data types of columns

The `info()` function allows us to check data types of each column easily.

In [205]:
animes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15450 entries, 0 to 16336
Data columns (total 34 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   mal_id                  15450 non-null  int64         
 1   url                     15450 non-null  object        
 2   title                   15450 non-null  object        
 3   title_english           6944 non-null   object        
 4   title_japanese          15394 non-null  object        
 5   title_synonyms          15450 non-null  object        
 6   type                    15450 non-null  object        
 7   source                  15450 non-null  object        
 8   episodes                15317 non-null  float64       
 9   status                  15450 non-null  object        
 10  airing                  15450 non-null  bool          
 11  duration                15450 non-null  object        
 12  rating                  15303 non-null  object     

### Finding the descriptive statistics

The `describe()` function can help us to find the descriptive statistics and structure of the numeric columns in the dataset. 

In [206]:
animes.describe()

Unnamed: 0,mal_id,episodes,score,scored_by,rank,popularity,members,favorites,year,start_date
count,15450.0,15317.0,9902.0,9902.0,12725.0,15450.0,15450.0,15450.0,15450.0,13365
mean,35225.594498,14.280734,6.481127,42791.47,10020.081965,12947.674628,52651.29,593.087055,2015.141942,2015-12-13 18:11:39.797979904
min,356.0,1.0,2.33,101.0,2.0,1.0,17.0,0.0,2006.0,2006-01-01 00:00:00
25%,27893.0,1.0,5.81,350.0,4639.0,5179.25,184.0,0.0,2012.0,2012-07-21 00:00:00
50%,38250.5,3.0,6.47,2349.0,9965.0,13546.0,898.0,1.0,2016.0,2016-07-11 00:00:00
75%,47205.25,13.0,7.17,20968.75,15382.0,20146.75,16305.75,24.0,2019.0,2019-10-02 00:00:00
max,57879.0,1664.0,9.09,2754086.0,20440.0,26089.0,3880951.0,223131.0,2022.0,2023-10-28 00:00:00
std,15331.733933,34.649701,0.918981,144576.6,6062.907524,8020.960102,195173.3,5153.888009,4.57407,


Function: `get_category_count_from_str(col):`
- The function requires one parameter: `col`.
    - `col` parameter takes the input column of the data.
- The function performs counting of each value as category from the comma separated string of the `col` column.
- The function returns `Series` of the category count.

In [207]:
def get_category_count_from_str(col):
    cat_df = col.str.split(',', expand=True).stack().str.strip()
    # print(cat_df.unique())
    cat_df = pd.get_dummies(cat_df, prefix='', prefix_sep='')

    cat_counts = cat_df.sum(axis=0)

    return cat_counts.sort_values(ascending=False)

In [208]:
get_category_count_from_str(animes['genres'])

Comedy           4744
Fantasy          3571
Others           3236
Action           2967
Adventure        2009
Sci-Fi           1611
Drama            1454
Slice of Life    1217
Romance          1170
Supernatural      959
Mystery           584
Avant Garde       556
Ecchi             545
Sports            431
Horror            290
Suspense          206
Gourmet           133
Award Winning     130
Boys Love         102
Girls Love         65
Erotica            56
dtype: int64

In [209]:
get_category_count_from_str(animes['themes'])

Others               5896
Music                3033
School               1478
Historical            787
Anthropomorphic       678
Parody                592
Mecha                 588
Super Power           476
Military              412
Adult Cast            370
Mythology             359
Martial Arts          344
Harem                 317
Psychological         282
Strategy Game         267
Idols (Female)        264
Space                 262
CGDCT                 215
Isekai                211
Mahou Shoujo          202
Detective             175
Team Sports           169
Racing                166
Gore                  161
Gag Humor             157
Iyashikei             148
Idols (Male)          147
Workplace             145
Samurai               133
Educational           124
Vampire               115
Video Game            112
Time Travel            99
Performing Arts        86
Survival               80
Otaku Culture          79
Reincarnation          73
Reverse Harem          68
Visual Arts 

In [210]:
get_category_count_from_str(animes['demographics'])

Others     9979
Kids       3477
Shounen    1043
Seinen      641
Shoujo      209
Josei       115
dtype: int64

In [211]:
# changing duplicate titles with title synonyms and null titles with english and japanese
animes.loc[animes['title'].duplicated(), 'title'] = animes.loc[animes['title'].duplicated(), 'title_synonyms']
animes.loc[animes['title'].apply(lambda x: x == ''), 'title'] = animes.loc[animes['title'].apply(lambda x: x == ''), 'title_english']
animes.loc[animes['title'].apply(lambda x: x == ''), 'title'] = animes.loc[animes['title'].apply(lambda x: x == ''), 'title_japanese']
animes.loc[animes['title'].isnull(), 'title'] = animes.loc[animes['title'].isnull(), 'title_english']
animes.loc[animes['title'].isnull(), 'title'] = animes.loc[animes['title'].isnull(), 'title_japanese']

In [212]:
animes[animes['title'].isnull()]

Unnamed: 0,mal_id,url,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,year,season,producers,licensors,studios,genres,themes,demographics,image_url,trailer_url,aired_date,broadcast_day_and_time,start_date


In [213]:
dummy = animes.loc[animes['title'].duplicated(), ['title', 'mal_id']]
dummy['title'] = dummy['title'] + '_' + dummy['mal_id'].apply(lambda x: str(x))
animes.loc[animes['title'].duplicated(keep=False), 'title'] = dummy['title']

In [214]:
animes.loc[animes['title'].isnull(), 'title'] = animes.loc[animes['title'].isnull(), 'title_english']
animes.loc[animes['title'].isnull(), 'title'] = animes.loc[animes['title'].isnull(), 'title_japanese']

In [215]:
animes[animes['title'].duplicated(keep=False)]

Unnamed: 0,mal_id,url,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,year,season,producers,licensors,studios,genres,themes,demographics,image_url,trailer_url,aired_date,broadcast_day_and_time,start_date


In [216]:
animes[animes['title'].isnull()]

Unnamed: 0,mal_id,url,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,year,season,producers,licensors,studios,genres,themes,demographics,image_url,trailer_url,aired_date,broadcast_day_and_time,start_date


# Exporting the DataFrame

Let's export the cleaned data frame as the CSV file to use in the future process.

In [217]:
animes.to_csv('anime_data_2006_2022_cleaned.csv', index=False)

# Summary

In this notebook, we made EDA process: discovering, structuring, cleaning and validating. Then we explore how the data is structured, which categories of genres, demographics, themes are popular and finally we exported cleaned data as CSV file.

In the next part, we will construct an interactive web application.