# Importing `pandas`

In [1]:
import pandas as pd

# Reading Data and checking for anomalies

In [2]:
df1 = pd.read_json("anime_full_details.json")
df2 = pd.read_json("anime_list.json")
df_merged = pd.merge(left=df1, right=df2, how='right', on=['title', 'link'], suffixes=(None,'_y')) 
# Joining by two columns, titles and links might be duplicated separately but not when used together
df_merged = df_merged.drop('score_y', axis=1)
print(f"Full anime data (df1) has size of {df1.shape}.")
print(f"Anime links data (df2) has size of {df2.shape}.")
print(f"Merged table (df_merged) has a size of {df_merged.shape}.")

Full anime data (df1) has size of (28209, 23).
Anime links data (df2) has size of (28210, 4).
Merged table (df_merged) has a size of (28210, 24).



`keep=False` allows to see all instances of duplication for better analysis. 
Apparently during the scraping process, this title below was scraped twice.


In [3]:
df_merged[df_merged['link'].duplicated(keep=False)]

Unnamed: 0,title,synopsis,type,episodes,status,aired,premiered,broadcast,producers,licensors,...,themes,duration,rating,score,ranked,popularity,members,favorites,link,rank
549,Zankyou no Terror,"Painted in red, the word ""VON"" is all that is ...",TV,\n 11\n,\n Finished Airing\n,"\n Jul 11, 2014 to Sep 26, 2014\n",Summer 2014,[\n Fridays at 00:50 (JST)\n ],"[Aniplex, Dentsu, Fuji TV, Tohokushinsha Film ...",[Funimation],...,[],\n 22 min. per ep.\n,\n R - 17+ (violence & profanity)\n,8.08,\n #551,\n #122\n,"\n 1,220,926\n","\n 22,796\n",https://myanimelist.net/anime/23283/Zankyou_no...,550
550,Zankyou no Terror,"Painted in red, the word ""VON"" is all that is ...",TV,\n 11\n,\n Finished Airing\n,"\n Jul 11, 2014 to Sep 26, 2014\n",Summer 2014,[\n Fridays at 00:50 (JST)\n ],"[Aniplex, Dentsu, Fuji TV, Tohokushinsha Film ...",[Funimation],...,[],\n 22 min. per ep.\n,\n R - 17+ (violence & profanity)\n,8.08,\n #551,\n #122\n,"\n 1,220,926\n","\n 22,796\n",https://myanimelist.net/anime/23283/Zankyou_no...,551


Let's observe the `df_merged` again, but with dropped duplicates this time.

In [4]:
print(f"Full anime data (df1) has size of {df1.shape}.")
print(f"Anime links data (df2) has size of {df2.shape}.")
print(f"Merged table (df_merged) has a size of {df_merged.shape}.")

df_merged = df_merged.drop_duplicates(subset=['title', 'link'])

print(f"Merged table (df_merged) after dropping duplicates has a size of {df_merged.shape}.")

Full anime data (df1) has size of (28209, 23).
Anime links data (df2) has size of (28210, 4).
Merged table (df_merged) has a size of (28210, 24).
Merged table (df_merged) after dropping duplicates has a size of (28209, 24).


As we can see here, duplication was removed, which makes `df_merged` more consistent with anime links data `df2`.

# Cleaning

Let's see the very first title to scale cleaning technique to entire dataframe

In [5]:
df_merged.head(1)

Unnamed: 0,title,synopsis,type,episodes,status,aired,premiered,broadcast,producers,licensors,...,themes,duration,rating,score,ranked,popularity,members,favorites,link,rank
0,Sousou no Frieren,During their decade-long quest to defeat the D...,TV,\n 28\n,\n Finished Airing\n,"\n Sep 29, 2023 to Mar 22, 2024\n",Fall 2023,[\n Fridays at 23:00 (JST)\n ],"[Aniplex, Dentsu, Shogakukan-Shueisha Producti...",[Crunchyroll],...,[],\n 24 min. per ep.\n,\n PG-13 - Teens 13 or older\n,9.31,\n #1,\n #157\n,"\n 1,060,746\n","\n 65,118\n",https://myanimelist.net/anime/52991/Sousou_no_...,1


In [6]:
df_merged.head(1).T # Transpose for better visibility

Unnamed: 0,0
title,Sousou no Frieren
synopsis,During their decade-long quest to defeat the D...
type,TV
episodes,\n 28\n
status,\n Finished Airing\n
aired,"\n Sep 29, 2023 to Mar 22, 2024\n"
premiered,Fall 2023
broadcast,[\n Fridays at 23:00 (JST)\n ]
producers,"[Aniplex, Dentsu, Shogakukan-Shueisha Producti..."
licensors,[Crunchyroll]


- I can see that `broadcast` column was scraped as list. This tells me that the original scraper logic had `.getall()` method. Replaced with `.get()` for the next time.

- I can see a lot of `\n`s. Let's use `.apply` method to replace those.

## Clean the `broadcast` column

In [7]:
df_merged['broadcast'] = df_merged['broadcast'].explode()

## Clean the `\n`s from the data.

In [8]:
for i in df_merged.columns:
    df_merged[i] = df_merged[i].apply(lambda x: x.replace("\n", "") if isinstance(x, str) else x)
df_merged.head()

Unnamed: 0,title,synopsis,type,episodes,status,aired,premiered,broadcast,producers,licensors,...,themes,duration,rating,score,ranked,popularity,members,favorites,link,rank
0,Sousou no Frieren,During their decade-long quest to defeat the D...,TV,28,Finished Airing,"Sep 29, 2023 to Mar 22, 2024",Fall 2023,Fridays at 23:00 (JST),"[Aniplex, Dentsu, Shogakukan-Shueisha Producti...",[Crunchyroll],...,[],24 min. per ep.,PG-13 - Teens 13 or older,9.31,#1,#157,1060746,65118,https://myanimelist.net/anime/52991/Sousou_no_...,1


Looks clean. Now we can proceed with further analysis of the data.

# Finding missing values

In [11]:
df_merged.isna().sum()

title              0
synopsis         665
type            5155
episodes         665
status           665
aired            665
premiered      22281
broadcast      20159
producers        665
licensors        665
studios          665
source         20987
genres           665
demographic      665
themes           665
duration         665
rating           665
score          10526
ranked           665
popularity       665
members          665
favorites        665
link               0
rank               0
dtype: int64

TODO: 665 appears too often. Figure out what that data is. 