# Data Collection and Cleaning

We will be pulling our data using the API of an anime tracking site called [AniList](https://anilist.co). This API uses GraphQL, an alternative to REST. It allows the user to define the structure of the data however they want, and that said structure will be returned from the server. The API documentation is linked [here](https://anilist.gitbook.io/anilist-apiv2-docs/). There is a limit of 90 API calls per minute.

In [None]:
import requests
import json
import pandas as pd
import numpy as np

## Querying Anime Data

In [None]:
query = '''
query ($id: Int, $page: Int, $perPage: Int, $seasonYear: Int) {
    Page (page: $page, perPage: $perPage) {
        pageInfo {
            total
            currentPage
            lastPage
            hasNextPage
            perPage
        }
        media (id: $id, seasonYear: $seasonYear, isAdult: false, sort: [POPULARITY_DESC]) {
            id
            title {
                romaji
                english
            }
            popularity
            averageScore
            studios(sort: [FAVOURITES], isMain: true) {
              nodes {
                name
              }  
            }
            tags {
              name
              rank
            }
            genres
            episodes
            format
            description(asHtml: false)
            season
            seasonYear
            favourites
            source
            duration
            siteUrl
            staff (sort: [RELEVANCE], page: 1, perPage: 5) {
              edges {
                role
                node {
                  name {
                    full
                  }
                }
              }
            }
        }
    }
}
'''
url = 'https://graphql.anilist.co'

variables = {
    'seasonYear': 1991,
    'page': 1,
    'perPage': 50
}

response = requests.post(url, json={'query': query, 'variables': variables})

In [None]:
results = json.loads(response.text)
anime_df = pd.json_normalize(results['data']['Page'], "media")
anime_df.head()

Unnamed: 0,id,popularity,averageScore,tags,genres,episodes,format,description,season,seasonYear,favourites,source,duration,siteUrl,title.romaji,title.english,studios.nodes,staff.edges
0,1029,25525,72,"[{'name': 'Female Protagonist', 'rank': 92}, {...","[Drama, Romance, Slice of Life]",1,MOVIE,"Taeko Okajima is a typical ""office lady"" in a ...",SUMMER,1991,697,MANGA,118,https://anilist.co/anime/1029,Omoide Poro Poro,Only Yesterday,[{'name': 'Studio Ghibli'}],"[{'role': 'Original Creator', 'node': {'name':..."
1,898,21115,66,"[{'name': 'Martial Arts', 'rank': 80}, {'name'...","[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1,MOVIE,"After defeating Freeza, Goku returns to Earth ...",SUMMER,1991,149,MANGA,48,https://anilist.co/anime/898,Dragon Ball Z: Tobikkiri no Saikyou Tai Saikyou,Dragon Ball Z: Cooler's Revenge,[{'name': 'Toei Animation'}],"[{'role': 'Original Creator', 'node': {'name':..."
2,897,17622,60,"[{'name': 'Shounen', 'rank': 79}, {'name': 'Su...","[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1,MOVIE,A Super Namekian named Slug comes to invade Ea...,SPRING,1991,87,MANGA,52,https://anilist.co/anime/897,Dragon Ball Z: Super Saiyajin da Son Goku,Dragon Ball Z: Lord Slug,[{'name': 'Toei Animation'}],"[{'role': 'Original Creator', 'node': {'name':..."
3,795,11194,77,"[{'name': 'Tragedy', 'rank': 98}, {'name': 'Sc...","[Drama, Psychological]",39,TV,"Before leaving her cram school, Nanako Misonō ...",SUMMER,1991,498,MANGA,25,https://anilist.co/anime/795,Onii-sama e...,Dear Brother,[{'name': 'Tezuka Productions'}],"[{'role': 'Original Creator', 'node': {'name':..."
4,2000,8185,68,"[{'name': 'Artificial Intelligence', 'rank': 9...","[Action, Comedy, Drama, Mecha, Sci-Fi]",1,MOVIE,The Z Project was intended to give the new gen...,SUMMER,1991,86,ORIGINAL,79,https://anilist.co/anime/2000,Roujin Z,Roujin Z,[{'name': 'APPP'}],"[{'role': 'Original Creator', 'node': {'name':..."


The above data is a single call of a maximum of 50 anime per page from the year 1991. We do the same for the years in the interval 1992 to 2021. Three pages of 50 anime each are pulled. The raw results come in the form of very long and complex dictionaries, which we can load as JSON files. We than use Python's `pandas` library to convert the JSON file into a dataframe and concatenate them to the previous dataframes.

In [None]:
# url = 'https://graphql.anilist.co'

for year in range(1992,2022):
  for page in range(1,4):
    variables = {
        'seasonYear': year,
        'page': page,
        'perPage': 50
    }
    print((year,page))
    response = requests.post(url, json={'query': query, 'variables': variables})
    results = json.loads(response.text)
    temp_df = pd.json_normalize(results['data']['Page'], "media")
    anime_df = pd.concat([anime_df,temp_df])

(1992, 1)
(1992, 2)
(1992, 3)
(1993, 1)
(1993, 2)
(1993, 3)
(1994, 1)
(1994, 2)
(1994, 3)
(1995, 1)
(1995, 2)
(1995, 3)
(1996, 1)
(1996, 2)
(1996, 3)
(1997, 1)
(1997, 2)
(1997, 3)
(1998, 1)
(1998, 2)
(1998, 3)
(1999, 1)
(1999, 2)
(1999, 3)
(2000, 1)
(2000, 2)
(2000, 3)
(2001, 1)
(2001, 2)
(2001, 3)
(2002, 1)
(2002, 2)
(2002, 3)
(2003, 1)
(2003, 2)
(2003, 3)
(2004, 1)
(2004, 2)
(2004, 3)
(2005, 1)
(2005, 2)
(2005, 3)
(2006, 1)
(2006, 2)
(2006, 3)
(2007, 1)
(2007, 2)
(2007, 3)
(2008, 1)
(2008, 2)
(2008, 3)
(2009, 1)
(2009, 2)
(2009, 3)
(2010, 1)
(2010, 2)
(2010, 3)
(2011, 1)
(2011, 2)
(2011, 3)
(2012, 1)
(2012, 2)
(2012, 3)
(2013, 1)
(2013, 2)
(2013, 3)
(2014, 1)
(2014, 2)
(2014, 3)
(2015, 1)
(2015, 2)
(2015, 3)
(2016, 1)
(2016, 2)
(2016, 3)
(2017, 1)
(2017, 2)
(2017, 3)
(2018, 1)
(2018, 2)
(2018, 3)
(2019, 1)
(2019, 2)
(2019, 3)
(2020, 1)
(2020, 2)
(2020, 3)
(2021, 1)
(2021, 2)
(2021, 3)


In [None]:
# setting the anime's AniList ID as our index

anime_df = anime_df.drop_duplicates(subset=['id']).set_index('id',drop=True)
anime_df

Unnamed: 0_level_0,popularity,averageScore,tags,genres,episodes,format,description,season,seasonYear,favourites,source,duration,siteUrl,title.romaji,title.english,studios.nodes,staff.edges
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1029,25525,72.0,"[{'name': 'Female Protagonist', 'rank': 92}, {...","[Drama, Romance, Slice of Life]",1.0,MOVIE,"Taeko Okajima is a typical ""office lady"" in a ...",SUMMER,1991.0,697,MANGA,118.0,https://anilist.co/anime/1029,Omoide Poro Poro,Only Yesterday,[{'name': 'Studio Ghibli'}],"[{'role': 'Original Creator', 'node': {'name':..."
898,21115,66.0,"[{'name': 'Martial Arts', 'rank': 80}, {'name'...","[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,"After defeating Freeza, Goku returns to Earth ...",SUMMER,1991.0,149,MANGA,48.0,https://anilist.co/anime/898,Dragon Ball Z: Tobikkiri no Saikyou Tai Saikyou,Dragon Ball Z: Cooler's Revenge,[{'name': 'Toei Animation'}],"[{'role': 'Original Creator', 'node': {'name':..."
897,17622,60.0,"[{'name': 'Shounen', 'rank': 79}, {'name': 'Su...","[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,A Super Namekian named Slug comes to invade Ea...,SPRING,1991.0,87,MANGA,52.0,https://anilist.co/anime/897,Dragon Ball Z: Super Saiyajin da Son Goku,Dragon Ball Z: Lord Slug,[{'name': 'Toei Animation'}],"[{'role': 'Original Creator', 'node': {'name':..."
795,11194,77.0,"[{'name': 'Tragedy', 'rank': 98}, {'name': 'Sc...","[Drama, Psychological]",39.0,TV,"Before leaving her cram school, Nanako Misonō ...",SUMMER,1991.0,498,MANGA,25.0,https://anilist.co/anime/795,Onii-sama e...,Dear Brother,[{'name': 'Tezuka Productions'}],"[{'role': 'Original Creator', 'node': {'name':..."
2000,8185,68.0,"[{'name': 'Artificial Intelligence', 'rank': 9...","[Action, Comedy, Drama, Mecha, Sci-Fi]",1.0,MOVIE,The Z Project was intended to give the new gen...,SUMMER,1991.0,86,ORIGINAL,79.0,https://anilist.co/anime/2000,Roujin Z,Roujin Z,[{'name': 'APPP'}],"[{'role': 'Original Creator', 'node': {'name':..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117002,12821,69,"[{'name': 'Female Protagonist', 'rank': 86}, {...","[Drama, Mahou Shoujo, Mystery, Psychological, ...",8.0,TV,The second season of <i>Magia Record: Mahou Sh...,SUMMER,2021.0,133,VIDEO_GAME,24.0,https://anilist.co/anime/117002,Magia Record: Mahou Shoujo Madoka☆Magica Gaide...,Magia Record: Puella Magi Madoka Magica Side S...,[{'name': 'Shaft'}],"[{'role': 'Original Creator', 'node': {'name':..."
130713,12783,75,"[{'name': 'Monster Boy', 'rank': 60}, {'name':...",[Fantasy],3.0,OVA,Just prior to Chise becoming a part-time stude...,SUMMER,2021.0,139,MANGA,23.0,https://anilist.co/anime/130713,Mahoutsukai no Yome: Nishi no Shounen to Seira...,The Ancient Magus' Bride: The Boy from the Wes...,[{'name': 'Studio Kafka'}],"[{'role': 'Original Creator', 'node': {'name':..."
131584,12237,58,"[{'name': 'Vampire', 'rank': 96}, {'name': 'Id...","[Music, Supernatural]",12.0,TV,Beautiful immortals have gathered in Harajuku ...,FALL,2021.0,199,ORIGINAL,24.0,https://anilist.co/anime/131584,Visual Prison,VISUAL PRISON,[{'name': 'A-1 Pictures'}],"[{'role': 'Original Creator', 'node': {'name':..."
123494,12207,56,"[{'name': 'Football', 'rank': 97}, {'name': 'S...","[Drama, Sports]",13.0,TV,With no soccer accomplishments to speak of dur...,SPRING,2021.0,83,MANGA,24.0,https://anilist.co/anime/123494,Sayonara Watashi no Cramer,"Farewell, My Dear Cramer",[{'name': 'LIDENFILMS'}],"[{'role': 'Original Creator', 'node': {'name':..."


In [None]:
# making a copy of the data so that if there is a mistake in out cleaning process, we will not have to make API calls again

df = anime_df.copy()

## Data cleaning

In [None]:
# removing specials and music videos

df = df[~df['format'].isin(["SPECIAL","MUSIC"])]

In [None]:
# replace NA sources with "Unknown"

df['source'] = df['source'].fillna("Unknown")

In [None]:
# One Piece, Crayon Shin-chan, Detective Conan, Boruto, Pokemon 2019 and Dragon Quest are still airing.

df.loc[[966,21,235,97938,112153,114099]] = df.loc[[966,21,235,97938,112153,114099]].fillna(np.Inf)

In [None]:
# English titles preferred if they exist

df['title'] = df['title.english'].fillna(df['title.romaji'])
df.drop(['title.romaji','title.english'],axis=1,inplace=True)

In [None]:
# dropping the remaining observations with any NaN values

df = df.dropna()

In [None]:
# filter out anime that contain no info on studios

df = df[df['studios.nodes'].apply(lambda x: len(x) > 0)]

In [None]:
# keep most popular studio if two studios worked on anime

df['studio'] = df['studios.nodes'].apply(lambda x: x[-1]["name"])
df = df.drop('studios.nodes',axis=1)
df['studio'].head()

id
1029         Studio Ghibli
898         Toei Animation
897         Toei Animation
795     Tezuka Productions
2000                  APPP
Name: studio, dtype: object

In [None]:
# creating a function that converts the tags from dictionary to list

def get_tags(arr):
  res = []
  for dic in arr:
    if dic['rank'] >= 70:
      res.append(dic['name'])
  return res

get_tags(df.tags.iloc[0])

['Female Protagonist',
 'Iyashikei',
 'Rural',
 'Philosophy',
 'Coming of Age',
 'Family Life',
 'Biographical',
 'Seinen']

In [None]:
# cleaning up tags; from dictionary to list

df['tags_cleaned'] = df['tags'].apply(lambda x: get_tags(x))
df = df.drop('tags',axis=1)

In [None]:
# cleaning staff names in a similar manner

def get_staff_names(x):
  staff = []
  for item in x:
    staff.append(item['node']['name']['full'])
  return staff

In [None]:
df['staff'] = df['staff.edges'].apply(get_staff_names)
df['staff']

id
1029      [Hotaru Okamoto, Yuko Tone, Isao Takahata, Nor...
898       [Akira Toriyama, Mitsuo Hashimoto, Yasuyuki Fu...
897       [Akira Toriyama, Mitsuo Hashimoto, Minoru Maed...
795       [Riyoko Ikeda, Osamu Dezaki, Tomoko Konparu, H...
2000      [Katsuhiro Ootomo, Hisashi Eguchi, Hiroyuki Ki...
                                ...                        
117002    [Magica Quartet, Ume Aoki, Gekidan Inu Curry, ...
130713    [Kore Yamazaki, Kazuaki Terasawa, Hirotaka Kat...
131584    [Noriyasu Agematsu, Ikumi Katagiri, Takeshi Fu...
123494    [Naoshi Arakawa, Seiki Takuno, Natsuko Takahas...
109946    [Mark Millar, Leinil Yu, Mark Millar, Leinil Y...
Name: staff, Length: 3178, dtype: object

In [None]:
df = df.drop('staff.edges',axis=1)
df.head()

Unnamed: 0_level_0,popularity,averageScore,genres,episodes,format,description,season,seasonYear,favourites,source,duration,siteUrl,title,studio,tags_cleaned,staff
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1029,25525,72.0,"[Drama, Romance, Slice of Life]",1.0,MOVIE,"Taeko Okajima is a typical ""office lady"" in a ...",SUMMER,1991.0,697,MANGA,118.0,https://anilist.co/anime/1029,Only Yesterday,Studio Ghibli,"[Female Protagonist, Iyashikei, Rural, Philoso...","[Hotaru Okamoto, Yuko Tone, Isao Takahata, Nor..."
898,21115,66.0,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,"After defeating Freeza, Goku returns to Earth ...",SUMMER,1991.0,149,MANGA,48.0,https://anilist.co/anime/898,Dragon Ball Z: Cooler's Revenge,Toei Animation,"[Martial Arts, Shounen, Super Power, Aliens]","[Akira Toriyama, Mitsuo Hashimoto, Yasuyuki Fu..."
897,17622,60.0,"[Action, Adventure, Comedy, Fantasy, Sci-Fi]",1.0,MOVIE,A Super Namekian named Slug comes to invade Ea...,SPRING,1991.0,87,MANGA,52.0,https://anilist.co/anime/897,Dragon Ball Z: Lord Slug,Toei Animation,"[Shounen, Super Power]","[Akira Toriyama, Mitsuo Hashimoto, Minoru Maed..."
795,11194,77.0,"[Drama, Psychological]",39.0,TV,"Before leaving her cram school, Nanako Misonō ...",SUMMER,1991.0,498,MANGA,25.0,https://anilist.co/anime/795,Dear Brother,Tezuka Productions,"[Tragedy, School, Ojou-sama, Primarily Female ...","[Riyoko Ikeda, Osamu Dezaki, Tomoko Konparu, H..."
2000,8185,68.0,"[Action, Comedy, Drama, Mecha, Sci-Fi]",1.0,MOVIE,The Z Project was intended to give the new gen...,SUMMER,1991.0,86,ORIGINAL,79.0,https://anilist.co/anime/2000,Roujin Z,APPP,"[Artificial Intelligence, Primarily Adult Cast...","[Katsuhiro Ootomo, Hisashi Eguchi, Hiroyuki Ki..."


In [None]:
# number of anime we pulled

len(df)

3178

In [None]:
# downloading our dataframe for usage.

# change version number if any changes are made to the data

#v2: filtered out less relevant tags (< 70)
#v3: switched to mainly English titles

df.to_csv("anime-1991-2021_v3.csv")