2/13/2022
The scripts and files contained within encompass my capstone project for the Google Data Analytics Professional Certificate. It combines my interest in television viewership data and allows such data to be explored with a visualization. Data will be:
- Downloaded via a website API
- Cleaned, transformed, and manipulated into tables that can be used for reporting here
- Loaded onto the kaggle website here
- Dashboard created via Tableau to explore here
The MyAnimeList website has a vast repository of ratings and rankings of viewership data and will be my data source for the project. However, data could just as easily come from Netflix viewer data, car purchase preferences, or amazon ratings.
To access the API I needed to setup an account on MyAnimeList and create an API application in the API panel of your profile. For more information, see https://myanimelist.net/forum/?topicid=1973077.
Set your working directory to wherever you’d like in the
WORKINDIRECTORY
section.
Additionally, I’ll also be loading in my token for accessing the API
(which will be deleted on posting this code). If you need to use this
script to download data from the API, simply enter your clientID in the
CLIENTIDHERE
section.
wd1 = "WORKINGDIRECTORY"
setwd(wd1)
clientID <- 'CLIENTIDHERE'
These three packages may take a few minutes to load, but will allow us to access the API for the MyAnimeList website and carry out some manipulations of the JSON format so that we can put the data into dataframes.
Additionally, we’ll use the tidyverse to manipulate/transform data.
install.packages(c("httr", "jsonlite", "tidyverse"))
Next, we want to load the three libraries:
library(httr)
library(jsonlite)
library(tidyverse)
While I will briefly touch on some of the specifics of the API, more information can be found at https://myanimelist.net/apiconfig/references/api/v2 - the official API website and this github page https://github.com/SuperMarcus/myanimelist-api-specification
The url https://api.myanimelist.net/v2/anime
allows you to query the
API with a specific string. The parameters are:
- q - which is the query string.
- limit - which limits the results.
A completed string would look like:
https://api.myanimelist.net/v2/anime?q=hi_score&limit=4
.
Using that string, I’m searching on ‘hi_score’ and limiting results to 4. Lets look at the output of the request and the format:
res <- GET('https://api.myanimelist.net/v2/anime?q=hi_score&limit=4', add_headers(`X-MAL-CLIENT-ID`= clientID))
stop_for_status(res)
data = fromJSON(rawToChar(res$content))
names(data)
## [1] "data" "paging"
glimpse(data)
## List of 2
## $ data :'data.frame': 4 obs. of 1 variable:
## ..$ node:'data.frame': 4 obs. of 3 variables:
## .. ..$ id : int [1:4] 39570 38422 21877 10958
## .. ..$ title : chr [1:4] "High Score Girl II" "High Score Girl: Extra Stage" "High Score Girl" "High Score"
## .. ..$ main_picture:'data.frame': 4 obs. of 2 variables:
## $ paging:List of 1
## ..$ next: chr "https://api.myanimelist.net/v2/anime?offset=4&q=hi_score&limit=4"
You may notice that I needed to add my credentials in the add headers
section. This needs to be completed for each request.
The output is made up of two elements. The data and paging elements.
Within data there is:
- node.id - the MyAnimeList (mal) id
- node.title - title of the anime
- Pictures in both medium and large format
- node.main_picture.medium
- node.main_picture.large
Paging is made up of the next API request. In this case, since we
limited the number of results to 4, it uses an offset
in order to
return the next four.
I’ll be utilizing the flatten
function within the httr
library to
construct dataframes of subsections of the API call:
df <- as.data.frame(flatten(data$data))
knitr::kable(df, "pipe")
id | title | main_picture.medium | main_picture.large |
---|---|---|---|
39570 | High Score Girl II | https://api-cdn.myanimelist.net/images/anime/1560/99904.jpg | https://api-cdn.myanimelist.net/images/anime/1560/99904l.jpg |
38422 | High Score Girl: Extra Stage | https://api-cdn.myanimelist.net/images/anime/1057/111384.jpg | https://api-cdn.myanimelist.net/images/anime/1057/111384l.jpg |
21877 | High Score Girl | https://api-cdn.myanimelist.net/images/anime/1668/91345.jpg | https://api-cdn.myanimelist.net/images/anime/1668/91345l.jpg |
10958 | High Score | https://api-cdn.myanimelist.net/images/anime/6/36265.jpg | https://api-cdn.myanimelist.net/images/anime/6/36265l.jpg |
The same API call (as above) also allows you to query on a specific
anime as long as you have the identifier to incorporate into the request
(e.g. https://api.myanimelist.net/v2/anime/21877
).
To obtain additional information, you need to use the fields
parameter.
res <- GET('https://api.myanimelist.net/v2/anime/21877?fields=id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,num_episodes,start_season,broadcast,source,average_episode_duration,rating,pictures,background,related_anime,related_manga,recommendations,studios,statistics', add_headers(`X-MAL-CLIENT-ID`= clientID))
stop_for_status(res)
data = fromJSON(rawToChar(res$content))
names(data)
## [1] "id" "title"
## [3] "main_picture" "alternative_titles"
## [5] "start_date" "end_date"
## [7] "synopsis" "mean"
## [9] "rank" "popularity"
## [11] "num_list_users" "num_scoring_users"
## [13] "nsfw" "created_at"
## [15] "updated_at" "media_type"
## [17] "status" "genres"
## [19] "num_episodes" "start_season"
## [21] "broadcast" "source"
## [23] "average_episode_duration" "rating"
## [25] "pictures" "background"
## [27] "related_anime" "related_manga"
## [29] "recommendations" "studios"
## [31] "statistics"
glimpse(data)
## List of 31
## $ id : int 21877
## $ title : chr "High Score Girl"
## $ main_picture :List of 2
## ..$ medium: chr "https://api-cdn.myanimelist.net/images/anime/1668/91345.jpg"
## ..$ large : chr "https://api-cdn.myanimelist.net/images/anime/1668/91345l.jpg"
## $ alternative_titles :List of 3
## ..$ synonyms: list()
## ..$ en : chr "Hi Score Girl"
## ..$ ja : chr "<U+30CF><U+30A4><U+30B9><U+30B3><U+30A2><U+30AC><U+30FC><U+30EB>"
## $ start_date : chr "2018-07-14"
## $ end_date : chr "2018-09-29"
## $ synopsis : chr "The year is 1991, and arcade video games are the latest craze. Becoming a professional gamer is a far-fetched d"| __truncated__
## $ mean : num 7.8
## $ rank : int 877
## $ popularity : int 869
## $ num_list_users : int 215304
## $ num_scoring_users : int 108656
## $ nsfw : chr "white"
## $ created_at : chr "2013-12-31T08:49:46+00:00"
## $ updated_at : chr "2022-02-14T19:41:44+00:00"
## $ media_type : chr "tv"
## $ status : chr "finished_airing"
## $ genres :'data.frame': 5 obs. of 2 variables:
## ..$ id : int [1:5] 4 11 22 23 42
## ..$ name: chr [1:5] "Comedy" "Game" "Romance" "School" ...
## $ num_episodes : int 12
## $ start_season :List of 2
## ..$ year : int 2018
## ..$ season: chr "summer"
## $ broadcast :List of 2
## ..$ day_of_the_week: chr "saturday"
## ..$ start_time : chr "00:30"
## $ source : chr "manga"
## $ average_episode_duration: int 1470
## $ rating : chr "pg_13"
## $ pictures :'data.frame': 3 obs. of 2 variables:
## ..$ medium: chr [1:3] "https://api-cdn.myanimelist.net/images/anime/8/57555.jpg" "https://api-cdn.myanimelist.net/images/anime/1668/91345.jpg" "https://api-cdn.myanimelist.net/images/anime/1252/93554.jpg"
## ..$ large : chr [1:3] "https://api-cdn.myanimelist.net/images/anime/8/57555l.jpg" "https://api-cdn.myanimelist.net/images/anime/1668/91345l.jpg" "https://api-cdn.myanimelist.net/images/anime/1252/93554l.jpg"
## $ background : chr ""
## $ related_anime :'data.frame': 1 obs. of 3 variables:
## ..$ node :'data.frame': 1 obs. of 3 variables:
## .. ..$ id : int 38422
## .. ..$ title : chr "High Score Girl: Extra Stage"
## .. ..$ main_picture:'data.frame': 1 obs. of 2 variables:
## ..$ relation_type : chr "sequel"
## ..$ relation_type_formatted: chr "Sequel"
## $ related_manga : list()
## $ recommendations :'data.frame': 10 obs. of 2 variables:
## ..$ node :'data.frame': 10 obs. of 3 variables:
## .. ..$ id : int [1:10] 34280 14741 35860 12467 35968 47257 14813 1596 48926 4224
## .. ..$ title : chr [1:10] "Gamers!" "Chuunibyou demo Koi ga Shitai!" "Karakai Jouzu no Takagi-san" "Nazo no Kanojo X" ...
## .. ..$ main_picture:'data.frame': 10 obs. of 2 variables:
## ..$ num_recommendations: int [1:10] 7 6 4 2 2 2 2 2 2 1
## $ studios :'data.frame': 1 obs. of 2 variables:
## ..$ id : int 7
## ..$ name: chr "J.C.Staff"
## $ statistics :List of 2
## ..$ status :List of 5
## .. ..$ watching : chr "11196"
## .. ..$ completed : chr "133937"
## .. ..$ on_hold : chr "5321"
## .. ..$ dropped : chr "8614"
## .. ..$ plan_to_watch: chr "57215"
## ..$ num_list_users: int 216283
The data obtained incorporates all of the fields that I’ve listed in the
fields
section.
You’ll notice that this response is made up of lists of lists. I can’t create a flattened data frame at this point because the lists are made up of differing sizes.
as.data.frame(flatten(data))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0, 5, 3, 10
Because of this, I’ll create tables off of some sections of the call,
then set those fields to NULL,
so that I can create a dataframe on the
remaining fields (this will make more sense below).
I’ll use this API request to obtain the bulk of the information that will be used to build tables.
Please note: the ID for the anime must be known to use this request. This leads me to:
This API request will allow you to obtain a list of anime identifiers and other details (including rank) for the following categories:
- all
- airing
- upcoming
- tv
- ova
- movie
- special
- bypopularity
- favorite
Here is a sample of a request and the output:
res <- GET('https://api.myanimelist.net/v2/anime/ranking?ranking_type=all&limit=4', add_headers(`X-MAL-CLIENT-ID`= clientID))
stop_for_status(res)
data = fromJSON(rawToChar(res$content))
names(data)
## [1] "data" "paging"
str(data$data)
## 'data.frame': 4 obs. of 2 variables:
## $ node :'data.frame': 4 obs. of 3 variables:
## ..$ id : int 5114 48583 38524 9253
## ..$ title : chr "Fullmetal Alchemist: Brotherhood" "Shingeki no Kyojin: The Final Season Part 2" "Shingeki no Kyojin Season 3 Part 2" "Steins;Gate"
## ..$ main_picture:'data.frame': 4 obs. of 2 variables:
## .. ..$ medium: chr "https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg" "https://api-cdn.myanimelist.net/images/anime/1948/120625.jpg" "https://api-cdn.myanimelist.net/images/anime/1517/100633.jpg" "https://api-cdn.myanimelist.net/images/anime/5/73199.jpg"
## .. ..$ large : chr "https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg" "https://api-cdn.myanimelist.net/images/anime/1948/120625l.jpg" "https://api-cdn.myanimelist.net/images/anime/1517/100633l.jpg" "https://api-cdn.myanimelist.net/images/anime/5/73199l.jpg"
## $ ranking:'data.frame': 4 obs. of 1 variable:
## ..$ rank: int 1 2 3 4
knitr::kable(as.data.frame(flatten(data$data)), "pipe")
id | title | main_picture.medium | main_picture.large | rank |
---|---|---|---|---|
5114 | Fullmetal Alchemist: Brotherhood | https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg | https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg | 1 |
48583 | Shingeki no Kyojin: The Final Season Part 2 | https://api-cdn.myanimelist.net/images/anime/1948/120625.jpg | https://api-cdn.myanimelist.net/images/anime/1948/120625l.jpg | 2 |
38524 | Shingeki no Kyojin Season 3 Part 2 | https://api-cdn.myanimelist.net/images/anime/1517/100633.jpg | https://api-cdn.myanimelist.net/images/anime/1517/100633l.jpg | 3 |
9253 | Steins;Gate | https://api-cdn.myanimelist.net/images/anime/5/73199.jpg | https://api-cdn.myanimelist.net/images/anime/5/73199l.jpg | 4 |
As you can see, the format for this response allows creation of a dataframe. This is because the lists are all of the same size. It also will be used as the basis for obtaining the list of all of the IDs to pull detailed data.
The API retrieves values from a point in time. Since I plan on maintaining this somewhat frequently (possibly as often as every two weeks), I will incorporate a time field setup as below:
tm_ky_de <- '2/11/22' # MANUAL ADJUSTMENT EACH LOAD
tm_ky <- 2 # MANUAL ADJUSTMENT EACH LOAD
tm_entry <- data.frame(tm_ky = tm_ky, tm_ky_de = tm_ky_de, stringsAsFactors = FALSE)
Right now this incorporates hard-coded dates, I will update this in the future to generate these automatically.
Next I will pull the ranking information into a table. Because this table will download relatively fast, I will download all categories of the ranking system by looping through each (even though I am only interestd in “tv”).
ranking_types <- c('all', 'airing', 'upcoming', 'tv', 'ova', 'movie', 'special', 'bypopularity', 'favorite')
This list comprises all of the ranking types that I will cycle through.
This API call has a limit of 500 listings per request, as a result, I will not only loop through the categories, but also the paging of each category.
for (items in ranking_types){ # I've broken this loop into the first call for the category, then it will cycle through each of the subsequent page requests before reaching the end, then it will cycle onto the next category.
get_call <- paste0('https://api.myanimelist.net/v2/anime/ranking?ranking_type=',items, '&limit=500')
res <- GET(get_call, add_headers(`X-MAL-CLIENT-ID`= clientID))
stop_for_status(res)
data = fromJSON(rawToChar(res$content))
temp_load <- flatten(data$data)
temp_load$main_picture <- NULL # I won't obtain the pictures for this round, I'll grab those in the API call where I grab all of the other attributes for the anime.
temp_load <- as.data.frame(temp_load)
temp_load$rank_category <- items # to be able to differentiate between the categories an additional column will be added that lists the category
if (exists('rank_table')){ # if the rank table already exists, I'll bind a new row, otherwise the variable will be initialized.
rank_table <- rbind(rank_table, temp_load)
} else {
rank_table <- temp_load
}
temp_load <- NULL
i = 0
# This will cycle through all of the subsequent pages of the API call until there are no more for the category
while (!is.null(data$paging[["next"]])){
i = i + 500
print(i)
res <- GET(data$paging[["next"]], add_headers(`X-MAL-CLIENT-ID`= clientID))
data = fromJSON(rawToChar(res$content))
temp_load <- flatten(data$data)
temp_load$main_picture <- NULL
temp_load <- as.data.frame(temp_load)
temp_load$rank_category <- items
rank_table <- rbind(rank_table, temp_load)
temp_load <- NULL
}
print(paste('The calls for the', items, 'have completed'))
}
# add the time key to the data frame
rank_table <- cbind(tm_ky = tm_ky, rank_table)
Here are the first few lines from the table constructed:
knitr::kable(head(rank_table), "pipe")
tm_ky | id | title | rank | rank_category |
---|---|---|---|---|
2 | 5114 | Fullmetal Alchemist: Brotherhood | 1 | all |
2 | 48583 | Shingeki no Kyojin: The Final Season Part 2 | 2 | all |
2 | 38524 | Shingeki no Kyojin Season 3 Part 2 | 3 | all |
2 | 9253 | Steins;Gate | 4 | all |
2 | 28977 | Gintama° | 5 | all |
2 | 42938 | Fruits Basket: The Final | 6 | all |
First, I need the list of IDs that correspond with my area of interest, “tv.”
mal_id_to_pull <- rank_table %>%
filter(rank_category == 'tv') %>%
distinct(id)
Next, I’ll construct a list of the fields that I want to pull from the API. These fields will be the basis for the database.
fields_to_pull <- c('id', 'title','alternative_titles', 'main_picture', 'start_date', 'end_date', 'synopsis', 'mean', 'media_type','media_type', 'status', 'num_episodes', 'start_season', 'rating', 'studios', 'nsfw','genres', 'rank', 'popularity', 'num_scoring_users', 'statistics')
There are five initial tables that I will be building. They are:
anime_table
- information about the anime (e.g. title)anime_syn_table
- synonyms for titlesanime_genres_table
- genres that are associated with the titleanime_studios_table
- studios that produced the titleanime_ranking_table
- ranking information
Below are three caveats to the data that gets loaded. I’ll build these
criteria into if
statements so that only data that meets these
conditions gets loaded:
- The title has a mean - this is the score on a scale of 0-10. The score for the anime is one of the building blocks of my visualization, so I don’t want to include anything that doesn’t have one.
- The title has a rank - this is the method that MyAnimeList uses to ensure that non-adult titles do not make their way into the rankings. I’d like my application to have a wide audience so I will use this condition as well as the next criteria to ensure that I exclude adult-oriented material.
- The rating is not
r+
- to exclude adult-oriented material.
To create the five tables above, I need to cycle through each of their IDs, request the information through the API, load each record as a row, then move onto the next ID.
Please note: I’ve built a break into the code below so that only the first 400 IDs are pulled in. If running for the entire list you will need to remove that if statement at the end of the loop.
i <- 0
for (j in 1:nrow(mal_id_to_pull)) { # to cycle through all of my IDs
get_call <- paste0('https://api.myanimelist.net/v2/anime/', mal_id_to_pull[j,],'?fields=', paste(fields_to_pull,collapse = ","))
res <- GET(get_call, add_headers(`X-MAL-CLIENT-ID`= clientID))
stop_for_status(res)
data = fromJSON(rawToChar(res$content))
if (!is.null(data$mean) & !is.null(data$rank)) { # the two conditions that I mentioned above
if (is.null(data$rating)) { # this is the last condition mentioned above
rating ="N/A"
} else {
rating = data$rating
}
if (rating != "r+") {
# Load the SYNONYMS table
if (length(data$alternative_titles$synonyms)!=0){
syn_table_load <- as.data.frame(cbind(tm_ky = tm_ky, id = data$id, synonyms = data$alternative_titles$synonyms), stringsAsFactors = FALSE)
if (exists('anime_syn_table')){
anime_syn_table <- dplyr::bind_rows(anime_syn_table, syn_table_load)
} else {
anime_syn_table <- syn_table_load
}
syn_table_load <- NULL
}
data$alternative_titles$synonyms <- NULL # this is the first case where I need to set a field to NULL so that I can create a dataframe based of remaining fields in the request
# Load the GENRES table
if (length(data$genres$id)!=0){
genres_table_load <- as.data.frame(cbind(tm_ky = tm_ky, id = data$id, genres_id = data$genres$id, genres_de = data$genres$name), stringsAsFactors = FALSE)
if (exists('anime_genres_table')){
anime_genres_table <- dplyr::bind_rows(anime_genres_table, genres_table_load)
} else {
anime_genres_table <- genres_table_load
}
genres_table_load <- NULL
}
data$genres <- NULL
# Load the STUDIOS table
if (length(data$studios$id)!=0){
studios_table_load <- as.data.frame(cbind(tm_ky = tm_ky, id = data$id, studio_id = data$studios$id, studio_de = data$studios$name), stringsAsFactors = FALSE)
if (exists('anime_studios_table')){
anime_studios_table <- dplyr::bind_rows(anime_studios_table, studios_table_load)
} else {
anime_studios_table <- studios_table_load
}
studios_table_load <- NULL
}
data$studios <- NULL
# Load the RANKING table, mean, rank, popularity, num_scoring_users, statistics
if (length(data$statistics$num_list_users)!=0){
ranking_table_load <- as.data.frame(cbind(tm_ky = tm_ky, id = data$id, mean = data$mean, rank = data$rank, popularity = data$popularity, num_scoring_users = data$num_scoring_users, statistics.watching = data$statistics$status$watching, statistics.completed = data$statistics$status$completed, statistics.on_hold = data$statistics$status$on_hold, statistics.dropped = data$statistics$status$dropped, statistics.plan_to_watch = data$statistics$status$plan_to_watch, statistics.num_scoring_users = data$statistics$num_list_users), stringsAsFactors = FALSE)
if (exists('anime_ranking_table')){
anime_ranking_table <- dplyr::bind_rows(anime_ranking_table, ranking_table_load)
} else {
anime_ranking_table <- ranking_table_load
}
ranking_table_load <- NULL
}
data$statistics <- NULL
data$mean <- NULL
data$rank <- NULL
data$popularity <- NULL
data$num_scoring_users <- NULL
# Load the anime table
for (k in 1:length(data)) { #cycles through each item in data and checks to see if its empty, if so, it will set it to NULL
if (k <= length(data)) { # need to add this because if I delete columns my original loop will extend beyond boundaries
if (length(data[[k]])==0) {
data[k] <- NULL
}
}
}
temp_load <- as.data.frame(data, stringsAsFactors = FALSE)
if (exists('anime_table')) {
anime_table <- dplyr::bind_rows(anime_table, temp_load)
} else {
anime_table <- temp_load
}
temp_load <- NULL
}
}
i = i + 1
if (i %% 500 == 0) { # to print to the console the status of every 500 requests
print(paste('Finished loading', i, 'items'))
print("Waiting for 5 minutes before proceeding") # I've built in a pause into the requests, otherwise, the server will deny requests
Sys.sleep(60)
print("4 minutes remaining")
Sys.sleep(60)
print("3 minutes remaining")
Sys.sleep(60)
print("2 minutes remaining")
Sys.sleep(60)
print("1 minute remaining")
Sys.sleep(60)
print("Starting up again")
}
if (j == 400) { # REMOVE THIS IF RUNNING FOR ENTIRE LIST
break
}
}
print("done collecting detailed information")
anime_table <- cbind(tm_ky = tm_ky, anime_table) %>%
mutate(across(c(tm_ky, id), as.character)) # alll other tables come across as characters in these two fields. Since I'm building additional tables off of these tables, I'll need to temporarily convert them into character fields.
print("Job done!")
I’d like to make the tables smaller if possible, so I’m going to create
some lookup tables for the anime_genres_table
and the
anime_studios_table
genres_l <- anime_genres_table %>%
distinct(tm_ky, genres_id, genres_de) %>%
arrange(genres_id)
knitr::kable(head(genres_l), "pipe")
tm_ky | genres_id | genres_de |
---|---|---|
2 | 1 | Action |
2 | 10 | Fantasy |
2 | 11 | Game |
2 | 13 | Historical |
2 | 14 | Horror |
2 | 15 | Kids |
This looks good, we can remove the description from the anime_genres_table now:
anime_genres_table <- anime_genres_table %>%
select(tm_ky, id, genres_id) %>%
arrange(tm_ky, id, genres_id)
knitr::kable(head(anime_genres_table))
tm_ky | id | genres_id |
---|---|---|
2 | 1 | 1 |
2 | 1 | 2 |
2 | 1 | 24 |
2 | 1 | 29 |
2 | 1 | 4 |
2 | 1 | 8 |
I want to make a studios lookup, similar to the genres:
studios_l <- anime_studios_table %>%
distinct(tm_ky, studio_id, studio_de) %>%
arrange(tm_ky, studio_id)
knitr::kable(head(studios_l))
tm_ky | studio_id | studio_de |
---|---|---|
2 | 1 | Studio Pierrot |
2 | 10 | Production I.G |
2 | 101 | Studio Hibari |
2 | 103 | Tatsunoko Production |
2 | 1075 | C-Station |
2 | 11 | Madhouse |
Now I need to remove the description from the anime_studios_table
anime_studios_table <- anime_studios_table %>%
select(tm_ky, id,studio_id) %>%
arrange(tm_ky, id, studio_id)
knitr::kable(head(anime_studios_table))
tm_ky | id | studio_id |
---|---|---|
2 | 1 | 14 |
2 | 10030 | 7 |
2 | 10049 | 37 |
2 | 10087 | 43 |
2 | 10162 | 10 |
2 | 10165 | 2 |
At this point, I have created the five main tables as well as their lookup tables. An enhancement that I’d like to build in is having a demographic table.
Demographics (Kids, Shoujo, Shounen, Seinen, Josei) have been built into the genres field.
See below for examples of genres_de that contains a demographic element:
anime_table %>%
left_join(anime_genres_table) %>%
inner_join(genres_l) %>%
filter(genres_de == "Shounen") %>%
select(!synopsis) %>%
head() %>%
knitr::kable("pipe")
## Joining, by = c("tm_ky", "id")
## Joining, by = c("tm_ky", "genres_id")
tm_ky | id | title | main_picture.medium | main_picture.large | alternative_titles.en | alternative_titles.ja | start_date | end_date | media_type | status | num_episodes | start_season.year | start_season.season | rating | nsfw | genres_id | genres_de |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 5114 | Fullmetal Alchemist: Brotherhood | https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg | https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg | Fullmetal Alchemist: Brotherhood | <U+92FC><U+306E><U+932C><U+91D1><U+8853><U+5E2B> FULLMETAL ALCHEMIST | 2009-04-05 | 2010-07-04 | tv | finished_airing | 64 | 2009 | spring | r | white | 27 | Shounen |
2 | 48583 | Shingeki no Kyojin: The Final Season Part 2 | https://api-cdn.myanimelist.net/images/anime/1948/120625.jpg | https://api-cdn.myanimelist.net/images/anime/1948/120625l.jpg | Attack on Titan: The Final Season Part 2 | <U+9032><U+6483><U+306E><U+5DE8><U+4EBA> The Final Season Part 2 | 2022-01-10 | NA | tv | currently_airing | 0 | 2022 | winter | r | white | 27 | Shounen |
2 | 38524 | Shingeki no Kyojin Season 3 Part 2 | https://api-cdn.myanimelist.net/images/anime/1517/100633.jpg | https://api-cdn.myanimelist.net/images/anime/1517/100633l.jpg | Attack on Titan Season 3 Part 2 | <U+9032><U+6483><U+306E><U+5DE8><U+4EBA> Season3 Part.2 | 2019-04-29 | 2019-07-01 | tv | finished_airing | 10 | 2019 | spring | r | white | 27 | Shounen |
2 | 28977 | Gintama° | https://api-cdn.myanimelist.net/images/anime/3/72078.jpg | https://api-cdn.myanimelist.net/images/anime/3/72078l.jpg | Gintama Season 4 | <U+9280><U+9B42>° | 2015-04-08 | 2016-03-30 | tv | finished_airing | 51 | 2015 | spring | pg_13 | white | 27 | Shounen |
2 | 9969 | Gintama’ | https://api-cdn.myanimelist.net/images/anime/4/50361.jpg | https://api-cdn.myanimelist.net/images/anime/4/50361l.jpg | Gintama Season 2 | <U+9280><U+9B42>’ | 2011-04-04 | 2012-03-26 | tv | finished_airing | 51 | 2011 | spring | pg_13 | white | 27 | Shounen |
2 | 11061 | Hunter x Hunter (2011) | https://api-cdn.myanimelist.net/images/anime/1337/99013.jpg | https://api-cdn.myanimelist.net/images/anime/1337/99013l.jpg | Hunter x Hunter | HUNTER×HUNTER(<U+30CF><U+30F3><U+30BF><U+30FC>×<U+30CF><U+30F3><U+30BF><U+30FC>) | 2011-10-02 | 2014-09-24 | tv | finished_airing | 148 | 2011 | fall | pg_13 | white | 27 | Shounen |
The different demographics that I want to capture and their descriptions are: 15 - kids, 25 - Shoujo, 27 - Shounen, 42 - Seinen, 43 - Josei
Lets construct a new demographic table and an associated lookup table;
anime_demo_table <- anime_genres_table %>%
filter(genres_id %in% c(15,25,27,42,43)) %>%
rename(demo_id = genres_id)
knitr::kable(head(anime_demo_table))
tm_ky | id | demo_id |
---|---|---|
2 | 10030 | 27 |
2 | 10049 | 27 |
2 | 10162 | 43 |
2 | 10165 | 27 |
2 | 10271 | 42 |
2 | 10379 | 25 |
demo_l <- genres_l %>%
filter(genres_id %in% c(15,25,27,42,43)) %>%
rename(demo_id = genres_id, demo_de = genres_de)
knitr::kable(head(demo_l))
tm_ky | demo_id | demo_de |
---|---|---|
2 | 15 | Kids |
2 | 25 | Shoujo |
2 | 27 | Shounen |
2 | 42 | Seinen |
2 | 43 | Josei |
Now, we need to remove those ids from the genres and genres lookup tables.
anime_genres_table <- anime_genres_table %>%
filter(!genres_id %in% c(15,25,27,42,43))
genres_l <- genres_l %>%
filter(!genres_id %in% c(15,25,27,42,43))
After reviewing the data and attempting to create some initial views and
filters in Tableau, its become apparent that I need to include columns
that collapse the values in the demographic, genres, synonyms, and
studios tables into single columns within the anime_table
table.
To do that, I’ll create temporary dataframes that contain the ID and comma delimited values, then join those to the main table so that the additional columns can be added.
temp_demo_table <- anime_demo_table %>%
inner_join(demo_l, by='demo_id') %>%
group_by(id) %>%
summarise(demo_de=paste(demo_de, collapse = ","))
temp_genres_table <- anime_genres_table %>%
inner_join(genres_l, by='genres_id') %>%
group_by(id) %>%
summarise(genres_de=paste(genres_de, collapse = ","))
temp_studios_table <- anime_studios_table %>%
inner_join(studios_l, by='studio_id') %>%
group_by(id) %>%
summarise(studios_de=paste(studio_de, collapse = ","))
temp_syn_table <- anime_syn_table %>%
group_by(id) %>%
summarise(synonyms=paste(synonyms, collapse = ","))
Here is what one of those tables looks like:
knitr::kable(head(temp_genres_table), "pipe")
id | genres_de |
---|---|
1 | Action,Adventure,Sci-Fi,Space,Comedy,Drama |
10030 | Romance,Comedy,Drama |
10049 | Action,Supernatural,Demons |
10087 | Action,Fantasy,Supernatural |
10162 | Slice of Life |
10165 | School,Slice of Life,Comedy |
Now, I need to join those temp tables and add those columns to the main table.
While I’m at it, I’m also going to create a new column
alternative title.
This columns will contain the english name by
default, and the anime title if that value is null. Similar to a
coalesce
SQL call.
anime_table <- anime_table %>%
left_join(temp_demo_table, by='id') %>%
left_join(temp_genres_table, by='id') %>%
left_join(temp_studios_table, by='id') %>%
left_join(temp_syn_table, by='id') %>%
mutate(alternative_title = case_when(
is.null(alternative_titles.en) ~ title,
trimws(alternative_titles.en) == '' ~ title,
!is.null(trimws(alternative_titles.en)) ~ alternative_titles.en
))
anime_table %>%
select(!synopsis) %>%
head() %>%
knitr::kable("pipe")
tm_ky | id | title | main_picture.medium | main_picture.large | alternative_titles.en | alternative_titles.ja | start_date | end_date | media_type | status | num_episodes | start_season.year | start_season.season | rating | nsfw | demo_de | genres_de | studios_de | synonyms | alternative_title |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 5114 | Fullmetal Alchemist: Brotherhood | https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg | https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg | Fullmetal Alchemist: Brotherhood | <U+92FC><U+306E><U+932C><U+91D1><U+8853><U+5E2B> FULLMETAL ALCHEMIST | 2009-04-05 | 2010-07-04 | tv | finished_airing | 64 | 2009 | spring | r | white | Shounen | Action,Fantasy,Adventure,Military,Comedy,Drama | Bones | Hagane no Renkinjutsushi: Fullmetal Alchemist,Fullmetal Alchemist (2009),FMA,FMAB | Fullmetal Alchemist: Brotherhood |
2 | 48583 | Shingeki no Kyojin: The Final Season Part 2 | https://api-cdn.myanimelist.net/images/anime/1948/120625.jpg | https://api-cdn.myanimelist.net/images/anime/1948/120625l.jpg | Attack on Titan: The Final Season Part 2 | <U+9032><U+6483><U+306E><U+5DE8><U+4EBA> The Final Season Part 2 | 2022-01-10 | NA | tv | currently_airing | 0 | 2022 | winter | r | white | Shounen | Action,Fantasy,Super Power,Military,Mystery,Drama | MAPPA | Shingeki no Kyojin Season 4,Attack on Titan Season 4 | Attack on Titan: The Final Season Part 2 |
2 | 38524 | Shingeki no Kyojin Season 3 Part 2 | https://api-cdn.myanimelist.net/images/anime/1517/100633.jpg | https://api-cdn.myanimelist.net/images/anime/1517/100633l.jpg | Attack on Titan Season 3 Part 2 | <U+9032><U+6483><U+306E><U+5DE8><U+4EBA> Season3 Part.2 | 2019-04-29 | 2019-07-01 | tv | finished_airing | 10 | 2019 | spring | r | white | Shounen | Action,Fantasy,Super Power,Military,Mystery,Drama | Wit Studio | NA | Attack on Titan Season 3 Part 2 |
2 | 9253 | Steins;Gate | https://api-cdn.myanimelist.net/images/anime/5/73199.jpg | https://api-cdn.myanimelist.net/images/anime/5/73199l.jpg | Steins;Gate | STEINS;GATE | 2011-04-06 | 2011-09-14 | tv | finished_airing | 24 | 2011 | spring | pg_13 | white | NA | Sci-Fi,Psychological,Suspense,Drama | White Fox | NA | Steins;Gate |
2 | 28977 | Gintama° | https://api-cdn.myanimelist.net/images/anime/3/72078.jpg | https://api-cdn.myanimelist.net/images/anime/3/72078l.jpg | Gintama Season 4 | <U+9280><U+9B42>° | 2015-04-08 | 2016-03-30 | tv | finished_airing | 51 | 2015 | spring | pg_13 | white | Shounen | Action,Historical,Parody,Samurai,Sci-Fi,Comedy | Bandai Namco Pictures | Gintama’ (2015) | Gintama Season 4 |
2 | 42938 | Fruits Basket: The Final | https://api-cdn.myanimelist.net/images/anime/1085/114792.jpg | https://api-cdn.myanimelist.net/images/anime/1085/114792l.jpg | <U+30D5><U+30EB><U+30FC><U+30C4><U+30D0><U+30B9><U+30B1><U+30C3><U+30C8> The Final | 2021-04-06 | 2021-06-29 | tv | finished_airing | 13 | 2021 | spring | pg_13 | white | Shoujo | Romance,Slice of Life,Supernatural,Drama | TMS Entertainment | Fruits Basket 3rd Season,Fruits Basket (2019) 3rd Season,Furuba | Fruits Basket: The Final |
temp_demo_table <- NULL
temp_genres_table <- NULL
temp_studios_table <- NULL
temp_syn_table <- NULL
Now we need to export the tables
write.csv(anime_demo_table, "temp_anime_demo_table.csv", row.names = FALSE)
write.csv(anime_genres_table, "temp_anime_genres_table.csv", row.names = FALSE)
write.csv(anime_ranking_table, "temp_anime_ranking_table.csv", row.names = FALSE)
write.csv(anime_studios_table, "temp_anime_studios_table.csv", row.names = FALSE)
write.csv(anime_syn_table, "temp_anime_syn_table.csv", row.names = FALSE)
write.csv(anime_table, "temp_anime_table.csv", row.names = FALSE)
write.csv(demo_l, "temp_demo_l.csv", row.names = FALSE)
write.csv(genres_l, "temp_genres_l.csv", row.names = FALSE)
write.csv(rank_table, "temp_rank_table.csv", row.names = FALSE)
write.csv(studios_l, "temp_studios_l.csv", row.names = FALSE)
write.csv(tm_entry, "temp_tm_ky.csv", row.names = FALSE)
anime_demo_table <- NULL
anime_genres_table <- NULL
anime_ranking_table <- NULL
anime_studios_table <- NULL
anime_syn_table <- NULL
anime_table <- NULL
demo_l <- NULL
genres_l <- NULL
rank_table <- NULL
studios_l <- NULL
tm_entry <- NULL
That is everything that is needed to extract the base tables. Next, I will go through the process for cleaning the data and loading them into their final table format.