3/12/2022
I currently have a tableau dashboard located here. This dashboard is an exploratory view of top ranked and popular anime titles based primarily on data from http://myanimelist.net. The purpose of this dashboard is to assist the end-user in locating their next anime to watch. However, what is lacking from this tool is the streaming platform that currently houses the titles.
To add that information into the dashboard, I will need to either locate, or scrape the anime titles that are housed on various platforms.
The most popular platforms for viewing anime titles are:
- Crunchyroll
- Funimation
- Netflix
- Hulu
- Amazon Prime
- HIDIVE
This list is not all-inclusive and in no particular order.
Set your working directory to wherever you’d like in the
WORKINDIRECTORY
section.
wd1 = WORKINGDIRECTORY
setwd(wd1)
The tidyverse
contains the rvest
library which will be used to
scrape the data, while dplyr
and tidyr
will be used to
manipulate/transform data.
install.packages(c("tidyverse","textutils"))
Next, we want to load three libraries:
library(tidyverse)
library(rvest)
library(textutils)
Although rvest is a component of the tidyverse, it doesn’t automatically
load with the library call tidyverse
, as a result, you’ll need to load
it separately. The library textutils
will allow you to decode html
special characters into their expected values.
Lets start with HIDIVE:
This is one of the large streaming platforms that I will be scraping as an add-on to my anime suggestion tableau dashboard. Thankfully, HIDIVE has a listing of all of their titles on a webpage at: https://www.hidive.com/tv.
HIDIVE <- read_html('https://www.hidive.com/tv')
title_bucket <- HIDIVE %>%
html_element(xpath = '/html/body/div[1]/div')
The code above isolates the bucket where I want to pull of my information from.
When reviewing the HTML code on the page, it appears that the titles for
the anime can be found between H2
headers, images can be found within
default-img
divs, and date information can be found within cell
divs.
Lets grab that data:
anime_title <- title_bucket %>%
html_elements('h2') %>%
html_text(trim = TRUE)
anime_premiere <- title_bucket %>%
html_elements('div.cell') %>%
html_attr('data-premieredt')
anime_release <- title_bucket %>%
html_elements('div.cell') %>%
html_attr('data-releasedt')
anime_nextair <- title_bucket %>%
html_elements('div.cell') %>%
html_attr('data-nextairdate')
anime_img <- title_bucket %>%
html_elements('div.default-img') %>%
html_elements('img') %>%
html_attr('data-src')
Now, there are also badges on each of the tiles for the anime that contain whether or not the anime is exclusive to HIDIVE or is dubbed. I also want to grab that information.
However, there aren’t consistent div
containers that will allow me to
align the badge attributes with the title. For that reason, I will
select all of the overall buckets for the titles that are housed within
cell
divs, then run a for
loop to see if there are badges within
those buckets.
anime_cell <- title_bucket %>%
html_elements('div.cell')
anime_badge <- c()
for (item in anime_cell){
rowVal <- item %>%
html_elements('div.top-badge') %>%
html_text(trim = TRUE)
if (identical(rowVal, character(0))){
rowVal <- NA
}
anime_badge <- append(anime_badge, rowVal)
}
Lets check the length of all the items and see if we can throw them together
length(anime_title)
## [1] 436
length(anime_premiere)
## [1] 436
length(anime_release)
## [1] 436
length(anime_nextair)
## [1] 436
length(anime_img)
## [1] 436
length(anime_badge)
## [1] 436
They are all the same length, so I can combine and export them in the next step
Lets make it a dataframe and export the table
hidive_data <- data.frame(anime_title = anime_title, anime_premiere = anime_premiere, anime_release = anime_release, anime_nextair = anime_nextair, anime_img = anime_img, anime_badge = anime_badge)
knitr::kable(head(hidive_data))
anime_title | anime_premiere | anime_release | anime_nextair | anime_img | anime_badge |
---|---|---|---|---|---|
100 Sleeping Princes & the Kingdom of Dreams | 7/5/2018 12:00:00 AM | 7/5/2018 12:00:00 AM | //static.hidive.com/titles/OSP/OSP_01_MASTER_300x169.jpg | Exclusive | |
A Little Snow Fairy Sugar | 10/2/2001 12:00:00 AM | 10/2/2001 12:00:00 AM | //static.hidive.com/titles/LSF/LSF_01_MASTER_300x169.jpg | Dubbed | |
Action Heroine Cheer Fruits | 7/6/2017 12:00:00 AM | 7/6/2017 5:00:00 PM | //static.hidive.com/titles/ACF/action-heroine-cheer-fruits_ACF_MASTER_300x169.jpg | NA | |
After the Rain | 1/12/2018 12:00:00 AM | 3/31/2021 1:00:00 PM | //static.hidive.com/titles/ATR/after-the-rain_ATR_01_MASTER_300x169.jpg | Dubbed | |
Ahiru no Sora | 10/2/2019 12:00:00 AM | 9/30/2019 5:00:00 PM | //static.hidive.com/titles/ANS/ahiru-no-sora_ANS_01_MASTER_300x169_01.jpg | Dubbed | |
Akame ga Kill! | 7/6/2014 12:00:00 AM | 7/6/2014 12:00:00 PM | //static.hidive.com/titles/AGK/AGK_MASTER_300x169.jpg | Dubbed |
write.csv(hidive_data, 'hidivetitles.csv', row.names = FALSE)