# PC Session 2

**Author:**
[Helge Liebert](https://hliebert.github.io/)

## Scraping WHO snake database

## Requirements

In [1]:
## Load libraries
library(rvest)
library(xml2)

Loading required package: xml2



## Simple approach

Collect form values, submit each of them and collect the required information from the resulting page. Collect all information in a data frame.

In [None]:
## The site
site <- "http://apps.who.int/bloodproducts/snakeantivenoms/database/SearchFrm.aspx"

In [None]:
## Initiate session on site
session <- html_session(site)

In [None]:
## Get all drop down options, value for submission and text
options <- html_nodes(session, css = "#ddlCountry > option")
countries <- data.frame(
  value  = html_attr(options, "value"),
  option = html_text(options)
)
countries <- countries[-1, ] # Trim first line

In [None]:
## Empty data frame to be filled
data <- data.frame(matrix(nrow = 0, ncol = 5))

In [None]:
## Get snake venom data for all countries in the list
for (opt in countries$value[1:2]) {

  ## display some information
  print(paste0(which(opt == countries$value), "/",
               nrow(countries), " ",
               countries[countries$value == opt, "option"]))

  ## set option and submit form
  form <- html_form(html_node(session, "#form1"))
  form <- set_values(form, "ddlCountry" = opt)
  newpage <- submit_form(session, form)

  ## Collect mortality statistics table
  snakeinfo <- html_node(newpage, css = "#SnakesGridView")
  snakeinfo <- html_table(snakeinfo, fill = TRUE, header = TRUE)
  snakeinfo$country <- countries[countries$value == opt, "option"]

  ## Append to data
  data <- rbind(data, snakeinfo[-1, ])
}

## Build header, add to data frame, write to file
header <- c("link", "category", "common name", "species name", "country")
names(data) <- header
write.table(t(header), "Data/who.csv", sep = ";",
            col.names = FALSE, row.names = FALSE)


In [None]:
## Look at data
head(data)

## Variant: write to data to disk immediately

Better alternative to collecting data in a data.frame (or any internal object): Write to disk immediately. Mitigates the risk of exhausting memory. CSV sufficient for small data, otherwise use a database.

In [2]:
## The site, main info: "http://apps.who.int/bloodproducts/snakeantivenoms/database/"
site <- "http://apps.who.int/bloodproducts/snakeantivenoms/database/SearchFrm.aspx"

## Initiate session on site
session <- html_session(site)

## Get all drop down options, value for submission and text
options <- html_nodes(session, css = "#ddlCountry > option")
countries <- data.frame(
  value  = html_attr(options, "value"),
  option = html_text(options)
)
countries <- countries[-1, ] # Trim first line

In [None]:
## Build header and write to file
header <- c("link", "category", "common name", "species name", "country")
write.table(t(header), "Data/who.csv", sep = ";",
            col.names = FALSE, row.names = FALSE)

In [None]:
## Get snake venom data for all countries in the list
for (opt in countries$value) {
## for (opt in countries$value[1:2]) {

  ## display some information
  print(paste0(which(opt == countries$value), "/",
               nrow(countries), " ",
               countries[countries$value == opt, "option"]))

  ## set option and submit form
  form <- html_form(html_node(session, "#form1"))
  form <- set_values(form, "ddlCountry" = opt)
  newpage <- submit_form(session, form)

  ## Collect mortality statistics table
  snakeinfo <- html_node(newpage, css = "#SnakesGridView")
  snakeinfo <- html_table(snakeinfo, fill = TRUE, header = TRUE)
  snakeinfo$country <- countries[countries$value == opt, "option"]
  snakeinfo

  ## append to file
  write.table(snakeinfo, "Data/who.csv", sep = ";", append = TRUE,
              col.names = FALSE, row.names = FALSE)

}

## Alternative: construct query URL directly

Alternative to form submission: Construct request URL in the following form. Check network monitoring in Browser developer tools to see.

https://apps.who.int/bloodproducts/snakeantivenoms/database/SnakeAntivenomListFrm.aspx?@CountryID=2


## Problem: Embedded javascript

Collect form values, submit each of them and collect the required information from the resulting page. Collect all information in a data frame.

In [None]:
## Wait, what is this? Data as they should be vs. data as they are
system("head -n 20 Data/who-complete.csv")
system("head -n 20 Data/who-not-complete.csv")
system("tail -n 20 Data/who-complete.csv")
system("tail -n 20 Data/who-not-complete.csv")

In [None]:
##  Check for Zimbabwe/check link in Browser
form <- html_form(html_node(session, "#form1"))
form <- set_values(form, "ddlCountry" = 211)
zbpage <- submit_form(session, form)
snakeinfo <- html_node(zbpage, css = "#SnakesGridView")

In [None]:
## These links contain Javascript and can't be followed
p2link <- html_node(zbpage, css = "#SnakesGridView a") %>% html_text()
zbpage2 <- follow_link(zbpage, p2link)

In [None]:
p2link <- html_node(zbpage, css = "#SnakesGridView a") %>% html_attr("href")
zbpage2 <- jump_to(zbpage, p2link)

In [3]:
## Solutions:
## -- Reverse engineer the internal API and construct a specific POST? Patch the forms?
## -- Or use Selenium.