## Web scraping with XPATHs
Now that we've covered the low-hanging fruit ("it has an API, and a client", "it has an API") it's time to talk about what to do when a website doesn't have any access mechanisms at all - when you have to rely on web scraping. This chapter will introduce you to the rvest web-scraping package, and build on your previous knowledge of XML manipulation and XPATHs.

### Reading HTML
The first step with web scraping is actually reading the HTML in. This can be done with a function from xml2, which is imported by rvest - read_html(). This accepts a single URL, and returns a big blob of XML that we can use further on.

We're going to experiment with that by grabbing Hadley Wickham's wikipedia page, with rvest, and then printing it just to see what the structure looks like.

In [1]:
# Load rvest
library(rvest)

# Hadley Wickham's Wikipedia page
test_url <- "https://en.wikipedia.org/wiki/Hadley_Wickham"

# Read the URL stored as "test_url" with read_html()
test_xml <- read_html(test_url)

# Print test_xml
test_xml

"package 'rvest' was built under R version 3.6.3"Loading required package: xml2
"package 'xml2' was built under R version 3.6.3"

{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

### Extracting nodes by XPATH
Now you've got a HTML page read into R. Great! But how do you get individual, identifiable pieces of it?

The answer is to use html_node(), which extracts individual chunks of HTML from a HTML document. There are a couple of ways of identifying and filtering nodes, and for now we're going to use XPATHs: unique identifiers for individual pieces of a HTML document.

These can be retrieved using a browser gadget we'll talk about later - in the meanwhile the XPATH for the information box in the page you just downloaded is stored as test_node_xpath. We're going to retrieve the box from the HTML doc with html_node(), using test_node_xpath as the xpath argument.

In [2]:
# regular expression
test_node_xpath = "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"vcard\", \" \" ))]"

# Use html_node() to grab the node with the XPATH stored as `test_node_xpath`
node <- html_node(x = test_xml, xpath = test_node_xpath)

# Print the first element of the result
print(node[1])

$node
<pointer: 0x000000005978c370>



### Extracting names
The first thing we'll grab is a name, from the first element of the previously extracted table. We can do this with html_name(). As you may recall from when you printed it, the element has the tag '<'table>...'<'/table>' (without '), so we'd expect the name to be, well, table.

In [4]:
# Extract the name of table_element
element_name <- html_name(node)

# Print the name
print(element_name)

[1] "table"


### Extracting values
Just knowing the type of HTML object a node is isn't much use, though (although it can be very helpful). What we really want is to extract the actual text stored within the value.

We can do that with (shocker) html_text(), another convenient rvest function that accepts a node and passes back the text inside it. For this we'll want a node within the extracted element - specifically, the one containing the page title. The xpath value for that node is stored as second_xpath_val.

Using this xpath value, extract the node within table_element that we want, and then use html_text to extract the text, before printing it.

In [9]:
# regular expression
second_xpath_val = "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"fn\", \" \" ))]"

# Extract the element of table_element referred to by second_xpath_val and store it as page_name
page_name <- html_node(x = node, xpath = second_xpath_val)

# Extract the text from page_name
page_title <- html_text(page_name)

# Print page_title
page_title

### Extracting tables
The data from Wikipedia that we've been playing around with can be extracted bit by bit and cleaned up manually, but since it's a table, we have an easier way of turning it into an R object. rvest contains the function html_table() which, as the name suggests, extracts tables. It accepts a node containing a table object, and outputs a data frame.

Let's use it now: take the table we've extracted, and turn it into a data frame.

In [11]:
# Turn table_element into a data frame and assign it to wiki_table
wiki_table <- html_table(node)

# Print wiki_table
wiki_table

Hadley Wickham,Hadley Wickham.1
,
Born,"(1979-10-14) 14 October 1979 (age 41)Hamilton, New Zealand"
Alma mater,"Iowa State University, University of Auckland"
Known for,R packages
Awards,John Chambers Award (2006) Fellow of the American Statistical Association (2015)
Scientific career,Scientific career
Fields,Statistics Data science R (programming language)
Thesis,Practical tools for exploring data and models (2008)
Doctoral advisors,Di Cook Heike Hofmann
,


### Cleaning a data frame
In the last exercise, we looked at extracting tables with html_table(). The resulting data frame was pretty clean, but had two problems - first, the column names weren't descriptive, and second, there was an empty row.

In this exercise we're going to look at fixing both of those problems. First, column names. Column names can be cleaned up with the colnames() function. You call it on the object you want to rename, and then assign to that call a vector of new names.

The missing row, meanwhile, can be removed with the subset() function. subset takes an object, and a condition. For example, if you have a data frame df containing a column x, you could run

subset(df, !x == "")

to remove all rows from df consisting of empty strings ("") in the column x.

In [13]:
# Rename the columns of wiki_table
colnames(wiki_table) <- c("key", "value")

# Remove the empty row from wiki_table
cleaned_table <- subset(wiki_table, !key == "")

# Print cleaned_table
cleaned_table

Unnamed: 0,key,value
2,Born,"(1979-10-14) 14 October 1979 (age 41)Hamilton, New Zealand"
3,Alma mater,"Iowa State University, University of Auckland"
4,Known for,R packages
5,Awards,John Chambers Award (2006) Fellow of the American Statistical Association (2015)
6,Scientific career,Scientific career
7,Fields,Statistics Data science R (programming language)
8,Thesis,Practical tools for exploring data and models (2008)
9,Doctoral advisors,Di Cook Heike Hofmann
