## CSS Web Scraping and Final Case Study

CSS path-based web scraping is a far-more-pleasant alternative to using XPATHs. You'll start this chapter by learning about CSS, and how to leverage it for web scraping. Then, you'll work through a final case study that combines everything you've learnt so far to write a function that queries an API, parses the response and returns data in a nice form.

### Using CSS to scrape nodes
As mentioned in the video, CSS is a way to add design information to HTML, that instructs the browser on how to display the content. You can leverage these design instructions to identify content on the page.

You've already used html_node(), but it's more common with CSS selectors to use html_nodes() since you'll often want more than one node returned. Both functions allow you to specify a css argument to use a CSS selector, instead of specifying the xpath argument.

What do CSS selectors look like? Try these examples to see a few possibilities.

In [2]:
# Load rvest
library(rvest)

# Hadley Wickham's Wikipedia page
test_url <- "https://en.wikipedia.org/wiki/Hadley_Wickham"

# Read the URL stored as "test_url" with read_html()
test_xml <- read_html(test_url)

# Print test_xml
test_xml

# Select the table elements
html_nodes(test_xml, css = "table")

# Select elements with class = "infobox"
html_nodes(test_xml, css = ".infobox")

# select all elements that have the attribute with id = "firstHeading"
html_nodes(test_xml, css = "#firstHeading")

{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

{xml_nodeset (3)}
[1] <table class="infobox biography vcard" style="width:22em"><tbody>\n<tr><t ...
[2] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...
[3] <table class="nowraplinks hlist navbox-inner" style="border-spacing:0;bac ...

{xml_nodeset (1)}
[1] <table class="infobox biography vcard" style="width:22em"><tbody>\n<tr><t ...

{xml_nodeset (1)}
[1] <h1 id="firstHeading" class="firstHeading" lang="en">Hadley Wickham</h1>

### Scraping names
You might have noticed in the previous exercise, to select elements with a certain class, you add a . in front of the class name. If you need to select an element based on its id, you add a # in front of the id name.

For example if this element was inside your HTML document:

'<'h1 class = "heading" id = "intro">

  Introduction

'<'/h1>

You could select it by its class using the CSS selector ".heading", or by its id using the CSS selector "#intro".

Once you've selected an element with a CSS selector, you can get the element tag name just like you did with XPATH selectors, with html_name(). 

In [3]:
# Extract element with class infobox
infobox_element <- html_nodes(test_xml, css = ".infobox")

# Get tag name of infobox_element
element_name <- html_name(infobox_element)

# Print element_name
print(element_name)

[1] "table"


### Scraping text
Of course you can get the contents of a node extracted using a CSS selector too, with html_text().

Can you put the pieces together to get the page title like you did in Chapter 4?

In [4]:
# Extract element with class fn
page_name <- html_node(x = infobox_element, ".fn")

# Get contents of page_name
page_title <- html_text(page_name)

# Print page_title
page_title

### API calls
Your first step is to use the Wikipedia API to get the page contents for a specific page. We'll continue to work with the Hadley Wickham page, but as your last exercise, you'll make it more general.

To get the content of a page from the Wikipedia API you need to use a parameter based URL. The URL you want is

https://en.wikipedia.org/w/api.php?action=parse&page=Hadley%20Wickham&format=xml

which specifies that you want the parsed content (i.e the HTML) for the "Hadley Wickham" page, and the API response should be XML.

In this exercise you'll make the request with GET() and parse the XML response with content().

In [5]:
# Load httr
library(httr)

# The API url
base_url <- "https://en.wikipedia.org/w/api.php"

# Set query parameters
query_params <- list(action = "parse", 
  page = "Hadley Wickham", 
  format = "xml")

# Get data from API
resp <- GET(url = base_url, query = query_params)
    
# Parse response
resp_xml <- content(resp)
resp_xml

"package 'httr' was built under R version 3.6.3"

{xml_document}
<api>
[1] <parse title="Hadley Wickham" pageid="41916270" revid="993139613" display ...

### Extracting information
Now we have a response from the API, we need to extract the HTML for the page from it. It turns out the HTML is stored in the contents of the XML response.
Take a look, by using xml_text() to pull out the text from the XML response:

xml_text(resp_xml)
In this exercise, you'll read this text as HTML, then extract the relevant nodes to get the infobox and page title.

In [7]:
# Load rvest
library(rvest)

# Read page contents as HTML
page_html <- read_html(xml_text(resp_xml))

# Extract infobox element
infobox_element <- html_node(page_html, css = ".infobox")

# Extract page name element from infobox
page_name <- html_node(infobox_element, css= ".fn")

# Extract page name as text
page_title <- html_text(page_name)
page_title

### Normalising information
Now it's time to put together the information in a nice format. You've already seen you can use html_table() to parse the infobox into a data frame. But one piece of important information is missing from that table: who the information is about!

In this exercise, you'll parse the infobox in a data frame, and add a row for the full name of the subject.

In [9]:
# Your code from earlier exercises
wiki_table <- html_table(infobox_element)
colnames(wiki_table) <- c("key", "value")
cleaned_table <- subset(wiki_table, !key == "")

# Create a dataframe for full name
name_df <- data.frame(key = "Full name", value = page_title)

# Combine name_df with cleaned_table
wiki_table2 <- rbind(name_df, cleaned_table)

# Print wiki_table
wiki_table2

key,value
Full name,Hadley Wickham
Born,"(1979-10-14) 14 October 1979 (age 41)Hamilton, New Zealand"
Alma mater,"Iowa State University, University of Auckland"
Known for,R packages
Awards,John Chambers Award (2006) Fellow of the American Statistical Association (2015)
Scientific career,Scientific career
Fields,Statistics Data science R (programming language)
Thesis,Practical tools for exploring data and models (2008)
Doctoral advisors,Di Cook Heike Hofmann


### Reproducibility
Now you've figured out the process for requesting and parsing the infobox for the Hadley Wickham page, it's time to turn it into a function that does the same thing for anyone.

You've already done all the hard work! In the sample script we've just copied all your code from the previous three exercises, with only one change: we've wrapped it in the function definition syntax, and chosen the name get_infobox() for this function.

It doesn't quite work yet, the argument title isn't used inside the function. In this exercise you'll fix that, then test it out with some other personalities.

In [16]:
library(httr)
library(rvest)
library(xml2)

get_infobox <- function(title){
  base_url <- "https://en.wikipedia.org/w/api.php"
  
  # Change "Hadley Wickham" to title
  query_params <- list(action = "parse", 
    page = title, 
    format = "xml")
  
  resp <- GET(url = base_url, query = query_params)
  resp_xml <- content(resp)
  
  page_html <- read_html(xml_text(resp_xml))
  infobox_element <- html_node(x = page_html, css =".infobox")
  page_name <- html_node(x = infobox_element, css = ".fn")
  page_title <- html_text(page_name)
  
  wiki_table <- html_table(infobox_element)
  colnames(wiki_table) <- c("key", "value")
  cleaned_table <- subset(wiki_table, !wiki_table$key == "")
  name_df <- data.frame(key = "Full name", value = page_title)
  wiki_table <- rbind(name_df, cleaned_table)
  
  wiki_table
}

# Test get_infobox with "Hadley Wickham"
get_infobox(title = "Hadley Wickham")
# Try get_infobox with "Ross Ihaka"
get_infobox(title = "Ross Ihaka")

# Try get_infobox with "Grace Hopper"
get_infobox(title = "Grace Hopper")

# Try get_infobox with "Donald Trump"
get_infobox(title = "Donald Trump")



key,value
Full name,Hadley Wickham
Born,"(1979-10-14) 14 October 1979 (age 41)Hamilton, New Zealand"
Alma mater,"Iowa State University, University of Auckland"
Known for,R packages
Awards,John Chambers Award (2006) Fellow of the American Statistical Association (2015)
Scientific career,Scientific career
Fields,Statistics Data science R (programming language)
Thesis,Practical tools for exploring data and models (2008)
Doctoral advisors,Di Cook Heike Hofmann


key,value
Full name,Ross Ihaka
Ihaka at the 2010 New Zealand Open Source Awards,Ihaka at the 2010 New Zealand Open Source Awards
Born,"1954 (age 66–67)Waiuku, New Zealand"
Alma mater,"University of AucklandUniversity of California, Berkeley"
Known for,R programming language
Awards,Pickering Medal (2008)
Scientific career,Scientific career
Fields,Statistical computing
Institutions,University of Auckland
Thesis,Ruaumoko (1985)


Unnamed: 0,key,value
1,Full name,Grace Murray Hopper
13,Photograph from 1984,Photograph from 1984
2,Born,"Grace Brewster Murray(1906-12-09)December 9, 1906New York City, U.S."
3,Died,"January 1, 1992(1992-01-01) (aged 85)Arlington, Virginia, U.S."
4,Alma mater,"Vassar College (BA)Yale University (MS, Ph.D.)"
6,Military career,Military career
7,Place of burial,Arlington National Cemetery
8,Allegiance,United States of America
9,Service/branch,United States Navy
10,Years of service,"1943–1966, 1967–1971, 1972–1986"


Unnamed: 0,key,value
1,Full name,Donald Trump
3,45th President of the United States,45th President of the United States
4,Incumbent,Incumbent
5,"Assumed office January 20, 2017","Assumed office January 20, 2017"
6,Vice President,Mike Pence
7,Preceded by,Barack Obama
9,Personal details,Personal details
10,Born,"Donald John Trump (1946-06-14) June 14, 1946 (age 74)Queens, New York City"
11,Political party,"Republican (1987–1999, 2009–2011, 2012–present)"
12,Other politicalaffiliations,Reform (1999–2001) Democratic (2001–2009) Independent (2011–2012)
