## Handling JSON and XML

Sometimes data is a TSV or nice plaintext output. Sometimes it's XML and/or JSON. This chapter walks you through what JSON and XML are, how to convert them into R-like objects, and how to extract data from them. You'll practice by examining the revision history for a Wikipedia article retrieved from the Wikipedia API using httr, xml2 and jsonlite.

### Parsing JSON
While JSON is a useful format for sharing data, your first step will often be to parse it into an R object, so you can manipulate it with R.

The content() function in httr retrieves the content from a request. It takes an as argument that specifies the type of output to return. You've already seen that as = "text" will return the content as a character string which is useful for checking the content is as you expect.

If you don't specify as, the default as = "parsed" is used. In this case the type of content() will be guessed based on the header and content() will choose an appropriate parsing function. For JSON this function is fromJSON() from the jsonlite package. If you know your response is JSON, you may want to use fromJSON() directly.

To practice, you'll retrieve some revision history from the Wikipedia API, check it is JSON, then parse it into a list two ways.

In [None]:
## Do not run 

function (title, format = "json"){
  if (title != "Hadley Wickham") {
    stop('rev_history() only works for `title = "Hadley Wickham"`')
  }
  
  if (format == "json"){
    resp <- readRDS("had_rev_json.rds")
  } else if (format == "xml"){
    resp <- readRDS("had_rev_xml.rds")
  } else {
    stop('Invalid format supplied, try "json" or "xml"')
  }
  resp  
}


# Get revision history for "Hadley Wickham"
resp_json <- rev_history("Hadley Wickham")

# Check http_type() of resp_json
http_type(resp_json)

# Examine returned text with content()
content(resp_json, as = "text" )

# Parse response with content()
content(resp_json, as = "parsed")

# Parse returned text with fromJSON()
library(jsonlite)
fromJSON(content(resp_json, as = "text" ))

# the output from content() is pretty long and hard to understand. 
# Don't worry, that is just the nature of nested data, you'll learn a couple of tricks for dealing 
# with it next. However, it will be helpful to know that this response contains 5 revisions.



### Manipulating parsed JSON
As you saw in the video, the output from parsing JSON is a list. One way to extract relevant data from that list is to use a package specifically designed for manipulating lists, rlist.

rlist provides two particularly useful functions for selecting and combining elements from a list: list.select() and list.stack(). list.select() extracts sub-elements by name from each element in a list. For example using the parsed movies data from the video (movies_list), we might ask for the title and year elements from each element:

list.select(movies_list, title, year)

The result is still a list, that is where list.stack() comes in. It will stack the elements of a list into a data frame.

list.stack(
    list.select(movies_list, title, year)
)

In [None]:
## do not run

# Load rlist
library(rlist)

# Examine output of this code
str(content(resp_json), max.level = 4)

# Store revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions

# Extract the user element
user_time <- list.select(revs, user, timestamp)

# Print user_time
print(user_time)

# Stack to turn into a data frame
list.stack(
    list.select(revs, user, timestamp)
)

## rlist is designed to make working with lists easy, so if find you are 
# working with JSON data a lot, you should explore more of its functionality.

### Reformatting JSON
Of course you don't have to use rlist. You can achieve the same thing by using functions from base R or the tidyverse. In this exercise you'll repeat the task of extracting the username and timestamp using the dplyr package which is part of the tidyverse.

Conceptually, you'll take the list of revisions, stack them into a data frame, then pull out the relevant columns.

dplyr's bind_rows() function takes a list and turns it into a data frame. Then you can use select() to extract the relevant columns. And of course if we can make use of the %>% (pipe) operator to chain them all together.

In [None]:
## do not run 

# Load dplyr
library(dplyr)

# Pull out revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions

# Extract user and timestamp
revs %>%
  bind_rows() %>%           
  select(user, timestamp)

### Examining XML documents
Just like JSON, you should first verify the response is indeed XML with http_type() and by examining the result of content(r, as = "text"). Then you can turn the response into an XML document object with read_xml().

One benefit of using the XML document object is the available functions that help you explore and manipulate the document. For example xml_structure() will print a representation of the XML document that emphasizes the hierarchical structure by displaying the elements without the data.

In this exercise you'll grab the same revision history you've been working with as XML, and take a look at it with xml_structure().

In [None]:
# Load xml2
# install.packages("xml2")
library(xml2)

# Get XML revision history
resp_xml <- rev_history("Hadley Wickham", format = "xml")

# Check response is XML 
http_type(resp_xml)
# Examine returned text with content()
rev_text <- content(resp_xml, as = "text")
rev_text

# Turn rev_text into an XML document
rev_xml <- read_xml(rev_text)

# Examine the structure of rev_xml
xml_structure(rev_xml)

### Extracting XML data
XPATHs are designed to specifying nodes in an XML document. Remember /node_name specifies nodes at the current level that have the tag node_name, where as //node_name specifies nodes at any level below the current level that have the tag node_name.

xml2 provides the function xml_find_all() to extract nodes that match a given XPATH. For example, xml_find_all(rev_xml, "/api") will find all the nodes at the top level of the rev_xml document that have the tag api. Try running that in the console. You'll get a nodeset of one node because there is only one node that satisfies that XPATH.

The object returned from xml_find_all() is a nodeset (think of it like a list of nodes). To actually get data out of the nodes in the nodeset, you'll have to explicitly ask for it with xml_text() (or xml_double() or xml_integer()).

Use what you know about the location of the revisions data in the returned XML document extract just the content of the revision.

In [None]:
## do not run 
# Find all nodes using XPATH "/api/query/pages/page/revisions/rev"
xml_find_all(rev_xml, "/api/query/pages/page/revisions/rev")

# Find all rev nodes anywhere in document
rev_nodes <- xml_find_all(rev_xml, "//rev")

# Use xml_text() to get text from rev_nodes
xml_text(rev_nodes)

### Extracting XML attributes
Not all the useful data will be in the content of a node, some might also be in the attributes of a node. To extract attributes from a nodeset, xml2 provides xml_attrs() and xml_attr().

xml_attrs() takes a nodeset and returns all of the attributes for every node in the nodeset. xml_attr() takes a nodeset and an additional argument attr to extract a single named argument from each node in the nodeset.

In this exercise you'll grab the user and anon attributes for each revision. You'll see xml_find_first() in the sample code. It works just like xml_find_all() but it only extracts the first node it finds.

In [None]:
## DO NOT RUN
# All rev nodes
rev_nodes <- xml_find_all(rev_xml, "//rev")

# The first rev node
first_rev_node <- xml_find_first(rev_xml, "//rev")

# Find all attributes with xml_attrs()
xml_attrs(first_rev_node)

# Find user attribute with xml_attr()
xml_attr(first_rev_node, "user")

# Find user attribute for all rev nodes
xml_attr(rev_nodes, "user")

# Find anon attribute for all rev nodes
xml_attr(rev_nodes, "anon")

### Wrapup: returning nice API output
How might all this work together? A useful API function will retrieve results from an API and return them in a useful form. In Chapter 2, you finished up by writing a function that retrieves data from an API that relied on content() to convert it to a useful form. To write a more robust API function you shouldn't rely on content() but instead parse the data yourself.

To finish up this chapter you'll do exactly that: write get_revision_history() which retrieves the XML data for the revision history of page on Wikipedia, parses it, and returns it in a nice data frame.

So that you can focus on the parts of the function that parse the return object, you'll see your function calls rev_history() to get the response from the API. You can assume this function returns the raw response and follows the best practices you learnt in Chapter 2, like using a user agent, and checking the response status.

In [None]:
# do not run 

get_revision_history <- function(article_title){
  # Get raw revision response
  rev_resp <- rev_history(article_title, format = "xml")
  
  # Turn the content() of rev_resp into XML
  rev_xml <- read_xml(content(rev_resp, "text"))
  
  # Find revision nodes
  rev_nodes <- xml_find_all(rev_xml, "//rev")

  # Parse out usernames
  user <- xml_attr(rev_nodes, "user")
  
  # Parse out timestamps
  timestamp <- readr::parse_datetime(xml_attr(rev_nodes, "timestamp"))
  
  # Parse out content
  content <- xml_text(rev_nodes)
  
  # Return data frame 
  data.frame(user = user,
    timestamp = timestamp,
    content = substr(content, 1, 40))
}

# Call function for "Hadley Wickham"
get_revision_history("Hadley Wickham")