# Pulling Data from APIs

APIs (application programming interfaces) are hosted on web servers. When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a webpage, which it then returns to your browser. APIs work much the same way, except instead of your web browser asking for a webpage, your program asks for data. This data is usually returned in JSON format.

To retrieve data, we make a request to a webserver. The server then replies with our data. In R, we'll use the `httr` library to do this.

## PatentsView Data

The PatentsView platform is built on data derived from the US Patent and Trademark Office (USPTO) bulk data to link inventors, their organizations, locations, and overall patenting activity. The PatentsView API provides programmatic access to longitudinal data and metadata on patents, inventors, companies, and geographic locations since 1976.

To access the API, we use the request function. In order to tell Python what to access, we need to specify the url of the API endpoint.

PatentsView has several API endpoints. An endpoint is a server route that is used to retrieve different data from the API. You can think of the endpoints as just specifying what types of data you want. Examples of PatentsView API endpoints are shown here: http://www.patentsview.org/api/doc.html

Many times, we need to request a key from the data provider in order to access an API. For example, if you wanted to access the Twitter API, then you would need to get a Twitter developer account and access token (see https://developer.twitter.com/en/docs/basics/authentication/overview/oauth). Currently no key is necessary to access the PatentsView API.

## Motivating Question

We will use the `httr` package to retrieve information about the patents that have been granted to inventors at University of Maryland, using the PatentsView API, then use the `jsonlite` package to convert it into a usable format (that is, a dataframe). This notebook goes over using getting the data, customizing the query to get the data that you need, and formatting the data once you've gotten it from the API.

## Accessing the PatentsView API

When you ping a website or portal for information this is called making a request. That is exactly what the requests library has been designed to do. However, we need to provide a query URL according to the format defined by PatentsView. The details on how to do that is explained at [this link](https://www.patentsview.org/api/query-language.html).

### Starting out

We'll first start out by making sure the appropriate packages are installed. We use the `httr` package for getting the data, and the `jsonlite` package for converting it into a usable form.

In [1]:
if (!requireNamespace("httr", quietly = TRUE)) {
  install.packages("httr")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
  install.packages("jsonlite")
}

library('httr')
library('jsonlite')

### Building the Request

Let's first try a simple example. We're going to be bringing in data about the patents that were awarded to University of Maryland, without any additional specifications. We follow the instructions detailed on the website at <https://www.patentsview.org/api/patent.html>, starting with the base URL, then including, in a list, the query parameters. 

In [2]:
url <- 'https://www.patentsview.org/api/patents/query'
request <- GET(url, query = list(q = '{"assignee_organization":"university of maryland"}'))

This essentially takes in the information about the webpage you want data from, and gives a response containing various outputs from that request. 

In [5]:
class(request)

### Check the response code

Before you can do anything with a website or URL, it’s a good idea to check the current status code of said portal.

The following are the response codes for the PatentsView API:

**200** - the query parameters are all valid; the results will be in the body of the response.

**400** - the query parameters are not valid, typically either because they are not in valid JSON format, or a specified field or value is not valid; the “status reason” in the header will contain the error message.

**500** - there is an internal error with the processing of the query; the “status reason” in the header will contain the error message.



In [3]:
request$status_code

If you see `200` above, that means we're good to go! If not, double check to make sure nothing was changed from the original code.

### Accessing the Content

After a web server returns a response, you can collect the content you need by converting it into a JSON format.

JSON is a way to encode data structures like lists to strings that ensures that they are easily readable by machines. JSON is the primary format in which data is passed back and forth to APIs, and most API servers will send their responses in JSON format.

Let's take a quick peek at the content.

In [6]:
head(request$content)

[1] 7b 22 70 61 74 65

This is in raw format, but we actually want characters. We can convert it using the `rawToChar` function.

In [7]:
raw_content <- rawToChar(request$content)
str(raw_content)

 chr "{\"patents\":[{\"patent_id\":\"10002228\",\"patent_number\":\"10002228\",\"patent_title\":\"Method for binding "| __truncated__


This is better, and you can start to see some of the data that you want, but it's not in a nice format for working with. We use the `fromJSON` function to convert it into an R list, which we can then work with.

In [8]:
patent_data <- fromJSON(raw_content)
str(patent_data)

List of 3
 $ patents           :'data.frame':	25 obs. of  3 variables:
  ..$ patent_id    : chr [1:25] "10002228" "10006019" "10010596" "10014561" ...
  ..$ patent_number: chr [1:25] "10002228" "10006019" "10010596" "10014561" ...
  ..$ patent_title : chr [1:25] "Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand c"| __truncated__ "Methods for recovery of leaf proteins" "Bacterial live vector vaccines expressing chromosomally-integrated foreign antigens" "Systems, methods, and devices for health monitoring of an energy storage device" ...
 $ count             : int 25
 $ total_patent_count: int 1243


Note that this list has three things: a dataframe containing the actual patent information, a count of the number of patents that were returned as part of this query, and a total patent count, signifying how many total patents there were that satisfied our criteria. In this case, since University of Maryland had 1243 patents, that's the number in `total_patent_count`.

Let's take the actual patent data and separate it out from the list.

In [10]:
patent_df <- patent_data$patents
head(patent_df)

Unnamed: 0_level_0,patent_id,patent_number,patent_title
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,10002228,10002228,Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation)
2,10006019,10006019,Methods for recovery of leaf proteins
3,10010596,10010596,Bacterial live vector vaccines expressing chromosomally-integrated foreign antigens
4,10014561,10014561,"Systems, methods, and devices for health monitoring of an energy storage device"
5,10015616,10015616,Sparse decomposition of head related impulse responses with applications to spatial audio rendering
6,10016413,10016413,Combination dopamine antagonist and opiate receptor antagonist treatment of addictive behavior


### Adding More Options

Above we were able to pull data with the default information on the patents (patent_id, patent_number, patent_title). It might be useful to know additional information on patents, such as patent classification and application date. 

Let's look for those variables in the API Endpoint (http://www.patentsview.org/api/patent.html), and add those fields to our query. We will use the patent type variable called `patent_type`. The application date varible is called `app_date`.


In [20]:
qry <- list(q = '{"assignee_organization":"university of maryland"}', 
            f='["patent_id", "patent_title","patent_type","app_date"]')
request <- GET(url, query = qry)
request$status_code

In [21]:
df <- fromJSON(rawToChar(request$content))$patents
head(df)

Unnamed: 0_level_0,patent_id,patent_title,patent_type,applications
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<list>
1,10002228,Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation),utility,"2015-11-19, 14/945792"
2,10006019,Methods for recovery of leaf proteins,utility,"2017-12-13, 15/840857"
3,10010596,Bacterial live vector vaccines expressing chromosomally-integrated foreign antigens,utility,"2016-08-11, 15/234703"
4,10014561,"Systems, methods, and devices for health monitoring of an energy storage device",utility,"2014-08-14, 14/912113"
5,10015616,Sparse decomposition of head related impulse responses with applications to spatial audio rendering,utility,"2015-06-08, 14/732864"
6,10016413,Combination dopamine antagonist and opiate receptor antagonist treatment of addictive behavior,utility,"2013-09-12, 14/025434"


We can see that we have these additional variables now. 

In [24]:
table(df$patent_type)


utility 
     25 

### Getting All University of Maryland Patents

Notice above that we only got 25 patents. This is because that's the default. We actually have the ability to get more than this, by specifying that we want more "per_page". 

In [12]:
qry <- list(q = '{"assignee_organization":"university of maryland"}', 
            f='["patent_title","patent_year", "patent_abstract"]',
            o='{"per_page":2000}')
request <- GET(url, query = qry)
request$status_code

This allows us to get all of the patents that were awarded to University of Maryland, since we already know the number is 1243. 

In [13]:
umd_patents <- fromJSON(rawToChar(request$content))$patents
str(umd_patents)

'data.frame':	1243 obs. of  3 variables:
 $ patent_title   : chr  "Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand c"| __truncated__ "Methods for recovery of leaf proteins" "Bacterial live vector vaccines expressing chromosomally-integrated foreign antigens" "Systems, methods, and devices for health monitoring of an energy storage device" ...
 $ patent_year    : chr  "2018" "2018" "2018" "2018" ...
 $ patent_abstract: chr  "The invention describes an explicit solvent all-atom molecular dynamics methodology (SILCS: Site Identification"| __truncated__ "A novel method for processing soluble plant leaf proteins is described. While leaf proteins are considered pote"| __truncated__ "Bacterial live vector vaccines represent a vaccine development strategy that offers exceptional flexibility. In"| __truncated__ "A health monitoring device includes an ultrasound source and an ultrasound sensor. The ultrasound source can be"| __truncat

## What Now?

Now that we have the patent data, what can we do with it? Well, one example is doing text analysis using the patent abstracts. We might be interested in what types of patents were awarded to University of Maryland, and the topics of those patents, because while there is a `patent_type` field, it doesn't actually give us that much information. To do that, we might want to take the abstracts and extract meaning out of them. We don't want to read over 1000 abstracts though (or, maybe you do, but it'd still be quite time-consuming!), but we can use something called **topic modeling**. 

## Exercises

**1. How might we change the code above to pull patent data for patents that were awarded to University of Michigan?** 

**2. Using the information in <https://www.patentsview.org/api/patent.html>, try adding some more fields and pulling some more data.**