Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
289 lines (199 sloc) 12.5 KB
title author date output
README
Paul Oldham
29 July 2016
github_document html_document
default
default
knitr::opts_chunk$set(echo = TRUE)

opsrdev

Travis-CI Build Status

This is the early development version of the opsr R package to access the European Patent Office Open Patent Services API

The main aim of the package is to provide access to the OPS service and, more importantly, to transform the data into data frames or the new tibbles (simpler data frames) that we can perform useful patent analysis with. There is still quite a way to go and if you would like to contribute feel welcome.

opsrdev is not on CRAN. To install it use:

devtools::install_github("poldham/opsrdev")

##Walkthrough

Here is a quick walkthrough on how to use opsrdev. We will use the topic of drones as an example.

opsr uses r packages in the tidyverse as well as some additional packages. You don't need to worry about that but for this walkthrough we need to install or load the following.

install.packages("dplyr") #for chaining using %>% 
install.packages("readr") #to write the files

Load dplyr to use %>% in interactive mode and opsrdev.

library(opsrdev) # the opsr package
library(dplyr)
library(readr) 

OPS will allow us to make limited use of the service without authenticating, but they plan to turn that off in future. It is a good idea to register for the service here, then create an App in the developer portal, go to MyApps and copy down the key and secret.

You may find it easiest to save the key and the secret in your environment using:

key <- "yourkey"
secret <- "yoursecret"

ops_auth() is a simple function

Authenticate (not needed for this walkthrough as the dataset is small but good practice)

ops_auth("yourkey", "yoursecret")

Test a query for the number of results with ops_count()

We can only fetch 2000 results at a time from the OPS service. So, we need to find out whether we have 2000 or more results.

Get a Count for Patent Biblios (front pages)

Patent biblios include the title, abstract, applicant and inventor fields.

library(opsrdev)
ops_count("drones") # biblio search is the default for ops_count() and "biblio" does not need to be specified.

Get a Count for Titles

library(opsrdev)
ops_count("drones", "ti") # titles, all years

Get a Count for the Titles and Abstract

library(opsrdev)
ops_count("drones", "ta") # title and abstract, all years

Use a shorter year range if you have too many results (over 2000). Note that the year is the publication year.

library(opsrdev)
ops_count("drones", "ta", start = 1990, end = 2015) # 1990 to 2015

An example of over 2000 results would be for something slightly ridiculous like "dna" (the maximum that will be displayed appears to be 10,000 rather than the actual number of results which will be many more).

library(opsrdev)
ops_count("dna", "ta", start = 1990, end = 2015)

So, bear in mind that it is important to think about your search strategy. Will address how to deal with more than 2000 results at a time below.

Retrieve the data from OPS

In practice retrieving data from OPS is a multi-step process.

Step 1. Define how many results a query will return. Step 2. Generate URLs to retrieve the results (100 per url) Step 3. Fetch the results Step 4. Transform into a data.frame (in future a tibble)

We will illustrate these steps and then show how they are combined.

###Step 2: Generate URLs

library(opsrdev)
tmp <- ops_urls(query = "drones", type = "ta", start = 1990, end = 2015)
tmp

ops_urls() takes a query, a type (ti for title, ta for title and abstract, bilio - the default) and start and end for year or date ranges (YYYYMMDD format).

###Step 3 Fetch the data (as JSON)

library(opsrdev)
tmp1 <- ops_iterate(tmp, service = "biblio", timer = 10)
tmp1

ops_iterate uses lapply and a timer to fetch the data for each URL using GET (see ops_get). ops_iterate is a generic function that will work with a range of services (such as retrieving claims or descriptions based on patent numbers)

###Step 4 Transform into a data.frame

We normally want to view patent bibliographic information in a data frame (or in future the simplified tibble). Parsing the JSON data is the job of ops_biblio()

library(opsrdev)
tmp2 <- ops_biblio(tmp1)
head(tmp2)

ops_biblio uses the aggregate() function in its final step and this generates a warning.

Note that some issues require clarification in future work (notably the number of applicants/inventors per record). At present ops_biblio appears to parse only one of the inventor or applicant names from a sequence. This therefore requires more work to capture and concatenate the names in a record. The same is true in the case of IPC/CPC codes where the sequence data requires preprocessing and concatenating to provide one field

#Pulling it together with ops_fetch_biblio

ops_fetch_biblio() brings the above together into one function that will return a data frame straight from a query.

library(opsrdev)
drones <- ops_fetch_biblio(query = "drones", type = "ta", service = "biblio", start = 1990, end = 2015, timer = 10)

Note that at the moment this only works with the "biblio" service (the front page bibliographies rather than the numbers or other services). In future we will make this function generic.

If you see a warning or set of warnings such as:

There were 50 or more warnings (use warnings() to see the first 50)

This is expected behaviour arising from the use of aggregate() to generate the data frame. In RStudio preview (at the time of writing) you may see these warnings spelled out as:

non-missing arguments, returning NAno non-missing arguments

We will attempt to address this issue in future updates but it is expected behaviour at present.

View(drones)

We can then write the data to a file.

write_csv(drones, "drones.csv")

##Retrieving more than 2000 results

Bearing in mind that patent databases like the Lens and Patentscope allow you to download a data table with up to 10,000 results at a time, the OPS maximum of 2000 is rather frustrating. This will maybe change now that the USPTO is taking a lead with open data services. The 2,000 limit per query does however focus our attention on the important question of why we are seeking more than

Let's imagine that we are interested in the results of searches of patent titles and abstracts that contain the words pizza between 1990 and 2015. We can do this by specifying the start and end years. We could also specify this as start = 19900101 end = 20151231 for finer control. Note that the format is YYYYMMDD.

qtotal <- ops_count("pizza", "ta", start = 1990, end = 2015) # 2986

This generates 2,986 results between the specified period. We now need to break these into smaller chunks.

qtotal_1990_2000 <- ops_count("pizza", "ta", start = 1990, end = 2000) # 884
qtotal_2001_2010 <- ops_count("pizza", "ta", start = 2001, end = 2010) # 1412
qtotal_2011_2015 <- ops_count("pizza", "ta", start = 2011, end = 2015) # 690

We can see that it will take us three queries to arrive at the total number of results across the period (2000 to 2015 returned 2093 results which is too high).

We now have three sets of queries that will be needed to arrive to the 2981 results between 1990 and 2015. Each of these sets will need to be retrieved in smaller sets of 100 records at a time (the most that can be called at one time).

In practice this means that we will have to generate a set of URLs containing 100 items at a time. For example, to retrieve the 888 results between 1990 and 2000 we will need to use 9 URLs (888/100 = 8.88) and for 2001 to 2010 we will need 15 (14.12).

We can use ops_urls to generate the necessary urls for us.

library(opsrdev)
library(dplyr)
urls_1990_2000 <- ops_urls("pizza", type = "ta", start = 1990, end = 2000) %>% print()
urls_2001_2010 <- ops_urls("pizza", "ta", start = 2001, end = 2010) %>% print()
urls_2011_2015 <- ops_urls("pizza", "ta", start = 2011, end = 2015) 4%>% print()

We now have three sets of URLs, each specifying a range of 100 results upto the rounded maximum for the date range.

The final element is to create a for loop that will loop over each of our URLs and place them into an object where we can store the results and parse them. Thanks to this very useful R-bloggers article on Accessing APIs from R by Christoph Waldhauser we can actually do this quite easily. The key to this is:

  1. To create an empty holder to store the data we receive back from OPS that is the same length as our list of URLs. We will call that ops_results.
  2. Setting a sensible sleep time before each GET request is sent to OPS. To avoid violating the fair use charter and receiving a 403 (blocking message), the minimum value must be 6 seconds (to match the 10 calls per minute limit). But it may make sense to make this quite a lot longer and go and make a cup of tea while it runs.
ops_loop <- function(x, timer = 10) {
ops_results <- vector(mode = "list", length = length(x))
for(i in 1:length(ops_results)) {
  my_urls <- x[[i]]
  raw_results <- httr::GET(url = my_urls, httr::content_type("plain/text"), httr::accept("application/json"))
  content <- httr::content(raw_results)
  ops_results[[i]] <- content
  message("getting_data")
  Sys.sleep(time = timer)
}
return(ops_results)
}

Before we run this we need to authenticate as we will be retrieving more than a trivial number of results.

library(opsrdev)
ops_auth(key, secret)

Note that in addition to the restriction on the number of results per query, there is also a fair use restriction on the amount of data that is called (with a 2 gigabyte per week limit). Depending on the time of day there may be more pressure on the service than at others. You can check the status of the service before trying to retrieve multiple sets of data.

library(opsrdev)
ops_status()

ops_status provides two pieces of information. In the first line is your authentication status (200 means good). The second line shows the status of the different services (green is good, red is bad). It makes sense to run larger queries when the service is green as there is less likelihood of being booted from the system.

Note that the amount of data that comes back (54 MB for the first query) is not trivial if you plan on using OPS a lot.

pizza1 <- ops_loop(urls_1990_2000, timer = 20)

We can then run the second query

pizza2 <- ops_loop(urls_2001_2010, timer = 20)
pizza3 <- ops_loop(urls_2011_2015, timer = 20)

Once we have the data we can use combine() from dplyr to create one list of data for a call to process in ops_biblio() to produce our complete data frame

library(dplyr)
complete <- combine(test, pizza2, pizza3) %>% 
  ops_biblio()
complete

#Regularizing the publication number

At present the publication_country, the publication_number and publication_kind are separate fields. It makes sense to unite them so they can be use for lookups with other database services.

library(tidyr)
complete <- unite(complete, publication_number_full, c(publication_country, publication_number, publication_kind), sep = "", remove = FALSE)

##Round Up

opsr is presently at an early stage of development but progress is being made. The OPS service is not intended for large scale bulk downloads and it is quite easy to bump up against the service restrictions and receive a 403 message. But, OPS is a valuable source of free patent data and offers a range of other services (e.g. family and legal status) along with the ability to retrieve description and claims for some jurisdictions that are not available for free anywhere else.