-
Notifications
You must be signed in to change notification settings - Fork 16
/
README.Rmd
289 lines (199 loc) · 12.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "README"
author: "Paul Oldham"
date: "29 July 2016"
output:
github_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# opsrdev
[![Travis-CI Build Status](https://travis-ci.org/poldham/opsrdev.svg?branch=master)](https://travis-ci.org/poldham/opsrdev)
This is the early development version of the opsr R package to access the [European Patent Office Open Patent Services API](http://www.epo.org/searching-for-patents/technical/espacenet/ops.html#tab1)
The main aim of the package is to provide access to the OPS service and, more importantly, to transform the data into data frames or the new tibbles (simpler data frames) that we can perform useful patent analysis with. There is still quite a way to go and if you would like to contribute feel welcome.
opsrdev is not on CRAN. To install it use:
```{r eval=FALSE}
devtools::install_github("poldham/opsrdev")
```
##Walkthrough
Here is a quick walkthrough on how to use opsrdev. We will use the topic of drones as an example.
opsr uses r packages in the tidyverse as well as some additional packages. You don't need to worry about that but for this walkthrough we need to install or load the following.
```{r install_drones, eval=FALSE}
install.packages("dplyr") #for chaining using %>%
install.packages("readr") #to write the files
```
Load dplyr to use %>% in interactive mode and opsrdev.
```{r library, eval=FALSE}
library(opsrdev) # the opsr package
library(dplyr)
library(readr)
```
OPS will allow us to make limited use of the service without authenticating, but they plan to turn that off in future. It is a good idea to register for the service [here](http://www.epo.org/searching-for-patents/technical/espacenet/ops.html#tab1), then create an App in the developer portal, go to MyApps and copy down the key and secret.
You may find it easiest to save the key and the secret in your environment using:
```{r key_secret, eval=FALSE}
key <- "yourkey"
secret <- "yoursecret"
```
`ops_auth()` is a simple function
Authenticate (not needed for this walkthrough as the dataset is small but good practice)
```{r auth, eval=FALSE}
ops_auth("yourkey", "yoursecret")
```
## Test a query for the number of results with ops_count()
We can only fetch 2000 results at a time from the OPS service. So, we need to find out whether we have 2000 or more results.
### Get a Count for Patent Biblios (front pages)
Patent biblios include the title, abstract, applicant and inventor fields.
```{r count}
library(opsrdev)
ops_count("drones") # biblio search is the default for ops_count() and "biblio" does not need to be specified.
```
### Get a Count for Titles
```{r title}
library(opsrdev)
ops_count("drones", "ti") # titles, all years
```
### Get a Count for the Titles and Abstract
```{r ta}
library(opsrdev)
ops_count("drones", "ta") # title and abstract, all years
```
Use a shorter year range if you have too many results (over 2000). Note that the year is the publication year.
```{r year_limit}
library(opsrdev)
ops_count("drones", "ta", start = 1990, end = 2015) # 1990 to 2015
```
An example of over 2000 results would be for something slightly ridiculous like "dna" (the maximum that will be displayed appears to be 10,000 rather than the actual number of results which will be many more).
```{r large_example}
library(opsrdev)
ops_count("dna", "ta", start = 1990, end = 2015)
```
So, bear in mind that it is important to think about your search strategy. Will address how to deal with more than 2000 results at a time below.
## Retrieve the data from OPS
In practice retrieving data from OPS is a multi-step process.
Step 1. Define how many results a query will return.
Step 2. Generate URLs to retrieve the results (100 per url)
Step 3. Fetch the results
Step 4. Transform into a data.frame (in future a tibble)
We will illustrate these steps and then show how they are combined.
###Step 2: Generate URLs
```{r urls}
library(opsrdev)
tmp <- ops_urls(query = "drones", type = "ta", start = 1990, end = 2015)
tmp
```
ops_urls() takes a query, a type (ti for title, ta for title and abstract, bilio - the default) and start and end for year or date ranges (YYYYMMDD format).
###Step 3 Fetch the data (as JSON)
```{r iterate, eval=FALSE}
library(opsrdev)
tmp1 <- ops_iterate(tmp, service = "biblio", timer = 10)
tmp1
```
ops_iterate uses lapply and a timer to fetch the data for each URL using GET (see ops_get). ops_iterate is a generic function that will work with a range of services (such as retrieving claims or descriptions based on patent numbers)
###Step 4 Transform into a data.frame
We normally want to view patent bibliographic information in a data frame (or in future the simplified tibble). Parsing the JSON data is the job of ops_biblio()
```{r biblio, eval=FALSE, warning=FALSE}
library(opsrdev)
tmp2 <- ops_biblio(tmp1)
head(tmp2)
```
ops_biblio uses the aggregate() function in its final step and this generates a warning.
Note that some issues require clarification in future work (notably the number of applicants/inventors per record). At present ops_biblio appears to parse only one of the inventor or applicant names from a sequence. This therefore requires more work to capture and concatenate the names in a record. The same is true in the case of IPC/CPC codes where the sequence data requires preprocessing and concatenating to provide one field
#Pulling it together with ops_fetch_biblio
ops_fetch_biblio() brings the above together into one function that will return a data frame straight from a query.
```{r fetch, warning=FALSE, eval=FALSE}
library(opsrdev)
drones <- ops_fetch_biblio(query = "drones", type = "ta", service = "biblio", start = 1990, end = 2015, timer = 10)
```
Note that at the moment this only works with the "biblio" service (the front page bibliographies rather than the numbers or other services). In future we will make this function generic.
If you see a warning or set of warnings such as:
`There were 50 or more warnings (use warnings() to see the first 50)`
This is expected behaviour arising from the use of `aggregate()` to generate the data frame. In RStudio preview (at the time of writing) you may see these warnings spelled out as:
`non-missing arguments, returning NAno non-missing arguments`
We will attempt to address this issue in future updates but it is expected behaviour at present.
```{r view, eval=FALSE}
View(drones)
```
We can then write the data to a file.
```{r write, eval=FALSE}
write_csv(drones, "drones.csv")
```
##Retrieving more than 2000 results
Bearing in mind that patent databases like the Lens and Patentscope allow you to download a data table with up to 10,000 results at a time, the OPS maximum of 2000 is rather frustrating. This will maybe change now that the USPTO is taking a lead with [open data services](http://www.uspto.gov/learning-and-resources/open-data-and-mobility). The 2,000 limit per query does however focus our attention on the important question of why we are seeking more than
Let's imagine that we are interested in the results of searches of patent titles and abstracts that contain the words pizza between 1990 and 2015. We can do this by specifying the start and end years. We could also specify this as start = 19900101 end = 20151231 for finer control. Note that the format is YYYYMMDD.
```{r eval=FALSE}
qtotal <- ops_count("pizza", "ta", start = 1990, end = 2015) # 2986
```
This generates 2,986 results between the specified period. We now need to break these into smaller chunks.
```{r qtotal}
qtotal_1990_2000 <- ops_count("pizza", "ta", start = 1990, end = 2000) # 884
qtotal_2001_2010 <- ops_count("pizza", "ta", start = 2001, end = 2010) # 1412
qtotal_2011_2015 <- ops_count("pizza", "ta", start = 2011, end = 2015) # 690
```
We can see that it will take us three queries to arrive at the total number of results across the period (2000 to 2015 returned 2093 results which is too high).
We now have three sets of queries that will be needed to arrive to the 2981 results between 1990 and 2015. Each of these sets will need to be retrieved in smaller sets of 100 records at a time (the most that can be called at one time).
In practice this means that we will have to generate a set of URLs containing 100 items at a time. For example, to retrieve the 888 results between 1990 and 2000 we will need to use 9 URLs (888/100 = 8.88) and for 2001 to 2010 we will need 15 (14.12).
We can use ops_urls to generate the necessary urls for us.
```{r url_set}
library(opsrdev)
library(dplyr)
urls_1990_2000 <- ops_urls("pizza", type = "ta", start = 1990, end = 2000) %>% print()
urls_2001_2010 <- ops_urls("pizza", "ta", start = 2001, end = 2010) %>% print()
urls_2011_2015 <- ops_urls("pizza", "ta", start = 2011, end = 2015) 4%>% print()
```
We now have three sets of URLs, each specifying a range of 100 results upto the rounded maximum for the date range.
The final element is to create a `for loop` that will loop over each of our URLs and place them into an object where we can store the results and parse them. Thanks to this very useful [R-bloggers article on Accessing APIs from R by Christoph Waldhauser](http://www.r-bloggers.com/accessing-apis-from-r-and-a-little-r-programming/) we can actually do this quite easily. The key to this is:
1. To create an empty holder to store the data we receive back from OPS that is the same length as our list of URLs. We will call that `ops_results`.
2. Setting a sensible sleep time before each GET request is sent to OPS. To avoid violating the fair use charter and receiving a 403 (blocking message), the minimum value must be 6 seconds (to match the 10 calls per minute limit). But it may make sense to make this quite a lot longer and go and make a cup of tea while it runs.
```{r ops_loop, eval=FALSE}
ops_loop <- function(x, timer = 10) {
ops_results <- vector(mode = "list", length = length(x))
for(i in 1:length(ops_results)) {
my_urls <- x[[i]]
raw_results <- httr::GET(url = my_urls, httr::content_type("plain/text"), httr::accept("application/json"))
content <- httr::content(raw_results)
ops_results[[i]] <- content
message("getting_data")
Sys.sleep(time = timer)
}
return(ops_results)
}
```
Before we run this we need to authenticate as we will be retrieving more than a trivial number of results.
```{r auth1, eval=FALSE}
library(opsrdev)
ops_auth(key, secret)
```
Note that in addition to the restriction on the number of results per query, there is also a fair use restriction on the amount of data that is called (with a 2 gigabyte per week limit). Depending on the time of day there may be more pressure on the service than at others. You can check the status of the service before trying to retrieve multiple sets of data.
```{r status}
library(opsrdev)
ops_status()
```
ops_status provides two pieces of information. In the first line is your authentication status (200 means good). The second line shows the status of the different services (green is good, red is bad). It makes sense to run larger queries when the service is green as there is less likelihood of being booted from the system.
Note that the amount of data that comes back (54 MB for the first query) is not trivial if you plan on using OPS a lot.
```{r get1, eval=FALSE}
pizza1 <- ops_loop(urls_1990_2000, timer = 20)
```
We can then run the second query
```{r get2, eval=FALSE}
pizza2 <- ops_loop(urls_2001_2010, timer = 20)
```
```{r get3, eval=FALSE}
pizza3 <- ops_loop(urls_2011_2015, timer = 20)
```
Once we have the data we can use `combine()` from dplyr to create one list of data for a call to process in `ops_biblio()` to produce our complete data frame
```{r complete, eval=FALSE, warning=FALSE}
library(dplyr)
complete <- combine(test, pizza2, pizza3) %>%
ops_biblio()
complete
```
#Regularizing the publication number
At present the publication_country, the publication_number and publication_kind are separate fields. It makes sense to unite them so they can be use for lookups with other database services.
```{r publicationnumber, eval=FALSE}
library(tidyr)
complete <- unite(complete, publication_number_full, c(publication_country, publication_number, publication_kind), sep = "", remove = FALSE)
```
##Round Up
`opsr` is presently at an early stage of development but progress is being made. The OPS service is not intended for large scale bulk downloads and it is quite easy to bump up against the service restrictions and receive a 403 message. But, OPS is a valuable source of free patent data and offers a range of other services (e.g. family and legal status) along with the ability to retrieve description and claims for some jurisdictions that are not available for free anywhere else.