# Notebook 8.1 RESTful GBIF

This notebook will introduce you to methods for parsing data from the web by accessing websites that have RESTful APIs. In the next notebook we will continue to other methods for parsing data from the web as well. 

These notebook will involve two core aspects: (1) Working with RESTful APIs, which we discussed briefly in lecture; and (2) Parsing HTML text. The first is much easier since the data that we query from the web is returned in a form that is easy to interpret and *meant* to be parsed and analyzed, whereas the second way requires us to learn a little about how the internet works, which is also useful stuff to know. So let's get started. You'll need to install the two libraries below. 

In [1]:
# conda install beautifulsoup4
# conda install requests

In [1]:
import toyplot
import requests
import pandas as pd


### The design of REST APIs. 
The idea behind REST APIs is that data on a server (like a webpage) can be accessed with a consistent type of argument in the form of a URL, to query data which will then be returned in a form that is easy to analyze (usually json or xml), as opposed to being returned in messy HTML that needs to be parsed. Many websites have REST APIs, but some are much easier to use than others. 

### Limits
Many REST APIs have limits on the way that you can use them. For example, the REST APIs for Twitter, GitHub, and Reddit require that you login in order to access data, and some sites will throttle how many requests you can make per hour. Although usually sites with a REST API are intended to be searched, the tools we will learn here can be used to search almost any website. It is important to know that some sites will intentionally block you if you try too aggressively to scrape data from them. For example, google scholar and genbank both have limits on the number of queries you can make per hour. 

### Good REST APIs
A good REST API will have good directions for its usage. Two very good examples are the [USDA Bison API](https://bison.usgs.gov/#api) and the [Global Biodiversity Information Facility (GBIF) API](https://www.gbif.org/developer/summary). We'll focus on the latter in this notebook. The GBIF database is an international effort to collect all observation data on plants, animals, and fungi into a single place where it can be searched. It is actually a conglomeration of many separate databases, with data from museums and similar institutions all over the world. These APIs are free to use but request that you cite them if the data is used in a publication eventually. 

### What does GBIF do?
GBIF can be used to find specimen collection records, or other types of observation data, stored in museum type databases. The website offers a convenient way to request taxa by name, and to select specific research criteria. For example, if we wanted to find all specimens of bumblebees (genus *Bombus*) that were collected between 1910 and 1920 we could request this through the website. It will draw a nice map with their locations and you can download a table with coordinates of where they were collected. This is actually one of the best databases around, since it organizes the data quite easily for you to download, but nevertheless, we'll use it as our example to learn REST APIs. 

Even though GBIF has a very nice web interface, it is obviously often more efficient to be able to query this database programmatically, instead of having to type each name we wish to search, and click on several buttons. This can provide a much more powerful way of applying filters over many different types of searches. That is the idea behind REST APIs and the reason why GBIF provides one. 

### The base-url
The base URL is the web address of the API. This is simply a string that we wil add arguments to in order to request particular types of data be returned to use from the database. For GBIF this is the following, which we'll store as a string for now. This base-url address is given to us right at the top of the [GBIF API documentation](https://www.gbif.org/developer/occurrence). You can see that it looks much like any other web address. 

In [2]:
baseurl = "http://api.gbif.org/v1/occurrence/search?"

### How to query GBIF
As you can see in the URL below, an API query just has additional arguments added to the baseurl. The string below searches for records with the name *Bombus*, which is the genus for bumblebees. We've added just the 'query' option 'q' and the name. We'll see next how to make more complex queries. But first, copy the URL below (without the quotation marks around it) and paste it into a web browser. This will show you what the returned data looks like. It might look a bit different depending on which browser you are using (I recommend using firefox or chrome) but the underlying data is the same, and is called JSON data.

In [13]:
search_url = "http://api.gbif.org/v1/occurrence/search?q=Bombus"

### JSON format
We will be using the `requests` library to get data from online, but before we do, let's talk a bit about how the data will be structured so we know what to expect. The data that you should see in your browser now is called JSON formatted data.  You'll notice that this format is almost identical to what a Python dictionary looks like. It is composed of key:value pairs. This will make it particularly easy to work with. 

### Requests

[Documentation](http://docs.python-requests.org/en/master/user/quickstart/)

The `requests` package work a lot like an automated web browser. We've used `requests` briefly in the past but now we'll start to use it more effectively. The main function we will call is `.get()`, which will send a GET command to the web address and return a Response Class object. We will then access attributes and functions of the Response instance to see if our request worked, and to parse the resulting text from it. Let's try this on our `search_url` string defined above. 

In [14]:
# create a Response instance from a request
response = requests.get(search_url)


In [15]:
# check that your request worked (200 = worked; other codes No))
response.status_code 

200

In [16]:
# this would return the error message if it didn't work (else None)
response.raise_for_status()

### Parse a Response
Before when we've used `requests` we've parsed the results as plain text, since it was usually in a format that was easiest to work with as a string. In this case, we are going to access the data a bit differently, by instead  accessing it in JSON format. This is easily available from the object just like text is. The first is not very easily readable or parseable, whereas the second can be accessed and searched more easily. 

In [19]:
# first 500 characters of the .text string
response.text[:500]

'{"offset":0,"limit":20,"endOfRecords":false,"count":1048303,"results":[{"key":131330107,"datasetKey":"86aba78e-f762-11e1-a439-00145eb45e9a","publishingOrgKey":"8595cd50-87c0-11dc-bb35-b8a03c50a862","publishingCountry":"BR","protocol":"DIGIR","lastCrawled":"2013-09-07T07:05:27.000+0000","crawlId":1,"extensions":{},"basisOfRecord":"PRESERVED_SPECIMEN","taxonKey":1340307,"kingdomKey":1,"phylumKey":54,"classKey":216,"orderKey":1457,"familyKey":4334,"genusKey":1340278,"speciesKey":1340307,"scientific'

In [18]:
# or, get results as a dictionary (JSON converted)
rdict = response.json()

# get some quick info on the dictionary items
list(rdict.keys())

['facets', 'count', 'limit', 'offset', 'results', 'endOfRecords']

### Parsing the results
In GBIF our response can be parsed into a dictionary object using the JSON format, and this as six keys shown above. These are explained in the API docs, and correspond to information about what records are available for our query. However, it did not return *all* of the data for those records to us yet. That would be too easy. Instead, databases usually have limits on the amount of data from each request as a way of limiting the bandwidth they will need for sending the data, and to make it faster. For GBIF the default number, shown under the "limit" key, is 20. And the default starting position, shown under "offset" is 0. The total number of records is in "count". So for *Bombus*, as we show below, there are 1,048,303 records, but on records 1-20 were returned to us so far.

In [20]:
## how many records are there for this query
rdict["count"]

1048303

In [21]:
## how many records were returned
rdict["limit"]

20

In [24]:
## starting from which record
rdict["offset"]

0

### So where's the data?
It's stored under the `results` variable, and is returned as a list of dictionaries, where each dictionary is a record with lots of information. Below I show the first record from our search. 

In [25]:
# here is the first record, it's also a dictionary
rdict["results"][0]

{'basisOfRecord': 'PRESERVED_SPECIMEN',
 'catalogNumber': '6416',
 'class': 'Insecta',
 'classKey': 216,
 'collectionCode': 'CEPANN',
 'continent': 'SOUTH_AMERICA',
 'country': 'Brazil',
 'countryCode': 'BR',
 'crawlId': 1,
 'datasetKey': '86aba78e-f762-11e1-a439-00145eb45e9a',
 'decimalLatitude': 0.0,
 'decimalLongitude': 0.0,
 'extensions': {},
 'facts': [],
 'family': 'Apidae',
 'familyKey': 4334,
 'gbifID': '131330107',
 'genericName': 'Bombus',
 'genus': 'Bombus',
 'genusKey': 1340278,
 'geodeticDatum': 'WGS84',
 'identifiers': [],
 'institutionCode': 'USP',
 'issues': ['ZERO_COORDINATE',
  'GEODETIC_DATUM_ASSUMED_WGS84',
  'COUNTRY_COORDINATE_MISMATCH'],
 'key': 131330107,
 'kingdom': 'Animalia',
 'kingdomKey': 1,
 'lastCrawled': '2013-09-07T07:05:27.000+0000',
 'lastInterpreted': '2018-02-04T02:00:19.037+0000',
 'license': 'http://creativecommons.org/licenses/by/4.0/legalcode',
 'order': 'Hymenoptera',
 'orderKey': 1457,
 'phylum': 'Arthropoda',
 'phylumKey': 54,
 'protocol': 'D

In [26]:
# or to see it prettier, convert records to a DataFrame
pd.DataFrame(rdict["results"]).head()

Unnamed: 0,basisOfRecord,catalogNumber,class,classKey,collectionCode,continent,country,countryCode,county,crawlId,...,publishingOrgKey,recordedBy,relations,scientificName,species,speciesKey,specificEpithet,stateProvince,taxonKey,taxonRank
0,PRESERVED_SPECIMEN,6416,Insecta,216,CEPANN,SOUTH_AMERICA,Brazil,BR,,1,...,8595cd50-87c0-11dc-bb35-b8a03c50a862,-,[],"Bombus morio (Swederus, 1787)",Bombus morio,1340307.0,morio,,1340307,SPECIES
1,PRESERVED_SPECIMEN,9002,Insecta,216,CEMeC,,Brazil,BR,Milagres,1,...,8595cd50-87c0-11dc-bb35-b8a03c50a862,,[],"Bombus morio (Swederus, 1787)",Bombus morio,1340307.0,morio,Bahia,1340307,SPECIES
2,UNKNOWN,43781,Insecta,216,2007_Mongol,,,,,12,...,be11c6a0-7cf5-11dc-92cb-b8a03c50a862,,[],Bombus margreiteri Skorikov,Bombus margreiteri,9055570.0,margreiteri,,9055570,SPECIES
3,PRESERVED_SPECIMEN,91583,Insecta,216,ZEN,,,,,17,...,4c415e40-1e21-11de-9e40-a0d6ecebb8bf,,[],"Bombus Latreille, 1802",,,,,1340278,GENUS
4,PRESERVED_SPECIMEN,94807,Insecta,216,ZEN,,,,,17,...,4c415e40-1e21-11de-9e40-a0d6ecebb8bf,,[],"Bombus Latreille, 1802",,,,,1340278,GENUS


### Building a request
Here we add more arguments to further filter the results. To see which options are available, you can either look at the results from our existing calls so far, or you can read further into the API docs. Sometimes API docs will be incomplete though, so it can be useful to learn to try to infer which options are possible from looking at the results. A more complex search is accomplished by building a URL that has more key:value pairs each appended to the end of the URL, and separated by a "&" symbol. For large searches it begins to get difficult to write out by hand, and that is where `requests` comes in handy. Here we enter the additional arguments we want using a simple python dictionary into the entry 'params'. Below I show the URL for when we add the requirement that a record have coordinate data, and for when we add additional arguments to raise the limit for the number of records returned. The max records at a time (limit - offset) is 300. Above that you need to increment the offset to search higher values.

In [15]:
# add requirement that the record have coordinate data
res = requests.get(
    url=baseurl, 
    params={
        "q": "Bombus", 
        "hasCoordinate": "true",
    }
)
res.url

'http://api.gbif.org/v1/occurrence/search?hasCoordinate=true&q=Bombus'

In [27]:
# request records 0-100
res = requests.get(
    url=baseurl, 
    params={
        "q": "Bombus", 
        "hasCoordinate": "true",
        "offset": "0", 
        "limit": "100"
    }
)
res.url

'http://api.gbif.org/v1/occurrence/search?hasCoordinate=true&limit=100&q=Bombus&offset=0'

### A complex search
Here I request all Bombus records from 1900-1910 that are associated with a preserved speciment (as opposed to HUMAN_OBSERVATION or FOSSIL_SPECIMEN), has spatial data, and is in the US. The 'count' shows us that there are 4,977 records meeting these requirements. We requested the max of 300 records. 

In [33]:
res = requests.get(
    url=baseurl, 
    params={
        "q": "Bombus", 
        "year": "1900,1910", 
        "basisOfRecord": "PRESERVED_SPECIMEN",
        "hasCoordinate": "true",
        "hasGeospatialIssue": "false",
        "country": "US",
        "offset": "0",
        "limit": "300"
    },
)
r2dict = res.json()
r2dict["count"]

4977

In [34]:
res = requests.get(
    url=baseurl, 
    params={
        "q": "Bombus", 
        "year": "1900,1910", 
        "basisOfRecord": "PRESERVED_SPECIMEN",
        "hasCoordinate": "true",
        "hasGeospatialIssue": "false",
        "country": "US",
        "offset": "300",
        "limit": "300"
    },
)
r1dict = res.json()
r1dict["count"]

4977

### Combining many searches
If we wanted to collect all records for a given search then we need to increment the "offset" argument until we reach the end of the records. Each is returned as a list of dictionaries, so we can just join all of those lists together and return them. 

In [35]:
def get_all_records(searchparams):
    "iterate until end of records"
    start = 0
    data = []
    
    while 1:
        # make request and store results
        res = requests.get(
            url=baseurl, 
            params=searchparams,
        )
        # increment counter
        searchparams["offset"] = str(int(searchparams["offset"]) + 300)
        
        # concatenate data 
        idata = res.json()
        data += idata["results"]
        
        # stop when end of record is reached
        if idata["endOfRecords"]:
            break
        
    return data

In [36]:
# make params dictionary
searchparams = {
    "q": "Bombus", 
    "year": "1900,1910", 
    "basisOfRecord": "PRESERVED_SPECIMEN",
    "hasCoordinate": "true",
    "hasGeospatialIssue": "false",
    "country": "US",
    "offset": "0",
    "limit": "300"
}

# call function to search over all offset values until end
data = get_all_records(searchparams)

### The full data

In [37]:
# convert to a data frame
df = pd.DataFrame(data)

In [38]:
# keys (columns) in the dataframe (there are many!)
list(df.columns)

['accessRights',
 'associatedReferences',
 'associatedTaxa',
 'basisOfRecord',
 'bibliographicCitation',
 'catalogNumber',
 'class',
 'classKey',
 'collectionCode',
 'collectionID',
 'continent',
 'coordinatePrecision',
 'coordinateUncertaintyInMeters',
 'country',
 'countryCode',
 'county',
 'crawlId',
 'datasetID',
 'datasetKey',
 'datasetName',
 'dateIdentified',
 'day',
 'decimalLatitude',
 'decimalLongitude',
 'disposition',
 'dynamicProperties',
 'elevation',
 'elevationAccuracy',
 'endDayOfYear',
 'eventDate',
 'eventRemarks',
 'extensions',
 'facts',
 'family',
 'familyKey',
 'fieldNotes',
 'gbifID',
 'genericName',
 'genus',
 'genusKey',
 'geodeticDatum',
 'georeferenceProtocol',
 'georeferenceRemarks',
 'georeferenceSources',
 'georeferencedBy',
 'higherClassification',
 'higherGeography',
 'http://unknown.org/recordEnteredBy',
 'http://unknown.org/recordId',
 'identifiedBy',
 'identifier',
 'identifiers',
 'individualCount',
 'infraspecificEpithet',
 'institutionCode',
 'ins

In [40]:
# view just the columns we're interested in for now.
sdf = df[["species", "year", "decimalLatitude", "decimalLongitude"]]
sdf.head()

Unnamed: 0,species,year,decimalLatitude,decimalLongitude
0,Bombus ternarius,1908,45.06167,-83.43278
1,Bombus ternarius,1909,39.75556,-105.22056
2,Bombus sylvicola,1907,53.865,-166.527
3,Bombus pensylvanicus,1909,37.72722,-89.21667
4,Bombus vagans,1907,42.17,-88.29


In [41]:
# how many records?
sdf.shape

(4977, 4)

In [50]:
# which unique species?
sdf.species.unique()

array(['Bombus ternarius', 'Bombus sylvicola', 'Bombus pensylvanicus',
       'Bombus vagans', 'Bombus terricola', 'Bombus fervidus',
       'Bombus balteatus', 'Bombus variabilis', 'Bombus centralis',
       'Bombus huntii', 'Bombus auricomus', 'Bombus nevadensis',
       'Bombus impatiens', 'Bombus rufocinctus', 'Bombus appositus',
       'Bombus fraternus', 'Bombus griseocollis', 'Bombus ashtoni',
       'Bombus insularis', 'Bombus vosnesenskii', 'Bombus vandykei',
       'Bombus bifarius', 'Bombus affinis', 'Bombus borealis',
       'Bombus morrisoni', 'Bombus citrinus', 'Bombus perplexus',
       'Bombus fernaldae', 'Bombus melanopygus', 'Bombus weisi',
       'Bombus hyperboreus', 'Bombus mixtus', 'Bombus flavifrons',
       'Bombus sandersoni', 'Bombus crotchii', 'Bombus bimaculatus',
       'Bombus caliginosus', 'Bombus occidentalis', nan,
       'Bombus cockerelli', 'Bombus lapponicus', 'Bombus californicus',
       'Bombus suckleyi', 'Bombus patagiatus', 'Bombus sitkensis',
 

In [48]:
# exclude nan/null and count species
mask = sdf.species.notnull()
df.species[mask].unique().size

48

In [53]:
# plot the number of each species in order (hover over bars for names)
sp_counts = df.species[mask].value_counts()
toyplot.bars(sp_counts, height=350, title=sp_counts.index);

## Assignment: 

### Task 1: 
Write a Class object called `Records` that can be given a taxon name query and a range of years and will return a class instance with all results from GBIF for the queried range using the same params from our example above, except allowing the 'q' and 'year' arguments to vary. You can reuse the code above to create the core functions for your object. When finished, you should be able to use the object in the following way. This means that after writing your Class object, test it using the code below and make sure it gives proper results. 

In [28]:
# fill in the params dictionary, write functions to update it
# with the entered arguments to __init__, write functions
# to get_all_records and store results as a dataframe.
# have all functions run during init so that the initialized
# object calls the request and returns a full dataframe. 

class Records:
    def __init__(self, q=None, interval=None):
        
        self.q = q
        self.interval = interval
        self.params = {}
    
    def _get_all_records(self):
        "..."
        pass
    

In [None]:
## create instance by entering query and a range of years as integers
rec = Records(q="Bombus", interval=(1950, 1955))

## access all of the returned records as a dataframe 
## (here asking for the shape to see how many records there are)
rec.df.shape


### Task 2: 
Once you have tested your Record class object in this notebook and it is working, copy it to a new `.py` file in a text editor and name it records.py, and put this in a new folder called `records/`. Then add an `__init__.py` file and a `setup.py` file and structure this directory so that it can be imported as a Python package. If you need a review on how to do this look back at lecture 5, and the assignment from that lecture (https://github.com/programming-for-bio/5-Packaging/blob/master/Notebooks/nb-5.2-packaging.ipynb). This will be very similar to the 'helloworld' package that we wrote. When you are finished, install the package with pip and try importing it like below. Then push the `records` folder as a new repo to your GitHub account named records.  

In [None]:
# import your library and access the Records class object from it
import records
rec = records.Records("Bombus", interval=(1990, 2000))


### Task 3: 
I will post a working example on Friday so that you can *update or fix your code before the next class to ensure your code is working*. Then push your updated/fixed code to GitHub before the next class. You should hopefully be able to get it working, since you've done all of these tasks before. We are going to continue adding new functions to this library over the next two weeks, so if you get stuck on this assignment ask for help and try to keep up, because it will be very difficult if you get behind. 