# The records library

After completing assignment 8 you should now have a working `records` Python library that can be imported and which contains a `Records` class object that can be used to query GBIF using `requests`. Here I'll explain my implementation of the `Records` class so that you can compare yours to it. To complete your assignment you will be reviewing your own code, and then updating it to match with the latest version of the `records` library on the course repository. That version has much more code than was assigned originally, and will be explained below. 

### First, an example implementation
According to the instructions for the assignment, you should be able to import your library and execute the code below to perform a GBIF query in which the results are returned and stored in a pandas dataframe that is accessible from your class instance. We'll start with a pretty small query in terms of the time interval so that it will run quickly. 

In [1]:
import records

In [2]:
# get results for a query
rec = records.Records("Bombus", (1900, 1905))

In [3]:
# show number of results in dataframe
rec.df.shape

(2053, 103)

### Location of the code-review repository
You can find my version of the assignment code in a GitHub repo [here](https://github.com/programming-for-bio/records/tree/assignment). This link is to a `branch` of the records repository that I have named `assignment`, and in which the code has been left at the point where I had completed the assignment. If you look at the `master` branch of the repository you will see that the code has since advanced quite significantly. Here I'm using a branch as a way of storing a snapshot in time of the code (a commit). Another way we could store a snapshot would be by using `git tags` as described [in the git docs](https://git-scm.com/book/en/v2/Git-Basics-Tagging). It's simply a way of putting a more informative name on a particular commit in your history. 

### Reviewing the code

In the `assignment` branch take a look at the file `records/records.py` to see my version of the `Records` class object. I tried to keep this object very simple. Compare your implementation to this one. Think about readability and redundancy. I wrote this code in sublimetext using the pycodestyle and pyflakes linters which helped to provide hints on styling the code. You'll notice that none of the lines are longer than 80 characters long, which help to make them easily readable. As usual, there are multiple ways in which this code could have been written and it would still function just as well. For example, the params dictionary does not necessarily need to be defined in the `__init__` function, but I chose to define it there. Some of the design decisions I made have to do with planning for future extensions to this code, as we'll see in a later section of this notebook. If your code did not work then take time to compare it to the version in the repository to find what the difference was. Modify your code as needed and test it again.

### Common differences between my code and submitted assignments
Some common differences that I saw in student version's of the code versus mine included redundancy in defining some terms multiple times, and errors in updating the offset or limit parameters in the dictionary. Some of these did not cause problems while others might have. Another useful fnction that many people did not include in their code was a method to check whether the API query was successful. If you do not check this then the `.json()` query will typically raise an error when the URL request is bad, but it will not provide a very informative reason why. So you will want to check the Response object to see if the `.get()` call was successful before parsing the json data. There are several ways to check that your URL was good by using attributes of the `requests` Response object, such as `.status_code`, `.raise_for_status`, or `.ok`. Example below. 

In [4]:
import pandas as pd
import numpy as np
import requests

In [5]:
# example to check the success of a request
response = requests.get("http://api.gbif.org/v1/occurrence/search?q=Bombus&year=1900")
response.ok

True

#### A note on exception handling 

The code below has a bad URL string, so it will **raise an exception**. We entered in `basisOfRecord` as a search query but provided it with a value that is not acceptable (e.g., Not "PRESERVED_SPECIMEN" or "FOSSIL_SPECIMEN"). An exception includes two things, a *traceback* and a *message*. The traceback will show you a bunch of lines of code and will try to hint for you as to where the problem arose in the code. Interpreting tracebacks takes some practice, especially when running code that you did not write, as in the case of `requests`. The traceback will print back a bunch of lines of the code inside `requests` that is running into an error. At some point, it will also include a line of code that has your request. The traceback follows the logical progression as one function calls another and then another. At the very end you will find the **message**, and this is often the most useful. Below we can see that there was a `HTTPError`, and it tells us that the requested URL was not good. When you see this you should examine the URL and try to figure out what is wrong. We would then see that 1900 is a bad argument and replace it with the appropriate value.  

In [6]:
# example of a bad request 
response = requests.get("http://api.gbif.org/v1/occurrence/search?q=Bombus&basisOfRecord=1900")
response.raise_for_status()

HTTPError: 400 Client Error: Bad Request for url: http://api.gbif.org/v1/occurrence/search?q=Bombus&basisOfRecord=1900

In [7]:
# corrected request 
response = requests.get("http://api.gbif.org/v1/occurrence/search?q=Bombus&year=1900")
response.raise_for_status()

### Additions to the `records` library 

We will proceed on Monday to use our library to run some more complex analyses. But first, I'll have you update your code to include some a bunch of new features so that you can see the code, try to interpret it, and even test it out before class. I also explain the new code in `records` in great detail below. If we call our original assignment version of the library 0.1, then let's call this new update version 0.2. You can find version 0.2 of the `records` library [here](https://github.com/programming-for-bio/records). This is the master branch of the same repository where the assignment branch is saved. 

### Completing your code-review
To complete your code review you should update your repository to match the master branch of my `records` repository. You could do this by adding my master branch as a new remote for your git repository and then doing git pull, or, if that feels too complicated, you can also just copy and paste the code into a text editor. The latter might be better as it gives you a chance to examine the code line by line. Update your repository by the time of class on Monday and read through the description below so you have a good understanding of the code. 

### Additions to `records.py` explained

I'll explain each of the new features below that I incorporated into `records.py`. Feel free to try out the code for yourself to see how it works. We'll also work with it in class. 

** 1. ** I added a `kwargs` argument to the function. We haven't covered this yet, it is a special way of entering arguments to a function in Python. Using kwargs is particularly useful for class objects that store large amounts of information using dictionaries, like our `Records` class object. `kwargs` allows you to pass in an arbitrary number of additional arguments to a function using a dictionary. If your object is already storing information in a dictionary, like the params attribute of the `records` object, then it can be easily updated using kwargs. This is convenient for our API example especially, since there are hundreds of possible arguments to the API and we don't want to have to create an entry for every one of them in our `__init__` function. Instead, the user can simply pass in a dictionary with additional arguments as `kwargs` and update the params dictionary with the kwargs values. The example below shows how this works by using the `.update` function, which is a builtin function of dictionaries to update one dictionary to have the same keys and values as another, while keeping whatever keys and values of its own that aren't present in the other. In other words, it's good for updating a dictionary that is full of default arguments like in our `records` object. Examples are shown below. 

In [8]:
# an example class that takes one required argument, 'q', and an optional argument **kwargs
class Obj:
    def __init__(self, q, **kwargs):
        
        # the default set of params
        self.params = {
            "q": q,
            "a": 1, 
            "b": 2,
            "c": 3,
        }
        
        # update params using kwargs
        self.params.update(kwargs)
        

In [9]:
# no kwargs argument so the params dictionary stays at its defaults except for 'q' 
instance = Obj("Bombus")
instance.params

{'a': 1, 'b': 2, 'c': 3, 'q': 'Bombus'}

In [10]:
# kwargs allows updating the params dict, and even adding new key:val pairs to it
instance = Obj("Bombus", **{'a': 100, 'd': 200, 'e': 300})
instance.params

{'a': 100, 'b': 2, 'c': 3, 'd': 200, 'e': 300, 'q': 'Bombus'}

In [11]:
# not the same as above. The ** is necessary to tell Python this dict is special.
instance = Obj("Bombus", kwargs={'a': 100, 'd': 200, 'e': 300})
instance.params

{'a': 1,
 'b': 2,
 'c': 3,
 'kwargs': {'a': 100, 'd': 200, 'e': 300},
 'q': 'Bombus'}

#### Example using `kwargs` with the updated Records object
Since our v.0.2 Records object has a `kwargs` argument we can now add whatever additional arguments we want to the API query. Our default params dictionary sets the "country" to "US". We can update this using `kwargs` to search other countries. 

In [24]:
# find all records of Bombus from 1935 up to 2010 from China
rec = records.Records("Bombus", (1935, 2010), **{"country": "CN"})
rec.df[["species", "country", "year"]].head(10)

Unnamed: 0,species,country,year
0,Bombus imitator,China,1938
1,Bombus breviceps,China,1937
2,Bombus mearnsi,China,1938
3,Bombus mearnsi,China,1938
4,Bombus mearnsi,China,1938
5,Bombus ladakhensis,China,1996
6,Bombus imitator,China,1938
7,Bombus eximius,China,1938
8,Bombus imitator,China,1938
9,Bombus flavescens,China,1938


In [25]:
# find all records of Bombus from 1935-1936 without country designated
rec = records.Records("Bombus", (1935, 1936), **{"country": None})
rec.df[["species", "country", "year"]].head(10)

Unnamed: 0,species,country,year
0,Bombus bifarius,United States,1936
1,Bombus centralis,United States,1935
2,Bombus terricola,United States,1935
3,Bombus rufocinctus,United States,1935
4,Bombus jonellus,Norway,1935
5,Bombus soroeensis,Norway,1936
6,Bombus lucorum,Norway,1935
7,Bombus hortorum,Norway,1936
8,Bombus pascuorum,Norway,1936
9,Bombus balteatus,Norway,1936


** 2. ** I also added a special type of function called a `property` to our class. This function called `sdf` simply has a decorator above it (`@property`) which defines that this will act a bit differently from a normal function. Essentially, it acts like an attribute in that it returns something, but the return value is not fixed, so in that sense it works like function to calculate the value. Here we simply use it to subselect a number of columns from the .df dataframe so that we have a quick and easy way to view a small number of columns, instead of the >100 columns that are in the dataframe. See the docstring in the code for more details, or search google for property decorators in Python. This is a special feature that I'm showing so that you know that decorators exist, but these are an optional tool, again used mostly for stylistic reasons. 


In [29]:
# show the first 10 records using the 'sdf' view, which shows only a few columns
rec.sdf.head(10)

Unnamed: 0,species,year,country,stateProvince
0,Bombus bifarius,1936,United States,Idaho
1,Bombus centralis,1935,United States,Oregon
2,Bombus terricola,1935,United States,Oregon
3,Bombus rufocinctus,1935,United States,Maine
4,Bombus jonellus,1935,Norway,Rogaland
5,Bombus soroeensis,1936,Norway,Rogaland
6,Bombus lucorum,1935,Norway,Rogaland
7,Bombus hortorum,1936,Norway,Rogaland
8,Bombus pascuorum,1936,Norway,Hordaland
9,Bombus balteatus,1936,Norway,Rogaland


** 3. ** The biggest change to the code is that I created another new class object called `Epochs`, which is used to create `Records` instances across a range of year intervals. I thought this would be a nice design since the Records objects will be focused on performing individual web searches whereas the `Epochs` object will be focused on doing analyses of those objects. We could have just one class object do both things, it is simply a design decision in this case to atomize the code in this way. 

The `Epochs` object takes as arguments the query, the starting year, the ending year, and the interval size, as well as kwargs. Below is an example. I then explain the code for the Epochs class in more detail. 

In [30]:
# create an Epochs instance to sample 7 3-year intervals of bumblebees from Canada
ep = records.Epochs("Bombus", 1900, 1921, 3, **{"country": "CA"})

In [31]:
# look at view of records (sdf of Epochs shows "epoch" column as well)
ep.sdf.head(10)

Unnamed: 0,species,year,epoch,country,stateProvince
0,Bombus fervidus,1900,1900,Canada,Ontario
1,Bombus borealis,1900,1900,Canada,Manitoba
2,Bombus vagans,1900,1900,Canada,New Brunswick
3,Bombus sandersoni,1901,1900,Canada,Ontario
4,Bombus occidentalis,1901,1900,Canada,British Columbia
5,Bombus bifarius,1901,1900,Canada,British Columbia
6,Bombus fervidus,1901,1900,Canada,British Columbia
7,Bombus terricola,1901,1900,Canada,Ontario
8,Bombus bifarius,1901,1900,Canada,British Columbia
9,Bombus bifarius,1901,1900,Canada,British Columbia


#### The Epochs class explained. 
Each section below is describing a part of the Epochs class in order:

(1) In the code, I use range to create the range of intervals that we will enter to the Requests class objects.

```python
# make a range of epochs
epochs = range(start, end, epochsize)

```

(2) Here I use list-comprehension to create a dictionary of Records objects for each interval, and pass in kwargs to them as well. 

```python
# get Record objects across the epoch range
rdicts = {
    i: Records(q, (i, i + epochsize), **kwargs) for i in epochs
}
```

(3) I add a new column to each Record.df called 'epoch', and store the epoch that we used in our search query. This uses *broadcasting* to allow us to enter a single value and have it fill then entire column of data. Broadcasting was covered in our reading about numpy. Feel free to review the chapters of the Data Science Handbook on numpy and pandas, it's super useful stuff. 

```python
# store epoch value in each dataframe, this is just the start interval
for epoch in rdicts:
    rdicts[epoch].df["epoch"] = epoch
```

(4) Check that there was actually some results returned, if not, then we will skip the next part. This statement asks whether the dictionary `rdicts` is empty or not by simply asking `if rdicts`. 

```python
# if rdicts, then build dataframe, otherwise skip it. 
if rdicts:
    ...
```

(5) Again, we use list-comprehension to access multiple items from the rdicts dictionary. Here I access the `.df` attribute from each Records instance, which provides a list of dataframes to the `pd.concat()` function which will then concatenate them to create one large dataframe. 

```python
# concatenate all dataframes into one
self.df = pd.concat([i.df for i in rdicts.values()])
```

(6) Last, I sort the dataframe. The syntax here is worth noting. This is a very clean and readable way to apply multiple transformations to a pandas dataframe. Remember that we can chain together multiple functions by using the dot attributes of the objects. If we wrap all of them in a parentheses then we can do this over multiple lines. Here I call `sort_values` with the argument `by='year'`, and then I call `reset_index` so that the index (row numbers) will be reset to be 0-N. If you don't reset the index after sorting then it will keep the old index order, although the values will be rearranged. This is because you might be interested in what the index order was before sorting. Here we're not interested, so we'll reset the index and use drop=True, which means discard the old index. Try this out on a dataframe of your own, see how using a different value for the by="column name" argument changes the sorting, and how drop=True versus drop=False changes the results. Remember, these functions only return a new *view* of the dataframe. To save the new view we need to store it as a variable. I set it to .df so that it overwrites the old unsorted dataframe. 

```python
# sort values by year, and reset index without keeping old index
self.df = (
    self.df
    .sort_values(by="year")
    .reset_index(drop=True)
    )

# or, this is same as above, a bit less easily readable but more compact.
self.df = self.df.sort_values(by="year").reset_index(drop=True)
```


### Now that we have our data, we can write analysis functions for the class
If the next part of the code is overly confusing then you may want to review the Data Science Handbook chapter on using `pandas` dataframes, and also refer to the pandas documentation. Pandas is an incredibly powerful and elegant library, but it takes time to learn. Here we'll be using some of the more useful functions in the library for manipulating data in dataframes. These tricks are generally useful for all sorts of computations. The core workflow we will be using is generally to use a `.groupby()` function call, following by a `.apply()` function call. The first is used to find all records in the dataframe that share some common element, for example `groupby("species")`, and the second is used to apply any arbitrary function to groups. This could be a builtin function or numpy function like `sum`, `mean`, or `unique`, or it can be a function that you write yourself. Below are several examples. 

In [32]:
# calling groupby by itself returns a groupby object (we still need to do something)
ep.sdf.groupby("stateProvince")

<pandas.core.groupby.DataFrameGroupBy object at 0x000002C9985D55F8>

In [33]:
# Each group is a dataframe in itself, and you can select columns of it. 
# You can see that .year now return groups of pandas Series objects
ep.sdf.groupby("stateProvince").year

<pandas.core.groupby.SeriesGroupBy object at 0x000002C996AD0D30>

In [34]:
# apply calls a function on every group. Let's calculate the median year of collections
# records from each state in our data set. Finally I call astype to convert to ints.
ep.sdf.groupby("stateProvince").year.apply(np.median).astype(int)

stateProvince
Alberta                      1914
British Columbia             1915
Manitoba                     1913
New Brunswick                1918
Newfoundland                 1920
Newfoundland and Labrador    1907
Northwest Territories        1903
Nova Scotia                  1908
Nunavut                      1915
Ontario                      1913
Prince Edward Island         1909
Quebec                       1917
Sakatchewan                  1909
Saskatchewan                 1914
Vancouver Island             1913
Yukon                        1919
Yukon Territory              1916
Name: year, dtype: int32

#### Typo fixes
Note that there are typos in some of the records above. For example, at least one of the records has it's state listed as "Sakatchewan". We can fix this easily by indexing values of the data with this term and replacing them with the correct one. The equality statements below are used to find the index (row) of the matching values, and then the column name is listed in the column index. To select cells by index and column names we use the `.loc[ind, col]` syntax. 

In [35]:
# fix some typos
ep.df.loc[ep.df.stateProvince=="Sakatchewan", "stateProvince"] = "Saskatchewan"
ep.df.loc[ep.df.stateProvince=="Yukon", "stateProvince"] = "Yukon Territory"
ep.df.loc[ep.df.stateProvince=="Newfoundland", "stateProvince"] = "Newfoundland and Labrador"

In [36]:
# filter for na values, groupby state, select years, and apply median function
(ep.df[ep.df.stateProvince.notna() & ep.df.year.notna()]
 .groupby("stateProvince")
 .year
 .apply(np.median)
 .astype(int)
)

stateProvince
Alberta                      1914
British Columbia             1915
Manitoba                     1913
New Brunswick                1918
Newfoundland and Labrador    1907
Northwest Territories        1903
Nova Scotia                  1908
Nunavut                      1915
Ontario                      1913
Prince Edward Island         1909
Quebec                       1917
Saskatchewan                 1914
Vancouver Island             1913
Yukon Territory              1919
Name: year, dtype: int32

In [37]:
# define a custom function
def count_unique_values(series):
    "counts the number of unique species"
    return np.unique(series).size

# call apply on my function
(ep.df[ep.df.species.notna()]
 .groupby("epoch")
 .species
 .apply(count_unique_values)
)

epoch
1900    21
1903    28
1906    29
1909    31
1912    38
1915    33
1918    35
Name: species, dtype: int64

### Calculating simpson's diversity
[Simpson's diversity index](https://en.wikipedia.org/wiki/Diversity_index#Simpson_index) is a common measure used to calculate species diversity in a geographic region. It measures the probability that any two individuals sampled at random from a community are the same species. Look at the code to see how it is calculated. The equation is quite simple. Below is a function written for the Epoch class that can calculate Simpsons's diversity for whichever groupby column name the user enters. We'll use this quite a bit later. 

In [38]:
# calculate diversity in space
ep.simpsons_diversity(by="stateProvince")

stateProvince
Alberta                      0.919960
British Columbia             0.885258
Manitoba                     0.863905
New Brunswick                0.799114
Newfoundland and Labrador    0.730159
Northwest Territories        0.834467
Nova Scotia                  0.775656
Nunavut                      0.602716
Ontario                      0.889101
Prince Edward Island         0.444444
Quebec                       0.842583
Saskatchewan                 0.821285
Vancouver Island                  NaN
Yukon Territory              0.859568
Name: species, dtype: float64

In [39]:
# calculate diversity in space
ep.simpsons_diversity(by="epoch")

epoch
1900    0.917793
1903    0.917234
1906    0.917822
1909    0.924506
1912    0.943998
1915    0.943072
1918    0.923183
Name: species, dtype: float64

### Saving and loding data
The Records and Epochs class objects are really nice for working with data on GBIF, and so all of the analyses we do with them should be reproducible since we start by downloading the data directly and so other users should be able to copy our code as well, there may be instances where we download a large database of records and then want to save them so that if we return later and want to work on them again we do not need to run the long and slow download step again. The most convenient format for saving these data in is CSV (comma separated values). In the code you can see that I wrote a function called `load_epochs_from_csv` which simply uses pandas to load a csv and set it to the `.df` attribute of an empty Epochs class object. Below is an example usage. To make the `.load_epochs_from_csv` function accessible from the `records` library I had to add it to the `__init__.py` file. Same with the `Epochs` class object. 

In [None]:
# save data to disk
ep.df.to_csv("../data/Bombus-data.csv")

# load data from disk into an Epochs class object
ep = records.load_epochs_from_csv("../data/Bombus-data.csv")

# show that it worked
ep.df.shape