# Migration Data Download

Get occurrence data from the Global Biodiversity Information Facility
(GBIF)

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [1]:
%%bash
pip install pygbif



In [2]:
import os
from pathlib import Path

import earthpy
import geopandas as gpd
import pandas as pd

In [3]:
import time
import zipfile
from getpass import getpass
from glob import glob
import shutil

import pygbif.occurrences as occ
import pygbif.species as species
import requests

For this challenge, you will need to download some data to the computer
you’re working on. We suggest using the `earthpy` library we develop to
manage your downloads, since it encapsulates many best practices as far
as:

1.  Where to store your data
2.  Dealing with archived data like .zip files
3.  Avoiding version control problems
4.  Making sure your code works cross-platform
5.  Avoiding duplicate downloads

If you’re working on one of our assignments through GitHub Classroom, it
also lets us build in some handy defaults so that you can see your data
files while you work.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Create a project folder</div></div><div class="callout-body-container callout-body"><p>The code below will help you get started with making a project
directory</p>
<ol type="1">
<li>Replace <code>'your-project-directory-name-here'</code> with a
<strong>descriptive</strong> name</li>
<li>Run the cell</li>
<li>The code should have printed out the path to your data files. Check
that your data directory exists and has data in it using the terminal or
your Finder/File Explorer.</li>
</ol></div></div>

> **File structure**
>
> These days, a lot of people find your file by searching for them or
> selecting from a `Bookmarks` or `Recents` list. Even if you don’t use
> it, your computer also keeps files in a **tree** structure of folders.
> Put another way, you can organize and find files by travelling along a
> unique **path**, e.g. `My Drive` \> `Documents` \>
> `My awesome project` \> `A project file` where each subsequent folder
> is **inside** the previous one. This is convenient because all the
> files for a project can be in the same place, and both people and
> computers can rapidly locate files they want, provided they remember
> the path.
>
> You may notice that when Python prints out a file path like this, the
> folder names are **separated** by a `/` or `\` (depending on your
> operating system). This character is called the **file separator**,
> and it tells you that the next piece of the path is **inside** the
> previous one.

In [4]:
# Create data directory
project = earthpy.Project(
    dirname='bigcat_migration_data')

#Turn string into object
project_dir = Path(project.project_dir)

#create folder for data
project_dir.mkdir(parents=True, exist_ok=True)

project_dir

PosixPath('/workspaces/data/bigcat_migration_data')

### STEP 1: Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to enter your credentials for the rest of
your session.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-error"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div></div><div class="callout-body-container callout-body"><p>This code is <strong>interactive</strong>, meaning that it will
<strong>ask you for a response</strong>! The prompt can sometimes be
hard to see if you are using VSCode – it appears at the
<strong>top</strong> of your editor window.</p></div></div>

> **Tip**
>
> If you need to save credentials across multiple sessions, you can
> consider loading them in from a file like a `.env`…but make sure to
> add it to .gitignore so you don’t commit your credentials to your
> repository!

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset=True` instead of `reset=False`.

In [5]:
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials 
# and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!

# GBIF needs a username, password, and email 
# All 3 need to match the account
reset = False

# Request and store username
if (not ('GBIF_USER'  in os.environ)) or reset:
    os.environ['GBIF_USER'] = input('GBIF username:')

# Securely request and store password
if (not ('GBIF_PWD'  in os.environ)) or reset:
    os.environ['GBIF_PWD'] = getpass('GBIF password:')
    
# Request and store account email address
if (not ('GBIF_EMAIL'  in os.environ)) or reset:
    os.environ['GBIF_EMAIL'] = input('GBIF email:')

### STEP 2: Get the taxon key from GBIF

One of the tricky parts about getting occurrence data from GBIF is that
species often have multiple names in different contexts. Luckily, GBIF
also provides a Name Backbone service that will translate scientific and
colloquial names into unique identifiers. GBIF calls these identifiers
**taxon keys**.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Put the species name, <code>{python}  scientific_name</code>, into
the correct location in the code below.</li>
<li>Examine the object you get back from the species query. What part of
it do you think might be the taxon key?</li>
<li>Extract and save the taxon key</li>
</ol></div></div>

In [6]:
#load the species data for selected species
backbone = species.name_backbone(name='Puma concolor')

#save the unique identifier 
species_key = backbone['usageKey']

species_key


2435099

### STEP 3: Download data from GBIF

Downloading GBIF data is a multi-step process. However, we’ve provided
you with a chunk of code that handles the API communications and caches
the download. You’ll still need to customize your search.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Submit a request to GBIF</div></div><div class="callout-body-container callout-body"><ol type="1">
<li><p>Replace <code>csv_file_pattern</code> with a string that will
match <strong>any</strong> <code>.csv</code> file when used in the
<code>.rglob()</code> method. HINT: the character <code>*</code>
represents any number of any values except the file separator
(e.g. <code>/</code> on UNIX systems)</p></li>
<li><p>Add parameters to the GBIF download function,
<code>occ.download()</code> to limit your query to:</p>
<ul>
<li>observations of <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="8"></span></li>
<li>from <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="9"></span></li>
<li>with spatial coordinates.</li>
</ul></li>
<li><p>Then, run the download. <strong>This can take a few
minutes</strong>. You can check your downloads by logging on to the <a
href="https://www.gbif.org/user/download">GBIF website</a>.</p></li>
</ol></div></div>

In [7]:
# Only download once
### set file name for download
gbif_pattern = os.path.join(project_dir, '*.csv')

### double check that there isn't already a file that matches this pattern.
### if it already exists, skip the whole conditional
### and go straight to the line: gbif_path = glob(gbif_pattern)[0]
if not glob(gbif_pattern):

    ### only submit a download request to GBIF once
    ### if GBIF_DOWNLOAD_KEY is not defined in our environment, make the download request
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:

        ### submit a query to GBIF
        gbif_query = occ.download([

            ### add your species key here
            f"taxonKey = {species_key}",

            ### filter out results that are missing coordinates
            "hasCoordinate = True",

            ### choose a year to include
            "year = 2023",
        ])
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_query[0]

    # Wait for the download to build
    download_key = os.environ['GBIF_DOWNLOAD_KEY']

    ### use the occurrence command module in pygbif to get the metadata
    wait = occ.download_meta(download_key)['status']

    ### check if the status of the download = "SUCCEEDED"
    ### wait and loop through until it finishes
    while not wait=='SUCCEEDED':
        wait = occ.download_meta(download_key)['status']

        ### don't want to re-query the API in the loop too frequently
        time.sleep(5)

    # Download GBIF data when it's ready
    download_info = occ.download_get(
        os.environ['GBIF_DOWNLOAD_KEY'], 
        path=project_dir)

    # Unzip GBIF data using the zipfile package
    with zipfile.ZipFile(download_info['path']) as download_zip:
        download_zip.extractall(path=project_dir)

# Find the extracted .csv file path (take the first result)
gbif_path = glob(gbif_pattern)[0]

INFO:Your download key is 0005625-251025141854904
INFO:Download file size: 99832 bytes
INFO:On disk at /workspaces/data/bigcat_migration_data/0005625-251025141854904.zip


You might notice that the GBIF data filename isn’t very
**descriptive**…at this point, you may want to clean up your data
directory so that you know what the file is later on!

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace ‘your-gbif-filename’ with a <strong>descriptive</strong>
name.</li>
<li>Run the cell</li>
<li>Check your data folder. Is it organized the way you want?</li>
</ol></div></div>

In [8]:
# Give the download a descriptive name
gbif_path = project.project_dir / 'taxon_gbif'

#Find and name path of recent download
original_gbif_path = Path('/workspaces/data/bigcat_migration_data'
'/0005625-251025141854904.csv')

# Move file to descriptive path
shutil.move(original_gbif_path, gbif_path)

PosixPath('/workspaces/data/bigcat_migration_data/taxon_gbif')

### STEP 4: Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><p>Just like you did when wrangling your data from the data subset,
you’ll need to load your GBIF data and convert it to a GeoDataFrame.</p></div></div>

In [9]:
# Load the GBIF data as dataframe
gbif_df = pd.read_csv(
    gbif_path,
    delimiter='\t',
    index_col='gbifID',
    usecols=['gbifID','month','decimalLatitude','decimalLongitude']
)

# Convert data to GDF
gbif_gdf = (
    gpd.GeoDataFrame(
        gbif_df, 
        geometry=gpd.points_from_xy(
            gbif_df.decimalLongitude, 
            gbif_df.decimalLatitude), 
        crs="EPSG:4326")
    # Select the desired columns
    [['month','geometry']]
)

# Check results
gbif_gdf.total_bounds


array([-128.604153,  -56.459571,   83.61557 ,   56.116539])

# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [None]:
%store backbone species_key original_gbif_path 
%store gbif_path gbif_df gbif_gdf
%store project project_dir

Stored 'backbone' (dict)
Stored 'species_key' (int)
Stored 'original_gbif_path' (PosixPath)
Stored 'gbif_path' (PosixPath)
Stored 'gbif_df' (DataFrame)
Stored 'gbif_gdf' (GeoDataFrame)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!