# Migration Data Download

Get occurrence data from the Global Biodiversity Information Facility
(GBIF)

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [1]:
# Import packages for working with files and directories
import pathlib
import os

# Import tool for timing functions
import time

# Import tool to extract csv from GBIF zipfiles
import zipfile

# Allows for secure input of GBIF password
from getpass import getpass

# Import tool for fiding files by pattern
from glob import glob

# Import tool for HTTP request - authenticate GBIF account to download data
import requests

# Import package for working with geospatial vector data
import geopandas as gpd

# Import package for working with tabular data
import pandas as pd

# Import tools for dowloading species occurrence data from GBIF
import pygbif.occurrences as occ

# Import tools for looking up species names and info
import pygbif.species as species



For this challenge, you will need to download some data to the computer
you’re working on. We suggest using the `earthpy` library we develop to
manage your downloads, since it encapsulates many best practices as far
as:

1.  Where to store your data
2.  Dealing with archived data like .zip files
3.  Avoiding version control problems
4.  Making sure your code works cross-platform
5.  Avoiding duplicate downloads

If you’re working on one of our assignments through GitHub Classroom, it
also lets us build in some handy defaults so that you can see your data
files while you work.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Create a project folder</div></div><div class="callout-body-container callout-body"><p>The code below will help you get started with making a project
directory</p>
<ol type="1">
<li>Replace <code>'your-project-directory-name-here'</code> with a
<strong>descriptive</strong> name</li>
<li>Run the cell</li>
<li>The code should have printed out the path to your data files. Check
that your data directory exists and has data in it using the terminal or
your Finder/File Explorer.</li>
</ol></div></div>

> **File structure**
>
> These days, a lot of people find your file by searching for them or
> selecting from a `Bookmarks` or `Recents` list. Even if you don’t use
> it, your computer also keeps files in a **tree** structure of folders.
> Put another way, you can organize and find files by travelling along a
> unique **path**, e.g. `My Drive` \> `Documents` \>
> `My awesome project` \> `A project file` where each subsequent folder
> is **inside** the previous one. This is convenient because all the
> files for a project can be in the same place, and both people and
> computers can rapidly locate files they want, provided they remember
> the path.
>
> You may notice that when Python prints out a file path like this, the
> folder names are **separated** by a `/` or `\` (depending on your
> operating system). This character is called the **file separator**,
> and it tells you that the next piece of the path is **inside** the
> previous one.

In [2]:
# Create data directory
trumpeter_swan_dir = os.path.join(

    # Home directory
    pathlib.Path.home(),

    # Earth analytics data directory
    'earth-analytics',
    'data',
    
    # Project directory
    'trumpeter-swan-directory')

### Make the directory
os.makedirs(trumpeter_swan_dir, exist_ok = True)

### Define directory name for gbif data
gbif_dir = os.path.join(trumpeter_swan_dir, 'gbif')

### Make the directory
os.makedirs(gbif_dir, exist_ok = True)

### STEP 1: Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to enter your credentials for the rest of
your session.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-error"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div></div><div class="callout-body-container callout-body"><p>This code is <strong>interactive</strong>, meaning that it will
<strong>ask you for a response</strong>! The prompt can sometimes be
hard to see if you are using VSCode – it appears at the
<strong>top</strong> of your editor window.</p></div></div>

> **Tip**
>
> If you need to save credentials across multiple sessions, you can
> consider loading them in from a file like a `.env`…but make sure to
> add it to .gitignore so you don’t commit your credentials to your
> repository!

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset=True` instead of `reset=False`.

In [3]:
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials 
# and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!

# GBIF needs a username, password, and email 
# All 3 need to match the account
reset = False

# Request and store username
if (not ('GBIF_USER'  in os.environ)) or reset:
    os.environ['GBIF_USER'] = input('GBIF username:')

# Securely request and store password
if (not ('GBIF_PWD'  in os.environ)) or reset:
    os.environ['GBIF_PWD'] = getpass('GBIF password:')
    
# Request and store account email address
if (not ('GBIF_EMAIL'  in os.environ)) or reset:
    os.environ['GBIF_EMAIL'] = input('GBIF email:')

In [4]:
'GBIF_PWD' in os.environ

True

### STEP 2: Get the taxon key from GBIF

One of the tricky parts about getting occurrence data from GBIF is that
species often have multiple names in different contexts. Luckily, GBIF
also provides a Name Backbone service that will translate scientific and
colloquial names into unique identifiers. GBIF calls these identifiers
**taxon keys**.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Put the species name, <code>{python}  scientific_name</code>, into
the correct location in the code below.</li>
<li>Examine the object you get back from the species query. What part of
it do you think might be the taxon key?</li>
<li>Extract and save the taxon key</li>
</ol></div></div>

In [5]:
### Grab the species info
backbone = species.name_backbone(name= 'Cygnus buccinator')

### Look at the species info
backbone

{'usageKey': 2498345,
 'scientificName': 'Cygnus buccinator Richardson, 1831',
 'canonicalName': 'Cygnus buccinator',
 'rank': 'SPECIES',
 'status': 'ACCEPTED',
 'confidence': 99,
 'matchType': 'EXACT',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Anseriformes',
 'family': 'Anatidae',
 'genus': 'Cygnus',
 'species': 'Cygnus buccinator',
 'kingdomKey': 1,
 'phylumKey': 44,
 'classKey': 212,
 'orderKey': 1108,
 'familyKey': 2986,
 'genusKey': 8996942,
 'speciesKey': 2498345,
 'class': 'Aves'}

In [6]:
### Pull out the species key
species_key = backbone['usageKey']

### Look at species key
species_key

2498345

### STEP 3: Download data from GBIF

Downloading GBIF data is a multi-step process. However, we’ve provided
you with a chunk of code that handles the API communications and caches
the download. You’ll still need to customize your search.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Submit a request to GBIF</div></div><div class="callout-body-container callout-body"><ol type="1">
<li><p>Replace <code>csv_file_pattern</code> with a string that will
match <strong>any</strong> <code>.csv</code> file when used in the
<code>.rglob()</code> method. HINT: the character <code>*</code>
represents any number of any values except the file separator
(e.g. <code>/</code> on UNIX systems)</p></li>
<li><p>Add parameters to the GBIF download function,
<code>occ.download()</code> to limit your query to:</p>
<ul>
<li>observations of <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="8"></span></li>
<li>from <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="9"></span></li>
<li>with spatial coordinates.</li>
</ul></li>
<li><p>Then, run the download. <strong>This can take a few
minutes</strong>. You can check your downloads by logging on to the <a
href="https://www.gbif.org/user/download">GBIF website</a>.</p></li>
</ol></div></div>

In [7]:
# Only download once
### Set file name for download
gbif_pattern = os.path.join(trumpeter_swan_dir, '*.csv')

### Double check that there isn't already a file that matches this pattern.
### If it already exists, skip the whole conditional
### And go straight to the line: gbif_path = glob(gbif_pattern)[0]
if not glob(gbif_pattern):

    ### Only submit a download request to GBIF once
    ### If GBIF_DOWNLOAD_KEY is not defined in our environment, make the download request
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:

        ### Submit a query to GBIF
        gbif_query = occ.download([

            ### Add your species key here
            f"speciesKey = {2498345}",

            ### Filter out results that are missing coordinates
            "hasCoordinate = True",

            ### Choose a year to include
            "year = 2022",
        ])
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_query[0]

    # Wait for the download to build
    download_key = os.environ['GBIF_DOWNLOAD_KEY']

    ### Use the occurrence command module in pygbif to get the metadata
    wait = occ.download_meta(download_key)['status']

    ### Check if the status of the download = "SUCCEEDED"
    ### Wait and loop through until it finishes
    while not wait=='SUCCEEDED':
        wait = occ.download_meta(download_key)['status']

        ### Don't want to re-query the API in the loop too frequently
        time.sleep(5)

    # Download GBIF data when it's ready
    download_info = occ.download_get(
        os.environ['GBIF_DOWNLOAD_KEY'], 
        path=trumpeter_swan_dir)

    # Unzip GBIF data using the zipfile package
    with zipfile.ZipFile(download_info['path']) as download_zip:
        download_zip.extractall(path=trumpeter_swan_dir)

# Find the extracted .csv file path (take the first result)
gbif_path = glob(gbif_pattern)[0]

You might notice that the GBIF data filename isn’t very
**descriptive**…at this point, you may want to clean up your data
directory so that you know what the file is later on!

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace ‘your-gbif-filename’ with a <strong>descriptive</strong>
name.</li>
<li>Run the cell</li>
<li>Check your data folder. Is it organized the way you want?</li>
</ol></div></div>

In [8]:
# Current path to the CSV file (first match)
gbif_path = glob(gbif_pattern)[0]

# New file name (for example, 'trumpeter_swan_2022.csv')
new_name = os.path.join(trumpeter_swan_dir, 'trumpeter_swan_2022.csv')

# Rename the file
# Only rename if the new file doesn't already exist
if os.path.exists(gbif_path) and not os.path.exists(new_name):
    os.rename(gbif_path, new_name)
    gbif_path = new_name

# Update gbif_path to point to the renamed file
gbif_path = new_name

# Check the rename csv file location
print("CSV renamed to:", gbif_path)

CSV renamed to: C:\Users\nymve\earth-analytics\data\trumpeter-swan-directory\trumpeter_swan_2022.csv


### STEP 4: Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><p>Just like you did when wrangling your data from the data subset,
you’ll need to load your GBIF data and convert it to a GeoDataFrame.</p></div></div>

In [9]:
# Load the GBIF data
# Read the renamed CSV into a DataFrame
swan_gbif_df = pd.read_csv(
    gbif_path,
    # GBIF files are tab-separated
    delimiter='\t',  
    # Helps avoid dtype warnings for large files 
    low_memory=False, 
    # Set index
    index_col='gbifID', 
    # Select coloumns needed for analysis
    usecols=['gbifID', 'decimalLatitude', 'decimalLongitude', 'month'] #
)

# Display first few rows of the dataframe to check data
swan_gbif_df.head()

Unnamed: 0_level_0,decimalLatitude,decimalLongitude,month
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4646825409,43.62827,-79.32917,1
4323802234,40.91399,-81.59443,5
4254074916,43.2658,-79.781044,1
4262046537,41.394104,-80.91398,4
4285827586,42.273277,-83.6934,1


In [13]:
# Convert GBIF dataframe into GeoDataFrame
swan_gbif_gdf = (
    gpd.GeoDataFrame(
        swan_gbif_df, 
        geometry=gpd.points_from_xy(
            swan_gbif_df.decimalLongitude, 
            swan_gbif_df.decimalLatitude), 
        crs="EPSG:4326")
    # Select the desired columns
    [['month', 'geometry']]
)

# Display GBIF Geodataframe 
swan_gbif_gdf

Unnamed: 0_level_0,month,geometry
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1
4646825409,1,POINT (-79.32917 43.62827)
4323802234,5,POINT (-81.59443 40.91399)
4254074916,1,POINT (-79.78104 43.2658)
4262046537,4,POINT (-80.91398 41.3941)
4285827586,1,POINT (-83.6934 42.27328)
...,...,...
3823266844,3,POINT (5.85 53.2)
3996487242,9,POINT (-110.48186 44.67639)
3822871585,1,POINT (5.85 53.2)
3826456857,3,POINT (5.85 53.2)


In [14]:
# Check results
swan_gbif_gdf.total_bounds

array([-165.58243 ,   26.07226 ,   14.369705,   71.301346])

# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [15]:
%store swan_gbif_df swan_gbif_gdf

Stored 'swan_gbif_df' (DataFrame)
Stored 'swan_gbif_gdf' (GeoDataFrame)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!