# GIATAR Dataset update

Scripts can be run directly individually (sequentially), or using this notebook as a more interactive interface.

The dataset is meant to be updated...
1. periodically when data sources (SInAS, EPPO, DAISIE) publish new lists of species
2. regularly (e.g., monthly, or 6-monthly) to incorporate new invasive records from EPPO and GBIF

The update process can also be used to rebuild the dataset from scratch, though this will take several days to process many API calls.

In [None]:
# Navigate from the tutorials folder to the root folder
# Add that directory to the path

import os
import sys

os.chdir("..")
sys.path.append(os.getcwd())

## Monthly (or multi-monthly) update

This process updates only the occurrence data/first records, not the underlying species lists. This is a good way to incorporate new reports and species observations of ongoing invasions.

In [None]:
# Run: 2_new_gbif_obs.py to get GBIF observations since the last update

!python data_update/2_new_gbif_obs.py

In [None]:
# Run: 3b_get_monthly_eppo_reports.py to get EPPO reports since the last update

!python data_update/3b_get_monthly_eppo_reports.py

In [None]:
# Run: 4_consolidate_all_occurence.py to incorporate these new records into the all_records and first_records datasets

!python data_update/4_consolidate_all_occurence.py

## Full update/re-create dataset
The same process is used to update the full dataset, including species lists and trait data, as is to create it from scratch. 

To update the species lists, first save the species list file from the source as described, then run the script. If there are no updates to a source, just run the script. This will prevent species that are already included in the dataset from being treated as new species. 

### Create .env file

In [None]:
# Optional: Create .env file. 
# If this is the first time you are running these scripts/creating the dataset, you will need to create an .env file. 

# Where the .csv files are being stored (data_dir)

drive_letter = "Y:" 
data_dir = "/GIATAR/dataset/"

# Auth token for EPPO API
 # Anyone can register on EPPO (https://data.eppo.int/user/login) to get a token
 
eppo_token = "INSERT_TOKEN" 

# Year to start collecting GBIF records

base_obs_year = 1970

# Store information about last updates

gbif_obs_last_update = "2025-05-14"
eppo_report_last_update = "2025-05-14"

with open(".env", "w") as f:
    f.write(f"DATA_PATH='{drive_letter + data_dir}'\n")
    f.write(f"EPPO_TOKEN='{eppo_token}'\n")
    f.write(f"BASE_OBS_YEAR='{base_obs_year}'\n")
    f.write(f"GBIF_OBS_UPDATED='{gbif_obs_last_update}'\n")
    f.write(f"EPPO_REP_UPDATED='{eppo_report_last_update}'\n")
    f.close()

### Download GBIF taxnomic backbone

To create the dataset for the first time, you will also need to download the GBIF taxonomic backbone. 

Go to https://www.gbif.org/occurrence/download and select the Download tab. Select “Species List” (the last option). 

You should get an email notification when your download is available. Save file as `species lists/by_database/gbif_all_small.csv`.

### Download species lists from each source

- Download the latest SInAS list and records from https://zenodo.org/records/10038256 (if available) and save as `species lists/by_databaseSInAS_AlienSpeciesDB_2.5_FullTaxaList.csv` and `species lists/by_database/SInAS_AlienSpeciesDB_2.5.csv`
- Download the CABI-ISC species list from: https://www.cabidigitallibrary.org/journal/cabicompendium/isdt#. Select and unselect a filter option to display full list. Download as CSV and save to `species lists/by_database/ISCSearchResults.csv`. Remove any headers and make sure columns are named "Scientific name", "Common name", "Coverage", "URL"
- Download the Bayer flat file from the EPPO data services dashboard, https://data.eppo.int/user/ (see Bayer flat file: https://data.eppo.int/documentation/bayer). Save all files to `species lists/by_database/EPPO-main/`
- Download the latest version of input_taxon.csv and save to `species lists/by_database/input_taxon.csv` from https://github.com/trias-project/daisie-checklist/tree/master/data/raw.

### Run all of the species list scripts 

Regardless of whether new data is available for each source, run the scripts for all sources to prevent existing species from being treated as new species.

In [None]:
! python data_update/0b_get_sinas_species_list.py

In [None]:
! python data_update/0c_get_cabi_species_list.py

In [None]:
! python data_update/0d_get_eppo_species_list.py

In [None]:
! python data_update/0e_get_daisie_species_list.py

### Check all species for matches in the GBIF

This step checks all of the species lists against the GBIF taxonomic backbone using GBIF's names API. This step can take some time depending on how many species are being searched for (minutes to hours for updating, ~10 hours to match the full original species lists).

In [None]:
! python data_update/1a_new_species_gbif_match.py

In [None]:
# If there are unmatched species, run 1a2_check_unfound_gbif_keys.py

! python data_update/1a2_check_unfound_gbif_keys.py

### Check the invasive/alien species status of species in each list

Some sources (EPPO, CABI) contain non-invasive and non-alien species (i.e., host species, natural enemies). This script checks the invasive species status against information from all four sources.

In [None]:
! python data_update/1b_new_species_check_invasive.py

### Consolidate same-species across lists

In [None]:
! python data_update/1c_combine_species_lists.py

## Acquire species records

Obtain and consolidate first reports and species occurrence records from GBIF, EPPO, DAISIE, ASFR, and CABI.

- Get GBIF species observations: This step involves many API calls so may take several hours when updating the dataset (with more recent records or new species) and several days (~1 week) for the first construction of the dataset.
- Get EPPO species distributions and species reports: This step involves web 
- Process the DAISIE data (formatting)
- Consolidate the data from the different sources

In [None]:
! python data_update/2_new_gbif_obs.py

In [None]:
! python data_update/3a_get_eppo_species_report.py

In [None]:
! python data_update/3c_get_eppo_species_dist.py

Download all files in the DAISIE Github repository "raw" directory: https://github.com/trias-project/daisie-checklist/tree/master/data/raw and save to DAISIE data/raw

In [None]:
! python data_update/3d_process_daisie_data.py

In [None]:
! python data_update/4_consolidate_all_occurence.py

## Get additional traits data from EPPO

CABI trait data is provided with the dataset as a static data set.

Trait data from EPPO can be obtained via an API so can therefore be updated easily for new species or as new information about species becomes available. This is run only for new species when updating the dataset, or for all species when re-creating the dataset. Querying this data for all species can take 6+ hours.


In [None]:
! python data_update/5_eppo_api_update.py

That's it! You should now have a complete or updated dataset. Please let us know if you run into any issues or questions: https://github.com/ncsu-landscape-dynamics/GIATAR-dataset/issues