In [1]:
import os, sys

os.chdir("..")

# Global Invasive and Alien Traits and Records (GIATAR) Dataset - query functions tutorial

Welcome to the tutorial for the query functions supplied with the GIATAR dataset!
The file containing query functions <'GIATAR_query_functions.py'> is availible the queries folder of the supporting github repository and in the queries folder of the released dataset folder (This tuturial should be stored adjacently)

These functions simplify the process of querying and joining the dataset and typically return pandas dataframes, to simplify the process of analysis. While there is considerably more information in the dataset than is acessible through these tools, we hope they will simplify the most common operations for dataset users. 
## Environment
The environment.yml file supplied with the code for this project will suffice for the query functinos here. However, it contains a complete suite of packages for dataset updating, some of which may be tricky to install. These query functions rely mostly on basic python packages (pandas, numpy etc) with the exception of pygbif, which can be installed with pip. 
## Setting paths
### to functions
We suggest putting the ```GIATAR_query_functions.py``` file (availible in the queries folder of the DOI released dataset or on the project GitHub ```https://github.com/ncsu-landscape-dynamics/GIATAR-dataset``` for the most current version)  in your project directory. 

### to data
```GIATAR_query_functions.py``` contains data_path as the first line of code in the file - set this to the dataset directory of GIATAR. If you prefer, you can call ```create_dotenv(pathtodata)``` to create a .env that permanently sets this path

## usageKeys
Unique ID Keys for species in the dataset are referred to as usageKeys, following the structure and naming of usageKeys from GBIF. Where possible, we have retained GBIF usageKeys as unique ID's for taxa - otherwise we have generated unique ID's that wont overlap with GBIF usageKeys

## functions
 returns species name as string - takes usageKey as string or int

### get_usageKey(species_name) 
 returns usageKey as string - takes species name as string

### get_all_species() 
 returns list of all species names in dataset - no inputs

### check_species_exists(species_name) 
 returns True or False - takes species name as string

### get_first_introductions(usageKey, check_exists=False, ISO3_only=False, import_additional_native_info=True) 


returns dataframe of first introductions takes usageKey as string or int

check_exists=True will raise a KeyError if species is not in dataset

ISO3_only=True will return only return species location info that are 3 character ISO3 codes. Some other location info includes bioregions or other geonyms

import_additional_native_info=True will import additional native range info, first by seeing if native range info for a particular country is availible from sources that reported later than the first introduction, and second by importing native range info from the file of native range info unique to GIATAR

### get_all_introductions(usageKey, check_exists=False, ISO3_only=True) 

returns dataframe of all introductions - takes usageKey as string or int 
check_exists=True will raise a KeyError if species is not in dataset
ISO3_only=True will return only return species location info that are 3 character ISO3 codes. Some other location info includes bioregions or other geonyms

import_additional_native_info=True will import additional native range info, first by seeing if native range info for a particular country is availible from sources that reported later than the first introduction, and second by importing native range info from the file of native range info unique to GIATAR

### get_ecology(species_name) 
 returns dictionary of dataframes of ecology info - takes species name as string. Ecology info returned by this function includes rainfall, airtemp, climate, lat/altitude, water temp and wether a pest utilizing wood packaging.

 Ecological info is variously formatted for different species - e.g. air temperature might include max, min, range or other info. We reccomend spending time with the outputs to find the information you want. 



### get_hosts_and_vectors(species_name) 
returns dictionary of dataframes of host and vector info - takes species name as string
This tool returns hosts (plant hosts for herbivorous insects, animal hosts for diseases and parasites) and vectors (either zoonotic or plant vectors - mostly for diseases)



### get_species_list(kingdom=None, phylum=None, taxonomic_class=None, order=None, family=None, genus=None)
 returns list of usageKeys matching taxonomic criteria - takes kingdom, phylum, taxonomic_class, order, family, genus as strings. This function can help select a group of organisms in the dataset matching the search criterion. Note that the term <code>class</code> is protected in python, so we refer to the taxonomic grouping as taxonomic_class



### get_native_ranges(usageKey, ISO3=None) 
returns dataframe of native ranges - takes usageKey as string or int.

The GIATAR dataset stores native range information in several ways - some better-studied species have native information as a binary true/false for the country level. Many other species have native range information stored only as biogeographic zones e.g. palearctic. We provide functionality to map this biogeographic zone data to presence-absence t/f using a crosswalk, which is availible in the native ranges subfolder of the dataset.

When the user calls <code>get_native_ranges()</code> and ISO3 is set to None, the function returns all avalible information about the species. 
If the user wishes to use the native-range to country-presence crosswalk, they should provide ISO3 as a python list of ISO3 standard country codes e.g. <code>['USA','CHN]</code> - the function will then use the crosswalk to provide true/false information on the native status of the species if there is biogeographic information avalible. When biogeographic information and more specific country/native binary information is avalible, the function defaults to the more specific country true/false info. 

ISO3=list returns dataframe of native ranges and True or False if species is native to ISO3 - takes a list of ISO3 codes for countries as input. See examples below for context.

In [2]:
import query_functions.python.GIATAR_query_functions as gqf
import pandas as pd

  invasive_all_source = pd.read_csv(


In [3]:
## Basic operations
get_all_species = gqf.get_all_species()
get_all_species

['Cirsium wallichii',
 'Fusarium solani',
 'Acridotheres cristatellus',
 'Marteilia',
 'Macrorhynchia philippina',
 'Peronospora aquilegiicola',
 'Perkinsus chesapeaki',
 'Carthamus oxyacanthus',
 'Ficus retusa',
 'Polyscias fruticosa',
 'Thunbergia erecta',
 'Butia capitata',
 'Helianthus debilis',
 'Begonia nelumbiifolia',
 'Leucophyllum frutescens',
 'Garcinia dulcis',
 'Epiphyllum oxypetalum',
 'Trirachys sartus',
 'Polyscias balfouriana',
 'Mycobacterium avium paratuberculosis',
 'Ludwigia palustris',
 'Galphimia gracilis',
 'Terminalia muelleri',
 'Inga acreana',
 'Malvaviscus penduliflorus',
 'Laelia rubescens',
 'Bonamia ostreae',
 'Acroceras zizanioides',
 'Galphimia glauca',
 'Bambusa longispiculata',
 'Tuberose mild mottle virus',
 'Euphorbia neriifolia',
 'Euphorbia leucocephala',
 'Inga mucuna',
 'Senna italica',
 'Emilia praetermissa',
 'Ixora casei',
 'Lycorma delicatula',
 'Ixora coccinea',
 'Peronosclerospora philippinensis',
 'Salmo salar',
 'Agave vivipara',
 'Bothri

In [4]:
gqf.check_species_exists("Ailanthus altissima")
gqf.get_usageKey("Ailanthus altissima")


'3190653'

In [5]:
# Let's pull some introduction records
gqf.get_first_introductions("Apis mellifera", import_additional_native_info=True)

Unnamed: 0,usageKey,ISO3,year,Source,Reference,Native,Type
42245,1341976,ABW,2007.0,GBIF,Counts API,False,First report
42246,1341976,AFG,1975.0,GBIF,Counts API,False,First report
42247,1341976,AGO,1970.0,GBIF,Counts API,True,First report
42248,1341976,AIA,2017.0,GBIF,Counts API,False,First report
42249,1341976,ALA,2014.0,GBIF,Counts API,True,First report
...,...,...,...,...,...,...,...
42464,1341976,XK,2007.0,GBIF,Counts API,,First report
42465,1341976,YEM,1982.0,GBIF,Counts API,True,First report
42466,1341976,ZAF,1974.0,GBIF,Counts API,True,First report
42467,1341976,ZMB,1972.0,GBIF,Counts API,True,First report


In [6]:
gqf.get_ecology("Thrips tabaci")

{}

In [7]:
gqf.get_hosts_and_vectors('Icerya purchasi')

{'CABI_tohostPlants':      usageKey   code       section                         Plant name  \
 2715  2080592  28432  tohostPlants                   Acacia (wattles)   
 2716  2080592  28432  tohostPlants                     Acacia confusa   
 2717  2080592  28432  tohostPlants   Acacia dealbata (acacia bernier)   
 2718  2080592  28432  tohostPlants              Acalypha (Copperleaf)   
 2719  2080592  28432  tohostPlants    Albizia julibrissin (silk tree)   
 ...       ...    ...           ...                                ...   
 2779  2080592  28432  tohostPlants                    Syringa (lilac)   
 2780  2080592  28432  tohostPlants             Ulex europaeus (gorse)   
 2781  2080592  28432  tohostPlants   Vaccinium corymbosum (blueberry)   
 2782  2080592  28432  tohostPlants  Virgilia capensis (snowdrop tree)   
 2783  2080592  28432  tohostPlants                   Viscum cruciatum   
 
              Family  Context                              References  
 2715       Fabac

In [8]:
#the get_introductions functions contain native range information, but a user might want more details on the sources of that information
gqf.get_native_ranges("Apis mellifera")
# we include 

  records = pd.concat([records, native_ranges_temp], ignore_index=True)


Unnamed: 0,ISO3,Source,Native,Reference,bioregion,DAISIE_region
0,DNK,DAISIE,False,Nobanis (2006),,
1,DNK,DAISIE,False,Nobanis (2006),,
2,PRT,DAISIE,False,,,
3,GRL,DAISIE,False,Nobanis (2006),,
4,Europe,DAISIE,True,Nobanis (2006),,
5,Southwestern Europe,DAISIE,True,Nobanis (2006),,
6,Southeastern Europe,DAISIE,True,Nobanis (2006),,
7,Northern Europe,DAISIE,True,Nobanis (2006),,
8,Middle Europe,DAISIE,True,Nobanis (2006),,
9,ITA,DAISIE,True,Nobanis (2006),,


In [9]:
gqf.get_common_names("Pancratium maritimum")

   DAISIE_idspecies                   name language  source usageKey
0                 5    Murphy's Threadwort  English  DAISIE  8180725
1                 6      Long's Threadwort  English  DAISIE  7433870
2                 7        Great Crestwort  English  DAISIE  7425480
3                 8     Southern Crestwort  English  DAISIE  6096602
4                 9  Micheli's Balloonwort  English  DAISIE  5286305
in daisie


{'DAISIE_vernacular':       DAISIE_idspecies          name language  source usageKey
 1349              2502  Sea Daffodil  English  DAISIE  2853283,
 'EPPO_names':        codeEPPO    nameid isolang isocountry                 fullname  \
 45685     PNZMA  132617.0      la        NaN     Pancratium maritimum   
 45686     PNZMA  423614.0      bg        NaN        морски панкрациум   
 45687     PNZMA  423615.0      ca        NaN         assutzena blanca   
 45688     PNZMA  423616.0      ca        NaN        assutzena d'arena   
 45689     PNZMA  423617.0      ca        NaN         assutzena marina   
 45690     PNZMA  423618.0      ca        NaN              ceba marina   
 45691     PNZMA  423619.0      ca        NaN           lliri de Canet   
 45692     PNZMA  423621.0      ca        NaN             lliri de mar   
 45693     PNZMA  423622.0      ca        NaN          lliri de platja   
 45694     PNZMA  423620.0      ca        NaN  lliri de Santa Cristina   
 45695     PNZMA  4236

In [11]:
gqf.get_usageKey("Pancratium maritimum")

'2853283'

In [12]:
gqf.get_trait_table_list()

['CABI_rainfall',
 'CABI_airtemp',
 'CABI_climate',
 'CABI_environments',
 'CABI_latitude_altitude',
 'CABI_natural_enemies',
 'CABI_water_tolerances',
 'CABI_wood_packaging',
 'CABI_host_plants',
 'CABI_pathway_vectors',
 'CABI_vectorsAndIntermediateHosts',
 'DAISIE_habitats',
 'CABI_impact_summary',
 'CABI_latitude_altitude_ranges',
 'CABI_symptoms_signs',
 'CABI_threatened_species',
 'EPPO_hosts',
 'EPPO_names',
 'DAISIE_pathways',
 'DAISIE_vectors',
 'DAISIE_vernacular']

In [13]:
gqf.get_trait_table('CABI_airtemp')

Unnamed: 0,usageKey,code,section,Parameter,Lower limit,Upper limit
0,3517956,9630,toairTemperature,Mean annual temperature (ºC),5.0,37.0
1,2096154,109097,toairTemperature,Mean annual temperature (ºC),0.0,45.0
2,10826565,119196,toairTemperature,Absolute minimum temperature (ºC),17.0,
3,10826565,119196,toairTemperature,Mean annual temperature (ºC),17.0,37.0
4,10304386,3651,toairTemperature,Absolute minimum temperature (ºC),15.0,
...,...,...,...,...,...,...
718,3152253,55771,toairTemperature,Mean minimum temperature of coldest month (ºC),-1.0,
719,2705869,55773,toairTemperature,Absolute minimum temperature (ºC),0,
720,2705869,55773,toairTemperature,Mean annual temperature (ºC),25,40.0
721,2705869,55773,toairTemperature,Mean maximum temperature of hottest month (ºC),30,40.0
