# Filter T0 invasive species from the nameserver data

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
plt.style.use('seaborn-notebook')

Making a subselection of the invasive species, as defined by the [t0-values in the aggregated checklist](https://github.com/inbo/alien-species-checklist/tree/master/data/processed), from the  verified set of matched names as defined by the [INBO nameserver](https://github.com/inbo/data-publication/tree/master/datasets/name-server/data/processed). As such, this subset defines the set of identifiers to query on the INBO database:

## Get verified nameserver names

Reading in the recorder verified names:

In [3]:
recorder_names = pd.read_csv('../../name-server/data/processed/verified-recommended-nameserver-names.tsv', delimiter='\t')

In [4]:
recorder_names.head()

Unnamed: 0,nbn_recommendedTaxonVersionKey,nameMatchValidation,nbn_scientificName,nbn_taxonGroup,nbn_kingdom,gbifapi_kingdom,gbifapi_usageKey,gbifapi_scientificName,gbifapi_canonicalName,gbifapi_status,gbifapi_rank,gbifapi_matchType,gbifapi_note,gbifapi_acceptedKey,gbifapi_acceptedScientificName
0,NHMSYS0000456996,ok,Caylusea,bloemplant,Plantae,Plantae,7275943,Caylusea A. St.-Hil.,Caylusea,ACCEPTED,GENUS,EXACT,,7275943,Caylusea A. St.-Hil.
1,NHMSYS0000900079,ok,Listrognathus mactator,insect - vliesvleugelige (Hymenoptera),Animalia,Animalia,1306714,"Listrognathus mactator (Thunberg, 1822)",Listrognathus mactator,ACCEPTED,SPECIES,EXACT,,1306714,"Listrognathus mactator (Thunberg, 1822)"
2,NHMSYS0000356730,ok,Dinocheirus,pseudoschorpioen (Pseudoscorpiones),Animalia,Animalia,2127154,"Dinocheirus Chamberlin, 1929",Dinocheirus,ACCEPTED,GENUS,EXACT,,2127154,"Dinocheirus Chamberlin, 1929"
3,NBNSYS0100000437,ok,Anurophorus satchelli,springstaart (Collembola),Animalia,Animalia,5166810,"Anurophorus satchelli Goto, 1956",Anurophorus satchelli,ACCEPTED,SPECIES,EXACT,,5166810,"Anurophorus satchelli Goto, 1956"
4,NHMSYS0000309359,ok,Anthelia juratzkana,levermos,Plantae,Plantae,5710214,Anthelia juratzkana (Limpr.) Trevis.,Anthelia juratzkana,ACCEPTED,SPECIES,EXACT,,5710214,Anthelia juratzkana (Limpr.) Trevis.


## Get t0 alien species checklist names

Reading in the species checklist:

In [5]:
alien_species_checklist = pd.read_csv('https://raw.githubusercontent.com/inbo/alien-species-checklist/master/data/processed/aggregated-checklist.tsv', delimiter='\t')

In [6]:
alien_species_checklist['datasetName'].unique()

array(['plants', 'wrims', 'harmonia | plants | t0', 'harmonia | plants',
       'fishes', 'harmonia', 'harmonia | t0', 't0', 'fishes | harmonia',
       'macroinvertebrates | wrims', 'plants | t0', 'macroinvertebrates',
       'macroinvertebrates | t0 | wrims', 'fishes | wrims',
       'fishes | harmonia | t0', 'fishes | harmonia | wrims',
       'macroinvertebrates | t0',
       'harmonia | macroinvertebrates | t0 | wrims', 't0 | wrims',
       'harmonia | wrims', 'harmonia | plants | t0 | wrims'], dtype=object)

All rows for which the t0 is part of the `datasetName` should be taken into account:

In [7]:
t0_alien_species_checklist = alien_species_checklist.loc[alien_species_checklist['datasetName'].str.contains('t0'), :]

In [8]:
t0_alien_species_checklist.head()

Unnamed: 0,gbifapi_acceptedScientificName,gbifapi_acceptedKey,kingdom,datasetName,euConcernStatus,firstObservationYearBE,firstObservationYearFL,invasionStage,habitat,nativeRange,introductionPathway,presenceBE,presenceFL,presenceWA,presenceBR,gbifapi_scientificName,index
11,Acer negundo L.,3189866,Plantae,harmonia | plants | t0,under consideration,1955.0,,established | invasive,terrestrial | to be determined by experts,N. America,escape > horticulture | escape > to be determi...,present,present | to be determined by experts,present | to be determined by experts,present | to be determined by experts,Acer negundo L.,23 | 24 | 2498 | 9552
88,"Alopochen aegyptiaca (Linnaeus, 1766)",2498252,Animalia,harmonia | t0,under consideration,1984.0,,invasive,freshwater,Africa,escape > pet/aquarium/terrarium species (inclu...,present,present | to be determined by experts,present | to be determined by experts,present | to be determined by experts,"Alopochen aegyptiaca (Linnaeus, 1766)",30 | 9553
93,Alternanthera philoxeroides (Mart.) Griseb.,3084923,Plantae,t0,under consideration,,,,,,,absent,to be determined by experts,to be determined by experts,to be determined by experts,Alternanthera philoxeroides (Mart.) Griseb.,9554
133,"Ameiurus melas (Rafinesque, 1820)",2340977,Animalia,t0,under consideration,,,,,,,absent,to be determined by experts,to be determined by experts,to be determined by experts,"Ameiurus melas (Rafinesque, 1820)",9555
225,Asclepias syriaca L.,3170247,Plantae,plants | t0,under consideration,1987.0,,unknown,to be determined by experts,N. America,escape > horticulture,present,present | to be determined by experts,absent | to be determined by experts,present | to be determined by experts,Asclepias syriaca L.,431 | 9556


In [9]:
len(t0_alien_species_checklist)

148

## Derive nameserver identifiers corresponding to alien species

What we want, is a list of `gbifapi_acceptedKey` from the t0 set of invasive species and check for each acceptedKey the corresponding identifiers of the nameserver. 

In [10]:
recorder_names[['nbn_recommendedTaxonVersionKey', 'gbifapi_acceptedKey']].dtypes

nbn_recommendedTaxonVersionKey    object
gbifapi_acceptedKey                int64
dtype: object

In [11]:
nameserver_identifiers_for_t0 = pd.merge(t0_alien_species_checklist[['gbifapi_acceptedScientificName', 'gbifapi_acceptedKey']], 
                                         recorder_names[['nbn_recommendedTaxonVersionKey', 'gbifapi_acceptedKey']], 
                                         how='left', 
                                         left_on="gbifapi_acceptedKey", 
                                         right_on="gbifapi_acceptedKey")

In [12]:
nameserver_identifiers_for_t0.head()

Unnamed: 0,gbifapi_acceptedScientificName,gbifapi_acceptedKey,nbn_recommendedTaxonVersionKey
0,Acer negundo L.,3189866,NBNSYS0000014604
1,"Alopochen aegyptiaca (Linnaeus, 1766)",2498252,NHMSYS0001689380
2,Alternanthera philoxeroides (Mart.) Griseb.,3084923,
3,"Ameiurus melas (Rafinesque, 1820)",2340977,NHMSYS0000544615
4,Asclepias syriaca L.,3170247,INBSYS0000005932


This is what will be needed to:
* setup the query to get data out of recorder (INBO dbae)
* provide datasets coming from recorder again with the corresponding `acceptedKey` of GBIF

In [13]:
nameserver_identifiers_for_t0.to_csv('../data/vocabularies/nameserver-species-identifiers-for-t0.tsv', 
                                     index=False, sep='\t', na_rep='')

## Create a query from the template by filling in the identifiers of the species and surveys

In [14]:
# Read the INBO_identifiers
species = pd.read_csv('../data/vocabularies/nameserver-species-identifiers-for-t0.tsv', delimiter='\t')
identifier_string = "','".join(nameserver_identifiers_for_t0['nbn_recommendedTaxonVersionKey'].dropna().tolist())
identifier_string = "'" + identifier_string + "'"

In [15]:
surveys = pd.read_csv('../data/vocabularies/nameserver-survey-identifiers-for-t0s.tsv', delimiter='\t')
survey_string = ','.join(surveys['survey_id'].dropna().tolist())

In [16]:
with open('../mapping/dwc-occurrence-template.sql') as query:
    template_query = query.read()

In [17]:
with open('../mapping/dwc-occurrence.sql', 'w') as query:
    query.write(template_query.format(identifier_string, survey_string))

DONE!