In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
plt.style.use('seaborn-notebook')

# Filter invasive species from the recorder data

Making a subselection of the invasive species, as defined by the [t0-values in the aggregated checklist](https://github.com/inbo/alien-species-checklist/tree/master/data/processed), from the  [verified set of matched names as defined by the INBO nameserver](https://raw.githubusercontent.com/inbo/data-publication/master/datasets/name_server/data/processed/recorder_sql_unique_names_matched_verified.tsv). As such, this subset defines the set of identifiers to query on the INBO database:

### Recorder verified names

Reading in the recorder verified names:

In [6]:
recorder_matched_verified = pd.read_csv("https://raw.githubusercontent.com/inbo/data-publication/master/datasets/name_server/data/processed/recorder_sql_unique_names_matched_verified.tsv",
                                       delimiter="\t")

In [8]:
recorder_matched_verified.head()

Unnamed: 0,INBO_identifier,canonicalName,class,classKey,confidence,family,familyKey,genus,genusKey,kingdom,...,rank,scientificName,species,speciesKey,gbif_status,synonym,usageKey,acceptedKey,acceptedScientificName,nameMatchValidation
0,INBSYS0000005533,Cotoneaster villosulus,Magnoliopsida,220.0,100,Rosaceae,5015.0,Cotoneaster,3025563.0,Plantae,...,SPECIES,Cotoneaster villosulus (Rehder & E.H. Wilson) ...,Cotoneaster acutifolius,3025770.0,SYNONYM,True,3025564,3025565,Cotoneaster acutifolius var. villosulus Rehd. ...,ok: SYNONYM probably valid
1,NBNSYS0000022430,Leuctra hippopus,Insecta,216.0,100,Leuctridae,2998.0,Leuctra,2001760.0,Animalia,...,SPECIES,"Leuctra hippopus Kempny, 1899",Leuctra hippopus,2001976.0,ACCEPTED,False,2001976,2001976,"Leuctra hippopus Kempny, 1899",ok
2,NHMSYS0020110590,Taraxacum amarellum,Magnoliopsida,220.0,100,Asteraceae,3065.0,Taraxacum,8322495.0,Plantae,...,SPECIES,Taraxacum amarellum Kirschner & Štepánek,Taraxacum amarellum,5699378.0,DOUBTFUL,False,5699378,5699378,Taraxacum amarellum Kirschner & Štepánek,ok: DOUBTFUL
3,BFN0017900000007,Hyloniscus riparius,Malacostraca,229.0,100,Trichoniscidae,5764.0,Hyloniscus,2208506.0,Animalia,...,SPECIES,"Hyloniscus riparius (Koch, 1838)",Hyloniscus riparius,2208537.0,ACCEPTED,False,2208537,2208537,"Hyloniscus riparius (Koch, 1838)",ok
4,BMSSYS0000000001,Abrothallus cetrariae,,,99,Abrothallaceae,8401066.0,Abrothallus,2584324.0,Fungi,...,SPECIES,"Abrothallus cetrariae I. Kotte, 1909",Abrothallus cetrariae,7824496.0,ACCEPTED,False,7824496,7824496,"Abrothallus cetrariae I. Kotte, 1909",ok


### t0 alien species checklist

Reading in the species checklist:

In [9]:
alien_species_checklist = pd.read_csv("https://raw.githubusercontent.com/inbo/alien-species-checklist/master/data/processed/aggregated-checklist.tsv", 
                                      delimiter="\t")

In [12]:
alien_species_checklist["datasetName"].unique()

array(['plants', 'wrims', 'harmonia | plants | t0', 'harmonia | plants',
       'fishes', 'harmonia', 'harmonia | t0', 't0', 'fishes | harmonia',
       'macroinvertebrates | wrims', 'plants | t0', 'macroinvertebrates',
       'macroinvertebrates | t0 | wrims', 'fishes | wrims',
       'fishes | harmonia | t0', 'fishes | harmonia | wrims',
       'macroinvertebrates | t0',
       'harmonia | macroinvertebrates | t0 | wrims', 't0 | wrims',
       'harmonia | wrims', 'harmonia | plants | t0 | wrims'], dtype=object)

All rows for which the t0 is part of the `datasetName` should be taken into account:

In [18]:
t0_alien_species_checklist = alien_species_checklist.loc[alien_species_checklist["datasetName"].str.contains("t0"), :]

In [19]:
t0_alien_species_checklist.head()

Unnamed: 0,gbifapi_acceptedScientificName,gbifapi_acceptedKey,kingdom,datasetName,euConcernStatus,firstObservationYearBE,firstObservationYearFL,invasionStage,habitat,nativeRange,introductionPathway,presenceBE,presenceFL,presenceWA,presenceBR,gbifapi_scientificName,index
11,Acer negundo L.,3189866,Plantae,harmonia | plants | t0,under consideration,1955.0,,established | invasive,terrestrial | to be determined by experts,N. America,escape > horticulture | escape > to be determi...,present,present | to be determined by experts,present | to be determined by experts,present | to be determined by experts,Acer negundo L.,23 | 24 | 2498 | 9552
88,"Alopochen aegyptiaca (Linnaeus, 1766)",2498252,Animalia,harmonia | t0,under consideration,1984.0,,invasive,freshwater,Africa,escape > pet/aquarium/terrarium species (inclu...,present,present | to be determined by experts,present | to be determined by experts,present | to be determined by experts,"Alopochen aegyptiaca (Linnaeus, 1766)",30 | 9553
93,Alternanthera philoxeroides (Mart.) Griseb.,3084923,Plantae,t0,under consideration,,,,,,,absent,to be determined by experts,to be determined by experts,to be determined by experts,Alternanthera philoxeroides (Mart.) Griseb.,9554
133,"Ameiurus melas (Rafinesque, 1820)",2340977,Animalia,t0,under consideration,,,,,,,absent,to be determined by experts,to be determined by experts,to be determined by experts,"Ameiurus melas (Rafinesque, 1820)",9555
225,Asclepias syriaca L.,3170247,Plantae,plants | t0,under consideration,1987.0,,unknown,to be determined by experts,N. America,escape > horticulture,present,present | to be determined by experts,absent | to be determined by experts,present | to be determined by experts,Asclepias syriaca L.,431 | 9556


In [23]:
len(t0_alien_species_checklist)

147

### Derivation of the nameserver identifiers corresponding to alien species

What we want, is a list of `gbifapi_acceptedKey` from the t0 set of invasive species and check for each acceptedKey the corresponding identifiers of the nameserver. 

In [33]:
recorder_matched_verified[["INBO_identifier", "acceptedKey"]].dtypes

INBO_identifier    object
acceptedKey         int64
dtype: object

In [38]:
nameserver_identifiers_for_t0 = pd.merge(t0_alien_species_checklist[["gbifapi_acceptedScientificName", "gbifapi_acceptedKey"]], 
                                         recorder_matched_verified[["INBO_identifier", "acceptedKey"]], 
                                         how='left', 
                                         left_on="gbifapi_acceptedKey", 
                                         right_on="acceptedKey")

In [43]:
nameserver_identifiers_for_t0 = nameserver_identifiers_for_t0.drop("acceptedKey", axis=1)

In [47]:
nameserver_identifiers_for_t0.head()

Unnamed: 0,gbifapi_acceptedScientificName,gbifapi_acceptedKey,INBO_identifier
0,Acer negundo L.,3189866,NBNSYS0000014604
1,"Alopochen aegyptiaca (Linnaeus, 1766)",2498252,
2,Alternanthera philoxeroides (Mart.) Griseb.,3084923,
3,"Ameiurus melas (Rafinesque, 1820)",2340977,NHMSYS0000544615
4,Asclepias syriaca L.,3170247,INBSYS0000005932


This is what will be needed to:
* setup the query to get data out of recorder (INBO dbae)
* provide datasets coming from recorder again with the corresponding `acceptedKey` of GBIF

In [55]:
nameserver_identifiers_for_t0.to_csv("../data/vocabularies/nameserver_species_identifiers_for_t0.tsv", 
                                     index=False, sep='\t', na_rep="")

### Create a query from the template by filling in the identifiers of the species and surveys

In [93]:
# Read the INBO_identifiers
species = pd.read_csv("../data/vocabularies/nameserver_species_identifiers_for_t0.tsv", delimiter="\t")
identifier_string = "','".join(nameserver_identifiers_for_t0["INBO_identifier"].dropna().tolist())
identifier_string = "'" + identifier_string + "'"

In [94]:
surveys = pd.read_csv("../data/vocabularies/nameserver_survey_identifiers_for_t0s.tsv", delimiter="\t")
survey_string = ",".join(surveys["survey_id"].dropna().tolist())

In [95]:
with open("../mapping/dwc-occurrence_template.sql") as query:
    template_query = query.read()

In [96]:
with open("../mapping/dwc-occurrence.sql", "w") as query:
    query.write(template_query.format(identifier_string, survey_string))

DONE!