# Demo

This Jupyter Notebook demonstrates the changes to the BIM species database and how the taxonomic information can be maintained.

## Setup

Load functions and packages:

In [193]:
import os
import sys
module_path = os.path.abspath(os.path.join('../scripts'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [194]:
import sqlalchemy as db
import logging
import gbif_match
import vernacular_names
import exotic_status
import populate_annex_scientificname
from helpers import execute_sql_from_file, get_database_connection, get_config, setup_log_file

Define location of log file:

In [195]:
LOG_FILE_PATH = "./logs/transform_db.log"
setup_log_file(LOG_FILE_PATH)

Connect to (a copy of) the BIM database:

In [196]:
conn = get_database_connection()

Get access to the configuration details (server address, demo mode, etc.) stored in config file `config.ini`:

In [197]:
config = get_config()

Is demo mode active?

In [198]:
demo = config.getboolean('demo_mode', 'demo')
demo

True

Define annex file location and its demo version containing a small but significant subset of annex names:

In [199]:
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.abspath('')))
# Full file with all names in official annexes
ANNEX_FILE_PATH = os.path.join(__location__, "../data/raw/official_annexes.csv")
# Annex demo version
ANNEX_FILE_PATH_DEMO = os.path.join(__location__, "../data/raw/official_annexes_demo.csv")

Define dataset key of the [_Global Register of Introduced and Invasive Species - Belgium_](https://www.gbif.org/dataset/6d9e952f-948c-4483-9807-575348147c7e):

In [200]:
GRIIS_DATASET_UUID = "6d9e952f-948c-4483-9807-575348147c7e"

Finally, define a SQLAlchemy connection to show changes of the database in this demo:

In [201]:
user = config.get('database', 'user')
pwd = config.get('database', 'password')
host = config.get('database', 'host')
port = config.get('database', 'port')
dbname = config.get('database', 'dbname')
db_conn = f'postgresql://{user}:{pwd}@{host}:{port}/{dbname}'
db.create_engine(db_conn)

Engine(postgresql://postgres:***@localhost:5433/postgres)

In [202]:
%load_ext sql
%sql $db_conn

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: postgres@postgres'

## Create the new tables

Create the following tables:

1. `scientificname`: table with scientific names
2. `taxonomy`: taxonomy backbone of all scientific names. Table entirely populated with information from GBIF Backbone
3. `rank`: taxon ranks used in `taxonomy` 
4. `annexscientificname`: all names (scientific names or expressions) contained in official annexes
5. `vernacularname`: vernacular names of all taxa in `taxonomy`. Table entirely populated with information from GBIF
6. `vernacularnamesource`: title and datasetKey of the datasets containing the vernacular names in `vernacularname`

In [203]:
message = "Step 1: create the new tables"
print(message)
logging.info(message)
execute_sql_from_file(conn, 'create_new_tables.sql')

Step 1: create the new tables


<cursor object at 0x0000000009ED3900; closed: 0>

These tables can be dropped and recreated if errors occur in any of the following steps. See step 0 in main migration script [`transform_db.py`](https://github.com/inbo/speciesbim/blob/annexscientificname/scripts/transform_db.py).

## Populate the `scientificname` table based on the actual content

We populate the `scientificname` table with taxa in `taxon`. From `taxon` we select the fields:
1. `id`
2. `acceptedname`
3. `scientificnameauthorship`

and we store them as:
1. `deprecatedTaxonId`
2. `scientificName`
3. `authorship`

We select only the taxa in use, i.e. taxa which are used in any of the linked tables.

In [204]:
message = "Step 2: populate the scientificname table based on the actual content"
print(message)
logging.info(message)
execute_sql_from_file(conn, 'populate_scientificname.sql',
                      {'limit': config.get('transform_db', 'scientificnames-limit')})

Step 2: populate the scientificname table based on the actual content


<cursor object at 0x0000000009ED3820; closed: 0>

Preview `scientificname` table:

In [205]:
%sql SELECT * FROM biodiv.scientificname LIMIT 10

 * postgresql://postgres:***@localhost:5433/postgres
10 rows affected.


id,taxonomyId,deprecatedTaxonId,scientificName,authorship,lastMatched,matchConfidence,matchType
1,,40758,Elachista,,,,
2,,1,Godronia cassandrae vaccinii,J.W. Groves,,,
3,,2,Phoma acuta phlogis,"(Roum.) Boerema et al., 1994",,,
4,,3,Puccinia sessilis convallariae-digraphidis,"Boerema & Hamers, 1988",,,
5,,4,Puccinia sessilis narcissi-orchidacearum,"Boerema & Kesteren, 1980",,,
6,,5,Aecidium rumicis form. acetosae,Oudem.,,,
7,,6,Amanita excelsa form. excelsa,,,,
8,,7,Amanita excelsa form. spissa,(Fr.),,,
9,,8,Amanita rubescens form. annulosulfurea,(Gillet) J.E. Lange,,,
10,,9,Amanita rubescens form. rubescens,,,,


Number of names in `scientificname`table:

In [206]:
%sql SELECT COUNT(*) from biodiv.scientificname

 * postgresql://postgres:***@localhost:5433/postgres
1 rows affected.


count
6000


## Populate the `scientificnameannex` table based on official annexes

Similarly to previous step, we populate the `scientificnameannex` table with all names (scientific names or expresssions) listed in official annexes. These are stored in an external file: [`official_annexes.csv`](https://github.com/inbo/speciesbim/blob/master/data/raw/official_annexes.csv). Where possible, some type correcting or simplifying taxa was performed.

In this demo we use a small but significant subset of these names: [`official_annexes_demo.csv`](https://github.com/inbo/speciesbim/blob/master/data/raw/official_annexes_demo.csv).

In [207]:
message = "Step 3: populate the scientificnameannex table based on official annexes"
print(message)
logging.info(message)
if not demo:
    populate_annex_scientificname.populate_annex_scientificname(conn, config_parser=config,
                                                                annex_file=ANNEX_FILE_PATH)
else:
    populate_annex_scientificname.populate_annex_scientificname(conn, config_parser=config,
                                                                annex_file=ANNEX_FILE_PATH_DEMO)

Step 3: populate the scientificnameannex table based on official annexes
Columns in C:\Users\damiano_oldoni\Documents\INBO\repositories\speciesbim\notebooks\../data/raw/official_annexes_demo.csv: annex_code, scientific_name_original, scientific_name_corrected, authorship, page_number, remarks
Number of taxa listed in official annexes and ordinances: 15

Total number of taxa inserted in annexscientificname: 15
Table annexscientificname populated in 1s.


Preview `scientificnameannex` table:

In [208]:
%sql SELECT * FROM biodiv.annexscientificname

 * postgresql://postgres:***@localhost:5433/postgres
15 rows affected.


id,scientificnameId,scientificNameOriginal,scientificName,authorship,remarks,annexCode
1,,Falco peregrinus,Falco peregrinus,,,BXL-ORD-2012_Annex II.1
2,,Aconitum corsicum Gayer (Aconitum napellus subsp. corsicum),Aconitum napellus subsp. corsicum,,"Removed Aconitum corsicum Gayer, synonym of Aconitum napellus subsp. corsicum",BXL-ORD-2012_Annex II.2
3,,Culcita macrocarpa C. Presl,Culcita macrocarpa,C. Presl,authorship present,BXL-ORD-2012_Annex II.2
4,,Valeriana repens,,,Multiple authorships: Valeriana repens Wall. (synonym of Valeriana hardwickei Wall.) and Valeriana repens Host (synonym of Valeriana excelsa subsp. excelsa),BXL-ORD-2012_Annex II.3
5,,Martes Martes,Martes martes,,decapitalized specific epithet,BXL-ORD-2012_Annex II.4
6,,Leuciscus microlepis,Squalius microlepis,,"Changed to its related accepted taxon within Squalius genus. Leuciscus microlepis Bleeker, 1853 is a proparte synonym and classified in two different genera on GBIF, but the other accepted taxon is very unlikely to be true: Amblypharyngodon microlepis (Bleeker, 1853)",BXL-ORD-2012_Annex II.5
7,,Sus scofra,Sus scrofa,,scofra to scrofa,BXL-ORD-2012_Annex III
8,,Rana (Pelophylax) ridibunda,Rana ridibunda,,removed (Pelophylax),BXL-ORD-2012_Annex IV
9,,Fallopia japonica,Fallopia japonica,,,BXL-ORD-2012_Annex IV
10,,Pulsatilla grandis Wend. (Pulsatilla vulgaris subsp. grandis (Wend.) Zamels,Pulsatilla grandis,Wender.,"removed Pulsatilla vulgaris subsp. grandis (Wend.) Zamels, synonym of Pulsatilla grandis subsp. grandis",EUR-CON-BER_Annex I


## Populate `taxonomy` table with matches to GBIF Backbone and corresponding backbone tree

In this step all scientific names in `scientificname` table are evaluated against the [_GBIF Backbone Taxonomy_](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c) or simply _GBIF Backbone_.
If a match occurs, the taxon and its related tree is added to `taxonomy`. In case of a synonym, the corresponding accepted taxon is added as well.

In this demo, we will focus on a small subset of names:
- _Elachista_: no match to GBIF Backbone will be found
- _Triturus alpestris Laurenti, 1768_: synonym of _Ichthyosaura alpestris (Laurenti, 1768)_
- _Fallopia japonica_: exotic and synonym of _Reynoutria japonica_
- _Trentepholia_: accepted genus

In [209]:
message = "Step 4: populate taxonomy table with matches to GBIF Backbone and related backbone tree " +\
          "and update scientificname table"
print(message)
logging.info(message)
gbif_match.gbif_match(conn, config_parser=config, unmatched_only=False)

Step 4: populate taxonomy table with matches to GBIF Backbone and related backbone tree and update scientificname table
Number of taxa in scientificname table: 4 (demo mode)
Match names (scientificName + authorship) to GBIF Backbone (demo mode)
Timestamp used for this (whole) match process: 2020-10-06 22:32:13.516000
Try matching the "Elachista" name...
No match found for Elachista (id: 1).
Add match information (and taxonomiyId, if a match was found) to scientificname for Elachista (id: 1).
Try matching the "Trentepholia" name...
Recursively adding the taxon with GBIF key 9792230 (Trentepholia (Mongoma)) to the taxonomy table
According to GBIF, this is *not* a root taxon, we'll insert parents first
    Recursively adding the taxon with GBIF key 7634 (Tipulidae) to the taxonomy table
    According to GBIF, this is *not* a root taxon, we'll insert parents first
        Recursively adding the taxon with GBIF key 811 (Diptera) to the taxonomy table
        According to GBIF, this is *not*

    Recursively adding the taxon with GBIF key 2889173 (Reynoutria japonica Houtt.) to the taxonomy table
    According to GBIF, this is *not* a root taxon, we'll insert parents first
        Recursively adding the taxon with GBIF key 8420120 (Reynoutria Houtt.) to the taxonomy table
        This taxon already appears in the taxonomy table
        Taxon Reynoutria Houtt. already present in taxonomy (id = 19).
    According to GBIF, this is *not* a synonym (no accepted taxon to insert)
    Taxon Reynoutria japonica Houtt. inserted in taxonomy (id = 20, parentId = 19).
Taxon Fallopia japonica (Houtt.) Ronse Decraene inserted in taxonomy (id = 21, parentId = 19, acceptedId = 20).
Add match information (and taxonomiyId, if a match was found) to scientificname for Fallopia japonica (Houtt.) Ronse Decr. (id: 3355).
Number of matched names: 3/4 (75.00%).
Total number of insertions in the taxonomy table: 64
Match to GBIF Backbone performed in 13s.


In [210]:
%sql SELECT * FROM biodiv.taxonomy

 * postgresql://postgres:***@localhost:5433/postgres
21 rows affected.


id,gbifId,scientificName,rankId,acceptedId,parentId,exotic_be
1,1,Animalia,6,,,
2,54,Arthropoda,5,,1.0,
3,216,Insecta,4,,2.0,
4,811,Diptera,3,,3.0,
5,7634,Tipulidae,2,,4.0,
6,9792230,Trentepholia (Mongoma),1,,5.0,
7,44,Chordata,5,,1.0,
8,131,Amphibia,4,,7.0,
9,953,Caudata,3,,8.0,
10,6750,Salamandridae,2,,9.0,


When there is a match, the `taxonomyId` is populated in `scientificname` to make a connection between the two tables.

In [211]:
%%sql 
SELECT * FROM biodiv.scientificname 
WHERE "scientificName" IN (
    'Elachista', -- no match to GBIF Backbone will be found
    'Triturus alpestris', -- synonym of Ichthyosaura alpestris
    'Fallopia japonica', -- exotic and synonym of Reynoutria japonica
    'Trentepholia' -- accepted genus
)

 * postgresql://postgres:***@localhost:5433/postgres
4 rows affected.


id,taxonomyId,deprecatedTaxonId,scientificName,authorship,lastMatched,matchConfidence,matchType
24,6.0,40884,Trentepholia,,2020-10-06 22:32:13.516000+00:00,94,EXACT
1,,40758,Elachista,,2020-10-06 22:32:13.516000+00:00,99,NONE
95,13.0,1333,Triturus alpestris,"Laurenti, 1768",2020-10-06 22:32:13.516000+00:00,99,EXACT
3355,21.0,21322,Fallopia japonica,(Houtt.) Ronse Decr.,2020-10-06 22:32:13.516000+00:00,100,EXACT


Everytime existing names are improved or added, this step can be repeated using the parameter `unmatched_only=True` in `gbif_match()`. However, we suggest to update the entire table (`unmatched_only=False`) at least every year in order to update the table with taxonomic changes from the GBIF Backbone.

This step populates also the table `rank`:

In [212]:
%sql SELECT * FROM biodiv.rank

 * postgresql://postgres:***@localhost:5433/postgres
7 rows affected.


id,name
1,GENUS
2,FAMILY
3,ORDER
4,CLASS
5,PHYLUM
6,KINGDOM
7,SPECIES


## Vernacular names

In this step we lookup all vernacular names recorded at GBIF for all taxa in `taxonomy`. This is done for the languages French, Dutch and English and stored in the table `vernacularnames` and its auxiliary table `vernacularnamesource`.

In [213]:
message = "Step 5: populate vernacular names from GBIF for each entry in the taxonomy table"
print(message)
logging.info(message)
# list of 2-letters language codes (ISO 639-1)
languages = ['fr', 'nl', 'en']
vernacular_names.populate_vernacular_names(conn, config_parser=config, empty_only=False, filter_lang=languages)

Step 5: populate vernacular names from GBIF for each entry in the taxonomy table
We'll now load vernacular names for 21 entries in the taxonomy table. Languages: fr, nl, en
Now saving 'Animals'(en) for taxon with ID: 1 (source: Phthiraptera.info)
Now saving 'animals'(en) for taxon with ID: 1 (source: Integrated Taxonomic Information System (ITIS))
Now saving 'animaux'(fr) for taxon with ID: 1 (source: Integrated Taxonomic Information System (ITIS))
Now saving 'dieren'(nl) for taxon with ID: 1 (source: Belgian Species List)
Now saving 'animals'(en) for taxon with ID: 1 (source: World Register of Introduced Marine Species (WRiMS))
Now saving 'animals'(en) for taxon with ID: 1 (source: World Register of Marine Species)
Now saving 'animaux'(fr) for taxon with ID: 1 (source: World Register of Introduced Marine Species (WRiMS))
Now saving 'animaux'(fr) for taxon with ID: 1 (source: World Register of Marine Species)
Now saving 'dieren'(nl) for taxon with ID: 1 (source: World Register of Marin

Now saving 'duizendknoopfamilie'(nl) for taxon with ID: 18 (source: Belgian Species List)
Now saving 'knotweed'(en) for taxon with ID: 18 (source: Integrated Taxonomic Information System (ITIS))
Now saving 'renouées'(fr) for taxon with ID: 18 (source: Integrated Taxonomic Information System (ITIS))
Now saving 'Polygonacées'(fr) for taxon with ID: 18 (source: Database of Vascular Plants of Canada (VASCAN))
Now saving 'buckwheat family'(en) for taxon with ID: 18 (source: Database of Vascular Plants of Canada (VASCAN))
Now saving 'duizendknoopfamilie'(nl) for taxon with ID: 18 (source: World Register of Marine Species)
Now saving 'knotweed'(en) for taxon with ID: 19 (source: GRIN Taxonomy)
Now saving 'Japanese knotweed'(en) for taxon with ID: 20 (source: GRIN Taxonomy)
Now saving 'Mexican-bamboo'(en) for taxon with ID: 20 (source: GRIN Taxonomy)
Now saving 'Japanese knotweed'(en) for taxon with ID: 20 (source: Database of Vascular Plants of Canada (VASCAN))
Now saving 'renouée du Japon'(f

Show table `vernacularnames` and `vernacularnamesource`:

In [214]:
%sql SELECT * FROM biodiv.vernacularname

 * postgresql://postgres:***@localhost:5433/postgres
108 rows affected.


id,taxonomyId,language,name,source
1,1,en,Animals,1
2,1,en,animals,2
3,1,fr,animaux,2
4,1,nl,dieren,4
5,1,en,animals,5
6,1,en,animals,6
7,1,fr,animaux,5
8,1,fr,animaux,6
9,1,nl,dieren,6
10,1,nl,dieren,5


In [215]:
%sql SELECT * FROM biodiv.vernacularnamesource

 * postgresql://postgres:***@localhost:5433/postgres
12 rows affected.


id,datasetKey,datasetTitle
1,71667154-257d-4d8e-a2a5-711aaf9b2d74,Phthiraptera.info
2,9ca92552-f23a-41a8-a140-01abaa31c931,Integrated Taxonomic Information System (ITIS)
4,39653f3e-8d6b-4a94-a202-859359c164c5,Belgian Species List
5,0a2eaf0c-5504-4f48-a47f-c94229029dc8,World Register of Introduced Marine Species (WRiMS)
6,2d59e5db-57ad-41ff-97d6-11f5fb264527,World Register of Marine Species
12,c33ce2f2-c3cc-43a5-a380-fe4526d63650,The Paleobiology Database
63,4dd32523-a3a3-43b7-84df-4cda02f15cf7,Checklist Dutch Species Register - Nederlands Soortenregister
64,1bd42c2b-b58a-4a01-816b-bec8c8977927,EUNIS Biodiversity Database
87,3f8a1297-3259-4700-91fc-acc4170b27ce,Database of Vascular Plants of Canada (VASCAN)
90,66dd0960-2d7d-46ee-a491-87b9adcfe7b1,GRIN Taxonomy


As for the previous step, we recommend to update these tables using the `empty_only=True` parameter in `populate_vernacular_names()` every time new names are added or improved. 

## Add exotic status of taxa in `taxonomy`

The exotic status (`True` or `False`) for all taxa in `taxonomy` is filled by consulting the GBIF checklist
[_Global Register of Introduced and Invasive Species - Belgium_](https://www.gbif.org/dataset/6d9e952f-948c-4483-9807-575348147c7e):

In [216]:
message = "Step 7: populate field exotic_be (values: True of False) from GRIIS checklist for each entry in " \
          "taxonomy table "
print(message)
logging.info(message)
exotic_status.populate_is_exotic_be_field(conn, config_parser=config, exotic_status_source=GRIIS_DATASET_UUID)


Step 7: populate field exotic_be (values: True of False) from GRIIS checklist for each entry in taxonomy table 
We'll now retrieve the GBIF checklist containing the exotic taxa in Belgium, datasetKey: 6d9e952f-948c-4483-9807-575348147c7e.
Retrieved 2891 exotic taxa in 35s.
We'll now update exotic_be field for 21 taxa of the taxonomy table.
Taxon Reynoutria japonica Houtt. (gbifId: 2889173) is exotic in Belgium.
    Taxon Fallopia japonica (Houtt.) Ronse Decraene (gbifId: 5334357) is exotic in Belgium.
2 exotic taxa found in taxonomy.
Field exotic_be updated for 21 taxa in taxonomy in 0.04s.


Exotic taxa:

In [217]:
%sql SELECT * FROM biodiv.taxonomy WHERE exotic_be IS TRUE

 * postgresql://postgres:***@localhost:5433/postgres
2 rows affected.


id,gbifId,scientificName,rankId,acceptedId,parentId,exotic_be
20,2889173,Reynoutria japonica Houtt.,7,,19,True
21,5334357,Fallopia japonica (Houtt.) Ronse Decraene,7,20.0,19,True


This step should be repeated everytime the `taxonomy` table changes. 