# GAP range map evaluation: Worm-eating Warbler
This notebook details an evaluation of the GAP range data for this species with occurrence data retrieved from databases such as GBIF via APIs.  The primary results are some maps for visualization, columns added to a GAP range data .csv file that is downloaded from ScienceBase, and documentation of decisions and archiving of data used. 

See the README.md file in this repository for more information.

### Evaluation Parameters
This process requires some decision making about how to filter the records and other things.  Such decisions are documented in a database that can be queried in the other steps and referred to later for reference.

### Occurrence Record Retrieval
The user identifies the sqlite db containing occurrence records that they want to use and the code connects with it.

### GAP Known Range Data Evaluation
GAP ranges exist in table form in a database and on ScienceBase.  Ranges can be compared to occurrence circles to find HUCs where GAP was correct about species' presence and where it was wrong about absence.  The results of those comparisons can be saved in columns in the range tables.

### Summary of Results
Provide summaries of interest regarding how many hucs were validated etc.

### Automated Range Delineation?
The occurrence record database populated for range evaluation could also be a source for range delineation: either by an expert or with an automated process.  Spatialite has a concave hull function that can be deployed.  I generated seasonal, yearly, and monthly range maps with that process, but they were of poor quality. 

### General Setup
Some parameters need to be declared, including a unique name for this evaluation.  

In [7]:
eval_id = 'tws2019'
gap_id = 'bwewax'
summary_name = 'wormie1' # an short, memorable name to use for file names etc.
inDir = '/users/nmtarr/documents/ranges/inputs/'
outDir = '/users/nmtarr/documents/ranges/outputs/'
occ_db = outDir + "bwewax0GBIFr13GBIFf4.sqlite"
wsw_db = '/users/nmtarr/code/occurrence-record-wrangler/parameters.sqlite'
parameters_db = '/users/nmtarr/code/GAP-range-evaluation/evaluations.sqlite'

In [1]:
%matplotlib inline
import sqlite3
import pprint
import pandas as pd
#import geopandas as gpd
pd.set_option('display.width', 600)
pd.set_option('display.max_colwidth', 60)
pd.set_option('display.max_rows', 100)
from IPython.display import Image
import config
import repo_functions as functions
from pygbif import occurrences
import matplotlib.pyplot as plt

## Evaluation Parameters

Evaluation parameters need to be set and justified in the cells within this section.  Values that are entered here will be used to update cells within the evaluations table stored in evaluations.sqlite. The decisions about what values to use are primarily documented here, not in the evaluations database.

Note that the evaluation ID and species' GAP code are set in the cell above, not here.

### Years

In [None]:
years = "2014, 2015, 2016, 2017, 2018"

A fair number of records exist for this species, so I chose 5 recent years.

### Months

In [None]:
months = "5,6,7"

The species does not winter in the US, so only summer months are relevant.  The choice in months was informed by the Birds of North America online species account.

### Evaluation Method

In [5]:
method = "proportion in polygon"

#### Minimum Count

In [6]:
min_count = 2

### 

In [5]:
connjup = sqlite3.connect(wsw_db)
cursorjup = connjup.cursor()
"""
table_names = cursorjup.execute("SELECT DISTINCT table_name FROM column_descriptions;").fetchall()
for table in table_names:
    table = table[0]
    tbl_desc = cursorjup.execute("SELECT description FROM table_descriptions WHERE table_name='{0}';".format(table)).fetchone()
    print("\n","-"*80,"\n\n",table.upper(),"TABLE\n",str(tbl_desc[0]),"\n")
    print(" COLUMNS")
    print(" Id, Column Name, Data Type, Not Null, Default, Unique")
    pprint.pprint(cursorjup.execute("PRAGMA table_info('{0}');".format(table)).fetchall())
    df = cursorjup.execute("SELECT * FROM column_descriptions WHERE table_name='{0}'".format(table)).fetchall()
    for row in df:
        print('\n' + row[1] + ' -- ' + row[2])
"""

'\ntable_names = cursorjup.execute("SELECT DISTINCT table_name FROM column_descriptions;").fetchall()\nfor table in table_names:\n    table = table[0]\n    tbl_desc = cursorjup.execute("SELECT description FROM table_descriptions WHERE table_name=\'{0}\';".format(table)).fetchone()\n    print("\n","-"*80,"\n\n",table.upper(),"TABLE\n",str(tbl_desc[0]),"\n")\n    print(" COLUMNS")\n    print(" Id, Column Name, Data Type, Not Null, Default, Unique")\n    pprint.pprint(cursorjup.execute("PRAGMA table_info(\'{0}\');".format(table)).fetchall())\n    df = cursorjup.execute("SELECT * FROM column_descriptions WHERE table_name=\'{0}\'".format(table)).fetchall()\n    for row in df:\n        print(\'\n\' + row[1] + \' -- \' + row[2])\n'

# Occurrence Record Retrieval
Data can be retrieved through APIs, but it needs to be filtered.  Many options for filtering exist and are data source specific, so decisions have to be made about how to filter.  In this framework, I am proposing that filters be treated as unique entities (__filter sets__) that are stored and documented in the rng_eval_params database.  Doing so provides a way to link data sets used for range evaluation (or delineation) back to the decisions made when aquiring them.  Filter sets would be documented in tables specific to the data source and step of filtering; so far, gbif_requests and gbif_filters are such tables.  This example is using filter sets __'r001'__ (request filter) and __'f001'__ (post-request filter).  

In [8]:
df1 = pd.read_sql_query(sql="SELECT * FROM gbif_requests WHERE request_id = '{0}'".format(request_id), con=connjup)
print("REQUEST FILTER SET")
print(df1.loc[0])

REQUEST FILTER SET
request_id                     GBIFr16
source                            GBIF
lat_range                        27,41
lon_range                      -91,-75
years_range                  2014,2019
months_range                      1,12
geoissue                         False
coordinate                        True
continent                         None
creator                        N. Tarr
notes           From the last 5 years.
Name: 0, dtype: object


In [9]:
df2 = pd.read_sql_query(sql="SELECT * FROM gbif_filters WHERE filter_id = '{}'".format(filter_id), con=connjup)
print("POST REQUEST FILTER SET")
print(df2.loc[0])

POST REQUEST FILTER SET
filter_id                                                                          GBIFf4
dataset                                                                              GBIF
collection_codes_omit                                                                None
institutions_omit                                                                    None
has_coordinate_uncertainty                                                              0
max_coordinate_uncertainty                                                           5000
bases_omit                                            PRESERVED_SPECIMEN, FOSSIL_SPECIMEN
protocols_omit                                                                       None
sampling_protocols_omit                                                              None
issues_omit                   GEODETIC_DATUM_INVALID, INDIVIDUAL_COUNT_INVALID, MULTIM...
creator                                                                     

In [10]:
# Run a script that retrieves and filters
%run 'retrieve_occurrences.py'

"""
Needs to connect to a database and then create a shapefile that can be displayed below.
"""

SELECT gbif_id, common_name, scientific_name,
                    detection_distance_meters, gap_id
             FROM species_concepts
             WHERE species_id = 'bwewax0';


IndexError: list index out of range

At this point, records have been retrieved, filtered, buffered, and stored in a database.  They are displayed below on a map with the GAP range map.

In [None]:
gap_range2 = "{0}{1}_range_4326".format(inDir, gap_id)

shp1 = {'file': gap_range2, 'column': None, 'alias': 'GAP range map',
        'drawbounds': False, 'linewidth': .5, 'linecolor': 'y',
        'fillcolor': 'y', 'marker':'s'}

shp2 = {'file': '{0}{1}_circles'.format(outDir, summary_name), 'column': None,
        'alias': 'Occurrence records', 'drawbounds': True, 'linewidth': .75, 'linecolor': 'k',
        'fillcolor': None, 'marker':'o'}

# Display occurrence polygons
title="Worm-eating Warbler ({0})".format(years)
functions.MapShapefilePolygons(map_these=[shp1, shp2], title=title)

# GAP Known Range Data Evaluation
The first step in using occurrence records to evaluate GAP range is to build another database to hold the GAP 12 digit HUCs and range for the species, as well as for performing the necessary spatial queries.  The GAP range is retrieved from ScienceBase and the HUCs would be too if they were available as a shapefile.  

In [1]:
#%run 'make_range_evaluation_db.py'

As with the filter sets, parameters for evaluation have to be set/decided upon.  I am proposing that evaluation parameter sets also be documented as unique entities in a database (i.e, rng_eval_params).  Each evaluation can be given a unique id that can be used in documentation, file naming, and for the names of the columns that will be added to the GAP range table to record the results of the evaluation.  In this example, the evaluation_id is __eval_gbif1__.  It's definition is printed below.

In [2]:
df3 = pd.read_sql_query(sql="SELECT * FROM evaluations WHERE evaluation_id = 'eval_gbif1'", con=connjup)
df3.loc[0, 'years'] = df3.loc[0, 'years'][0:4] + '-' + df3.loc[0, 'years'][-4:]
print("\nEVALUATION PARAMETERS")
print(df3.loc[0])

NameError: name 'pd' is not defined

In [None]:
#%run 'eval_gbif1.py'
connr = sqlite3.connect('/users/nmtarr/documents/ranges/outputs/bybcux_range.sqlite')
df4 = pd.read_sql_query(sql="SELECT strHUC12RNG AS HUC12RNG, "
                                    "intGAPOrigin AS Origin, intGAPPresence AS Presence, "
                                    "intGAPReproduction AS Reproduction,"
                                    "intGAPSeason AS Season, eval_gbif1_cnt, eval_gbif1, "
                                    "validated_presence AS validated_pres FROM new_range WHERE eval_gbif1_cnt >=0", con=connr)
df4.set_index(["HUC12RNG"], inplace=True)
print("Tabular results of the evaluation")
print(df4)

In [None]:
print("Mapped results of the evaluation.")
shp3 = {'file': '{0}{1}_eval_gbif1'.format(outDir, gap_id), 'column': 'eval_gbif1',
        'alias': 'eval_gbif1', 'column_colors': {1: 'b', 0: 'r'}, 
        'value_alias': {1:'Agreement', 0:'Disagreement'}, 'drawbounds': False, 
        'marker': "s"}
title="Yellow-billed Cuckoo -- eval_gbif1"
functions.MapShapefilePolygons(map_these=[shp1, shp3], title=title)

In [None]:
dups0 = curs_occ.execute("SELECT COUNT(occ_id) FROM occurrences GROUP BY geom_xy4326, occurrenceDate;").fetchall()
dups1 = [x[0] for x in dups0]
dups2 = [x for x in dups1 if x > 1]
print(str(len(dups2)) + ' records were duplicates based on xy coordinate and date-time')

After occurrence circles are attributed to HUCs, the results can be recorded in the species' range map table in terms of whether the two data sets agreed and whether they validate the GAP range data for any HUCs. For each evaluation, a column is added for 1) how many records could be attributed to each huc and 2) whether there is agreement at that huc (1 for yes, 0 for no, 'None' for no data for that huc) and 3) whether the GAP range has been validated by the evaluation.

# Summary of Results

### How many records were available in the occurrence database?

### How many of the records were attributable to a HUC?

In [None]:
hucable = curs_rng.execute("SELECT SUM(eval_gbif1_cnt) FROM new_range WHERE eval_gbif1_cnt >=0").fetchall()[0]
print(str(hucable[0]) + " records were attributable to a HUC.")

### How many hucs had records attributed to them?

In [None]:
containers = curs_rng.execute("SELECT COUNT(eval_gbif1_cnt) FROM new_range WHERE {0}_cnt >=0".format(eval_id)).fetchall()[0]
print(str(containers[0]) + " HUCs 'contained' records.")

### How many records were not used because of the minimum count?

In [None]:
conn_rng = sqlite3.connect(outDir + gap_id + '_range.sqlite')
curs_rng = conn_rng.cursor()
ones = curs_rng.execute("SELECT SUM(eval_gbif1_cnt) FROM new_range WHERE eval_gbif1_cnt = 1").fetchall()[0]
print(str(ones[0]) + " HUCs had occurrences but were not validated because they didn't meet the minimum.")

### How many HUCs were validated?

In [None]:
validated = curs_rng.execute("SELECT COUNT(eval_gbif1) FROM new_range WHERE {0} = 1".format(eval_id)).fetchall()[0]
print(str(validated[0]) + " HUCs were validated.")

### How many HUCs did GAP appear to omit?

In [None]:
missed = curs_rng.execute("SELECT COUNT(eval_gbif1) FROM new_range WHERE {0} = 0".format(eval_id)).fetchall()[0]
print(str(missed[0]) + " HUCs were missed.")

### What was the maximum number of occurrences attributable to a single HUC?

In [None]:
maxi = curs_rng.execute("SELECT MAX(eval_gbif1_cnt) FROM new_range").fetchall()[0]
print("The maximum number of records attributed to a HUC was " + str(maxi[0]))

# Next Steps
This is just a starting point that needs scrutiny.  It is currently hard-coded for a single species, so deploying it would require redesigning to accomodate large numbers of species, multiple users, many more occurrence records, optimal methods for evaluation and range delineation among other things.  