N. Tarr 8/2/2018

Python 2.7

Description: Matches SGCN species to GAP species when possible.  Tries to match on different things starting with ITIS TSN, tries to then match unmatched species on scientific and then common name. Inserts matches in to SGCN-GAP_matches collection in the bis database.  Prints non-matched SGCN species at the end.

Notes: Running this code requires access to the gapproduction package, which requires arcpy and connection to the NCSU GAP databases.  

It is necessary to have access to info from NCSU GAP databases on gap species name-ITIS crosswalk info.  Here, it is accessed with a csv file.  Each GAP species is linked with an ITIS TSN.  However, some GAP species are not recognized by ITIS or direct matches will require extensive review of literature.  GAP species may be linked to a "parent" TSN (e.g. for the Genus) in such cases.  These cases created some one-to-many matches (one SGCN to many GAP) that were sometimes resolved by matching on TSN AND scientific name.

In [1]:
import pandas as pd
pd.set_option('display.width', 1000)
import sys
sys.path.append("T:/BCB/")
import BISdbConfig
import pprint
execfile("T:/Scripts/AppendPaths27.py")
execfile("T:/Scripts/AppendGAPAnalysis.py")
import gapproduction as gp
import pymongo

Functions and variables for connecting to bis database

In [2]:
host="54.91.95.139"
port=27017
username=BISdbConfig.user
password=BISdbConfig.password

def _connect_mongodb(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = pymongo.MongoClient(mongo_uri)
    else:
        conn = pymongo.MongoClient(host, port)
        
    return conn[db]

def get_collection_cursor(db, collection, host, port, username, password, query={}):
    """ Read from Mongo """

    # Connect to MongoDB
    db = _connect_mongodb(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)
    
    return cursor

## Create a dictionary to hold SGCN - GAP matches

In [3]:
matches = {}
non_matches = {}

## Read in GAP Scientific Names, TSN codes, and species codes

In [4]:
# Read in csv of gap species, species codes, and TSN codes and notes
gapSpecies = pd.read_csv("T:/SALCC/GAPspecies2.csv")
gapSpecies.drop(["scientific_name", "common_name"],inplace=True, axis=1)
gapSpecies.drop(columns=["Unnamed: 0"],inplace=True, axis=1)
gapSpecies['intITIScode'].fillna(0, inplace=True)
gapSpecies['intITIScode'] = [int(i) for i in gapSpecies['intITIScode']]

# Print first 5 records
gapSpecies.head()

Unnamed: 0,strUC,strSciName,strComName,intITIScode,intGapITISmatch,strGapITISmatch,memTaxaNotes
0,rLWWHx,Aspidoscelis gypsi,Little White Whiptail,174021,6.0,ITIS does not recognize this species concept. ...,Aspidoscelis gypsi (Little White Whiptail) is ...
1,rDENIx,Hypsiglena chlorophaea,Desert Nightsnake,174233,6.0,ITIS does not recognize this species concept. ...,Hypsiglena chlorophaea (Desert Nightsnake) is ...
2,rCHNIx,Hypsiglena jani,Chihuahuan Nightsnake,174233,6.0,ITIS does not recognize this species concept. ...,Hypsiglena jani (Chihuahuan Nightsnake) is rec...
3,rWNLIx,Xantusia wigginsi,Wiggins' Night Lizard,174092,6.0,ITIS does not recognize this species concept. ...,"Based on mitochondrial and nuclear DNA data, S..."
4,mPYRAc,Brachylagus idahoensis pop. 2,Pygmy Rabbit (Columbia Basin Population),552521,5.0,ITIS does not recognize this sub-population co...,NatureServe recognizes a distinct subpopulatio...


## Read in SGCN Scientific Names and TSN codes
Define the search query for filtering out unwanted records from the bis database.  No need for aquatic or invertebrates or species only listed in AK.

In [5]:
# Get a list of state names for the query.  We will want to exclude non-lower 48 state species.
lower48s = list(set(gp.dictionaries.stateDict_To_Abbr.keys()) - set(["Northern Mariana Islands", "Alaska", "Puerto Rico", 
                                                           "American Samoa", "Guam", "Virgin Islands", "Hawaii"]))

# Get a list of taxanomic groups GAP modeled (terrestrial)
terrSpecies = ["Birds", "Reptiles", "Amphibians", "Mammals"]

query = {"State Summary.2015.State List": { "$in": lower48s }, 
         "State Summary.2005.State List": { "$in": lower48s },
         "Taxonomic Group": {"$in": terrSpecies}
         }

## Attempt to match with GAP

### TSN matches
Process each document in the collection. for each document/record, check to see if the SGCN TSN is associated with a GAP species.  If it is, collect that record and match.  Note that there are some one-to-many matches on TSN (one SGCN TSN to many GAP species; sometimes due to subspecies).  In those cases, attempt to find a match on TSN and scientific name.

In [6]:
# Get a cursor for the SGCN Synthesis collection from bis
SGCNcursor = get_collection_cursor(db='bis', collection='SGCN Synthesis', query=query, host=host, port=port, 
                                   username=username, password=password)

# Loop through SGCN's and collect matches on TSN
for x in SGCNcursor:
    tsn = int(x['Taxonomic Authority ID'][-6:])
    sciname = x['Taxonomy'][-1]['name']
    # If the tsn occurs in the GAP taxonomy, then proceed, otherwise collect as a non-match
    if int(tsn) in set(gapSpecies.intITIScode):
        GAPtsnDF = gapSpecies[gapSpecies['intITIScode'] == tsn]
        # One-to-one TSN: If there's one occurrence of the tsn in GAP taxonomy, then collect as a match
        if len(GAPtsnDF) == 1:
            spInfo = {}
            spInfo["SGCN_id"] = sciname
            spInfo["SGCN-ITIS_TSN"] = tsn
            spInfo["SGCN_common_name"] = x['Common Name']
            spInfo["GAP-ITIS_TSN"] = tsn
            spInfo["GAP_scientific_name"] = GAPtsnDF["strSciName"].unique()[0]
            spInfo["GAP_common_name"] = gp.gapdb.NameCommon(GAPtsnDF["strUC"].unique()[0])
            spInfo["GAP_species_code"] = GAPtsnDF["strUC"].unique()[0]
            spInfo["GAP-ITIS_match_notes"] = GAPtsnDF["strGapITISmatch"].unique()[0]
            spInfo["GAP_taxa_notes"] = GAPtsnDF["memTaxaNotes"].unique()[0]
            spInfo["SGCN-GAP_match_basis"] = "TSN" 
            spInfo["SGCN-GAP_match_notes"] = "exact match"
            matches[sciname] = spInfo
        # One-to-many TSN: If the tsn occurs twice in GAP taxonomy, match on scientific name and tsn, provided
        # that the sgcn scientific name occurs in GAP taxonomy.
        elif len(GAPtsnDF) != 1 and sciname in GAPtsnDF['strSciName'].unique():
            DF1 = GAPtsnDF[GAPtsnDF['strSciName'] == sciname]
            spInfo = {}
            spInfo["SGCN_id"] = sciname
            spInfo["SGCN-ITIS_TSN"] = tsn
            spInfo["SGCN_common_name"] = x['Common Name']
            spInfo["GAP-ITIS_TSN"] = tsn
            spInfo["GAP_scientific_name"] = DF1["strSciName"].unique()[0]
            spInfo["GAP_common_name"] = gp.gapdb.NameCommon(DF1["strUC"].unique()[0])
            spInfo["GAP_species_code"] = DF1["strUC"].unique()[0]
            spInfo["GAP-ITIS_match_notes"] = DF1["strGapITISmatch"].unique()[0]
            spInfo["GAP_taxa_notes"] = DF1["memTaxaNotes"].unique()[0]
            spInfo["SGCN-GAP_match_basis"] = "TSN and scientific name" 
            spInfo["SGCN-GAP_match_notes"] = "exact match"
            matches[sciname] = spInfo
        # If the tsn occurs in the GAP taxonomy but not with the sgcn scientific name, count as a mismatch.  These cases
        # don't seem to occur.
        else:
            spInfo = {}
            spInfo["SGCN_scientific_name"] = sciname
            spInfo["SGCN-ITIS_TSN"] = tsn
            spInfo["SGCN_common_name"] = x['Common Name']
            non_matches[sciname] = spInfo
    # Collect as a mismatch because the tsn isn't in the GAP taxonomy.
    else:
        spInfo = {}
        spInfo["SGCN_scientific_name"] = sciname
        spInfo["SGCN-ITIS_TSN"] = tsn
        spInfo["SGCN_common_name"] = x['Common Name']
        non_matches[sciname] = spInfo

Show the number of records in each dictionary

In [7]:
print("SGCN Synthesis documents: " + str(len(list(get_collection_cursor(db='bis', collection='SGCN Synthesis', 
                                                                        query=query, host=host, port=port,
                                                                        username=username, password=password)))))
print("GAP species: 1719")
print("Matches: " + str(len(matches.keys())))
print("Non-matches: " + str(len(non_matches.keys())))

SGCN Synthesis documents: 1425
GAP species: 1719
Matches: 1062
Non-matches: 363


### Match on scientific names
See if any non-matches can be matched on scientific name.

In [8]:
# Loop through non-matches and collect matches on scientific name.
for x in non_matches.keys():   
    if x in set(gapSpecies.strSciName) and x not in matches.keys():
        print(x)
        GAPsciDF = gapSpecies[gapSpecies['strSciName'] == x]
        spInfo = {}
        spInfo["SGCN_id"] = x
        spInfo["SGCN-ITIS_TSN"] = non_matches[x]["SGCN-ITIS_TSN"]
        spInfo["SGCN_common_name"] = non_matches[x]['SGCN_common_name']
        spInfo["GAP-ITIS_TSN"] = GAPsciDF["intITIScode"].unique()[0]
        spInfo["GAP_scientific_name"] = GAPsciDF["strSciName"].unique()[0]
        spInfo["GAP_common_name"] = gp.gapdb.NameCommon(GAPsciDF["strUC"].unique()[0])
        spInfo["GAP_species_code"] = GAPsciDF["strUC"].unique()[0]
        spInfo["GAP-ITIS_match_notes"] = GAPsciDF["strGapITISmatch"].unique()[0]
        spInfo["GAP_taxa_notes"] = GAPsciDF["memTaxaNotes"].unique()[0]
        spInfo["SGCN-GAP_match_basis"] = "scientific name" 
        spInfo["SGCN-GAP_match_notes"] = "exact match"
        matches[x] = spInfo
        #pprint.pprint(spInfo)
        del non_matches[x]

Onychoprion fuscatus
Onychoprion anaethetus
Chroicocephalus philadelphia
Leucophaeus atricilla
Grus canadensis
Tringa semipalmata
Kinosternon subrubrum
Hydrocoloeus minutus
Thalasseus maximus
Leucophaeus pipixcan
Cratogeomys castanops


Show the number of records in each dictionary

In [9]:
print("SGCN Synthesis documents: " + str(len(list(get_collection_cursor(db='bis', collection='SGCN Synthesis', 
                                                                        query=query, host=host, port=port,
                                                                        username=username, password=password)))))
print("GAP species: 1719")
print("Matches: " + str(len(matches.keys())))
print("Non-matches: " + str(len(non_matches.keys())))

SGCN Synthesis documents: 1425
GAP species: 1719
Matches: 1073
Non-matches: 352


### Match on common names
See if any non-matches can be matched on common name.

In [10]:
# Loop through non-matches and collect matches on common name.
for x in non_matches.keys():   
    comname = non_matches[x]['SGCN_common_name']
    if comname in set(gapSpecies.strComName) and x not in matches.keys():
        print("\n\t\t" + comname)
        GAPcomDF = gapSpecies[gapSpecies['strComName'] == comname]
        spInfo = {}
        spInfo["SGCN_id"] = x
        spInfo["SGCN-ITIS_TSN"] = non_matches[x]["SGCN-ITIS_TSN"]
        spInfo["SGCN_common_name"] = comname
        spInfo["GAP-ITIS_TSN"] = GAPcomDF["intITIScode"].unique()[0]
        spInfo["GAP_scientific_name"] = GAPcomDF["strSciName"].unique()[0]
        spInfo["GAP_common_name"] = gp.gapdb.NameCommon(GAPcomDF["strUC"].unique()[0])
        spInfo["GAP_species_code"] = GAPcomDF["strUC"].unique()[0]
        spInfo["GAP-ITIS_match_notes"] = GAPcomDF["strGapITISmatch"].unique()[0]
        spInfo["GAP_taxa_notes"] = GAPcomDF["memTaxaNotes"].unique()[0]
        spInfo["SGCN-GAP_match_basis"] = "common name" 
        spInfo["SGCN-GAP_match_notes"] = "exact match"
        matches[x] = spInfo
        pprint.pprint(spInfo)
        del non_matches[x]


		Blackburnian Warbler
{'GAP-ITIS_TSN': 178904,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Blackburnian Warbler',
 'GAP_scientific_name': 'Dendroica fusca',
 'GAP_species_code': 'bBLBWx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 950037,
 'SGCN_common_name': u'Blackburnian Warbler',
 'SGCN_id': u'Setophaga fusca'}

		Ornate Box Turtle
{'GAP-ITIS_TSN': 173778,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Ornate Box Turtle',
 'GAP_scientific_name': 'Terrapene ornata',
 'GAP_species_code': 'rOBTUx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 208604,
 'SGCN_common_name': u'Ornate Box Turtle',
 'SGCN_id': u'Terrapene ornata ornata'}

		Orange-crowned Warbler
{'GAP-ITIS_TSN': 178856,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Orange-crowned Warbler',
 'GAP_scient

{'GAP-ITIS_TSN': 176259,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Yellow Rail',
 'GAP_scientific_name': 'Coturnicops noveboracensis',
 'GAP_species_code': 'bYERAx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 176260,
 'SGCN_common_name': u'Yellow Rail',
 'SGCN_id': u'Coturnicops noveboracensis noveboracensis'}

		Coal Skink
{'GAP-ITIS_TSN': 208882,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Coal Skink',
 'GAP_scientific_name': 'Plestiodon anthracinus',
 'GAP_species_code': 'rCOSKx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 173962,
 'SGCN_common_name': u'Coal Skink',
 'SGCN_id': u'Eumeces anthracinus'}

		American Kestrel
{'GAP-ITIS_TSN': 175622,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'American Kestrel',
 'GAP_scientific_name': 'Falco sparverius',
 'GA

{'GAP-ITIS_TSN': 178917,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u"Kirtland's Warbler",
 'GAP_scientific_name': 'Dendroica kirtlandii',
 'GAP_species_code': 'bKIWAx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 950030,
 'SGCN_common_name': u"Kirtland's Warbler",
 'SGCN_id': u'Setophaga kirtlandii'}

		Evening Grosbeak
{'GAP-ITIS_TSN': 179173,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Evening Grosbeak',
 'GAP_scientific_name': 'Coccothraustes vespertinus',
 'GAP_species_code': 'bEVGRx',
 'GAP_taxa_notes': nan,
 'SGCN-GAP_match_basis': 'common name',
 'SGCN-GAP_match_notes': 'exact match',
 'SGCN-ITIS_TSN': 179175,
 'SGCN_common_name': u'Evening Grosbeak',
 'SGCN_id': u'Hesperiphona vespertina'}

		Oregon Slender Salamander
{'GAP-ITIS_TSN': 573579,
 'GAP-ITIS_match_notes': 'ITIS valid match.',
 'GAP_common_name': u'Oregon Slender Salamander',
 'GAP_scientific_

Show the number of records in each dictionary

In [11]:
print("SGCN Synthesis documents: " + str(len(list(get_collection_cursor(db='bis', collection='SGCN Synthesis', 
                                                                        query=query, host=host, port=port,
                                                                        username=username, password=password)))))
print("GAP species: 1719")
print("Matches: " + str(len(matches.keys())))
print("Non-matches: " + str(len(non_matches.keys())))

SGCN Synthesis documents: 1425
GAP species: 1719
Matches: 1146
Non-matches: 279


# Insert matches into bis database as a collection

Build a list of records to insert

In [12]:
# Add _id values to matches dictionaries
for j in matches:
    matches[j]["_id"] = j

# Put matches dictionaries into a list for insertion into the SGCN-GAP_matches collection
inserts = [matches[x] for x in matches.keys()][:25]  # NOTE !!!! note the selection of first 25 records, any more throws an error. 
pprint.pprint(inserts)

[{'GAP-ITIS_TSN': 554145,
  'GAP-ITIS_match_notes': 'ITIS valid match.',
  'GAP_common_name': u'Stilt Sandpiper',
  'GAP_scientific_name': 'Calidris himantopus',
  'GAP_species_code': 'bSTSAx',
  'GAP_taxa_notes': nan,
  'SGCN-GAP_match_basis': 'TSN',
  'SGCN-GAP_match_notes': 'exact match',
  'SGCN-ITIS_TSN': 554145,
  'SGCN_common_name': u'Stilt Sandpiper',
  'SGCN_id': u'Calidris himantopus',
  '_id': u'Calidris himantopus'},
 {'GAP-ITIS_TSN': 550252,
  'GAP-ITIS_match_notes': 'ITIS valid match.',
  'GAP_common_name': u'Southern Torrent Salamander',
  'GAP_scientific_name': 'Rhyacotriton variegatus',
  'GAP_species_code': 'aSTOSx',
  'GAP_taxa_notes': nan,
  'SGCN-GAP_match_basis': 'TSN',
  'SGCN-GAP_match_notes': 'exact match',
  'SGCN-ITIS_TSN': 550252,
  'SGCN_common_name': u'Southern Torrent Salamander',
  'SGCN_id': u'Rhyacotriton variegatus',
  '_id': u'Rhyacotriton variegatus'},
 {'GAP-ITIS_TSN': 180008,
  'GAP-ITIS_match_notes': 'ITIS valid match.',
  'GAP_common_name': u'Bi

Insert the records
### NOTE: This doesn't work unless the length of the inserts list is < 26.  Why????/

In [13]:
# Get a connection to the database so records can be inserted
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, "bis")
client = pymongo.MongoClient(mongo_uri)
bisdb = client.bis
collection = bisdb['SGCN-GAP_matches']

# Insert the records
y = collection.insert_many(inserts)

del client

# Non-matches

In [14]:
pprint.pprint(non_matches)

{u'Acris blanchardi': {'SGCN-ITIS_TSN': 774220,
                       'SGCN_common_name': u"Blanchard's cricket frog",
                       'SGCN_scientific_name': u'Acris blanchardi'},
 u'Agkistrodon contortrix mokasen': {'SGCN-ITIS_TSN': 174297,
                                     'SGCN_common_name': u'Northern Copperhead',
                                     'SGCN_scientific_name': u'Agkistrodon contortrix mokasen'},
 u'Agkistrodon piscivorus leucostoma': {'SGCN-ITIS_TSN': 209504,
                                        'SGCN_common_name': u'Western Cottonmouth',
                                        'SGCN_scientific_name': u'Agkistrodon piscivorus leucostoma'},
 u'Amazona viridigenalis': {'SGCN-ITIS_TSN': 177806,
                            'SGCN_common_name': u'Red-crowned Amazon',
                            'SGCN_scientific_name': u'Amazona viridigenalis'},
 u'Ambystoma macrodactylum croceum': {'SGCN-ITIS_TSN': 195693,
                                      'SGCN_common_na

                                      'SGCN_scientific_name': u'Lampropeltis triangulum taylori'},
 u'Lanius ludovicianus mearnsi': {'SGCN-ITIS_TSN': 178522,
                                  'SGCN_common_name': u'San Clemente Loggerhead Shrike',
                                  'SGCN_scientific_name': u'Lanius ludovicianus mearnsi'},
 u'Lanius ludovicianus migrans': {'SGCN-ITIS_TSN': 178516,
                                  'SGCN_common_name': u'Migrant loggerhead shrike',
                                  'SGCN_scientific_name': u'Lanius ludovicianus migrans'},
 u'Leptonycteris yerbabuenae': {'SGCN-ITIS_TSN': 202347,
                                'SGCN_common_name': u'lesser long-nosed bat',
                                'SGCN_scientific_name': u'Leptonycteris yerbabuenae'},
 u'Lepus americanus cascadensis': {'SGCN-ITIS_TSN': 727811,
                                   'SGCN_common_name': u'Sierra Nevada snowshoe hare',
                                   'SGCN_scientific_name': 

                                  'SGCN_scientific_name': u'Sitta carolinensis aculeata'},
 u'Sorex cinereus fontinalis': {'SGCN-ITIS_TSN': 710033,
                                'SGCN_common_name': u'Maryland shrew',
                                'SGCN_scientific_name': u'Sorex cinereus fontinalis'},
 u'Sorex dispar blitchi': {'SGCN-ITIS_TSN': 710042,
                           'SGCN_common_name': u'Long-tailed Or Rock Shrew',
                           'SGCN_scientific_name': u'Sorex dispar blitchi'},
 u'Sorex hoyi winnemana': {'SGCN-ITIS_TSN': 710056,
                           'SGCN_common_name': u'southern pygmy shrew',
                           'SGCN_scientific_name': u'Sorex hoyi winnemana'},
 u'Sorex palustris punctulatus': {'SGCN-ITIS_TSN': 710086,
                                  'SGCN_common_name': u'southern water shrew',
                                  'SGCN_scientific_name': u'Sorex palustris punctulatus'},
 u'Spilogale gracilis amphialus': {'SGCN-ITIS_TSN': 727403