N. Tarr 7/19/2018

Python 2.7

Description: Match SGCN species to gap species.  Begin by matching on TSN codes.  Running this code requires access to the gapproduction package, which requires arcpy and connection to the NCSU GAP databases.  

Note: It is necessary to have access to a info from NCSU GAP databases on gap species name-ITIS crosswalk info.  Here, it is accessed with a csv file.  Each GAP species is linked with an ITIS TSN.  However, some gap species are not recognized by ITIS or direct matches will require extensive review of literature.  GAP species may be linked to a "parent" TSN (e.g. for the Genus) in such cases.

In [22]:
import pandas as pd
import bis
import sys
import requests
import pprint
sys.path.append('C:/Program Files (x86)/ArcGIS/Desktop10.4/bin64')
sys.path.append('C:/Program Files (x86)/ArcGIS/Desktop10.4/ArcPy')
sys.path.append('C:/Program Files (x86)/ArcGIS/Desktop10.4/ArcToolBox/Scripts')
sys.path.append('P:/Proj3/USGap/Scripts/GAPAnalysis')
sys.path.append('P:/Proj3/USGap/Scripts/GAPProduction')
sys.path.append('P:/Proj3/USGap/Scripts/')
import gapproduction as gp
import pymongo

Functions and variables for connecting to bis database

In [23]:
host="54.91.95.139"
port=27017
username=raw_input("username: ")
password=raw_input("password: ")

def _connect_mongodb(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = pymongo.MongoClient(mongo_uri)
    else:
        conn = pymongo.MongoClient(host, port)
        
    return conn[db]

def get_collection_cursor(db, collection, host, port, username, password, query={}):
    """ Read from Mongo """

    # Connect to MongoDB
    db = _connect_mongodb(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)
    
    return cursor

username: ntarr
password: LrPS7Q_gq!mw]kz


## Create a dictionary to hold SGCN - GAP matches

In [24]:
matches = {}
non_matches = {}

## Read in GAP Scientific Names, TSN codes, and species codes
Read in as a table with fields "SciName_GAP", "TSN_GAP", "gap_code"

In [25]:
# Read in csv of gap species, species codes, and TSN codes
gapSpecies = pd.read_csv("T:/SALCC/GAPspecies2.csv")
gapSpecies.drop(["scientific_name", "common_name", "strComName"],inplace=True, axis=1)
gapSpecies.drop(columns=["Unnamed: 0"],inplace=True, axis=1)
gapSpecies['intITIScode'].fillna(0, inplace=True)
gapSpecies['intITIScode'] = [int(i) for i in gapSpecies['intITIScode']]

# Print first 5 records
gapSpecies.head()

Unnamed: 0,strUC,strSciName,intITIScode,intGapITISmatch,strGapITISmatch,memTaxaNotes
0,rLWWHx,Aspidoscelis gypsi,174021,6.0,ITIS does not recognize this species concept. ...,Aspidoscelis gypsi (Little White Whiptail) is ...
1,rDENIx,Hypsiglena chlorophaea,174233,6.0,ITIS does not recognize this species concept. ...,Hypsiglena chlorophaea (Desert Nightsnake) is ...
2,rCHNIx,Hypsiglena jani,174233,6.0,ITIS does not recognize this species concept. ...,Hypsiglena jani (Chihuahuan Nightsnake) is rec...
3,rWNLIx,Xantusia wigginsi,174092,6.0,ITIS does not recognize this species concept. ...,"Based on mitochondrial and nuclear DNA data, S..."
4,mPYRAc,Brachylagus idahoensis pop. 2,552521,5.0,ITIS does not recognize this sub-population co...,NatureServe recognizes a distinct subpopulatio...


## Read in SGCN Scientific Names and TSN codes, attempt to match with GAP
Define the search query for filtering out unwanted records from the bis database

In [26]:
query = {"Taxonomic Group": {"$in": ["Birds", "Reptiles", "Amphibians", "Mammals"]}}

Process each document in the collection. for each document/record, first check to see if the SGCN TSN is associated with a GAP species.  If it is, collect that record and match.  Note that there are some one-to-many matches on tsn (one SGCN TSN to many GAP species; sometimes due to subspecies).  Those aren't resolved here; they count as non-matches.

In [None]:
# Get a cursor for the SGCN Synthesis collection from bis
SGCNcursor = get_collection_cursor(db='bis', collection='SGCN Synthesis', query=query, host=host, port=port, 
                                   username=username, password=password)

# Loop through SGCN's and collect matches on TSN
for x in SGCNcursor:
    tsn = int(x['Taxonomic Authority ID'][-6:])
    sciname = x['Taxonomy'][-1]['name']
    if int(tsn) in set(gapSpecies.intITIScode):
        GAPtsnDF = gapSpecies[gapSpecies['intITIScode'] == tsn]
        if len(GAPtsnDF) == 1:
            spInfo = {}
            spInfo["SGCN_id"] = sciname
            spInfo["SGCN-ITIS_TSN"] = tsn
            spInfo["SGCN_commmon_name"] = x['Common Name']
            spInfo["GAP-ITIS_TSN"] = tsn
            spInfo["GAP_scientific_name"] = GAPtsnDF["strSciName"].unique()[0]
            spInfo["GAP_common_name"] = gp.gapdb.NameCommon(GAPtsnDF["strUC"].unique()[0])
            spInfo["GAP_species_code"] = GAPtsnDF["strUC"].unique()[0]
            spInfo["GAP-ITIS_match_notes"] = GAPtsnDF["strGapITISmatch"].unique()[0]
            spInfo["GAP_taxa_notes"] = GAPtsnDF["memTaxaNotes"].unique()[0]
            spInfo["notes"] = ""
            matches[sciname] = spInfo
        else:
            non_matches[sciname] = spInfo
    else:
        spInfo = {}
        spInfo["SGCN_scientific_name"] = sciname
        spInfo["SGCN-ITISTSN"] = tsn
        spInfo["SGCN_commmon_name"] = x['Common Name']
        non_matches[sciname] = spInfo


# Display results
Show the number of records in each dictionary

In [None]:
print("SGCN Synthesis documents: " + str(len(list(get_collection_cursor(db='bis', collection='SGCN Synthesis', 
                                                                        query=query, host=host, port=port,
                                                                        username=username, password=password)))))
print("GAP species: 1719")
print("Matches: " + str(len(matches.keys())))
print("Non-matches: " + str(len(non_matches.keys())))

# Non-matches

In [None]:
pprint.pprint(non_matches)

# Matches

In [None]:
pprint.pprint(matches)