# GAP Range and Presence Coding Reference
N. Tarr
11/7/22

This document supports the develoment and demonstration of rules for GAP range and presence code assignment.  Multiple types of information are assessed (opinion, occurrence data, and previous presence codes) and they sometimes conflict, thus rules for handling cases of disagreement are needed.  In the future, rules for handling model predictions would be incorporated as well.

In this notebook a dataframe of all possible combinations of
codes is built and then a presence column is filled out according to clearly specified rules. The results are then tested and can be referenced for understanding presence code assignment.

Additionally, opinions must be reconciled and cleaned up prior to application.  The rules and and assumptions for those processes are also presented in this document.  

__Sections__
1. Information Hierarchy -- A dataframe is built that specifies rules of dominance for various information sources.
2. Opinion Processing -- Cleaning and reconciliation processes are detailed and a dataframe is built that shows the opinion weight values for all expert rank and confidence values, as well as which would win or lose against a previous map.  Expert opinion confidence values are collected from the expert when they submit opinions.
3. Coding Rules -- First, a dataframe with all types of information combinations is built.  2011-2015 is used as an example time period.  Second, that dataframe is populated according to rules.  Finally, tests are run to ensure all cases were accounted for and certain values are present where they should be.


__Notes__
* Rules are applied here to a single time period (2015) but in the range compiler script, multiple periods are assessed.

* Rules are applied in python here for clarity, but SQL is used in the range compiler for better speed.


## Section 1: Information Hierarchy
The following table illustrates the information hierarchy that the range compilation processes adhere to.  Documented refers to documented presence or range based upon observational data.  Opinion is collected from experts or GAP staff.  The temporal grain of the processes means that compiled data from previous time periods or GAP version 1 range data may be available. 

In [2]:
# Show the source dominance hierarchy
import pandas as pd
pd.options.display.max_rows = 500
pd.options.display.max_columns = 10
pd.options.display.width = 200
import random
import itertools
import numpy as np
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("<style>.output_result { max-width:100% !important; }</style>"))
display(HTML("<style>.prompt { display:none !important; }</style>"))

# Silence future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Make a table of pairwise comparisons
sources = ["opinion", "documented", "version 1", "previous time period"]

vs = []
for p in list(itertools.permutations(sources,2)):
    p = list(p)
    p.sort()
    if p not in vs:
        vs.append(p)
df3 = pd.DataFrame(vs, columns=['Source1', 'Source2'])

# Identify winners
# Documented occurrence always wins
df3.loc[df3["Source1"] == 'documented', "Dominant"] = "documented"

# Version 1 always loses
df3.loc[df3["Source2"] == 'version 1', "Dominant"] = df3["Source1"]

# previous time period vs opinion depends on opinion score (rank*(confidence/10))
df3.loc[(df3["Source1"] == "opinion") &
          (df3["Source2"] == "previous time period"),
          "Dominant"] = "opinion where rank*(confidence/10) > 2"

df3

Unnamed: 0,Source1,Source2,Dominant
0,documented,opinion,documented
1,opinion,version 1,opinion
2,opinion,previous time period,opinion where rank*(confidence/10) > 2
3,documented,version 1,documented
4,documented,previous time period,documented
5,previous time period,version 1,previous time period


## Section 2: Opinion Processing
Expert opinion is stored in a stand-alone database with tables for presence, summer, winter, and year-round.  Each tables has a row for opinions stored at the level of the spatial unit and year.  As opinion is collected, duplication and conflicts can arise that need to be addresssed before application.  The collection expert confidence and rank scores facilitates a way to systematically and consistently resolve conflicts by calculating a weight for each opinion.  Weight is calculates as rank*(confidence/10) and is scaled between 0 and 10.

As opinions are read from the opinions database and inserted into the opinions table of the output database, they are cleaned in the following ways: 

1. Duplicated records (based on all fields) are dropped.
2. Where an expert entered two opinions for the same spatiotemporal unit, older records are dropped and only the most recent is kept.
3. If experts with the same rank submitted conflicting opinions but with the same level of confidence, then the conflicting records are dropped.
4. If multiple experts submited opinions, the one with the highest rank is kept and the others are dropped.  If rank is tied, the one with higher confidence is kept.

The next step in processing opinion records is to expand and adjust the status and weight associated with each spatial unit-year record.  These actions are based upon the following logic:
* Range-present implies presence-present
* Presence-absent implies range-absent
* Range-absent does not imply presence-absent
* Year-round present implies present during summer and winter, but summer or winter range present does not imply year-round present.

The following table identifies which types of opinions could be applied to each of the map types that are compiled by GAP.

In [4]:
# Display a table showing when various opinions are applied
maps = ["presence", "summer", "winter", "year-round"]
status = ["present", "absent"]

# Make a dataframe with map as row index and map-status combinations as hierarchical columns
df = pd.DataFrame(index=maps, columns=pd.MultiIndex.from_product([maps, status]))
df.fillna("", inplace=True)

# Name the indices
df.index.name = "Map"
df.columns.names = ["Opinion", ""]

# Fill whole columns with the appropriate values
df["presence", "absent"] = "X"
df["year-round", "present"] = "X"

# Fill individual cells with the appropriate values
df.loc["presence", ("presence", "present")] = "X"
df.loc["presence", ("summer", "present")] = "X"
df.loc["presence", ("winter", "present")] = "X"
df.loc["presence", ("year-round", "present")] = "X"
df.loc["summer", ("summer", "absent")] = "X"
df.loc["summer", ("summer", "present")] = "X"
df.loc["summer", ("year-round", "absent")] = "X"
df.loc["winter", ("winter", "absent")] = "X"
df.loc["winter", ("winter", "present")] = "X"
df.loc["winter", ("year-round", "absent")] = "X"
df.loc["year-round", ("year-round", "absent")] = "X"

print("Table showing which types of opinions are applied to each type of map")
df.T

Table showing which types of opinions are applied to each type of map


Unnamed: 0_level_0,Map,presence,summer,winter,year-round
Opinion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
presence,present,X,,,
presence,absent,X,X,X,X
summer,present,X,X,,
summer,absent,,X,,
winter,present,X,,X,
winter,absent,,,X,
year-round,present,X,X,X,X
year-round,absent,,X,X,X


Another extension of the logic that was specified above and serves as the basis of the preceding table is how to handle opinion conflicts and inferences.  An opinion about presence or range can, in some cases, transfer to other seasons or presence.  Those transfers can generate conflicts or fill data gaps.  Therefore, we designed a process to adjust opinions and weights according to the table below, which includes all potential presence and season pairs.

In [5]:
# Detail adjustments in a table
columns = ["Presence", "Range", "Adjusted_Presence", "Adjusted_range"]

# Create a dataframe
df = pd.DataFrame(columns=columns)

# Add rows
df.loc[0] = ["present", "present", "present (higher weight)", "present (higher weight)"]
df.loc[1] = ["present", "absent", "present", "absent"]
df.loc[2] = ["present", "null", "present", "null"]
df.loc[3] = ["absent", "present", "higher weight", "higher weight"]
df.loc[4] = ["absent", "absent", "absent (presence)", "absent (higher weight)"]
df.loc[5] = ["absent", "null", "absent", "absent"]
df.loc[6] = ["null", "present", "present", "present"]
df.loc[7] = ["null", "absent", "null", "absent"]
df.loc[8] = ["null", "null", "null", "null"]

df

Unnamed: 0,Presence,Range,Adjusted_Presence,Adjusted_range
0,present,present,present (higher weight),present (higher weight)
1,present,absent,present,absent
2,present,,present,
3,absent,present,higher weight,higher weight
4,absent,absent,absent (presence),absent (higher weight)
5,absent,,absent,absent
6,,present,present,present
7,,absent,,absent
8,,,,


Opinion has to be weighted when there is a previous presence code available in addition to opinion.  This section shows the relationship the weight, expert rank, and confidence.  A threshold opinion weight of 3 is used for winning vs. a previous presence code.

In [6]:
# Display tables that show relationships between rank and confidence
# Make a dataframe of scores for pairwise rank-confidence values
v = [1,2,3,4,5,6,7,8,9,10]
v.reverse()
df5 = pd.DataFrame(columns=v, index=v)
for i in df5.index:
    for c in df5.columns:
        df5.loc[i, c] = i*c/10
df5.index.name = "Confidence"
df5.columns.name = "Expert Rank"
print("Rounded Opinion Weights")
print(df5.round().astype(int))

# Make a dataframe that shows which cases opinion would get used (1) absent
# a previous presence code.
m = df5 > 8
df7 = df5.mask(m, 10)
m = df7 < 2
df7.mask(m, 0, inplace = True)
m = (df7 >= 2) & (df7 <= 8)
df7.mask(m, 1, inplace = True)
print("\nOutcome of a Conflict With a Previous Presence Code")
print(df7.replace({1: "suspected", 0: "defer", 10: "likely"}))

Rounded Opinion Weights
Expert Rank  10  9   8   7   6   5   4   3   2   1 
Confidence                                         
10           10   9   8   7   6   5   4   3   2   1
9             9   8   7   6   5   4   3   2   1   0
8             8   7   6   5   4   4   3   2   1   0
7             7   6   5   4   4   3   2   2   1   0
6             6   5   4   4   3   3   2   1   1   0
5             5   4   4   3   3   2   2   1   1   0
4             4   3   3   2   2   2   1   1   0   0
3             3   2   2   2   1   1   1   0   0   0
2             2   1   1   1   1   1   0   0   0   0
1             1   0   0   0   0   0   0   0   0   0

Outcome of a Conflict With a Previous Presence Code
Expert Rank         10         9          8          7          6          5          4          3          2      1 
Confidence                                                                                                           
10              likely     likely  suspected  suspected  suspec

# Section 3: Coding Rules

### Presence Codes
__1: Confirmed Present__ -- Presence is documented with sufficient occurrence data

__2: Likely Present__ -- There is strong evidence to suggest the species' presence, but recent presence is not documented.

__3: Suspected Present__ -- There is compelling reason to believe that the species may be present.

__4: Suspected Absent__ -- There is compelling reason to believe that the species is absent.

__5/Null/NoData: Likely Absent__ -- There is strong evidence to suggest the species' absence, or no reason to suspect it's presence.


### Range Codes
__1: Confirmed Range__ -- Range is documented with sufficient occurrence data

__2: Likely Range__ -- There is strong evidence to suggest the spatial unit is within the species' range, but it is not confirmed.

__3: Suspected Range__ -- There is compelling reason to suggest the spatial unit is within the species' range.

__4: Suspected Non-range__ -- There is compelling reason to suggest the spatial unit is not within the species' range.

__5/Null/NoData: Likely Non-range__ -- There is strong evidence to suggest the spatial unit is not within the species' range, but it is not confirmed.


### Rules - EDIT TO COVER RANGE TOO
Rules are applied at the level of the individual spatial unit.


Start with 2001v1 Codes,
* If a 2001v1 code exists, use that as a default for the first period.

* Old legend values 1,2,3 become new legend value 3 (suspected present).

* Old legend values 4,5 become new legend value 4 (suspected absent).

If a Code from the Previous Period is Available for the Spatial Unit,
* If documented in previous time step, code as suspected present (3).

* If coded as present (documented, likely, or suspected;1, 2, or 3) in previous time step, code as suspected present (3).
    
* If coded as suspected absent (4) in the previous time step, code as suspected absent (4).

* If coded as likely absent (5) in the previous time step, code as likely absent (5).

If Expert Opinion is Available for the Spatial Unit,
* If opinion weight is high enough, use opinion to overwrite null values and codes from previous periods, including 2001v1.  Weights between 2 and 8 yeild suspected present or absent, whereas weights above 8 yeild likely absent or present.

If Occurrence Records from the Spatial Unit are Available,
* If summed record weight is high enough (>9), presence is documented.


### Range behaviors
The above rules play out in the following ways:

* A spatial unit for which there is never any occurrence data or expert opinion will remain coded as it was in 2001 version 1.

* If presence is documented in one period, then it will be coded as suspected present in the subsequent period and will be coded suspected present in subsequent periods until expert opinion of absence is recorded for the unit or presence is documented again.

* If an expert registers her opinion that a species is absent in a unit, but then presence is documented with occurrence data, then the expert's opinion will be over-ridden and the unit will be coded as documented present.

* If a spatial unit is coded suspected present in one period, but expert opinion indicates absence in the next period, then the unit's code will transition from documented present to suspected or likely absent.


### Make a Table of Possible Value Combinations
Opinion weights could range from 1 to 10, but only values of 2 and 9 are included, along with a threshold of 3 for dominance over a prior presence code.

In [3]:
# Make and display a table of possible combinations of values
# Possible values
documented = [1,pd.NA]
last_period = [1,2,3,4,5,pd.NA]
status = [0,1,pd.NA]
GAP2001 = [1,2,3,4,5,6,7,pd.NA]
opinion_score = [2,9]
confidence = list(range(1,11,1))
rank = list(range(1,11,1))

# Make a table with all combinations, use 2015 as an example.
# Opinion score is rank*(confidence/10), but only 2 and 9 are used here to
# reduce table size. 2 would be subordinate to a past code, 9 would not be.
df1 = pd.DataFrame(columns=["presence_2015v2", "documented_2015v2",
                            "opinion_2015", "opinion_score",
                            "presence_2010v2"])

for doc in documented:
    for sta in status:
        for las in last_period:
            for opi in opinion_score:
                new = {"documented_2015v2" : doc,
                       "opinion_2015" : sta,
                       "opinion_score" : opi,
                       "presence_2010v2" : las,
                       }
                df1 = df1.append(new, ignore_index = True)

# Some values may not be in the 2001v1 maps, remove those
#df1 = df1[df1["presence_2001v1"].isin([2,3,5,6,7]) == False]

    
# If no opinion, then the score should be NA as well.
df1.loc[df1["opinion_2015"].isnull() == True, 'opinion_score'] = pd.NA

# Remove rows with all Null values
dftmp1 = (df1[(df1["opinion_score"].isnull() == True)
          & (df1["opinion_2015"].isnull() == True)
          #& (df1["presence_2001v1"].isnull() == True)
          & (df1["presence_2010v2"].isnull() == True)
          & (df1["documented_2015v2"].isnull() == True)
          ])

df1.drop(dftmp1.index, inplace=True)

# Drop duplicate rows
df1.drop_duplicates(inplace=True)

print(df1.astype("Int64"))

    presence_2015v2  documented_2015v2  opinion_2015  opinion_score  presence_2010v2
0              <NA>                  1             0              2                1
1              <NA>                  1             0              9                1
2              <NA>                  1             0              2                2
3              <NA>                  1             0              9                2
4              <NA>                  1             0              2                3
5              <NA>                  1             0              9                3
6              <NA>                  1             0              2                4
7              <NA>                  1             0              9                4
8              <NA>                  1             0              2                5
9              <NA>                  1             0              9                5
10             <NA>                  1             0             

### Populate Presence Values

For the first time period, the SQL equavelent of the following would be 
run:
#If a 2001v1 code exists, use that as a start
df["presence_2015v2"] = df["presence_2001v1"]

#Old legend values 1,2,3 become new legend value 3
df.loc[df["presence_2015v2"].isin([1,2,3]) == True, 'presence_2015v2'] = 3

#Old legend values 4,5 become new legend value 4
df.loc[df["presence_2015v2"].isin([4,5]) == True, 'presence_2015v2'] = 4

In [6]:
# Function to shorten field names to fit screen width upon printing
def fit_print(df):
    '''Changes column name to make print width narrower.'''
    print(df.rename({#"presence_2001v1": "2001v1",
                     "presence_2010v2": "pres_2010",
                     "opinion_score": "opinion_weight",
                     "opinion_2015": "opinion",
                     "documented_2015v2": "documented",
                     "presence_2015v2": "presence"}, axis=1))

# Put rules into a function here
def rules(df):
    # --------------------- Previous Period Code ------------------------------
    # If documented in previous time step, code as 3 (placement in this section
    # is necessary or opinion can overwrite it).
    df.loc[df["presence_2010v2"] == 1, 'presence_2015v2'] = 3
    
    # If coded as present in previous time step, code as 3
    df.loc[df["presence_2010v2"].isin([2,3]) == True, 'presence_2015v2'] = 3

    # If coded as suspected absent in previous time step, code as 4
    df.loc[df["presence_2010v2"].isin([4,]) == True, 'presence_2015v2'] = 4

    # If coded as likely absent in previous time step, code as 5
    df.loc[df["presence_2010v2"].isin([5,]) == True, 'presence_2015v2'] = 5


    # --------------------------- Opinion -------------------------------------
    # If opinion with any score exists, but all else in null,
    # base the presence value on it.
    # Suspected present
    df.loc[(df["opinion_2015"] == 1)
           #& (df["presence_2001v1"].isnull() == True)
           & (df["presence_2010v2"].isnull() == True)
           & (df["documented_2015v2"].isnull() == True), 'presence_2015v2'] = 3
    
    # Suspected absent
    df.loc[(df["opinion_2015"] == 0)
           #& (df["presence_2001v1"].isnull() == True)
           & (df["presence_2010v2"].isnull() == True)
           & (df["documented_2015v2"].isnull() == True), 'presence_2015v2'] = 4
    
    # If opinion with a high enough score exists, use it to overwrite null
    # values and codes from previous periods (including 2001v1)
    # Suspected present
    df.loc[(df["opinion_score"] > 2) &
            (df["opinion_2015"] == 1), 'presence_2015v2'] = 3

    # Suspected absent
    df.loc[(df["opinion_score"] > 2) &
            (df["opinion_2015"] == 0), 'presence_2015v2'] = 4
    
    # Likely present
    df.loc[(df["opinion_score"] > 8) &
            (df["opinion_2015"] == 1), 'presence_2015v2'] = 2

    # Likely absent
    df.loc[(df["opinion_score"] > 8) &
            (df["opinion_2015"] == 0), 'presence_2015v2'] = 5
    

    # ------------------------ Model Predictions ------------------------------
    # Results of model predictions would be placed here to assign likely
    # present and likely absent.
    
    # ------------------------ Occurrence Records -----------------------------
    #/* If documented in a previous time period, code as 3*/
    #UPDATE presence SET presence_{0} = 3 WHERE documented_pre{1}=1;    DON"T actually incllude this, it's problematic
    
    # If documented with records, presence is documented
    df.loc[df["documented_2015v2"] == 1, 'presence_2015v2'] = 1

    # Make values integers with pd.NA
    df = df.astype("Int64")

    return df

# Apply rules to fill out presence column
df2 = rules(df1)

# Save to file
df2.to_csv("T:/RangeMaps/presence_coding_matrix.csv")
print(df2)

    presence_2015v2  documented_2015v2  opinion_2015  opinion_score  presence_2010v2
0                 1                  1             0              2                1
1                 1                  1             0              9                1
2                 1                  1             0              2                2
3                 1                  1             0              9                2
4                 1                  1             0              2                3
5                 1                  1             0              9                3
6                 1                  1             0              2                4
7                 1                  1             0              9                4
8                 1                  1             0              2                5
9                 1                  1             0              9                5
10                1                  1             0             

### Tests
1. There shouldn't be any cases with null presence values for the period

In [7]:
# Run test 1
testdf = df2
nulls = testdf[testdf["presence_2015v2"].isnull() == True]
if len(nulls) == 0:
    print('Test 1: pass')
else:
    print('Test 1: fail')
    print(nulls)

Test 1: pass


2. All of the potential codes for each source should still be present

In [8]:
# Run test 2
r = 0
def check_values(column, OK_values, r = r, df = testdf):
    '''Gets unique values from a column, excludes nan'''
    vals = list(df[column].unique())
                                                                               
    # Are all column values OK?
    violations = set(vals) - set(OK_values)
    if len(violations) != 0:
        print("Test 2: failed on {0}".format(column))
        print("\t violations: " + str(violations))
        r = 1
    # Are all values represented?
    missing = set(OK_values) - set(vals)
    if len(missing) != 0:
            print("Test 2: failed on {0}".format(column))
            print("\t missing: " + str(missing))
            r = 1
    return r

# Presence 2015 values should be 1, 3, 4, or 5
# Remove NA and 2 from last_period
fifteen_values = [x for x in last_period if pd.isnull(x) == False]
#fifteen_values.remove(2)
r = check_values("presence_2015v2", fifteen_values)

# Presence 2010 values should be 1, 2, 3, 4, or 5
r = check_values("presence_2010v2", last_period)

# Documented values should be 1 or nan
r = check_values("documented_2015v2", documented)

# Opinion values should be 1, 0, or nan
r = check_values("opinion_2015", status)

# Opinion score should be 2,9, or pd.NA
r = check_values("opinion_score", [2,9,pd.NA])

# 2001v1 values should be 3 or 4 (or null)
#r = check_values("presence_2001v1", [1.0, 4.0, pd.NA])

# Report a pass of test 2
if r == 0:
    print("Test 2: pass")

Test 2: pass


3. If documented present, then presence code should be 1

In [9]:
# Run test 3
df7 = testdf[testdf["documented_2015v2"] == 1]
df8 = df7["presence_2015v2"] == df7["documented_2015v2"]
if bool(df8.unique()) == True:
    print("Test 3: pass")
else:
    print("Test 3: failed!!!!!!!!")

Test 3: pass
