# iNaturalist API Example: Finding observations with disagreeing IDs
- Link: https://jumear.github.io/stirpy/lab?path=iNat_obs_with_disagreeing_ids.ipynb
- GitHub Repo: https://github.com/jumear/stirpy

In the [iNatForum](https://forum.inaturalist.org/), folks often ask how to filter for observations with disagreements, and there is even a [Feature Request](https://forum.inaturalist.org/t/provide-a-way-to-filter-observations-by-disputed-ids/6698) to implement a basic form of this kind of filter.

Under the hood, the [Explore](https://www.inaturalist.org/observations) and [Identify](https://www.inaturalist.org/observations/identify) pages get results from the [`GET /v1/observations`](https://api.inaturalist.org/v1/docs/#!/Observations/get_observations) API endpoint. Although that endpoint provides a `identifications[i].disagreement` field in its response, which indicates whether or not an observation's identification is a disagreement, there is not a filter parameter that can be used to return only observations which have an identifications where `identifications[i].disagreement=true`.

If we move away from observations, it is possible to use [`GET /v1/identifications`](https://api.inaturalist.org/v1/docs/#!/Identifications/get_identifications) to find *identifications* where `disagreement=true`. The problem is that there are no user interfaces in the system which display identifications data from this endpoint in a human-friendly way (although there is at least one [third-party tool](https://jumear.github.io/stirfry/iNatAPIv1_identifications) that can fill this gap). However, it is possible to get the observation IDs from those identifcation records, and then display those observations by passing an `id=[comma separated list of observation IDs]` parameter to the Explore and Identify pages. So this script provides an example of how to do that in a somewhat automated way.

One limitation of this workflow is that the GET /v1/identifications endpoint provides fewer [available filter parameters](https://api.inaturalist.org/v1/docs/#!/Identifications/get_identifications) than the GET /v1/observations endpoint. For example, it is possible to filter by project when filtering for observations, but it is not possible to filter for identifications by project. However, since we are effectively going through GET /v1/observations in the end, we can apply those additional observation filter parameters on top then.

Another limitation is that identification records don't seem to have recorded `disagreement=true` prior to 2018-01-03. So if the disagreement occurred prior to then, it will not be picked up by this workflow. A similar limitation is that it is possible to withdraw or replace an initial identification where `disagreement=true` in a way that subsequent identifications will not be recorded with `disagreement=true` and so cannot be picked up by this workflow. Finding disagreements in either of these cases requires a more inefficient client-side filtering of observation records which will not be covered in this example.

Although this notebook was created with the intent of finding observations with disagreements, the basic concept of getting ids and then displaying the observations associated with those IDs can be applied to other purposes as well (ex. filtering for observations where a specific user made an identification of a specific taxon).

In [None]:
# load required modules
from urllib.parse import parse_qs # used for parsing URL parameters
import asyncio # used for asynchronous fetching
import math # used for a ceiling method
#from datetime import datetime # used to convert string datetimes into actual datetimes

# use Pyodide's pyfetch module if possible, but fall back to urllib3 outside of Pyodide
try:
    from pyodide.http import pyfetch # Pyodide's fetch function (asynchronous)
    use_pyfetch=True
except:
    #!pip install urllib3
    import urllib3 # fall back to urllib3 if pyfetch isn't available. it can be made asynchronous using asynchio.to_thread().
    use_pyfetch=False

In [None]:
# define custom functions used for getting data

# function to turn a parameter string into a dict
def params_to_dict(params_string):
    params_dict = parse_qs(params_string)
    for p, v in params_dict.items():
        if v: # iNaturalist handles multiple values for the same parameter using comma separated values. since parse_qs doesn't handle that situation, this section will handle it.
            v = [(vv.split(',') if vv else vv) for vv in v]
            params_dict[p] = [vvv for vv in v for vvv in vv]
    return params_dict

# function to combine a base url with a set of parameters. (there's a urlencode method in urllib.parse, but it's easier to get exactly what I need using this custom code.)
# iNaturalist parameters are sometimes passed through the endpoint URL path rather than through the query string. so this handles that specific case.
def url_with_params(url_base, params=None):
    if params is None:
        params = {}
    url = url_base
    for p, v in params.items():
        pv = ','.join(v)
        if url.find(pp:=f'{{{p}}}') >= 0:
            url = url.replace(pp, pv)
        else:
            s = '?' if url.find('?') < 0 else '&'
            url += f'{s}{p}={pv}'
    return url

# basic function to fetch from API and convert repsonse to JSON
async def fetch_data(url, method='GET', use_authorization=False, delay=0):
    await asyncio.sleep(delay)
    req_headers = {}
    if use_authorization and jwt:
        req_headers = req_headers_base.copy() # make a copy
        req_headers['Authorization'] = jwt
    if use_pyfetch:
        response = await pyfetch(url, method=method, headers=req_headers)
        data = await response.json()
    else:
        response = await asyncio.to_thread(urllib3.request, method, url, headers=req_headers)
        data = response.json()
    print(f'Fetch complete: {method} {url}')
    return data

# function to GET total_results (count) from the API
async def get_total_results(endpoint, params=None, use_authorization=False, delay=0):
    if params is None:
        params = {}
    rp = params.copy() # make a copy
    rp.pop('per_page', None) # remove per_page parameter, if it exists
    rp['per_page'] = ['0'] # set this to 0, since we need only the count, not the actual records
    data = await fetch_data(url_with_params(endpoint['url'], rp), use_authorization=use_authorization, delay=delay)
    total_results = int(data['total_results'])
    print(f'Total records: {str(total_results)}')
    return total_results

# function to GET a single page of results from the API
# additional parsing and additional filtering before and after the parsing can happen here, too
# can be called directly but generally is intended to be called through get_results
# note that the parse_function can be an async function (in case we need to get additional information from the API during parsing)
# but if the parse_function gets additional data from the API for every page, then the delay used by get_results may need to be tweaked to keep within request limits
# (ideally, cases where additional data wlll be needed from the API for every page should be handled after all pages have been retreieved)
async def get_results_single_page(endpoint, params=None, use_authorization=False, parse_function=None, pre_parse_filter_function=None, post_parse_filter_function=None, delay=0):
    if params is None:
        params = {}
    rp = params.copy() # make a copy
    data = await fetch_data(url_with_params(endpoint['url'], rp), use_authorization=use_authorization, delay=delay)
    results = data.get('results',[])
    if pre_parse_filter_function:
        results = list(filter(pre_parse_filter_function, results))
    if parse_function:
        try:
            results = await parse_function(results) # assume async function first
        except:
            results = parse_function(results) # fall back to non-async function 
    if post_parse_filter_function:
        results = list(filter(post_parse_filter_function, results))
    return results

# function to GET results from the API
# if get_all_pages=True, then get all records, up to the limit that the API endpoint provides.
# query pages in parallel, with each page having a incrementally delayed start.
# (iNaturalist wants you to limit requests to ~1 req/second.)
async def get_results(endpoint, params=None, get_all_pages=False, use_authorization=False, parse_function=None, pre_parse_filter_function=None, post_parse_filter_function=None):
    if params is None:
        params = {}
    results = []
    if (page_key := endpoint.get('page_key')):
        if not (page_key_values := params.get(page_key)):
            print(f'Cannot query from this endpoint without values for {page_key} parameter')
            return None
        max_per_page = endpoint['max_per_page']
        total_key_values = len(page_key_values)
        page_sets = [page_key_values[i:i+max_per_page] for i in range(0, total_key_values, max_per_page)]
        print(f'There are {total_key_values} {page_key} values, requiring {len(page_sets)} API requests to retrieve. Retrieving {"all sets" if get_all_pages else "only the first set"}...')
        async with asyncio.TaskGroup() as tg: # available in Python 3.11+
            tasks = []
            for i in (range(len(page_sets) if get_all_pages else 1)):
                rp = params.copy() # make a copy
                rp[page_key] = page_sets[i]
                tasks.append(tg.create_task(get_results_single_page(endpoint, params=rp, use_authorization=use_authorization, parse_function=parse_function, pre_parse_filter_function=pre_parse_filter_function, post_parse_filter_function=post_parse_filter_function, delay=i)))
        for t in tasks:
            results += t.result()
    else:
        max_page = math.ceil(endpoint['max_records'] / endpoint['max_per_page']) if get_all_pages else 1
        if get_all_pages:
            # when getting all pages, make a small query first to find how many total records there are.
            # this allows us to calculate how many requests we need to make in total.
            # if total records exceeds the maximum that the API will return, then retrieve only up to the maximum.
            total_results = await get_total_results(endpoint, params, use_authorization)
            total_pages = math.ceil(total_results / endpoint['max_per_page'])
            if total_pages < max_page:
                max_page = total_pages
            print(f'Pages to retrieve: {str(max_page)}')
        async with asyncio.TaskGroup() as tg: # available in Python 3.11+
            tasks = []
            for i in range(max_page):
                rp = params.copy() # make a copy
                if get_all_pages:
                    # if getting all pages, remove per_page and page parameters if they exist in the base params
                    # and then set per_page = max and increment page for each request
                    rp.pop('per_page', None)
                    rp.pop('page', None)
                    rp['per_page'] = [str(endpoint['max_per_page'])] # set this to the max if we're getting all pages
                    rp['page'] = [str(i+1)]
                tasks.append(tg.create_task(get_results_single_page(endpoint, params=rp, use_authorization=use_authorization, parse_function=parse_function, pre_parse_filter_function=pre_parse_filter_function, post_parse_filter_function=post_parse_filter_function, delay=i)))
        for t in tasks:
            results += t.result()
    print(f'Total records retrieved: {str(len(results))}')
    return results

# function to string together a list of observation ids into sets of up to a max number of observations per set
# the original intended use case is to create URLs linking to the iNaturalist Explore or Identification page, filtered for specific observations
def obs_ids_to_sets(obs_ids, max_set_size=500, separator=',', prefix=''):
    obs_id_sets = []
    for i in range(0, len(obs_ids), max_set_size):
        obs_id_string = prefix + separator.join(map(str, obs_ids[i:i+max_set_size]))
        obs_id_sets.append(obs_id_string)
        print(f'Set {int(i/max_set_size+1)}: {obs_id_string}')
    return obs_id_sets

In [None]:
# define the parameters needed for your request
req_params_string = 'per_page=200&disagreement=true&place_id=110679' # remember: these are filter parameters for identifications, not observations.
req_params = parse_qs(req_params_string)
req_headers_base = {'Content-Type': 'application/json', 'Accept': 'application/json'}

# to make authorized calls, set jwt to the "api_token" value from https://www.inaturalist.org/users/api_token.
# the JWT is valid for 24 hours. it can be used to do / access anything your iNat account can access. so keep it safe, and don't share it.
# you will also have to set use_authorization=True when making your API request below.
jwt = None

# define endpoints
endpoint_get_ids = {
    'method': 'GET',
    'url': 'https://api.inaturalist.org/v1/identifications',
    'max_records': 10000,
    'max_per_page': 200,
}

In [None]:
# main execution section -- part 1

# get identifications, filtered by the parameters defined in the request parameters (req_params)
ids = await get_results(endpoint_get_ids, req_params, get_all_pages=False, use_authorization=False)

In [None]:
# main execution section -- part 2

# extract the observation ids associated with the identifications
obs_ids = [oi['observation']['id'] for oi in ids]

# string together the observation IDs, along with with a prefix, to create links to iNaturalist
# these will be printed in the cell output below. click on the URLs in the output to open a browser tab/window to that URL.
obs_id_sets = obs_ids_to_sets(obs_ids, prefix='https://www.inaturalist.org/observations/identify?id=')

In [None]:
# optional execution section
# this can be used to fetch and accumulate additional identifications after part 1 has already run, without having to change the main request parameters
# to use the code below, set get_more_ids = True before running.

# if you order by id when you get identifications (this is the default behavior if you don't specify an order_by parameter), 
# then it should be possible to work around the max 10000 record limit of the API by using the id_above or id_below parameters.
# i purposely am not automating this process completely (because I don't want to make it too easy to accidentally get a ton of data),
# but i'm including this bit of code here to provide an idea of how to do it.
get_more_ids = False
if get_more_ids and ids:
    rp = dict(req_params) # make a copy
    if rp.get('order_by',['id']) == ['id']: # this only works if the records were sorted by id
        if rp.get('order',['desc']) == ['asc']:
            max_id = max([i.get('id') for i in ids])
            print(f'getting additional identifications for id_above={max_id}')
            rp.pop('id_above', None) # remove per_page parameter, if it exists
            rp['id_above'] = [str(max_id)] # set this to the max_id so that the records we get will have ids above those of the identifications we already have
        else:
            min_id = min([i.get('id') for i in ids])
            print(f'getting additional identifcations for id_below={min_id}')
            rp.pop('id_below', None) # remove per_page parameter, if it exists
            rp['id_below'] = [str(min_id)] # set this to the min_id so that the records we get will have ids below those of the identifications we already have
        ids += await get_results(endpoint_get_ids, rp, get_all_pages=False, use_authorization=False)
        print(f'identifications accumulated: {len(ids)}')