# Processing Steps and Discussion
As part of this experiment, I looked at what information we might retrieve via the very usable [Macrostrat API](http://macrostrat.org/api). There are a number of different routes that could bring useful context to the point locations in the NDC, and we would eventually want to run this as an "enhancer" on the entire catalog. For now, I worked up a process to use one of the simplest API routes set up for mobile applications to return some basic geologic map unit context for surface geology. This gives us rock type, geologic formation, and age that we can format into keywords (tags) in the ScienceBase Items, which exposes the terms for faceted search and other uses. It also sets up a conversation about semantic alignment when we look at the shorthand form of subsurface geologic formation and age present in much of the interval information from the CRC records, which I will also digest into tags from a different scheme/vocabulary.

Similar to the situation with web scraping, I ran into fewer but still some hiccups in accessing the Macrostrat API. Since there are some point coordinates in the data that are the same, and we are only retrieving data for each unique point, I created document stubs of only the unique coordinates, put them in a ledger to be filled, and run them in a loop until I can fill every order. For convenience, I use geohash2 here to decode the hashes to latitude and longitude for the ledger as I need to pass them as separate variables to the Macrostrat API that I'm using.

# Dependencies
This code requires the Requests, Pymongo, and Geohash2 packages, all installed from Conda-Forge distributions in my case. Processing data at even this relatively small scale on a point by point basis really does require a database of some kind to deal with reasonably. I use MongoDB in this case, but it could be anything. The Macrostrat API used is described at its logical path [here](https://macrostrat.org/api/mobile).

In [1]:
import requests
from pymongo import MongoClient
from datetime import datetime
import geohash2

mongo_ndc = MongoClient()

In [8]:
%%time
# Get coordinates from cores
all_coords = [i["coordinates_geohash"] for i in mongo_ndc.crc.cores_raw.find({"coordinates_geohash": {"$ne": None}},{"coordinates_geohash": 1})]
# Extend to add coordinates from cuttings
all_coords.extend([i["coordinates_geohash"] for i in mongo_ndc.crc.cuttings_raw.find({"coordinates_geohash": {"$ne": None}},{"coordinates_geohash": 1})])
print("Total records with point coordinates:", len(all_coords))
# Reduce the list by limiting to unique geohashes
unique_coords = list(set(all_coords))
# Build the ledger
unique_coords_ledger = [
    {
        "coordinates_geohash": c,
        "coordinates": [
            geohash2.decode(c)[1],
            geohash2.decode(c)[0]
        ]
    } for c in unique_coords
]
print("Total unique geohashes:", len(unique_coords_ledger))
# Insert the ledger into MongoDB to be filled
mongo_ndc.crc.gmu_context.insert_many(unique_coords_ledger)

Total records with point coordinates: 69899
Total unique geohashes: 58544
CPU times: user 2.65 s, sys: 36.1 ms, total: 2.69 s
Wall time: 3.27 s


<pymongo.results.InsertManyResult at 0x1040cf648>

# Surficial Geology Context
The Macrostrat team did the heavy lifting of integrating multiple resolutions of geologic maps from different national and global sources and setting up an API for retrieving this basic surface geology context with (nearly) any point location. Information retrieved from this API that may be of interest in the National Digital Catalog for narrowing search in a faceted manner include rock type, geologic formation, and stratigraphic units. The following function handles the basic process of requesting this information for a set of coordinates, stamping the result with datetime, and returning the information for our ledger.

In [9]:
def macrostrat_context(coordinates):
    api = f"https://macrostrat.org/api/mobile/point?lat={coordinates[1]}&lng={coordinates[0]}"
    
    r = requests.get(api, headers={"accept": "application/json"}).json()
    
    if "success" in r.keys() and "data" in r["success"].keys():
        return {
            "date_retrieved": datetime.utcnow().isoformat(),
            "data": r["success"]["data"]
        }
    
    return None

Once everything is prepped, we can run through and request what we need from the Macrostrat API. Even though we could multithread this and get it done faster, I don't know what kind of denial of service thresholds may be in place for the API. If we end up wanting to run this as a production enhancer for every point coordinate in the NDC, we would also want to set up some different kind of process as this really won't scale to that level. Depending on how the geologic map unit polygon data is set up (e.g., something like an Elasticsearch index), we should be able to simply pass what might be a very large collection of geohashed points, find all polygons containing those points, and then return relevant properties from the index. In the meantime, we can look at the available information as tags within ScienceBase and discuss its utility.

In [None]:
item = mongo_ndc.crc.gmu_context.find_one({"date_retrieved": {"$exists": False}})
while item is not None:
    mongo_ndc.crc.gmu_context.update_one({"_id": item["_id"]},{"$set": macrostrat_context(tuple(item["coordinates"]))})
    item = mongo_ndc.crc.gmu_context.find_one({"date_retrieved": {"$exists": False}})