# Processing Steps and Discussion
An ArcGIS Server in the USGS infrastructure houses a [MapServer service](https://my.usgs.gov/arcgis/rest/services/crcwc/crcwc/MapServer) for the "CRC Well Catalog (crcwc)." Along with some supporting layers for a map tool, the service provides two layers, cores and cuttings, that provide the one really useful bit of information we can't get anywhere else, the internal database id property tied to the CRC Library Number (called "libno" in cores and "chlibno" in cuttings). We need the id value if we are going to assemble the landing page URL for each core/cutting record both as a reference to include in the final ScienceBase Items and to use in scraping related information on thin sections, "analysis" files, and photos.

We could also make use of the geometry returned by the MapServer as it presumably represents fully validated point locations for the core/cutting items. In looking at the raw data downloaded from the web site initially, we see at least one issue with differing precision in latitude and longitude values, which differ from the coordinates in the MapServer. However, at this point, we really don't know what happened between the actual database records and the ArcGIS instantiation, and the issues in precision don't impact what we need to do from this particular point.

Using the ArcGIS MapServer query service, we can pull batches of 1000 records as a time, limited to just the two properties we need, and cache those in MongoDB for later assembly. In this notebook, I laid out a function that exercises the differential logic needed for each service layer, the parameters required for the ArcGIS MapServer REST API, and the HTTP request. The workflow loops over the two layers, assembles the full recordsets, and loads the data to my local MongoDB instance for later processing.

# Dependencies
The code requires the Python Requests package and PyMongo client, both installed from the Conda-Forge distributions in my case. The MongoDB use here is completely optional. It is a convenience for what I am doing in the data assembly process, but this same idea could be executed with different types of databases or local files read into memory.

In [1]:
import requests
from pymongo import MongoClient

mongo_ndc = MongoClient()

In [2]:
def crcwc_items_from_mapserver(sample_type="core", record_count=1000, offset=0):
    if sample_type == "core":
        layer = 0
        fields = "id,libno"
    elif sample_type == "cutting":
        layer = 1
        fields = "id,chlibno"

    params = [
        "where=0%3D0",
        f"outFields={fields}",
        "returnGeometry=true",
        "returnIdsOnly=false",
        "returnCountOnly=false",
        "returnZ=false",
        "returnM=false",
        "returnDistinctValues=false",
        f"resultOffset={offset}",
        f"resultRecordCount={record_count}",
        "returnExtentsOnly=false",
        "f=geojson"
    ]

    ags_url = f"https://my.usgs.gov/arcgis/rest/services/crcwc/crcwc/MapServer/{layer}/query?{'&'.join(params)}"
    
    response = requests.get(ags_url).json()
    
    return response

In [3]:
%%time
for sample_type in ["core","cutting"]:
    offset = 0
    crc_records = list()
    server_response = crcwc_items_from_mapserver(sample_type=sample_type, offset=offset)
    while len(server_response["features"]) > 0:
        crc_records.extend(server_response["features"])
        offset += len(server_response["features"])
        server_response = crcwc_items_from_mapserver(sample_type=sample_type, offset=offset)
        
    mongo_ndc.crc[f'{sample_type}s_from_mapserver'].insert_many(crc_records)

CPU times: user 2.76 s, sys: 353 ms, total: 3.11 s
Wall time: 11min 41s
