# Processing Steps and Discussion
The "raw" data I used for building new ScienceBase Items for the Core Research Center cores and cuttings collections started with downloads from the CRC Well Catalog web application. The data downloaded as CSV from this process is relatively harmonized between the two collections and represents the best online availability for the data at this time. After experimenting with various ways of reading files into memory for all processing steps, I opted to use a local MongoDB instance in Docker as an assembly point. Information assembled from the Macrostrat API on geologic map unit context if keyed on latitude and longitude coordinates pulled from the raw data. To support the later assembly of these related properties, I run a process here to generate a geohash string from the coordinates. This gives me a single unique value to operate against in later steps.

This code executes the following steps for each collection:

1. Read the CSV file from a dynamic web server response into a Pandas dataframe
2. Geohash the coordinates into a new field and add two additional preset values (string for the type of collection and the target ScienceBase Item ID for the relevant collection that the items are destined for)
3. Load the resulting dataframe as a list of dictionaries to MongoDB collections for the "raw" input data

The raw data downloaded and prepped here contains duplicate records for the actual core and cutting metadata where there are multiple borehole intervals identified in the data. For ScienceBase purposes, we want to only identify one physical sample item for each core and cutting metadata record, grouping together interval information into an array. Because we brought the data together into MongoDB in this step, we can accomplish that grouping at the end using aggregation in the database.

# Dependencies
This code requires the Geohash2 and Pandas (installed from Conda-Forge distributions in my case). I use MongoDB with a local instance run in Docker from DockerHub with no authentication. This sets up barebones instance that does what I need it to do and can then go away. The same process could be run with a variety of different approaches from other databases to local file storage.

In [1]:
from pymongo import MongoClient
import geohash2
import pandas as pd

mongo_ndc = MongoClient()

In [2]:
def geohash_coords(lat, lng):
    if lat is None:
        return None
    else:
        return geohash2.encode(float(lat), float(lng))

In [3]:
%%time
# Set raw data download from CRC web site
cores_raw_url = "https://my.usgs.gov/crcwc/search/cores?f=csv&extension=csv&offset=0&max=50&county=&format=&section=&wellname=&formation=&type=&operator=&cuttings=true&search=Search&cores=true&field=&crclibrarynumber=&townshipnumber=&state=&apinumber=&rangenumber=&fieldsorting=%2Btwnnum&fieldsorting=%2Blibnum&fieldsorting=%2Bmindepth"

# Read raw data CSV into dataframe
df_cores_raw = pd.read_csv(cores_raw_url, dtype=str)

# Add geohashed coordinates and two default properties to dataframe
df_cores_raw["coordinates_geohash"] = df_cores_raw.apply(lambda x: geohash_coords(x["Latitude"], x["Longitude"]), axis=1)
df_cores_raw["sb_parent_id"] = "4f4e49dae4b07f02db5e0486"
df_cores_raw["crc_collection_name"] = "core"

# Load list of JSON documents to MongoDB collection for processing changing NaN to None values
mongo_ndc.crc.cores_raw.insert_many(df_cores_raw.where((pd.notnull(df_cores_raw)), None).to_dict('records'))

CPU times: user 1.46 s, sys: 105 ms, total: 1.56 s
Wall time: 53.4 s


<pymongo.results.InsertManyResult at 0x11c352f88>

In [4]:
%%time
# Set raw data download from CRC web site
cuttings_raw_url = "https://my.usgs.gov/crcwc/search/cuttings?f=csv&extension=csv&offset=0&max=50&county=&format=&section=&wellname=&formation=&type=&operator=&cuttings=true&search=Search&cores=true&field=&crclibrarynumber=&townshipnumber=&state=&apinumber=&rangenumber=&fieldsorting=%5B%2Btwnnum%2C+%2Blibnum%2C+%2Bmindepth%5D"

# Read raw data CSV into dataframe
df_cuttings_raw = pd.read_csv(cuttings_raw_url, dtype=str)

# Add geohashed coordinates and two default properties to dataframe
df_cuttings_raw["coordinates_geohash"] = df_cuttings_raw.apply(lambda x: geohash_coords(x["Latitude"], x["Longitude"]), axis=1)
df_cuttings_raw["sb_parent_id"] = "4f4e49d8e4b07f02db5df2d2"
df_cuttings_raw["crc_collection_name"] = "cutting"

# Load list of JSON documents to MongoDB collection for processing changing NaN to None values
mongo_ndc.crc.cuttings_raw.insert_many(df_cuttings_raw.where((pd.notnull(df_cuttings_raw)), None).to_dict('records'))

CPU times: user 4.89 s, sys: 316 ms, total: 5.21 s
Wall time: 2min 57s


<pymongo.results.InsertManyResult at 0x11e507e08>