# Processing Steps and Discussion
The MapServer data gave us the necessary ID values to use in constructing landing page URLs for the cores and cuttings. The landing pages contain information on thin sections, documents, and photos. Out of the 62K total items, about 10K of those have one or more related artifacts based on the "T/F" values from the core/cutting metadata. We can put the URLs we generated from the internal id property into a list and send those to a function that uses BeautifulSoup to pull out the relevant bits into a data structure that we can then stitch together into our item information.

As I first started trying to run this, I was attempting to simply run everything in real time to assemble my final items on the fly. I kept running into issues with HTTP disconnects and other hiccups. This happens for all kinds of reasons, but it's a pain when trying to build out something like this - essentially, a distributed data system that is using the web and linking parameters as a relational database. These processes blow up routinely with a connection aborted error, potentially due to something happening at an edge firewall device that doesn't like receiving so many connections.

Included in the raw data properties are boolean indicators of whether or not a given core/cutting record has related photos, "analysis" files, and/or thin sections. Now that we have everything in MongoDB, it makes sense to leverage a bit of the aggregation framework to pull together the records we want to operate against, those that will have something usable on the web pages to scrape.

## Dependencies
This notebook requires the Requests, BeautifulSoup4, and Pymongo clients, all from Conda-Forge distributions in my case. I also use the builtin datetime to stamp each record extracted with the date/time it was scraped. This could help determine when an update should be run in future. More than other steps in this workflow, this one involves some relatively heavy MongdoDB dependencies. Once we start assembling our data in the local Docker-based MongoDB, it ends up being much more efficient to conduct certain operations at the database instead of pulling the data down and then pushing it back in some other form.

In [10]:
import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from datetime import datetime

mongo_ndc = MongoClient()

## Web Scraping Function
BeautifulSoup is one of a number of available tools for scraping data from web pages. In examining the HTML content of the CRC core and cutting "report" pages, there are a couple of usable hooks that let us zero in on the section of the dynamically rendered pages that have links to photos and documents or sets of a table of thin section information. This function is, of course, completely specific to this particular use case. However, it may be a useful pattern for other cases on the way toward providing a more usable data structure behind the various collections we are working with in the data preservation pursuit.

In [2]:
def extract_crc_landing_page(crcwc_url):
    target_schemas = {
        "intervals": ['Min Depth', 'Max Depth', 'Age', 'Formation'],
        "thin_sections": ['Sequence', 'Min Depth', 'Max Depth', 'View']
    }
    
    r = requests.get(crcwc_url)
    
    soup = BeautifulSoup(r.content, "html.parser")
    
    data_structures = {
        "crcwc_url": crcwc_url,
        "date_scraped": datetime.utcnow().isoformat()
    }
    for index, table in enumerate(soup.findAll("table",{"class":"report2"})):
        first_row = table.find("tr")
        labels = [i.text for i in first_row.findAll("td", {"class": "label"})]
        target_data = list(target_schemas.keys())[list(target_schemas.values()).index(labels)]
        data_structures[target_data] = list()

        for row in [r for r in table.findAll("tr")][1:]:
            d_this = dict()
            for i, col in enumerate(row.findAll("td")):
                anchor = col.find("a")
                if anchor:
                    this_data = anchor.get("href")
                else:
                    this_data = col.text
                d_this[labels[i]] = this_data
            data_structures[target_data].append(d_this)

    photos = list()
    documents = list()

    for section in soup.findAll("div",{"class":"report2"}):
        photos.extend(list(set([i.get('href') for i in section.findAll("a",{"title":"see photo"})])))
        documents.extend(list(set([i.get('href') for i in section.findAll("a",{"title":"download analysis document"})])))
        
    if len(photos) > 0:
        data_structures["photos"] = photos
        
    if len(documents) > 0:
        data_structures["documents"] = documents
        
    if not data_structures:
        return None
    else:
        return data_structures

## Aggregations to Pull Usable Landing Page Links
The following two aggregation pipelines implement the necessary logic to put together lists of cores and cuttings that will have something harvestable from their web pages and the links to those pages from our previous pull of identifier values from MapServer. To help deal with the vagaries of the web and needing to potentially restart this process multiple times, I set up essentially a ledger of "orders" to be filled with the web scraping routine. Each ledger entry has a URL and a CRC Library Number to go after. The order is filled when the web scraping function runs, pulls its information together, and updates the MongoDB record with a data stamp and the resulting information. We can essentially run it while any record doesn't have its date stamp filled in. The pipelines here finish by dumping their results into two different collections, but we really only need one "ledger" to operate against. I finish this part of the process by copying all the items from the second collection into the first one and dropping the second collection.

In [3]:
core_url_pipeline = [
    {
        u"$match": {
            u"$or": [
                {
                    u"Photos": u"T"
                },
                {
                    u"Thin Sec": u"T"
                },
                {
                    u"Analysis": u"T"
                }
            ]
        }
    }, 
    {
        u"$group": {
            u"_id": u"$Lib Num"
        }
    }, 
    {
        u"$lookup": {
            u"from": u"cores_from_mapserver",
            u"localField": u"_id",
            u"foreignField": u"properties.libno",
            u"as": u"mapserver_data"
        }
    }, 
    {
        u"$project": {
            u"_id": 0.0,
            u"Lib Num": u"$_id",
            u"crcwc_url": {
                u"$concat": [
                    u"https://my.usgs.gov/crcwc/core/report/",
                    {
                        u"$toString": {
                            u"$arrayElemAt": [
                                u"$mapserver_data.id",
                                0.0
                            ]
                        }
                    }
                ]
            }
        }
    }, 
    {
        u"$out": u"web_page_info"
    }
]

cutting_url_pipeline = [
    {
        u"$match": {
            u"$or": [
                {
                    u"Thin Sec": u"T"
                },
                {
                    u"Analysis": u"T"
                }
            ]
        }
    }, 
    {
        u"$group": {
            u"_id": u"$Lib Num"
        }
    }, 
    {
        u"$lookup": {
            u"from": u"cuttings_from_mapserver",
            u"localField": u"_id",
            u"foreignField": u"properties.chlibno",
            u"as": u"mapserver_data"
        }
    }, 
    {
        u"$project": {
            u"_id": 0.0,
            u"Lib Num": u"$_id",
            u"crcwc_url": {
                u"$concat": [
                    u"https://my.usgs.gov/crcwc/cutting/report/",
                    {
                        u"$toString": {
                            u"$arrayElemAt": [
                                u"$mapserver_data.id",
                                0.0
                            ]
                        }
                    }
                ]
            }
        }
    }, 
    {
        u"$out": u"web_page_info_cuttings"
    }
]

The following codeblock executes both URL ledger-building pipelines to build new collections, assembles the one final collection, and drops the one we don't need to keep.

In [4]:
%%time
mongo_ndc.crc.cores_raw.aggregate(core_url_pipeline)
mongo_ndc.crc.cuttings_raw.aggregate(cutting_url_pipeline)
mongo_ndc.crc.web_page_info.insert_many([i for i in mongo_ndc.crc.web_page_info_cuttings.find({},{"_id": 0})])
mongo_ndc.crc.web_page_info_cuttings.drop()

CPU times: user 47.1 ms, sys: 10.3 ms, total: 57.4 ms
Wall time: 2min 11s


At the end of all the prep work, the rest is reasonably simple - running the ledger for everything not yet date stamped, running the scraper, and updating the ledger with the extracted information. This codeblock had to be restarted a number of times to fully complete.

In [12]:
item = mongo_ndc.crc.web_page_info.find_one({"date_scraped": {"$exists": False}}, sort=[('crcwc_url', -1)])
while item is not None:
    mongo_ndc.crc.web_page_info.update_one({"_id": item["_id"]},{"$set": extract_crc_landing_page(item["crcwc_url"])})
    item = mongo_ndc.crc.web_page_info.find_one({"date_scraped": {"$exists": False}}, sort=[('crcwc_url', -1)])