# Exploratory Scraping

The data structures of the [browse results](https://numismatics.org/ocre/results?q=&start=0) and [canonical URI](https://numismatics.org/ocre/id/ric.1(2).aug.1A) for individual coins do not appear to be consistant. In order to determine the data fields required for the database, exploratory scraping will need to be conducted in Jupyter Notebooks and Python scripts.

In [1]:
from bs4 import BeautifulSoup
from pathlib import Path

In [5]:
import constants as c

## Browse Results

The "browse results" are available [here](https://numismatics.org/ocre/results?q=&start=0) and are the top-level data about coins in the database. Coin name, canonical URI, date of coin, denomination of coin, mint of coin, obverse (heads) of coin, reverse (tails) of coin, and number of coins. Not all coins have all eight data fields and it is not sure if there are any other fields that may be available. Therefore, an exploratory scrape will be conducted to determine all possible fields that are available.

A sample file for local experimentation is defined below.

In [13]:
# Actual path ('start=X' value changes with incrementing pages)
# path = "https://numismatics.org/ocre/results?q=objectType_facet%3A%22Coin%22&start=0"

In [3]:
!ls ./../data/ | grep -E "ocre_browse_results_sample"

ocre_browse_results_sample.html


In [8]:
EXAMPLE_FILE = "ocre_browse_results_sample.html"
PATH_SAMPLE = c.DATA_FOLDER / EXAMPLE_FILE

In [9]:
with open(PATH_SAMPLE, "r", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, "lxml")

In [12]:
soup

<!DOCTYPE HTML>
<html lang="en"><head profile="http://a9.com/-/spec/opensearch/1.1/"><title>Online Coins of the Roman Empire: Browse Collection</title><link href="http://numismatics.org/ocre/feed/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/atom+xml"/><link href="http://numismatics.org/ocre/query.csv/?q=objectType_facet:%22Coin%22" rel="alternate" type="text/csv"/><link href="http://numismatics.org/ocre/query.kml/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/vnd.google-earth.kml+xml"/><link href="http://numismatics.org/ocre/opensearch.xml" rel="search" title="Example Search for http://numismatics.org/ocre/" type="application/opensearchdescription+xml"/><meta content="41269" name="totalResults"/><meta content="0" name="startIndex"/><meta content="20" name="itemsPerPage"/><link href="https://numismatics.org/themes/ocre/images/favicon.png" rel="shortcut icon" type="image/x-icon"/><meta content="width=device-width, initial-scale=1" name="viewport"/>

First step is to extract the page ID, start coin ID, and end coin ID.

In [63]:
records_and_page = soup.find("div", class_="paging_div row")
display_records_data = (
    records_and_page
        .find("div", class_="col-md-6")
        .text
        .strip()
        .split()
)
# max_coin_id will only need to be extracted on the first page
coin_start_id, coin_end_id, max_coin_id = [
    int(item) for item in display_records_data[2:6+1:2]
]
print([coin_start_id, coin_end_id, max_coin_id])

[1, 20, 41269]


In [62]:
# page_id from navigation bar
page_id = int(
    records_and_page
        .find("div", class_="col-md-6 page-nos")
        .text.strip()
)
page_id

1

The `max_coin_id` can be used to calculate the number of pages to iterate through. Twenty coins appear on each page. If the maximum number of coins is not evenly divisible by twenty, then an extra page will have to scraped.

In [38]:
(max_coin_id//20) + int(max_coin_id%20 != 0)

2064