# Exploratory Scraping

The data structures of the [browse results](https://numismatics.org/ocre/results?q=&start=0) and [canonical URI](https://numismatics.org/ocre/id/ric.1(2).aug.1A) for individual coins do not appear to be consistant. In order to determine the data fields required for the database, exploratory scraping will need to be conducted in Jupyter Notebooks and Python scripts.

In [1]:
from bs4 import BeautifulSoup
import requests
from pathlib import Path
from collections import deque
from pprint import pprint
import re

In [2]:
import constants as c

## Browse Results

The "browse results" are available [here](https://numismatics.org/ocre/results?q=&start=0) and are the top-level data about coins in the database. Coin name, canonical URI, date of coin, denomination of coin, mint of coin, obverse (heads) of coin, reverse (tails) of coin, and number of coins. Not all coins have all eight data fields and it is not sure if there are any other fields that may be available. Therefore, an exploratory scrape will be conducted to determine all possible fields that are available.

A sample file for local experimentation is defined below.

In [3]:
# Actual path ('start=X' value changes with incrementing pages)
# path = "https://numismatics.org/ocre/results?q=objectType_facet%3A%22Coin%22&start=0"

In [4]:
!ls ./../data/ | grep -E "ocre_browse_results_sample"

ocre_browse_results_sample.html
ocre_browse_results_sample_some_coins_no_text.html


In [5]:
EXAMPLE_FILE = "ocre_browse_results_sample.html"
PATH_SAMPLE = c.DATA_FOLDER / EXAMPLE_FILE

In [6]:
with open(PATH_SAMPLE, "r", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, "lxml")

In [7]:
soup.prettify()[:10_000]

'<!DOCTYPE HTML>\n<html lang="en">\n <head profile="http://a9.com/-/spec/opensearch/1.1/">\n  <title>\n   Online Coins of the Roman Empire: Browse Collection\n  </title>\n  <link href="http://numismatics.org/ocre/feed/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/atom+xml"/>\n  <link href="http://numismatics.org/ocre/query.csv/?q=objectType_facet:%22Coin%22" rel="alternate" type="text/csv"/>\n  <link href="http://numismatics.org/ocre/query.kml/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/vnd.google-earth.kml+xml"/>\n  <link href="http://numismatics.org/ocre/opensearch.xml" rel="search" title="Example Search for http://numismatics.org/ocre/" type="application/opensearchdescription+xml"/>\n  <meta content="41269" name="totalResults"/>\n  <meta content="0" name="startIndex"/>\n  <meta content="20" name="itemsPerPage"/>\n  <link href="https://numismatics.org/themes/ocre/images/favicon.png" rel="shortcut icon" type="image/x-icon"/>\n  <meta content="

First step is to extract the page ID, start coin ID, and end coin ID.

In [8]:
records_and_page = soup.find("div", class_="paging_div row")
display_records_data = (
    records_and_page
        .contents[0]
        .text
        .strip()
        .split()
)
# max_coin_id will only need to be extracted on the first page
coin_start_id, coin_end_id, max_coin_id = [
    int(item) for item in display_records_data[2:6+1:2]
]
print([coin_start_id, coin_end_id, max_coin_id])

[1, 20, 41269]


In [9]:
# page_id from navigation bar
page_id = int(
    records_and_page
        .find("div", class_="col-md-6 page-nos")
        .text.strip()
)
page_id

1

The `max_coin_id` can be used to calculate the number of pages to iterate through. Twenty coins appear on each page. If the maximum number of coins is not evenly divisible by twenty, then an extra page will have to scraped.

In [10]:
(max_coin_id//20) + int(max_coin_id%20 != 0)

2064

Next, the content of the page needs to be extracted.

- 20 coins (or less) per page
- Each coin is in `<div class="row result-doc">` tags.
- An example of a coin is below.

![](./../images/browse_page_results_one_coin.png)

- The coin image and number of objects, etc. on the right is available through `<div class="col-md-5 col-lg-4 pull-right">`.
- The data on the left are available through `<div class="col-md-7 col-lg-8">`.

In [11]:
all_page_coins = soup.find_all("div", class_="row result-doc")
len(all_page_coins)

20

In [12]:
all_page_coins[0]

<div class="row result-doc"><div class="col-md-12"><h4><a href="id/ric.1(2).aug.1A">RIC I (second edition) Augustus 1A</a></h4></div><div class="col-md-5 col-lg-4 pull-right"><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.obv.width350.jpg" id="http://numismatics.org/collection/1944.100.39025" rel="gallery" title="Obverse of 1944.100.39025: American Numismatic Society"><img class="side-thumbnail" src="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.obv.width175.jpg"/></a><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.rev.width350.jpg" id="http://numismatics.org/collection/1944.100.39025" rel="gallery" title="Reverse of 1944.100.39025: American Numismatic Society"><img class="side-thumbnail" src="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.rev.width175.jpg"/></a><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1

Accessing the titles of each coin is done through the following.

In [13]:
soup_title = all_page_coins[0].find("div", class_="col-md-12").find("a")
soup_title

<a href="id/ric.1(2).aug.1A">RIC I (second edition) Augustus 1A</a>

In [14]:
soup_title["href"]

'id/ric.1(2).aug.1A'

(full link is https://numismatics.org/ocre/id/ric.1(2).aug.1A, where everything before "id" is  the home page, http://numismatics.org/ocre/)

In [15]:
soup_title.text.strip()

'RIC I (second edition) Augustus 1A'

The right division and the text under the coin is available below. To determine how to best extract the number of objects found, further exploratory scraping is required.

- [x] Extract all text from right sides, review the text, and determine best way to extract only number of objects found.

In [16]:
all_page_coins[0].find("div", class_="col-md-5 col-lg-4 pull-right").text.strip()

'objects: 22; hoard: 1'

In [17]:
bool(all_page_coins[0].find("div", class_="col-md-5 col-lg-4 pull-right").text.strip())

True

There are instances when there are not any text in the right area. The text of `find("div", class_="col-md-5 col-lg-4 pull-right")` can be convert into a boolean to test if there is anything in the area. If there is not text, the boolean converstion returns false.

In [18]:
!ls ./../data/ | grep -E "ocre_browse_results_sample"

ocre_browse_results_sample.html
ocre_browse_results_sample_some_coins_no_text.html


In [19]:
EXAMPLE_FILE_NT = "ocre_browse_results_sample_some_coins_no_text.html"
PATH_SAMPLE_NT = c.DATA_FOLDER / EXAMPLE_FILE_NT

with open(PATH_SAMPLE_NT, "r", encoding="UTF-8") as f:
    soup_nt = BeautifulSoup(f, "lxml")
    
all_page_coins_nt = soup_nt.find_all("div", class_="row result-doc")
empty_contents = (
    all_page_coins_nt[3]
        .find("div", class_="col-md-5 col-lg-4 pull-right")
        .text.strip()
)
empty_contents

''

In [20]:
bool(empty_contents)

False

Using `find()` method for the substring "object" will find all text that has "object" or "objects" in the text. If the method returns -1, then the substring was not found.

In [21]:
!head -5 ./../data/unique_object_count_text.txt

"hoard: 1", first appears on page 5 at API start value of 80
"object: 1", first appears on page 1 at API start value of 0
"object: 1; hoard: 1", first appears on page 6 at API start value of 100
"object: 1; hoards: 3", first appears on page 8 at API start value of 140
"objects: 10", first appears on page 1 at API start value of 0


In [22]:
path_unique_text = c.DATA_FOLDER / "unique_object_count_text.txt"
with open(path_unique_text, "r", encoding="UTF-8") as f:
    unique_text = f.readlines()
    
unique_text = [
    line.split(sep=",", maxsplit=1)[0].replace("\"", "")
    for line in unique_text
]
find_results = [text.find("object") for text in unique_text]
[(text, res) for text, res in zip(unique_text, find_results)][:10]

[('hoard: 1', -1),
 ('object: 1', 0),
 ('object: 1; hoard: 1', 0),
 ('object: 1; hoards: 3', 0),
 ('objects: 10', 0),
 ('objects: 101', 0),
 ('objects: 102', 0),
 ('objects: 103', 0),
 ('objects: 104', 0),
 ('objects: 105', 0)]

In [23]:
unique_text[:10]

['hoard: 1',
 'object: 1',
 'object: 1; hoard: 1',
 'object: 1; hoards: 3',
 'objects: 10',
 'objects: 101',
 'objects: 102',
 'objects: 103',
 'objects: 104',
 'objects: 105']

In [24]:
[
    int(text.split(sep=";", maxsplit=1)[0].split(maxsplit=1)[1])
    for text in unique_text
][:10]

[1, 1, 1, 1, 10, 101, 102, 103, 104, 105]

The left side is us accessible by `find("div", class_="col-md-7 col-lg-8")`. A list of the `dt` tags and `dd` tags can be created with `find_all()`. The lists can be iterated through using `zip()`.

In [25]:
all_left_data = all_page_coins[0].find("div", class_="col-md-7 col-lg-8").contents[0]
all_dt = all_left_data.find_all("dt")
all_dd = all_left_data.find_all("dd")
for dt, dd in zip(all_dt, all_dd):
    print(f"{dt.text}:\t{dd.text}")

Date:	25 BCE - 23 BCE
Denomination:	Quinarius
Mint:	Emerita
Obverse:	AVGVST: Head of Augustus, bare, left
Reverse:	P CARISI LEG: Victory standing right, placing wreath on trophy with dagger and sword at base


## Canonical URI

Caononical URI examples are available [here](https://numismatics.org/ocre/id/ric.1(2).aug.1A), [here](https://numismatics.org/ocre/id/ric.8.ar.261), and [here](https://numismatics.org/ocre/id/ric.8.sir.52).

The basic structure of these pages are:

- Coin title
- Canonical URI (url)
- Typological Description (section)
    - Date Range
    - Object Type
    - Manufacture
    - Denomination
    - Material
    - Authority (subsection)
        - Authority
        - Issuer
    - Geographic (subsection)
        - Mint
        - Region
    - Obverse (subsection)
        - Legend
        - Type
        - Portrait
    - Reverse (subsection)
        - Legend
        - Type
        - Deity
- Examples of this type (section)
    - "Div"s with different fields
    - Need to exploratory scrape
- Quantitative Analysis (section)
    - Average Axis
    - Average Diameter
    - Average Weight

Unfortunately, there are several variations between these three examples and between other URI pages reviewed. The variations identified are below:

- Fields in the section and subsections vary. I will need to perform exploratory scraping to determine all possible fields.
- The "Examples of this type" section have several divisions. Some coins do not have this section. Will need to check if this div exists. And will need to verify the unique fields and unique values for this section.
- The "Quantiative Analysis" section may not exist. I will need to check if this div exists. Also need to verify the fields of this section.

In [26]:
!ls ./../data/

number_of_header_lines.txt
ocre_browse_results_sample.html
ocre_browse_results_sample_some_coins_no_text.html
ocre_canonical_uri_sample.html
ocre_canonical_uri_sample_example_pagination.html
ocre_canonical_uri_sample_only_first_section.html
ocre_canonical_uri_sample_subtype_section.html
unique_browse_fields.csv
unique_object_count_text.txt
unique_section_text.txt
unique_sections.txt
unique_typological_fields.txt


In [27]:
paths_uri_sample_d = {
    "general": c.DATA_FOLDER / "ocre_canonical_uri_sample.html",
    "pagination": c.DATA_FOLDER / "ocre_canonical_uri_sample_example_pagination.html",
    "one_section": c.DATA_FOLDER / "ocre_canonical_uri_sample_only_first_section.html",
    "subtype_section": c.DATA_FOLDER / "ocre_canonical_uri_sample_subtype_section.html",
}

### URI Page Header

There are four types of headers: general, with only one section, with a subtype section, and with parent types.

The general header has the coin name, coin canonical URI, and sections of the page lines. See the image below. The header is in `<div class="col-md-12">` tag. The title is in a `h1` tag, the URI is in a `p` tag, and the sections of the page are in the second (index 1) `p` tag. Loop through the `a` tags in this `p` tag to get the sections. The coin title and URI are not required as they are taken from the summary scrape.

![](./../images/uri_header_general.png)

In [28]:
with open(paths_uri_sample_d["general"], "r", encoding="UTF-8") as f:
    soup_general = BeautifulSoup(f, "lxml")

soup_sections = (
    soup_general
        .body
        .find_all("div", class_="col-md-12")[1]
        .find_all("p")[1]
        .contents
)
sections = [
    " ".join(item.text.strip().split())
    for item in soup_sections if item.name == "a"
]
sections = ["Typological Description"] + sections
print(sections)
print(f"Number of sections: {len(sections)}")

['Typological Description', 'Examples of this type', 'Quantitative Analysis']
Number of sections: 3


When a header does not have any sections, indicating that there is only a Typological Description section, the second `p` tag is blank.

![](./../images/uri_header_one_section.png)

In [29]:
with open(paths_uri_sample_d["one_section"], "r", encoding="UTF-8") as f:
    soup_one_section = BeautifulSoup(f, "lxml")

soup_sections = (
    soup_one_section
        .body
        .find_all("div", class_="col-md-12")[1]
        .find_all("p")[1]
        .contents
)
if soup_sections:
    # There is a sections "p" tag
    sections = [
        " ".join(item.text.strip().split())
        for item in soup_sections if item.name == "a"
    ]
else:
    sections.clear()

sections = ["Typological Description"] + sections
print(sections)
print(f"Number of sections: {len(sections)}")

['Typological Description']
Number of sections: 1


When the header has a parent type, this data is in the third (index 2) item of the list generated from the find all function. Thus, using `... .find_all("div", class_="col-md-12")[1].find_all("p")[1] ...` selects the correct item with the sections.

![](./../images/uri_header_parent_types.png)

In [30]:
soup_sections = (
    soup_one_section
        .body
        .find_all("div", class_="col-md-12")[1]
        .find_all("p")
)
soup_sections

[<p><strong>Canonical URI: </strong><code><a href="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4" title="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4">http://numismatics.org/ocre/id/ric.2_3(2).hdn.4</a></code></p>,
 <p></p>,
 <p>Parent Type: <a href="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6" rel="skos:broader">ric.2_3(2).hdn.4-6</a></p>]

In [31]:
soup_sections[1]

<p></p>

In [32]:
soup_sections[2]

<p>Parent Type: <a href="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6" rel="skos:broader">ric.2_3(2).hdn.4-6</a></p>

When there is a subtype section, this section can be ignored as there is no need for this section for later ML purposes.

![](./../images/uri_header_subtypes_section.png)

In [33]:
with open(paths_uri_sample_d["subtype_section"], "r", encoding="UTF-8") as f:
    soup_subtype_section = BeautifulSoup(f, "lxml")

soup_sections = (
    soup_subtype_section
        .body
        .find_all("div", class_="col-md-12")[1]
        .find_all("p")[1]
        .contents
)
[
    " ".join(item.text.strip().split())
    for item in soup_sections
    if item.name == "a" if item["href"] != "#subtypes"
]

['Examples of this type', 'Quantitative Analysis']

Exploratory scraping of the header section will involve:

- Determining the number of tags in `find_all("div", class_="col-md-12")[1]`
- Make sure the sections of the page line is in the same place for all pages
- Determine all sections in a page from sections line

In [34]:
all_header_items = (
    soup_subtype_section
        .body
        .find_all("div", class_="col-md-12")[1]
        .contents
)
all_header_items

['\n',
 <h1 id="object_title" lang="en" property="skos:prefLabel">RIC II, Part 3 (second edition) Hadrian 4-6</h1>,
 '\n',
 <p><strong>Canonical URI: </strong><code><a href="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6" title="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6">http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6</a></code></p>,
 '\n',
 <p><a href="#subtypes">Subtypes</a> | <a href="#examples">Examples of this type</a> | <a href="#metrical">Quantitative
                      Analysis</a></p>,
 '\n']

In [35]:
# If item is text, item.name returns None
all_tags = [item for item in all_header_items if item.name]
all_tags

[<h1 id="object_title" lang="en" property="skos:prefLabel">RIC II, Part 3 (second edition) Hadrian 4-6</h1>,
 <p><strong>Canonical URI: </strong><code><a href="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6" title="http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6">http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6</a></code></p>,
 <p><a href="#subtypes">Subtypes</a> | <a href="#examples">Examples of this type</a> | <a href="#metrical">Quantitative
                      Analysis</a></p>]

In [36]:
# Number of tags in header
len(all_tags)

3

In [37]:
all_tags[2].contents

[<a href="#subtypes">Subtypes</a>,
 ' | ',
 <a href="#examples">Examples of this type</a>,
 ' | ',
 <a href="#metrical">Quantitative
                      Analysis</a>]

In [38]:
[
    (item["href"], " ".join(item.text.strip().split()))
    for item in all_tags[2].contents
    if item.name
]

[('#subtypes', 'Subtypes'),
 ('#examples', 'Examples of this type'),
 ('#metrical', 'Quantitative Analysis')]

In [39]:
# Sections + Typological section
all_tags[2].text

'Subtypes | Examples of this type | Quantitative\n                     Analysis'

In [40]:
# Clean sections + Typological section
[" ".join(item.split()) for item in all_tags[2].text.split(" | ")]

['Subtypes', 'Examples of this type', 'Quantitative Analysis']

In [41]:
# URI
all_tags[1].find("a")["href"]

'http://numismatics.org/ocre/id/ric.2_3(2).hdn.4-6'

Results from the exploratory scrape of URI page headers:

- The number of tags in the header is consistantly three.
- The section line is consistantly on line 3 (index 2).
- The following are a list of sections:
    - "" (blank) (This indicates there is only the Topological section)
    - Annotations (This section can be skipped)
    - Examples of this type
    - Quantitative Analysis
    - Subtypes (This section can be skipped)
- There are several pages with inconsistant section lines related to the "|" symbol. (The code below shows how process the text.)
    - "Subtypes |"
    - "Subtypes | | Annotations"
    - "| Annotations"

In [42]:
example_section_lines = (
    "",
    "Examples of this type | Quantitative Analysis",
    "Examples of this type | Quantitative Analysis | Annotations",
    "Subtypes |",
    "Subtypes | Examples of this type | Quantitative Analysis",
    "Subtypes | Examples of this type | Quantitative Analysis | Annotations",
    "Subtypes | | Annotations",
    "| Annotations"
)

In [43]:
for example in example_section_lines:
    mod_line = example.split("|")
    mod_line = [item.strip() for item in mod_line]
    mod_line = [item for item in mod_line if len(item) > 0]
    mod_line = ["Typological Description"] + mod_line
    print(mod_line)

['Typological Description']
['Typological Description', 'Examples of this type', 'Quantitative Analysis']
['Typological Description', 'Examples of this type', 'Quantitative Analysis', 'Annotations']
['Typological Description', 'Subtypes']
['Typological Description', 'Subtypes', 'Examples of this type', 'Quantitative Analysis']
['Typological Description', 'Subtypes', 'Examples of this type', 'Quantitative Analysis', 'Annotations']
['Typological Description', 'Subtypes', 'Annotations']
['Typological Description', 'Annotations']


In [44]:
pages = (
    "http://numismatics.org/ocre/id/ric.1(2).aug.17",
    "http://numismatics.org/ocre/id/ric.1(2).aug.1A",
    "http://numismatics.org/ocre/id/ric.2.tr.315",
    "http://numismatics.org/ocre/id/ric.2_3(2).hdn.45-47",
    "http://numismatics.org/ocre/id/ric.2_3(2).hdn.2-3",
    "http://numismatics.org/ocre/id/ric.4.gor_iii.209",
    "http://numismatics.org/ocre/id/ric.5.gall(2).22",
    "http://numismatics.org/ocre/id/ric.3.ant.184"
)

for page in pages:
    response = requests.get(page)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")
    soup_header = soup.body.find_all("div", class_="col-md-12")[1].contents
    all_tags = [item for item in soup_header if item.name]
    print(all_tags[2].contents)
    print(
        [
            (item["href"], " ".join(item.text.strip().split()))
            for item in all_tags[2].contents
            if item.name
        ]
    )
    print()
    print()

[]
[]


[<a href="#examples">Examples of this type</a>, ' | ', <a href="#metrical">Quantitative
                     Analysis</a>]
[('#examples', 'Examples of this type'), ('#metrical', 'Quantitative Analysis')]


[<a href="#examples">Examples of this type</a>, ' | ', <a href="#metrical">Quantitative
                     Analysis</a>, ' | ', <a href="#annotations">Annotations</a>]
[('#examples', 'Examples of this type'), ('#metrical', 'Quantitative Analysis'), ('#annotations', 'Annotations')]


[<a href="#subtypes">Subtypes</a>, ' | \n               ']
[('#subtypes', 'Subtypes')]


[<a href="#subtypes">Subtypes</a>, ' | ', <a href="#examples">Examples of this type</a>, ' | ', <a href="#metrical">Quantitative
                     Analysis</a>]
[('#subtypes', 'Subtypes'), ('#examples', 'Examples of this type'), ('#metrical', 'Quantitative Analysis')]


[<a href="#subtypes">Subtypes</a>, ' | ', <a href="#examples">Examples of this type</a>, ' | ', <a href="#metrical">Quantitative
        

### Typological Description section

In [45]:
!ls ./../images/

browse_page_results_one_coin.png
canonical_uri_with_example_pagination.png
uri_header_general.png
uri_header_one_section.png
uri_header_parent_types.png
uri_header_subtypes_section.png
uri_typological_section.png


The general structure of the Typological Description section has a "no header" section and then several subsection. The section is contained in the `<div class="metadata_section">` tag. The name of the section is in the "h3" tag. The fields, values, and subsection names are in the `<ul>` tag.

![](./../images/uri_typological_section.png)

In [46]:
soup_general.find("div", class_="metadata_section").h3

<h3>Typological Description</h3>

In [47]:
typological_raw_data = (
    soup_general.find("div", class_="metadata_section").ul.contents
)
typological_raw_data = [item for item in typological_raw_data if item.name]
typological_raw_data[:5]

[<li><b>Date Range: </b>25 BC - 23 BC
                      </li>,
 <li><b>Object Type: </b><a href='../results?q=objectType_facet:"Coin"'>Coin</a><a class="external_link" href="http://nomisma.org/id/coin" rel="nmo:representsObjectType" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>,
 <li><b>Manufacture: </b><a href='../results?q=manufacture_facet:"Struck"'>Struck</a><a class="external_link" href="http://nomisma.org/id/struck" rel="nmo:hasManufacture" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>,
 <li><b>Denomination: </b><a href='../results?q=denomination_facet:"Quinarius"'>Quinarius</a><a class="external_link" href="http://nomisma.org/id/quinarius" rel="nmo:hasDenomination" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>,
 <li><b>Material: </b><a href='../results?q=material_facet:"Silver"'>Silver</a><a class="external_link" href="http://nomisma.org/id/ar" rel="nmo:hasMaterial" target="_blank

In [48]:
for item in typological_raw_data:
    print(item, end="\n\n")

<li><b>Date Range: </b>25 BC - 23 BC
                     </li>

<li><b>Object Type: </b><a href='../results?q=objectType_facet:"Coin"'>Coin</a><a class="external_link" href="http://nomisma.org/id/coin" rel="nmo:representsObjectType" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>

<li><b>Manufacture: </b><a href='../results?q=manufacture_facet:"Struck"'>Struck</a><a class="external_link" href="http://nomisma.org/id/struck" rel="nmo:hasManufacture" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>

<li><b>Denomination: </b><a href='../results?q=denomination_facet:"Quinarius"'>Quinarius</a><a class="external_link" href="http://nomisma.org/id/quinarius" rel="nmo:hasDenomination" target="_blank"><span class="glyphicon glyphicon-new-window"></span></a></li>

<li><b>Material: </b><a href='../results?q=material_facet:"Silver"'>Silver</a><a class="external_link" href="http://nomisma.org/id/ar" rel="nmo:hasMaterial" target="_blank"><spa

The contents of this section can be accessed using BeautifulSoup's `stripped_strings` method. This method will remove all excessive white space and return a generator of text. I will use a Python `deque` object to remove items from the left without harming performance.

In [49]:
stripped_str = soup_general.find("div", class_="metadata_section").ul.stripped_strings
stripped_str = deque(stripped_str)
stripped_str

deque(['Date Range:',
       '25 BC - 23 BC',
       'Object Type:',
       'Coin',
       'Manufacture:',
       'Struck',
       'Denomination:',
       'Quinarius',
       'Material:',
       'Silver',
       'Authority',
       'Authority:',
       'Augustus',
       'Issuer:',
       'P. Carisius',
       'Geographic',
       'Mint:',
       'Emerita',
       'Region:',
       'Lusitania',
       'Obverse',
       'Legend:',
       'AVGVST',
       'Type:',
       'Head of Augustus, bare, left',
       'Portrait:',
       'Augustus',
       'Reverse',
       'Legend:',
       'P CARISI LEG',
       'Type:',
       'Victory standing right, placing wreath on trophy with dagger and sword at base',
       'Deity:',
       'Victory'])

In [50]:
fields_values_d = dict()
curr_subsection = str()
while len(stripped_str) > 0:
    item = stripped_str.popleft()
    item = item.replace(" ", "_").lower()
    if ":" in item:
        # field name
        # pop next item as value
        key = (
            ((curr_subsection + "_") if curr_subsection else "") 
            + item.replace(":", "")
        )
        value = stripped_str.popleft()
        if key not in fields_values_d.keys():
            fields_values_d[key] = value
    else:
        # header
        # update curr_subsection
        curr_subsection = item

pprint(fields_values_d)

{'authority_authority': 'Augustus',
 'authority_issuer': 'P. Carisius',
 'date_range': '25 BC - 23 BC',
 'denomination': 'Quinarius',
 'geographic_mint': 'Emerita',
 'geographic_region': 'Lusitania',
 'manufacture': 'Struck',
 'material': 'Silver',
 'object_type': 'Coin',
 'obverse_legend': 'AVGVST',
 'obverse_portrait': 'Augustus',
 'obverse_type': 'Head of Augustus, bare, left',
 'reverse_deity': 'Victory',
 'reverse_legend': 'P CARISI LEG',
 'reverse_type': 'Victory standing right, placing wreath on trophy with dagger '
                 'and sword at base'}


Confirming this will work for the other sample URI pages.

There is an issue with the "Pagination" example due to the text "(uncertain)" in the Denomination field. This text appears on a new line after the denomination value. The code below will be altered to skip over lines that are "(uncertain)".

In [51]:
with open(paths_uri_sample_d["pagination"], "r", encoding="UTF-8") as f:
    soup_pagination = BeautifulSoup(f, "lxml")

In [52]:
uri_samples = (
    ("Pagination", soup_pagination),
    ("One Section", soup_one_section),
    ("Subtype Section", soup_subtype_section),
)
for desc, soup in uri_samples:
    data_strings = soup.find("div", class_="metadata_section").ul.stripped_strings
    data_strings = deque(data_strings)
    
    d = dict()
    subsection = str()
    while len(data_strings) > 0:
        item = data_strings.popleft()
        item = item.replace(" ", "_").lower()
        
        if item == "(uncertain)":
            continue
            
        if ":" in item:
            key = (
                ((subsection + "_") if subsection else "_") 
                + item.replace(":", "")
            )
            value = data_strings.popleft()
            value = value.replace("\n", " ")
            value = re.sub(" +", " ", value)
            if key not in d.keys():
                d[key] = value
            else:
                key_count = len([k for k in d.keys() if key in k]) + 1
                d[key + f"{key_count}"] = value
        else:
            subsection = item

    print(f"Typological Description fields in {desc} file:")
    pprint(d)
    print()    

Typological Description fields in Pagination file:
{'_date_range': 'AD 330 - AD 331',
 '_denomination': 'AE2',
 '_denomination2': 'AE3',
 '_manufacture': 'Struck',
 '_material': 'Bronze',
 '_object_type': 'Coin',
 'authority_authority': 'Constantine I',
 'geographic_mint': 'Trier',
 'geographic_region': 'Gallia',
 'obverse_deity': 'Constantinopolis',
 'obverse_legend': 'CONSTAN-TINOPOLIS',
 'obverse_type': 'Bust of Constantinopolis, laureate, helmeted, wearing '
                 'imperial cloak, left, holding reversed spear in right hand',
 'reverse_deity': 'Victory',
 'reverse_mintmark': '-/-//TRP•',
 'reverse_officinamark': 'P',
 'reverse_officinamark2': 'S',
 'reverse_type': 'Victory, winged, draped, standing left on prow, holding '
                 'spear in right hand and shield in left hand'}

Typological Description fields in One Section file:
{'_date': 'AD 117',
 '_denomination': 'Denarius',
 '_manufacture': 'Struck',
 '_material': 'Silver',
 '_object_type': 'Coin',
 'authority

After exploratory scraping the Typological Description section with the method above, several issues are encountered resulting from (1) missing values in field-value pairs, (2) non-subsection fields appearing after all of the subsections, and (3) symbol fields in the "Reverse" section have non-standard text and are causing problems.. The process above may be too simplistic to account for these issues. A more detailed approach to scraping this section will be undertaken below.

[RIC V Probus 651](http://numismatics.org/ocre/id/ric.5.pro.651) is one URI page causing issue (1) above. In the "Reverse" subsection, one of the `MintMark` fields is missing a value, which throws off the scraping. The code below is a potential solution to this issue.

In [53]:
url_issue_01 = "http://numismatics.org/ocre/id/ric.5.pro.651"

response = requests.get(url_issue_01)
response.raise_for_status()

soup_issue_01 = BeautifulSoup(response.text, "lxml")
data_raw = soup_issue_01.find("div", class_="metadata_section").ul
data_raw = [item for item in data_raw if item.name]

unique_fields_d = dict()
row_field_counts_d = dict()
for item in data_raw:
    all_tags = [i for i in item if i.name]
    
    if all_tags[0].name == "b":
        field, value = item.text.strip().split(": ", maxsplit=1)
        field = "_" + field.lower().replace(" ", "_")
        value = re.sub(" +", " ", value.replace("\n", " "))
        
        # TODO: repeat
        if field not in row_field_counts_d.keys():
            row_field_counts_d[field] = 1
        else:
            row_field_counts_d[field] += 1
        
        if field not in unique_fields_d.keys():
            unique_fields_d[field] = (value, "coin_id", "path_uri")
        else:
            if row_field_counts_d[field] > 1:
                key_idx = row_field_counts_d[field]
                unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")
    elif all_tags[0].name == "h4":
        section_name = all_tags[0].text.strip().lower().replace(" ", "_")
        for li in item.find_all("li"):
            if li.contents[0].name == "b":
                field = li.contents[0].text.strip().lower().replace(" ", "_").replace(":", "")
                field = section_name + "_" + field
                value = (
                    re.sub(
                        " +", " ", li.contents[1].text.strip().replace("\n", " ")
                    )
                    if len(li.contents) > 1
                    else None
                )
                
                # TODO: repeat
                if field not in row_field_counts_d.keys():
                    row_field_counts_d[field] = 1
                else:
                    row_field_counts_d[field] += 1

                if field not in unique_fields_d.keys():
                    unique_fields_d[field] = (value, "coin_id", "path_uri")
                else:
                    if row_field_counts_d[field] > 1:
                        key_idx = row_field_counts_d[field]
                        unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")

# pprint(row_field_counts_d)
pprint(unique_fields_d)

{'_date_range': ('AD 276 - AD 282', 'coin_id', 'path_uri'),
 '_denomination': ('Antoninianus', 'coin_id', 'path_uri'),
 '_manufacture': ('Struck', 'coin_id', 'path_uri'),
 '_material': ('Silver', 'coin_id', 'path_uri'),
 '_object_type': ('Coin', 'coin_id', 'path_uri'),
 'authority_authority': ('Probus', 'coin_id', 'path_uri'),
 'geographic_mint': ('Siscia', 'coin_id', 'path_uri'),
 'geographic_region': ('Pannonia', 'coin_id', 'path_uri'),
 'obverse_legend': ('IMP C M AVR PROBVS AVG', 'coin_id', 'path_uri'),
 'obverse_portrait': ('Probus', 'coin_id', 'path_uri'),
 'obverse_type': ('Bust of Probus, radiate, draped, right or bust of Probus, '
                  'radiate, draped, cuirassed, right or bust of Probus, '
                  'radiate, cuirassed, right or bust of Probus, helmeted, '
                  'radiate, cuirassed, left, holding spear in right hand and '
                  'shield in left hand',
                  'coin_id',
                  'path_uri'),
 'reverse_deity': ('Co

[RIC II, Part 1 (second edition) Vespasian 787](http://numismatics.org/ocre/id/ric.2_1(2).ves.787) is one URI page causing issue (2) above. The `denomination` field follows the "Reverse" section. The code from issue (1) is copied, pasted, and modified for the new URI page, and this code handles issue (2).

In [54]:
url_issue_02 = "http://numismatics.org/ocre/id/ric.2_1(2).ves.787"

response = requests.get(url_issue_02)
response.raise_for_status()

soup_issue_02 = BeautifulSoup(response.text, "lxml")
data_raw = soup_issue_02.find("div", class_="metadata_section").ul
data_raw = [item for item in data_raw if item.name]

unique_fields_d = dict()
row_field_counts_d = dict()
for item in data_raw:
    all_tags = [i for i in item if i.name]
    
    if all_tags[0].name == "b":
        field, value = item.text.strip().split(": ", maxsplit=1)
        field = "_" + field.lower().replace(" ", "_")
        value = re.sub(" +", " ", value.replace("\n", " "))
        
        # TODO: repeat
        if field not in row_field_counts_d.keys():
            row_field_counts_d[field] = 1
        else:
            row_field_counts_d[field] += 1
        
        if field not in unique_fields_d.keys():
            unique_fields_d[field] = (value, "coin_id", "path_uri")
        else:
            if row_field_counts_d[field] > 1:
                key_idx = row_field_counts_d[field]
                unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")
    elif all_tags[0].name == "h4":
        section_name = all_tags[0].text.strip().lower().replace(" ", "_")
        for li in item.find_all("li"):
            if li.contents[0].name == "b":
                field = li.contents[0].text.strip().lower().replace(" ", "_").replace(":", "")
                field = section_name + "_" + field
                value = (
                    re.sub(
                        " +", " ", li.contents[1].text.strip().replace("\n", " ")
                    )
                    if len(li.contents) > 1
                    else None
                )
                
                # TODO: repeat
                if field not in row_field_counts_d.keys():
                    row_field_counts_d[field] = 1
                else:
                    row_field_counts_d[field] += 1

                if field not in unique_fields_d.keys():
                    unique_fields_d[field] = (value, "coin_id", "path_uri")
                else:
                    if row_field_counts_d[field] > 1:
                        key_idx = row_field_counts_d[field]
                        unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")

# pprint(row_field_counts_d)
pprint(unique_fields_d)

{'_date': ('AD 75', 'coin_id', 'path_uri'),
 '_denomination': ('Aureus', 'coin_id', 'path_uri'),
 '_manufacture': ('Struck', 'coin_id', 'path_uri'),
 '_material': ('Gold', 'coin_id', 'path_uri'),
 '_object_type': ('Coin', 'coin_id', 'path_uri'),
 'authority_authority': ('Vespasian', 'coin_id', 'path_uri'),
 'geographic_mint': ('Rome', 'coin_id', 'path_uri'),
 'geographic_region': ('Italy', 'coin_id', 'path_uri'),
 'obverse_legend': ('CAES AVG F DOMIT COS III', 'coin_id', 'path_uri'),
 'obverse_portrait': ('Domitian', 'coin_id', 'path_uri'),
 'obverse_type': ('Head of Domitian, laureate, right', 'coin_id', 'path_uri'),
 'reverse_deity': ('Spes', 'coin_id', 'path_uri'),
 'reverse_legend': ('PRINCEPS IVVENTVT', 'coin_id', 'path_uri'),
 'reverse_type': ('Spes standing, left holding flower in right and raising '
                  'skirt with left',
                  'coin_id',
                  'path_uri')}


[RIC X Libius Severus 2715](http://numismatics.org/ocre/id/ric.10.lib_sev_w.2715) is one URI page causing issue (3) above. In the "Reverse" subsection, the `Symbol` fields have an image inline with the text which is throwing off the original process.

In [55]:
url_issue_03 = "http://numismatics.org/ocre/id/ric.10.lib_sev_w.2715"

response = requests.get(url_issue_03)
response.raise_for_status()

soup_issue_03 = BeautifulSoup(response.text, "lxml")
data_raw = soup_issue_03.find("div", class_="metadata_section").ul
data_raw = [item for item in data_raw if item.name]

unique_fields_d = dict()
row_field_counts_d = dict()
for item in data_raw:
    all_tags = [i for i in item if i.name]
    
    if all_tags[0].name == "b":
        field, value = item.text.strip().split(": ", maxsplit=1)
        field = "_" + field.lower().replace(" ", "_")
        value = re.sub(" +", " ", value.replace("\n", " "))
        
        # TODO: repeat
        if field not in row_field_counts_d.keys():
            row_field_counts_d[field] = 1
        else:
            row_field_counts_d[field] += 1
        
        if field not in unique_fields_d.keys():
            unique_fields_d[field] = (value, "coin_id", "path_uri")
        else:
            if row_field_counts_d[field] > 1:
                key_idx = row_field_counts_d[field]
                unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")
    elif all_tags[0].name == "h4":
        section_name = all_tags[0].text.strip().lower().replace(" ", "_")
        for li in item.find_all("li"):
            if li.contents[0].name == "b":
                
                # debug - cleaning up works
                field_ = li.contents[0].text.strip().lower().replace(" ", "_").replace(":", "")
                value_ = re.sub(
                    " +", " ", li.text.strip().replace(" - ", " ").replace(",", "", 1)
                )
                if "symbol" in field_:
                    print(field_)
                    print(value_)
                    
                
                field = li.contents[0].text.strip().lower().replace(" ", "_").replace(":", "")
                field = section_name + "_" + field
                value = (
                    re.sub(
                        " +", " ", li.contents[1].text.strip().replace("\n", " ")
                    )
                    if len(li.contents) > 1
                    else None
                )
                
                # TODO: repeat
                if field not in row_field_counts_d.keys():
                    row_field_counts_d[field] = 1
                else:
                    row_field_counts_d[field] += 1

                if field not in unique_fields_d.keys():
                    unique_fields_d[field] = (value, "coin_id", "path_uri")
                else:
                    if row_field_counts_d[field] > 1:
                        key_idx = row_field_counts_d[field]
                        unique_fields_d[field + f"{key_idx}"] = (value, "coin_id", "path_uri")

# pprint(row_field_counts_d)
pprint(unique_fields_d)

symbol
Symbol: Monogram 1 (Libius Severus) consists of C, E, V, and R
symbol
Symbol: Monogram 2 (Libius Severus) consists of C, E, V, R, and A
{'_date_range': ('AD 461 - AD 465', 'coin_id', 'path_uri'),
 '_denomination': ('AE4', 'coin_id', 'path_uri'),
 '_manufacture': ('Struck', 'coin_id', 'path_uri'),
 '_material': ('Bronze', 'coin_id', 'path_uri'),
 '_object_type': ('Coin', 'coin_id', 'path_uri'),
 'authority_authority': ('Libius Severus', 'coin_id', 'path_uri'),
 'geographic_mint': ('Rome', 'coin_id', 'path_uri'),
 'geographic_region': ('Italy', 'coin_id', 'path_uri'),
 'obverse_legend': ('D N LIB SEVER P A or D N LB (rev.S)[ ] or [ ]RVS AV or [ '
                    ']RVS P A or [ ]RVS P AV or [ ]RVS P AC or [ ] P F AV or D '
                    'N [ ]',
                    'coin_id',
                    'path_uri'),
 'obverse_portrait': ('Libius Severus', 'coin_id', 'path_uri'),
 'obverse_type': ('Bust of Libius Severus, pearl-diademed, draped, cuirassed, '
                  'rig

### Quantitative Analysis Section

The Quantitative Analysis section is found inside `<div class="row" id="metrical">`, and the data is available in `<div class="col-md-12">` and `<dl class="dl-horizontal"`.

In [56]:
uri_samples_full = (
    ("General", soup_general),
    ("Pagination", soup_pagination),
    ("One Section", soup_one_section),
    ("Subtype Section", soup_subtype_section),
)

In [65]:
for desc, soup in uri_samples_full:
    unique_fields_d = dict()
    row_field_counts_d = dict()
    soup_analysis_section = soup.find("div", class_="row", id="metrical")
    
    if not soup_analysis_section:
        continue
        
    section_title = soup_analysis_section.div.h3.text.strip()
    section_title = re.sub(" +", " ", section_title)
    section_title = section_title.replace("\n", "")
    
    soup_data = soup_analysis_section.find("dl", class_="dl-horizontal")
    all_dt = soup_data.find_all("dt")
    all_dd = soup_data.find_all("dd")
    
    for dt, dd in zip(all_dt, all_dd):
        field = dt.text.strip().lower().replace(" ", "")
        field = "average_" + field
        value = float(dd.text.strip())
        
        if field not in row_field_counts_d.keys():
            row_field_counts_d[field] = 1
        else:
            row_field_counts_d[field] += 1
        
        if field not in unique_fields_d.keys():
            unique_fields_d[field] = value
        else:
            if row_field_counts_d[field] > 1:
                key_idx = row_field_counts_d[field]
                unique_fields_d[field + f"{key_idx}"] = value
    
    print(f"Quantitative Analysis section for {desc} sample:")
    print(section_title)
    pprint(unique_fields_d)
    print()

Quantitative Analysis section for General sample:
Quantitative Analysis
{'average_axis': 5.0, 'average_diameter': 13.66, 'average_weight': 1.61}

Quantitative Analysis section for Pagination sample:
Quantitative Analysis
{'average_axis': 7.85, 'average_diameter': 16.59, 'average_weight': 2.04}

Quantitative Analysis section for Subtype Section sample:
Quantitative Analysis
{'average_axis': 6.29, 'average_diameter': 18.25, 'average_weight': 3.04}

