# Exploratory Scraping

The data structures of the [browse results](https://numismatics.org/ocre/results?q=&start=0) and [canonical URI](https://numismatics.org/ocre/id/ric.1(2).aug.1A) for individual coins do not appear to be consistant. In order to determine the data fields required for the database, exploratory scraping will need to be conducted in Jupyter Notebooks and Python scripts.

In [1]:
from bs4 import BeautifulSoup
from pathlib import Path

In [2]:
import constants as c

## Browse Results

The "browse results" are available [here](https://numismatics.org/ocre/results?q=&start=0) and are the top-level data about coins in the database. Coin name, canonical URI, date of coin, denomination of coin, mint of coin, obverse (heads) of coin, reverse (tails) of coin, and number of coins. Not all coins have all eight data fields and it is not sure if there are any other fields that may be available. Therefore, an exploratory scrape will be conducted to determine all possible fields that are available.

A sample file for local experimentation is defined below.

In [3]:
# Actual path ('start=X' value changes with incrementing pages)
# path = "https://numismatics.org/ocre/results?q=objectType_facet%3A%22Coin%22&start=0"

In [4]:
!ls ./../data/ | grep -E "ocre_browse_results_sample"

ocre_browse_results_sample.html
ocre_browse_results_sample_some_coins_no_text.html


In [5]:
EXAMPLE_FILE = "ocre_browse_results_sample.html"
PATH_SAMPLE = c.DATA_FOLDER / EXAMPLE_FILE

In [6]:
with open(PATH_SAMPLE, "r", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, "lxml")

In [7]:
soup

<!DOCTYPE HTML>
<html lang="en"><head profile="http://a9.com/-/spec/opensearch/1.1/"><title>Online Coins of the Roman Empire: Browse Collection</title><link href="http://numismatics.org/ocre/feed/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/atom+xml"/><link href="http://numismatics.org/ocre/query.csv/?q=objectType_facet:%22Coin%22" rel="alternate" type="text/csv"/><link href="http://numismatics.org/ocre/query.kml/?q=objectType_facet:%22Coin%22" rel="alternate" type="application/vnd.google-earth.kml+xml"/><link href="http://numismatics.org/ocre/opensearch.xml" rel="search" title="Example Search for http://numismatics.org/ocre/" type="application/opensearchdescription+xml"/><meta content="41269" name="totalResults"/><meta content="0" name="startIndex"/><meta content="20" name="itemsPerPage"/><link href="https://numismatics.org/themes/ocre/images/favicon.png" rel="shortcut icon" type="image/x-icon"/><meta content="width=device-width, initial-scale=1" name="viewport"/>

First step is to extract the page ID, start coin ID, and end coin ID.

In [8]:
records_and_page = soup.find("div", class_="paging_div row")
display_records_data = (
    records_and_page
        .contents[0]
        .text
        .strip()
        .split()
)
# max_coin_id will only need to be extracted on the first page
coin_start_id, coin_end_id, max_coin_id = [
    int(item) for item in display_records_data[2:6+1:2]
]
print([coin_start_id, coin_end_id, max_coin_id])

[1, 20, 41269]


In [9]:
# page_id from navigation bar
page_id = int(
    records_and_page
        .find("div", class_="col-md-6 page-nos")
        .text.strip()
)
page_id

1

The `max_coin_id` can be used to calculate the number of pages to iterate through. Twenty coins appear on each page. If the maximum number of coins is not evenly divisible by twenty, then an extra page will have to scraped.

In [10]:
(max_coin_id//20) + int(max_coin_id%20 != 0)

2064

Next, the content of the page needs to be extracted.

- 20 coins (or less) per page
- Each coin is in `<div class="row result-doc">` tags.
- An example of a coin is below.

![](./../images/browse_page_results_one_coin.png)

- The coin image and number of objects, etc. on the right is available through `<div class="col-md-5 col-lg-4 pull-right">`.
- The data on the left are available through `<div class="col-md-7 col-lg-8">`.

In [11]:
all_page_coins = soup.find_all("div", class_="row result-doc")
len(all_page_coins)

20

In [12]:
all_page_coins[0]

<div class="row result-doc"><div class="col-md-12"><h4><a href="id/ric.1(2).aug.1A">RIC I (second edition) Augustus 1A</a></h4></div><div class="col-md-5 col-lg-4 pull-right"><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.obv.width350.jpg" id="http://numismatics.org/collection/1944.100.39025" rel="gallery" title="Obverse of 1944.100.39025: American Numismatic Society"><img class="side-thumbnail" src="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.obv.width175.jpg"/></a><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.rev.width350.jpg" id="http://numismatics.org/collection/1944.100.39025" rel="gallery" title="Reverse of 1944.100.39025: American Numismatic Society"><img class="side-thumbnail" src="https://numismatics.org/collectionimages/19001949/1944/1944.100.39025.rev.width175.jpg"/></a><a class="thumbImage" href="https://numismatics.org/collectionimages/19001949/1

Accessing the titles of each coin is done through the following.

In [13]:
soup_title = all_page_coins[0].find("div", class_="col-md-12").find("a")
soup_title

<a href="id/ric.1(2).aug.1A">RIC I (second edition) Augustus 1A</a>

In [14]:
soup_title["href"]

'id/ric.1(2).aug.1A'

(full link is https://numismatics.org/ocre/id/ric.1(2).aug.1A, where everything before "id" is  the home page, http://numismatics.org/ocre/)

In [15]:
soup_title.text.strip()

'RIC I (second edition) Augustus 1A'

The right division and the text under the coin is available below. To determine how to best extract the number of objects found, further exploratory scraping is required.

- [x] Extract all text from right sides, review the text, and determine best way to extract only number of objects found.

In [16]:
all_page_coins[0].find("div", class_="col-md-5 col-lg-4 pull-right").text.strip()

'objects: 22; hoard: 1'

In [17]:
bool(all_page_coins[0].find("div", class_="col-md-5 col-lg-4 pull-right").text.strip())

True

There are instances when there are not any text in the right area. The text of `find("div", class_="col-md-5 col-lg-4 pull-right")` can be convert into a boolean to test if there is anything in the area. If there is not text, the boolean converstion returns false.

In [18]:
!ls ./../data/ | grep -E "ocre_browse_results_sample"

ocre_browse_results_sample.html
ocre_browse_results_sample_some_coins_no_text.html


In [19]:
EXAMPLE_FILE_NT = "ocre_browse_results_sample_some_coins_no_text.html"
PATH_SAMPLE_NT = c.DATA_FOLDER / EXAMPLE_FILE_NT

with open(PATH_SAMPLE_NT, "r", encoding="UTF-8") as f:
    soup_nt = BeautifulSoup(f, "lxml")
    
all_page_coins_nt = soup_nt.find_all("div", class_="row result-doc")
empty_contents = (
    all_page_coins_nt[3]
        .find("div", class_="col-md-5 col-lg-4 pull-right")
        .text.strip()
)
empty_contents

''

In [20]:
bool(empty_contents)

False

Using `find()` method for the substring "object" will find all text that has "object" or "objects" in the text. If the method returns -1, then the substring was not found.

In [23]:
!head -5 ./../data/unique_object_count_text.txt

"hoard: 1", first appears on page 5 at API start value of 80
"object: 1", first appears on page 1 at API start value of 0
"object: 1; hoard: 1", first appears on page 6 at API start value of 100
"object: 1; hoards: 3", first appears on page 8 at API start value of 140
"objects: 10", first appears on page 1 at API start value of 0


In [39]:
path_unique_text = c.DATA_FOLDER / "unique_object_count_text.txt"
with open(path_unique_text, "r", encoding="UTF-8") as f:
    unique_text = f.readlines()
    
unique_text = [
    line.split(sep=",", maxsplit=1)[0].replace("\"", "")
    for line in unique_text
]
find_results = [text.find("object") for text in unique_text]
[(text, res) for text, res in zip(unique_text, find_results)][:10]

[('hoard: 1', -1),
 ('object: 1', 0),
 ('object: 1; hoard: 1', 0),
 ('object: 1; hoards: 3', 0),
 ('objects: 10', 0),
 ('objects: 101', 0),
 ('objects: 102', 0),
 ('objects: 103', 0),
 ('objects: 104', 0),
 ('objects: 105', 0)]

The left side is us accessible by `find("div", class_="col-md-7 col-lg-8")`. A list of the `dt` tags and `dd` tags can be created with `find_all()`. The lists can be iterated through using `zip()`.

In [21]:
all_left_data = all_page_coins[0].find("div", class_="col-md-7 col-lg-8").contents[0]
all_dt = all_left_data.find_all("dt")
all_dd = all_left_data.find_all("dd")
for dt, dd in zip(all_dt, all_dd):
    print(f"{dt.text}:\t{dd.text}")

Date:	25 BCE - 23 BCE
Denomination:	Quinarius
Mint:	Emerita
Obverse:	AVGVST: Head of Augustus, bare, left
Reverse:	P CARISI LEG: Victory standing right, placing wreath on trophy with dagger and sword at base


## Canonical URI

Caononical URI examples are available [here](https://numismatics.org/ocre/id/ric.1(2).aug.1A), [here](https://numismatics.org/ocre/id/ric.8.ar.261), and [here](https://numismatics.org/ocre/id/ric.8.sir.52).

The basic structure of these pages are:

- Coin title
- Canonical URI (url)
- Typological Description (section)
    - Date Range
    - Object Type
    - Manufacture
    - Denomination
    - Material
    - Authority (subsection)
        - Authority
        - Issuer
    - Geographic (subsection)
        - Mint
        - Region
    - Obverse (subsection)
        - Legend
        - Type
        - Portrait
    - Reverse (subsection)
        - Legend
        - Type
        - Deity
- Examples of this type (section)
    - "Div"s with different fields
    - Need to exploratory scrape
- Quantitative Analysis (section)
    - Average Axis
    - Average Diameter
    - Average Weight

Unfortunately, there are several variations between these three examples and between other URI pages reviewed. The variations identified are below:

- Fields in the section and subsections vary. I will need to perform exploratory scraping to determine all possible fields.
- The "Examples of this type" section have several divisions. Some coins do not have this section. Will need to check if this div exists. And will need to verify the unique fields and unique values for this section.
- The "Quantiative Analysis" section may not exist. I will need to check if this div exists. Also need to verify the fields of this section.

### Through Typological Description Section