# Obtaining Coin Data and Images from the Portable Antiquities Scheme Database

This notebook demonstrates how to obtain coin data and images from the Portable Antiquities Scheme (PAS) database using web scraping techniques due to the imposition of restrictions on machine queries of the API that I built between 2006 and 2015. The Scheme/British Museum now uses Cloudflare to protect its API endpoints, making it difficult to access the data programmatically as a javascript challenge is thrown up. This can be bypassed and here's how. 

## Using Python to download data and images

To do this I used the python library cloudscraper in a virtual environment and created a script to handle the scraping and downloading of json to CSV and subsequently images. First off, set up your virtual environment:

```bash
python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
```
You're now ready to go!

## Code

For this example, I am going to make a slight change to the actual script I ran to make this easier to manage in the notebook environment. 
Instead of all records attached to Reece Period 1, I am going to mock the pagination object to only have 4 pages by changing the `pagination` variable to 4.  

In [1]:
pip install cloudscraper

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [4]:
import cloudscraper
import json
import pandas as pd
import time

# Create a scraper instance
# This handles the Cloudflare challenges automatically
scraper = cloudscraper.create_scraper()

# Define the base URL
url_base = 'https://finds.org.uk/database/search/results/broadperiod/ROMAN/reeceID/1/format/json'

# Set a user-agent to mimic a real browser
# Cloudscraper will add other necessary headers automatically
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'Accept': 'application/json'
}

# Make the initial request and get the total number of pages
print("Fetching metadata from the first page...")
response = scraper.get(url_base, headers=headers)
json_data = json.loads(response.text)

total_results = json_data['meta']['totalResults']
results_per_page = json_data['meta']['resultsPerPage']
# pagination = (total_results + results_per_page - 1) // results_per_page
pagination = 4
print(f"Total records: {total_results}")
print(f"Total pages to scrape: {pagination}")

all_data = []

# Process the first page
records = json_data['results']
df = pd.DataFrame(records)
all_data.append(df)

# Loop through the remaining pages
for i in range(2, pagination + 1):
    url_download = f"{url_base}/page/{i}"
    print(f"Scraping page {i}/{pagination}...")
    
    try:
        response_paged = scraper.get(url_download, headers=headers)
        paged_json = json.loads(response_paged.text)
        records_paged = paged_json['results']
        df_paged = pd.DataFrame(records_paged)
        all_data.append(df_paged)
    except Exception as e:
        print(f"An error occurred on page {i}: {e}")
        time.sleep(5)
    
    time.sleep(1)

# Concatenate all dataframes and save to CSV
final_df = pd.concat(all_data, ignore_index=True)
final_df.to_csv('./data/reece1.csv', index=False, na_rep='')

print("Data successfully scraped and saved to ./data/reece1.csv")

Fetching metadata from the first page...
Total records: 3305
Total pages to scrape: 4
Scraping page 2/4...
Scraping page 3/4...
Scraping page 4/4...
Data successfully scraped and saved to ./data/reece1.csv


Let's see what we got back from the JSON API, converting it to a DataFrame and csv. This can be seen by running the next command. 

In [5]:
pd.read_csv('./data/reece1.csv').head(10)

Unnamed: 0,findIdentifier,id,old_findID,objecttype,broadperiod,description,notes,periodFrom,periodTo,fromdate,...,length,currentLocation,treasure,rally,TID,note,fromsubperiod,tosubperiod,subperiodFrom,subperiodTo
0,finds-1233746,1233746,GAT-06A665,COIN,ROMAN,A silver Roman denarius of Augustus (27&nbsp;B...,Until 28/08/2025 this find was grouped under I...,21,21,-27.0,...,,,,,,,,,,
1,finds-1230870,1230870,YORYM-493963,COIN,ROMAN,A silver Roman denarius of Tiberius (AD 14-37)...,,21,21,36.0,...,,,,,,,,,,
2,finds-1228996,1228996,SUR-609809,COIN,ROMAN,A&nbsp;worn silver Roman Republican denarius i...,Recorded from details emailed by the finder.,21,21,-48.0,...,,,,,,,,,,
3,finds-1227572,1227572,NMS-78A76D,COIN,ROMAN,A Roman&nbsp;silver&nbsp;Republican denarius o...,,21,21,32.0,...,,,,,,,,,,
4,finds-1226906,1226906,SUR-4C2E85,COIN,ROMAN,An extremely worn silver Roman Republican Dena...,Recorded from details emailed by the finder.,21,21,-108.0,...,,,,,,,,,,
5,finds-1226546,1226546,NMS-E7CC18,COIN,ROMAN,Copper alloy as or dupondius of uncertain Juli...,,21,21,37.0,...,,,,,,,,,,
6,finds-1225966,1225966,BERK-8F4653,COIN,ROMAN,A worn silver Roman Republican denarius of Mar...,,21,21,-32.0,...,,,,,,,,,,
7,finds-1225964,1225964,BERK-8F1AC6,COIN,ROMAN,A silver Roman Republican denarius of&nbsp;P. ...,,21,21,-42.0,...,,,,,,,,,,
8,finds-1225829,1225829,SF-683A10,COIN,ROMAN,A complete silver Roman denarius issued by the...,,21,21,-17.0,...,,,,,,,,,,
9,finds-1225730,1225730,KENT-55AA42,COIN,ROMAN,A silver Roman republican denarius of L. Treba...,,21,21,-135.0,...,,,,,,,,,,


We now want to extract and save images from the database and these can be generated from several fields that exist in the csv data frame. These are filename, imagedir and a base URL of 'https://finds.org.uk/'. The next script will get the images from the database, again using cloudscraper. 

In [4]:
import os
import cloudscraper
import pandas as pd
import time
import warnings
from urllib3.exceptions import NotOpenSSLWarning

warnings.simplefilter('ignore', NotOpenSSLWarning)

from requests.exceptions import HTTPError

# Create a scraper instance to handle Cloudflare
scraper = cloudscraper.create_scraper()

# Define the base URL and the local directory for images
base_url = 'https://finds.org.uk/'
output_dir = './data/downloaded_images'

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Read the CSV file
try:
    df = pd.read_csv('./data/reece1.csv')
    print(f"Loaded {len(df)} records from reece1.csv.")
except FileNotFoundError:
    print("Error: The file 'reece1.csv' was not found. Please ensure it's in the same directory.")
    exit()

# Initialize an empty DataFrame to store 404 errors
error_log_df = pd.DataFrame(columns=['old_findID', 'imagedir', 'filename', 'error_message'])

# Get the list of all unique image directories and filenames
images_to_download = df[['old_findID', 'imagedir', 'filename']].dropna().drop_duplicates()

print(f"Found {len(images_to_download)} unique images to download.")

# Loop through the unique image paths and download each file
for index, row in images_to_download.iterrows():
    old_findID = row['old_findID']
    imagedir = row['imagedir']
    filename = row['filename']
    
    # Construct the full image URL
    full_url = os.path.join(base_url, imagedir, filename).replace("\\", "/")

    # Define the local path to save the image
    local_path = os.path.join(output_dir, filename)

    # Skip if the file already exists
    if os.path.exists(local_path):
        print(f"Skipping: {filename} (already exists)")
        continue

    print(f"Downloading: {filename}")
    try:
        # Get the image content
        response = scraper.get(full_url, stream=True)
        response.raise_for_status() # Raise an exception for bad status codes
        
        # Save the image to the local file
        with open(local_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Successfully downloaded {filename}.")
        
    except HTTPError as e:
        if e.response.status_code == 404:
            new_row = pd.DataFrame([{'old_findID': old_findID, 'imagedir': imagedir, 'filename': filename, 'error_message': '404 - Not Found'}])
            error_log_df = pd.concat([error_log_df, new_row], ignore_index=True)
            print(f"Failed to download {filename} (404 Not Found). Logged to CSV.")
        else:
            print(f"Failed to download {filename} from {full_url}. Reason: {e}")
    
    except Exception as e:
        print(f"An unexpected error occurred with {filename}: {e}")
        
    # Be a polite scraper and add a short delay between requests
    time.sleep(1)

# Save the 404 error log to a CSV file
if not error_log_df.empty:
    error_log_df.to_csv('./data/404_errors.csv', index=False)
    print("\n404 errors have been logged to './data/404_errors.csv'.")
else:
    print("\nNo 404 errors were found during the download process.")

print("\nDownload process complete.")

Loaded 80 records from reece1.csv.
Found 78 unique images to download.
Skipping: GAT-06A665_68b06c8915771.jpg (already exists)
Skipping: YORYM-493963_68a59124800a5.jpg (already exists)
Skipping: SUR-609809_688773ef6844e.jpg (already exists)
Skipping: SUR-4C2E85_6874db2471a8b.jpg (already exists)
Skipping: BERK-8F4653_687611bc28843.jpg (already exists)
Skipping: BERK-8F1AC6_687612a110ec4.jpg (already exists)
Skipping: SF-683A10_68762c8c57dc5.jpg (already exists)
Skipping: KENT-55AA42_68655af514f86.jpg (already exists)
Skipping: WAW-932FCA_685a792a0d8a2.jpg (already exists)
Skipping: SWYOR-91E15B_6862c07573466.jpg (already exists)
Skipping: HAMP-905009_68590bb613a11.jpg (already exists)
Skipping: LIN-561371_68a6c5f4ac9a4.jpg (already exists)
Skipping: BUC-134FE1_68513593a9599.jpg (already exists)
Skipping: WMID-C303A6_686d16ce9fc4d.jpg (already exists)
Skipping: LVPL-167290_685ec780e1c94.jpg (already exists)
Skipping: ESS-F092BB_6842e05912ec8.jpg (already exists)
Skipping: SF-5C3305_6866

Some of the records are missing geographical coordinates due to the limitations of the original data source. As a result, we need to handle these missing values appropriately during our analysis. We're going to enrich these and create a new geocoded csv file. 

First make sure to install the required packages if you haven't already:

```bash
pip install pandas geopy
```

In [5]:
pip install pandas geopy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderUnavailable
import time
import os

# Define the input and output filenames
input_file = './data/reece1.csv'
output_file = './data/geocoded.csv'

# Check if the input file exists
if not os.path.exists(input_file):
    print(f"Error: The file '{input_file}' was not found. Please ensure it's in the same directory.")
    exit()

# Load the data
try:
    # Use a common encoding like 'utf-8' or 'latin1'
    df = pd.read_csv(input_file, encoding='utf-8')
    print(f"Loaded {len(df)} records from '{input_file}'.")
except Exception as e:
    print(f"Error reading the CSV file: {e}")
    exit()

# Initialize the geocoder with a custom user agent
# A user agent is required by many services for proper identification
geolocator = Nominatim(user_agent="geocoding_script_for_roman_coins")

# Find records that are missing both latitude and longitude
missing_coords_df = df[(df['fourFigureLat'].isnull()) | (df['fourFigureLon'].isnull())]
print(f"Found {len(missing_coords_df)} records missing geocoordinates.")

# Check if there are records to geocode
if len(missing_coords_df) == 0:
    print("No missing coordinates to geocode. Saving the original file as 'geocoded.csv'.")
    df.to_csv(output_file, index=False)
    exit()

# Iterate through the records with missing coordinates, using knownas for parish or further details
records_geocoded = 0
for index, row in missing_coords_df.iterrows():
    parish = str(row['knownas']).strip() if pd.notnull(row['knownas']) else ''
    county = str(row['county']).strip() if pd.notnull(row['county']) else ''
    
    # Construct the query string. Adding 'United Kingdom' improves accuracy.
    query = f"{parish}, {county}, United Kingdom"

    if not parish and not county:
        print(f"Skipping record at index {index}: No parish or county information available.")
        continue

    try:
        location = geolocator.geocode(query, timeout=10)
        
        if location:
            # Update the original DataFrame with the new coordinates
            df.loc[index, 'fourFigureLat'] = location.latitude
            df.loc[index, 'fourFigureLon'] = location.longitude
            records_geocoded += 1
            print(f"Geocoded record {index+1}: Found coordinates for '{query}' - Lat: {location.latitude}, Lon: {location.longitude}")
        else:
            print(f"Could not find coordinates for '{query}'.")

    except (GeocoderTimedOut, GeocoderUnavailable) as e:
        print(f"Geocoding service error for '{query}': {e}. Retrying after a short delay.")
        time.sleep(5)  # Pause to avoid rate limiting
        location = geolocator.geocode(query) # Try one more time
        if location:
            df.loc[index, 'fourFigureLat'] = location.latitude
            df.loc[index, 'fourFigureLon'] = location.longitude
            records_geocoded += 1
            print(f"Geocoded record {index+1}: Found coordinates for '{query}' - Lat: {location.latitude}, Lon: {location.longitude}")
        else:
            print(f"Retry failed. Could not find coordinates for '{query}'.")

    # Be polite and add a short delay between requests to avoid rate limiting
    time.sleep(1)

print(f"\nGeocoding complete. Successfully geocoded {records_geocoded} records.")

# Save the final, updated DataFrame to a new CSV file
df.to_csv(output_file, index=False)
print(f"Updated data saved to '{output_file}'.")

Loaded 80 records from './data/reece1.csv'.
Found 22 records missing geocoordinates.
Geocoded record 1: Found coordinates for 'Llanfachraeth, Isle of Anglesey, United Kingdom' - Lat: 53.3141694, Lon: -4.5388991
Geocoded record 9: Found coordinates for 'Baylham, Suffolk, United Kingdom' - Lat: 52.1282329, Lon: 1.0819174
Could not find coordinates for 'Near Bratoft, Lincolnshire, United Kingdom'.
Geocoded record 19: Found coordinates for 'Eyke, Suffolk, United Kingdom' - Lat: 52.1166526, Lon: 1.3845169
Geocoded record 21: Found coordinates for 'Bedfield, Suffolk, United Kingdom' - Lat: 52.2532916, Lon: 1.2515435
Geocoded record 41: Found coordinates for 'Bedfield, Suffolk, United Kingdom' - Lat: 52.2532916, Lon: 1.2515435
Geocoded record 42: Found coordinates for 'Great Barton, Suffolk, United Kingdom' - Lat: 52.2709279, Lon: 0.770001
Geocoded record 43: Found coordinates for 'North Elmham, Norfolk, United Kingdom' - Lat: 52.7614104, Lon: 0.9262684
Geocoded record 49: Found coordinates f

So you now have a script that can geocode missing coordinates for your dataset. Now we want to convert this to Linked Pasts geojson format. 

In [9]:
import csv
import json

def convert_csv_to_geojson(csv_file, geojson_file):
    """
    Converts a CSV file with lat/lon coordinates into a GeoJSON file.
    """
    geojson = {
        "type": "FeatureCollection",
        "indexing": {
            "@context": "https://schema.org/",
            "@type": "Dataset",
            "name": "Roman Republican Coins from the Portable Antiquities Scheme",
            "description": "An enriched dataset of Roman Republican coins from the Portable Antiquities Scheme",
            "license": "https://creativecommons.org/licenses/by/4.0/",
            "identifier": "https://finds.org.uk/database/search/results/broadperiod/ROMAN/reeceID/1/"
        },
        "features": []
    }

    try:
        with open(csv_file, mode='r', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            for row in reader:
                # Create a new, cleaned row dictionary
                cleaned_row = {key: value.replace(u'\xa0', u' ') if isinstance(value, str) else value for key, value in row.items()}
                
                try:
                    # Clean up and validate coordinates
                    lat_str = cleaned_row.get('fourFigureLat', '').strip()
                    lon_str = cleaned_row.get('fourFigureLon', '').strip()

                    # Skip records with empty or invalid coordinates
                    if not lat_str or not lon_str:
                        print(f"Skipping row with findIdentifier '{cleaned_row.get('findIdentifier', 'N/A')}' due to missing coordinates.")
                        continue

                    lat = float(lat_str)
                    lon = float(lon_str)

                    # Create a GeoJSON Feature
                    # Format 'created' date to YYYY
                    created_raw = cleaned_row.get('created')
                    created_year = None
                    if created_raw:
                        try:
                            created_year = str(created_raw).strip()[:4]
                            if not created_year.isdigit():
                                created_year = None
                        except Exception:
                            created_year = None
                    
                    feature = {
                        "@id": f"https://finds.org.uk/database/artefacts/record/id/{cleaned_row.get('id')}",
                        "type": "Feature",
                        "geometry": {
                            "type": "Point",
                            "coordinates": [lon, lat]
                        },
                        "properties": {
                            "findIdentifier": cleaned_row.get('findIdentifier'),
                            "oldFindID": cleaned_row.get('old_findID'),
                            "objecttype": cleaned_row.get('objecttype'),
                            "broadperiod": cleaned_row.get('broadperiod'),
                            "description": cleaned_row.get('description'),
                            "county": cleaned_row.get('county'),
                            "district": cleaned_row.get('district'),
                            "parish": cleaned_row.get('parish'),
                            "knownas": cleaned_row.get('knownas'),
                            "ruler": cleaned_row.get('rulerName'),
                            "moneyer": cleaned_row.get('moneyerName'),
                            "denomination": cleaned_row.get('denominationName'),
                            "mint": cleaned_row.get('mintName'),
                            "manufacture": cleaned_row.get('manufactureTerm'),
                            "rrcType": cleaned_row.get('rrcType'),
                            "rrcID": cleaned_row.get('rrcID'),
                            "ricID": cleaned_row.get('ricID'),
                            "reeceID": cleaned_row.get('reeceID'),
                            "nomismaIssuer": cleaned_row.get('rulerNomisma'),
                            "nomismaMint": cleaned_row.get('nomismaMintID'),
                            "pleiadesID": cleaned_row.get('pleiadesID'),
                            "issuerDbPedia": cleaned_row.get('rulerDbpedia'),
                            "metal": cleaned_row.get('metal'),
                            "materialTerm": cleaned_row.get('materialTerm'),
                            "weight": cleaned_row.get('weight'),
                            "date_from": cleaned_row.get('fromdate'),
                            "date_to": cleaned_row.get('todate'),
                            "institution": cleaned_row.get('institution'),
                            "created": created_year,
                        }
                    }
                    filename = cleaned_row.get('filename', '').strip()
                    if filename:
                        baseurl = 'https://republican-coins.museologi.st/images/'
                        depiction_url = baseurl + filename
                        oldfindID = cleaned_row.get('old_findID', '').strip()
                        feature['depictions'] = [
                            {
                                "@id": depiction_url,
                                "thumbnail": depiction_url,
                                "label": f"A depiction of {oldfindID}"
                            }
                        ]
                    description = cleaned_row.get('description', '').strip()
                    if description:
                        feature['descriptions'] = [
                            {
                                "value": description
                            }
                        ]
                    rrc_id = cleaned_row.get('rrcID', '').strip()
                    if rrc_id:
                        feature['types'] = [
                            {
                                "identifier": f"https://numismatics.org/crro/id/{rrc_id.lower()}",
                                "label": f"Nomisma RRC type: {rrc_id.lower()}"
                            }
                        ]
                    ric_id = cleaned_row.get('ricID', '').strip()
                    if ric_id:
                        feature['types'] = [
                            {
                                "identifier": f"https://numismatics.org/ocre/id/{ric_id.lower()}",
                                "label": f"Nomisma RIC type: {ric_id.lower()}"
                            }
                        ]
                    # Add 'when' key only if 'fromDate' is present
                    from_date = cleaned_row.get('fromdate', '').strip().split('.')[0]
                    if from_date:
                        to_date = cleaned_row.get('todate', '').strip().split('.')[0]
                        if to_date:
                            feature['when'] = {
                                "timespans": [
                                    {
                                        "start": {
                                            "in": f"{from_date}" if from_date else ""
                                        },
                                        "end": {
                                            "in": f"{to_date}" if to_date else "",
                                        }
                                    }
                                ],
                                "periods": [
                                    {
                                        "name": "Roman Republican 510 BC - 27 BC",
                                        "uri": "http://n2t.net/ark:/99152/p08m57h65c8"
                                    }
                                ],
                                "label": "for a century during the Roman period",
                                "certainty": "certain",
                                "duration": "P100Y"
                            }

                    links = []
                    pleiades_id = cleaned_row.get('pleiadesID', '').strip()
                    if pleiades_id and pleiades_id.replace('.', '', 1).isdigit():
                        pleiades_id = str(int(float(pleiades_id)))
                    nomisma_mint_id = cleaned_row.get('nomismaMintID', '').strip()
                    moneyer_id = cleaned_row.get('moneyerID', '').strip()
                    dbpedia_issuer = cleaned_row.get('rulerDbpedia', '').strip()
                    nomisma_issuer = cleaned_row.get('rulerNomisma', '').strip()
                    nomisma_reece_id = cleaned_row.get('reeceID', '').strip()

                    if nomisma_issuer:
                        links.append({
                            "identifier": f"https://nomisma.org/id/{nomisma_issuer}",
                            "type": "seeAlso",
                            "label": f"Nomisma ruler {nomisma_issuer}"
                        })
                    if pleiades_id:
                        links.append({
                            "identifier": f"https://pleiades.stoa.org/places/{pleiades_id}",
                            "type": "seeAlso",
                            "label": f"Pleiades place {pleiades_id}"
                        })
                    if nomisma_mint_id:
                        links.append({
                            "identifier": f"https://nomisma.org/id/{nomisma_mint_id}",
                            "type": "seeAlso",
                            "label": f"Nomisma mint {nomisma_mint_id}"
                        })
                    if nomisma_reece_id:
                        links.append({
                            "identifier": f"https://nomisma.org/id/reece{nomisma_reece_id}",
                            "type": "seeAlso",
                            "label": "Nomisma Reece Period 1"
                        })
                    if moneyer_id:
                        links.append({
                            "identifier": f"https://nomisma.org/id/{moneyer_id}",
                            "type": "seeAlso",
                            "label": f"Nomisma moneyer {moneyer_id}"
                        })
                    if dbpedia_issuer:
                        links.append({
                            "identifier": f"https://dbpedia.org/resource/{dbpedia_issuer}",
                            "type": "seeAlso",
                            "label": f"DBpedia resource for {dbpedia_issuer}"
                        })
                    if links:
                        feature['links'] = links

                    geojson['features'].append(feature)
                except (ValueError, TypeError) as e:
                    print(f"Skipping row with findIdentifier '{cleaned_row.get('findIdentifier', 'N/A')}' due to invalid coordinate data: {e}")
                    continue

        with open(geojson_file, 'w', encoding='utf-8') as f:
            json.dump(geojson, f, indent=2, ensure_ascii=False)
            
        print(f"\nSuccessfully converted {len(geojson['features'])} valid records to {geojson_file}")
            
    except FileNotFoundError:
        print(f"Error: The file '{csv_file}' was not found.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
            
if __name__ == "__main__":
    input_csv = "./data/geocoded.csv"
    output_geojson = "./data/republican.geojson"
    convert_csv_to_geojson(input_csv, output_geojson)


Skipping row with findIdentifier 'finds-1223903' due to missing coordinates.
Skipping row with findIdentifier 'finds-1204869' due to missing coordinates.
Skipping row with findIdentifier 'finds-1204740' due to missing coordinates.
Skipping row with findIdentifier 'finds-1198637' due to missing coordinates.

Successfully converted 76 valid records to ./data/republican.geojson


And there you go, that's how you create Linked Pasts geojson files from your CSV data ready to use in a Peripleo instance. 