# British Museum – Production Place Cleaning (Nicky)

This notebook cleans the **Production place** field in the British Museum sculpture dataset.  

The aim is to turn highly inconsistent free text location entries into clean, usable modern country names for analysis and visualisation.

**Goals for this notebook:**

1. Load the British Museum sculpture CSV into pandas.  
2. Explore the `Production place`, `Find spot`, and `Culture` fields.  
3. Clean raw place strings by:
   - removing prefixes such as “Made in”
   - standardising spelling and formatting
   - stripping noise words and punctuation  
4. Extract meaningful place names from complex or multi valued entries.  
5. Map cities and ancient regions to **modern countries** using a controlled vocabulary.  
6. Backfill missing locations using:
   - `Find spot` (where clearly geographic)
   - `Culture` as a **last resort** (e.g. “Greek” → Greece, “Roman” → Italy)  
7. Create final fields:
   - `ProductionPlace_Raw`
   - `ProductionPlace_Cleaned`
   - `ProductionCountry`
   - `ProductionPlace_Inferred` (True/False)  
8. Keep steps transparent and reproducible, and preserve original values for auditability.

**Inference priority order:**  
To avoid overstating the certainty of any location, I'm using the following order when determining the final production country:

1. Cleaned `Production place`  
2. `Find spot` (only when clearly a geographic location)  
3. `Culture` (only as a last resort for broad cultural-geographic clues)

This makes sure the final dataset remains consistent, traceable, and usable for visualisations such as heatmaps.


## 1. Load the British Museum sculpture dataset

In this section I'm loading the British Museum sculpture CSV into pandas so that I can explore the location related fields.

The notebook is stored in the `notebooks` folder and the data files are in the `data` folder at the project root. That means the relative path from this notebook to the CSV goes up one level (`..`) and then into `data`.

For this cleaning work I'm using the version of the British Museum data that contains only sculpture records on display:

- file name: 'british_museum_dataset_raw.csv`
- location: `../data/british_museum_dataset_raw.csv`

Once the data is loaded I'll:

1. Check the number of rows and columns.  
2. Print all column names to confirm the exact labels for `Production place`, `Find spot`, and `Culture`.  
3. Preview the first few rows of those three columns.


In [1]:
import pandas as pd

# Path to the British Museum sculpture dataset.
# The notebook is in /notebooks and the CSV is in /data,
# so we go up one folder ("..") to reach the project root, then into /data.
bm_file_path = "../data/british_museum_dataset_raw.csv"

# Load the dataset into a pandas DataFrame
bm_df = pd.read_csv(bm_file_path)

# Basic checks to confirm everything loaded correctly
print("Rows, columns:", bm_df.shape)
print("\nColumn names:")
print(bm_df.columns.tolist())

# Preview the key columns for place cleaning
columns_to_preview = [col for col in bm_df.columns if col.lower() in ["production place", "find spot", "culture"]]
bm_df[columns_to_preview].head(10)


Rows, columns: (4458, 47)

Column names:
['Image', 'Object type', 'Museum number', 'Title', 'Denomination', 'Escapement', 'Description', 'Producer name', 'School/style', 'State', 'Authority', 'Ethnic name (made by)', 'Ethnic name (assoc)', 'Culture', 'Production date', 'Production place', 'Find spot', 'Materials', 'Ware', 'Type series', 'Technique', 'Dimensions', 'Inscription', 'Curators Comments', 'Bib references', 'Location', 'Exhibition history', 'Condition', 'Subjects', 'Assoc name', 'Assoc place', 'Assoc events', 'Assoc titles', 'Acq name (acq)', 'Acq name (finding)', 'Acq name (excavator)', 'Acq name (previous)', 'Acq date', 'Acq notes (acq)', 'Acq notes (exc)', 'Dept', 'BM/Big number', 'Reg number', 'Add ids', 'Cat no', 'Banknote serial number', 'Joined objects']


Unnamed: 0,Culture,Production place,Find spot
0,Archaic Greek,Made in: South Ionia (historic),Excavated/Findspot: Sanctuary of Apollo (Naukr...
1,Classical Greek,,Excavated/Findspot: Athens
2,Apulian (Greek),Made in: Taranto,Excavated/Findspot: Taranto
3,Classical Greek,,Excavated/Findspot: Parthenon
4,26th Dynasty,,Excavated/Findspot: Tell Nabasha
5,Third Intermediate; Late Period,,Found/Acquired: Egypt
6,Late Period,,Found/Acquired: Egypt
7,Third Intermediate; 26th Dynasty,,Found/Acquired: Egypt
8,26th Dynasty,,Excavated/Findspot: Tell Nabasha
9,Ancient Egypt,,Found/Acquired: Egypt


## 2. Explore the location related fields

Before cleaning anything, I'll inspect the structure and variety of the raw data.

I want to understand:
- how many unique values appear in `Production place`
- how many unique values appear in `Find spot`
- how often each field is missing
- what prefixes and patterns appear
- how many objects have no location data at all

This will help me design the cleaning rules and prevents mistakes later on.


In [2]:
# Count missing values
print("Missing Production place:", bm_df["Production place"].isna().sum())
print("Missing Find spot:", bm_df["Find spot"].isna().sum())
print("Missing Culture:", bm_df["Culture"].isna().sum())

# Unique value counts (just the first few for inspection)
print("\nSample unique Production place values:")
print(bm_df["Production place"].dropna().unique()[:20])

print("\nSample unique Find spot values:")
print(bm_df["Find spot"].dropna().unique()[:20])

print("\nSample unique Culture values:")
print(bm_df["Culture"].dropna().unique()[:20])


Missing Production place: 2617
Missing Find spot: 933
Missing Culture: 911

Sample unique Production place values:
['Made in: South Ionia (historic)' 'Made in: Taranto' 'Made in: Egypt'
 'Made in: Lyveden (historic) (?);  Made in: Stanion (?)'
 'Made in: Benin City' 'Made in: Germany' 'Made in: England (probably)'
 'Made in: East Greece' 'Made in: Egypt (?)' 'Made in: Nuremberg'
 'Made in: Oya district'
 'Original from: North West Palace;  Factory in: Spode Works'
 'Original from: Palace of Sargon II (gateway);  Factory in: Spode Works'
 'Made in: England' 'Made in: China;  Made in: Beijing (city) (?)'
 'Factory in: Etruria (historic - England)'
 "Factory in: Faubourg Saint-Denis, rue du;  Retailed in: Échelle, rue de l' (corner of);  Retailed in: Carrousel, rue du (corner of)"
 'Factory in: Chelsea (London)' 'Factory in: Derby (Derbyshire)'
 'Made in: Boeotia']

Sample unique Find spot values:
["Excavated/Findspot: Sanctuary of Apollo (Naukratis) (Attributed by Petrie to the 'second t

In [3]:
# A. Identify the most common prefixes (before the colon)
import re

def get_prefix(value):
    if pd.isna(value):
        return None
    match = re.match(r"([^:]+):", value)
    if match:
        return match.group(1).strip()
    return None

bm_df['prod_prefix'] = bm_df["Production place"].apply(get_prefix)

print("Production place prefixes (top 20):")
print(bm_df['prod_prefix'].value_counts().head(20))


# B. Count semicolon entries
print("\nProduction place entries containing semicolons:", 
      bm_df["Production place"].dropna().str.contains(";").sum())


# C. Identify the raw place components after prefixes (first 20 examples)
def get_after_prefix(value):
    if pd.isna(value):
        return None
    parts = value.split(":", 1)
    if len(parts) > 1:
        return parts[1].strip()
    return None

bm_df['prod_after_prefix'] = bm_df["Production place"].apply(get_after_prefix)

print("\nSample cleaned values after prefix removal:")
print(bm_df['prod_after_prefix'].dropna().unique()[:20])


Production place prefixes (top 20):
prod_prefix
Made in          1686
Factory in         97
Original from      54
Made for            2
Lustred in          1
Painted in          1
Name: count, dtype: int64

Production place entries containing semicolons: 61

Sample cleaned values after prefix removal:
['South Ionia (historic)' 'Taranto' 'Egypt'
 'Lyveden (historic) (?);  Made in: Stanion (?)' 'Benin City' 'Germany'
 'England (probably)' 'East Greece' 'Egypt (?)' 'Nuremberg' 'Oya district'
 'North West Palace;  Factory in: Spode Works'
 'Palace of Sargon II (gateway);  Factory in: Spode Works' 'England'
 'China;  Made in: Beijing (city) (?)' 'Etruria (historic - England)'
 "Faubourg Saint-Denis, rue du;  Retailed in: Échelle, rue de l' (corner of);  Retailed in: Carrousel, rue du (corner of)"
 'Chelsea (London)' 'Derby (Derbyshire)' 'Boeotia']


## 3. Clean the raw Production place values

The aim here is to extract a single meaningful place name from each raw string. This involves:
- removing prefixes such as "Made in:", "Factory in:" and similar
- splitting semicolon entries and keeping the first usable part
- removing question marks, parentheses and extra notes
- normalising spacing


In [4]:
import re

def clean_production_place(value):
    if pd.isna(value):
        return None
    
    # 1. Split on semicolon and take the first part
    first_part = value.split(";")[0].strip()
    
    # 2. Remove known prefixes (anything before the first colon)
    # e.g., "Made in: Taranto" -> "Taranto"
    if ":" in first_part:
        first_part = first_part.split(":", 1)[1].strip()
    
    # 3. Remove anything in parentheses
    first_part = re.sub(r"\(.*?\)", "", first_part).strip()
    
    # 4. Remove stray characters and normalise spaces
    first_part = re.sub(r"\s+", " ", first_part)
    first_part = first_part.replace("?", "").strip()
    
    # 5. Empty strings become None
    return first_part if first_part else None


# Apply to create a new column
bm_df["ProductionPlace_Cleaned"] = bm_df["Production place"].apply(clean_production_place)

# Preview
bm_df[["Production place", "ProductionPlace_Cleaned"]].head(15)


Unnamed: 0,Production place,ProductionPlace_Cleaned
0,Made in: South Ionia (historic),South Ionia
1,,
2,Made in: Taranto,Taranto
3,,
4,,
5,,
6,,
7,,
8,,
9,,


### 3.1 Inspect the cleaned ProductionPlace_Cleaned values

Now that I have a first pass at cleaning `Production place`, I want to see the most common cleaned values. This will help me:
- spot obvious issues in the cleaning function
- identify key cities and regions for mapping to modern countries
- see how many generic or unclear locations I will need to handle separately


In [5]:
# How many entries now have a cleaned production place?
non_null_cleaned = bm_df["ProductionPlace_Cleaned"].notna().sum()
print("Non null cleaned ProductionPlace_Cleaned:", non_null_cleaned)

# Top 100 most common cleaned places
pd.set_option("display.max_rows", 200)
print("\nTop 100 cleaned production places:")
bm_df["ProductionPlace_Cleaned"].value_counts().head(100)


Non null cleaned ProductionPlace_Cleaned: 1841

Top 100 cleaned production places:


ProductionPlace_Cleaned
Athens                          169
Cyprus                          142
Italy                           115
Boeotia                         101
China                            65
Egypt                            57
Taranto                          47
India                            45
Campania                         36
Japan                            32
Cyrenaica                        30
Crete                            27
Rhodes                           26
Corinth                          25
Etruria                          24
Greece                           23
Sicily                           23
Myrina                           22
Meissen                          22
Dehua                            18
One Hundred Column Hall          18
Derby                            17
Palace of Darius                 17
London                           16
Java                             16
Gaul                             13
Gandhara                         13
Demo

In [6]:
PLACE_TO_COUNTRY = {
    # Greece and Greek world
    "Athens": "Greece",
    "Boeotia": "Greece",
    "Crete": "Greece",
    "Rhodes": "Greece",
    "Corinth": "Greece",
    "Greece": "Greece",
    "Cyclades": "Greece",
    "Attica": "Greece",
    "Laconia": "Greece",
    "Eretria": "Greece",
    "East Greece": "Greece",

    # Italian peninsula and Roman world
    "Italy": "Italy",
    "Taranto": "Italy",
    "Campania": "Italy",
    "Sicily": "Italy",
    "Etruria": "Italy",
    "Puglia": "Italy",
    "Umbria": "Italy",
    "Canosa di Puglia": "Italy",
    "Centuripe": "Italy",
    "Chiusi": "Italy",
    "Rome": "Italy",
    "Capua": "Italy",
    "Medma": "Italy",
    "Vulci": "Italy",
    "Paestum": "Italy",
    "Sardinia": "Italy",
    "Florence": "Italy",
    "Locri Epizephyrii": "Italy",

    # China
    "China": "China",
    "Dehua": "China",
    "Jingdezhen": "China",
    "Beijing": "China",
    "Zhangzhou": "China",
    "Shanxi": "China",
    "Henan": "China",

    # Egypt
    "Egypt": "Egypt",
    "Nile Delta": "Egypt",
    "Alexandria": "Egypt",

    # India (including historical regions)
    "India": "India",
    "Tamil Nadu": "India",
    "Orissa": "India",
    "Bengal": "India",
    "Portuguese India": "India",
    "Karnataka": "India",

    # Iran (ancient Persia)
    "Iran": "Iran",
    "Persepolis": "Iran",
    "Palace of Darius": "Iran",
    "Apadana": "Iran",
    "One Hundred Column Hall": "Iran",
    "Susa": "Iran",

    # Turkey / Anatolia
    "Turkey": "Turkey",
    "Myrina": "Turkey",
    "Knidos": "Turkey",
    "Smyrna": "Turkey",
    "Halicarnassus": "Turkey",
    "Çanakkale": "Turkey",
    "Lycia": "Turkey",
    "Ionia": "Turkey", 

    # United Kingdom
    "Derby": "United Kingdom",
    "Staffordshire": "United Kingdom",
    "London": "United Kingdom",
    "Chelsea": "United Kingdom",
    "Cambridge": "United Kingdom",
    "Fulham": "United Kingdom",
    "Bow": "United Kingdom",
    "England": "United Kingdom",

    # Nigeria
    "Nigeria": "Nigeria",
    "Benin City": "Nigeria",

    # Libya
    "Cyrenaica": "Libya",

    # Pakistan / Afghanistan region
    "Gandhara": "Pakistan",
    "Pakistan": "Pakistan",
    "Kashmir": "Pakistan", 

    # Iraq
    "Iraq": "Iraq",
    "Nineveh": "Iraq",

    # Syria
    "Palmyra": "Syria",

    # Palestine
    "Jericho": "Palestine",

    # Indonesia
    "Java": "Indonesia",

    # DR Congo
    "Democratic Republic of Congo": "Democratic Republic of Congo",

    # Mozambique
    "Maputo": "Mozambique",

    # Russia
    "Moscow": "Russia",

    # Yemen
    "Yemen": "Yemen",

    # Gabon
    "Gabon": "Gabon",

    # Sri Lanka
    "Sri Lanka": "Sri Lanka",

    # Nepal
    "Nepal": "Nepal",

    # France
    "Gaul": "France",

    # Myanmar / Burma region
    # (None in top 100)

    # Korea
    "Korea": "South Korea",   # BM default position unless specified North

    # Tibet
    "Tibet": "China",  
}

PLACE_TO_COUNTRY.update({
    # Simple modern countries that appeared unmapped
    "Germany": "Germany",
    "Tunisia": "Tunisia",
    "France": "France",
    "Cameroon": "Cameroon",
    "Mexico": "Mexico",
    "Japan": "Japan",
    "Canada": "Canada",
    "Peru": "Peru",
    "Mali": "Mali",
    "Ethiopia": "Ethiopia",
    "Bangladesh": "Bangladesh",
    "Ghana": "Ghana",
    "Uganda": "Uganda",
    "Zimbabwe": "Zimbabwe",
    "Lesotho": "Lesotho",
    "Austria": "Austria",
    "Netherlands": "Netherlands",
    "Belgium": "Belgium",
    "Burma": "Myanmar",
    "Bhutan": "Bhutan",
    "South Korea": "South Korea",
    "Republic of Congo": "Republic of Congo",
    "South Sudan": "South Sudan",
    "Tanzania": "Tanzania",
})


## 5. Map cleaned Production places to modern countries

Now that I have a cleaned list of place names, I’ll map each one to a modern country using the `PLACE_TO_COUNTRY` dictionary. This will give me a `ProductionCountry` field that can be used directly for heat map visualisation.

I'll also create a column to show whether the country was successfully mapped or not, so I can review any remaining values and extend the dictionary where needed.


In [7]:
def map_place_to_country(place):
    if place is None:
        return None
    return PLACE_TO_COUNTRY.get(place, None)

# Apply the mapping
bm_df["ProductionCountry"] = bm_df["ProductionPlace_Cleaned"].apply(map_place_to_country)

# Flag values that didn't map
bm_df["CountryMapped"] = bm_df["ProductionCountry"].notna()

bm_df[["ProductionPlace_Cleaned", "ProductionCountry"]].head(100)


Unnamed: 0,ProductionPlace_Cleaned,ProductionCountry
0,South Ionia,
1,,
2,Taranto,Italy
3,,
4,,
5,,
6,,
7,,
8,,
9,,


In [8]:
# Count unmapped values
unmapped = bm_df[bm_df["ProductionCountry"].isna()]["ProductionPlace_Cleaned"].unique()

print("Number of unmapped place names:", len(unmapped))
unmapped[:200]  # preview first 200


Number of unmapped place names: 164


array(['South Ionia', None, 'Lyveden', 'Nuremberg', 'Oya district',
       'North West Palace', 'Palace of Sargon II',
       'Faubourg Saint-Denis, rue du', 'Flanders', 'Stoke-on-Trent',
       'Sèvres', "Shaw's Brow", "Herald's wall", 'Royal Buttress',
       'Cerveteri', 'Québec', 'Mauretania', 'Apulia', 'Cyprus',
       'Thanjavur', 'Aeolis', 'Kanchipuram', 'Hwidiem', 'Meissen',
       'Troas', 'Lunéville', 'Mikawachi', 'Liaoning', 'Urartu',
       'Hyogo-ken', 'Allier Valley', 'Unterweissbach', 'Yixing',
       'Clifton Junction', 'Maningrida', 'Plymouth', 'Numan', 'Lapethus',
       'Qurigua', 'Central Java', 'Mexico City', 'Rano Kao',
       'Easter Island', 'Doccia', 'Kerala', 'Tanba', 'Inukjuaq', 'Samos',
       'Arita', 'East Coast', 'Nymphenburg Palace', 'Naukratis', 'Nome',
       'Fulda', 'Worcester', 'Bihar', 'Bristol', 'Carthage', 'Konark',
       'Tokyo-to', 'Yambio', 'Kakiemon Kiln', 'Buddam', 'New Canton',
       'Krishnanagar', 'Capo di Monte', 'Allier', "Côte d'Ivoi

## 6. Clean the Find spot field

The *Find spot* column contains excavation or acquisition information in a variety of formats, usually with prefixes such as:

- `Excavated/Findspot:`
- `Found/Acquired:`
- `Excavated/Findspot:` with long descriptions in brackets
- Multiple locations separated by semicolons

Before I can use these locations as a fallback for mapping to modern countries, I will need to extract a clean place name for each row.

The aim here is to:
- remove all prefixes before the first colon  
- drop any bracketed commentary or extra notes  
- keep only the first place listed (if there are multiple entries)  
- remove artefacts like question marks and double spaces  

This will produce a new column called **FindSpot_Cleaned**, which can then be mapped to modern countries using the same `PLACE_TO_COUNTRY` dictionary as the Production place field.

After this, I'll be able to combine:
1. Production country (preferred)
2. Find spot country (fallback)
3. Culture-based inference (last resort)

This will give me a consistent and mappable country field for visualisation.


In [9]:
import re
import pandas as pd

def clean_find_spot(value):
    """Clean the Find spot field to extract a simple place name."""
    if pd.isna(value):
        return None

    # 1. Keep only the first entry if there are multiple separated by semicolons
    first_part = value.split(";")[0].strip()

    # 2. Remove prefix before colon (e.g. 'Excavated/Findspot:')
    if ":" in first_part:
        first_part = first_part.split(":", 1)[1].strip()

    # 3. Remove bracketed notes (e.g. '(historic)', '(probably)')
    first_part = re.sub(r"\(.*?\)", "", first_part).strip()

    # 4. Remove question marks
    first_part = first_part.replace("?", "")

    # 5. Normalise spaces
    first_part = re.sub(r"\s+", " ", first_part).strip()

    return first_part if first_part else None

# Apply the cleaning function
bm_df["FindSpot_Cleaned"] = bm_df["Find spot"].apply(clean_find_spot)

# Preview the results
bm_df[["Find spot", "FindSpot_Cleaned"]].head(15)


Unnamed: 0,Find spot,FindSpot_Cleaned
0,Excavated/Findspot: Sanctuary of Apollo (Naukr...,Sanctuary of Apollo
1,Excavated/Findspot: Athens,Athens
2,Excavated/Findspot: Taranto,Taranto
3,Excavated/Findspot: Parthenon,Parthenon
4,Excavated/Findspot: Tell Nabasha,Tell Nabasha
5,Found/Acquired: Egypt,Egypt
6,Found/Acquired: Egypt,Egypt
7,Found/Acquired: Egypt,Egypt
8,Excavated/Findspot: Tell Nabasha,Tell Nabasha
9,Found/Acquired: Egypt,Egypt


## 7. Map cleaned Find spot locations to modern countries

Now that I have a simplified version of the Find spot location, I can map these values to modern countries using the same `PLACE_TO_COUNTRY` dictionary that I used for the Production place field.

This creates a new column called `FindSpotCountry`.  
This country value will act as a fallback when the Production place is missing or cannot be reliably mapped.


In [10]:
def map_place_to_country(place):
    if place is None:
        return None
    return PLACE_TO_COUNTRY.get(place, None)

# Apply mapping to FindSpot_Cleaned
bm_df["FindSpotCountry"] = bm_df["FindSpot_Cleaned"].apply(map_place_to_country)

# Preview results
bm_df[["FindSpot_Cleaned", "FindSpotCountry"]].head(20)


Unnamed: 0,FindSpot_Cleaned,FindSpotCountry
0,Sanctuary of Apollo,
1,Athens,Greece
2,Taranto,Italy
3,Parthenon,
4,Tell Nabasha,
5,Egypt,Egypt
6,Egypt,Egypt
7,Egypt,Egypt
8,Tell Nabasha,
9,Egypt,Egypt


### Step 6: Identify unmapped Find spot values

Now that the `FindSpot_Cleaned` column has been standardised and passed through the `map_place_to_country()` function, I need to identify which Find spots did **not** map to a modern country.

This helps in two ways:

1. **Expanding the PLACE_TO_COUNTRY dictionary**  
   Many Find spots are well known archaeological sites (for example Tell Nabasha, Saqqara, Naukratis, Delphi).  
   These can be reliably assigned to a modern country.

2. **Deciding which Find spots should remain unmapped**  
   Some entries represent:  
   - very broad regions (for example North America or Asia)  
   - ambiguous entries  
   - excavation context labels that are not actual geographic places (for example Deposit D&E or Eye Temple)

These should remain `None` and will not be used for country-level visualisation.

The next step is to inspect the list of unmapped values and classify them into:
- Sites that can be confidently mapped to a modern country  
- Sites best left unmapped  
- Sites that duplicate information already captured in Production place  


In [11]:
unmapped_findspots = bm_df[bm_df["FindSpotCountry"].isna()]["FindSpot_Cleaned"].unique()

print("Number of unmapped Find spot values:", len(unmapped_findspots))
unmapped_findspots[:200]


Number of unmapped Find spot values: 661


array(['Sanctuary of Apollo', 'Parthenon', 'Tell Nabasha', 'Tarsus',
       'Saqqara', 'Town', 'Amarna, el-', 'Higham', 'Twywell', None,
       'Hexham', 'Alhambra Palace', 'Samarra', 'Great Mosque',
       'Dar al-Khilafa', 'Sanctuary of Aphrodite', 'Naukratis', 'Coptos',
       'Fikellura grave 34', 'Naples', 'Tanagra',
       'Sanctuary of Artemis Orthia', 'Vienne', 'Sopron', 'Porta Latina',
       'Temple of Athena Polias', 'Thapsus', 'Villa Casali',
       'Oxyrhynchus', 'Isis Tomb', 'Herakleia Lynkestis', 'Arles',
       'Vesuvius, Mount', "Hadrian's Villa", 'Cumae', 'Torre Annunziata',
       'Bari', 'Thebes', 'Via Appia', 'Erechtheion', 'Royal Buttress',
       'Volterra', 'Kamiros', 'Fiddleford', 'Siraf',
       'Papatislures grave 14', 'Bull Wharf', 'Ephesus', 'Toronto',
       'Oualata', 'Orvieto', 'Gela', 'Eye Temple',
       "J S Fry & Sons' Factories", 'Troy', 'Sitka', 'Kyrenia',
       'Cheapside', 'Sanctuary of Artemis', 'Southbroom',
       'Bank of England', 'Fasano',

In [12]:
# Get unique Find spots that are actual strings (exclude None/empty)
unique_findspots = sorted(
    [x for x in unmapped_findspots if isinstance(x, str)]
)

len(unique_findspots), unique_findspots[:50]


(660,
 ['Abu Habba',
  'Abu Simbel',
  'Abydos',
  'Acropolis',
  'Aegean Region',
  'Aegina',
  'Afghanistan',
  'Agrigento',
  'Ain Sakhri',
  'Aix-en-Provence',
  'Akhnur',
  'Akra',
  'Alacahöyük',
  'Alambra',
  'Albano',
  'Alexandretta',
  'Algallarin',
  'Alhambra Palace',
  'Allahabad',
  'Altoetting',
  'Amarna, el-',
  'Amathus',
  'Amelia',
  'Amlash',
  'Amman',
  'Amra, el-',
  'Anaphe',
  'Anchorage',
  'Andhra Pradesh',
  'Andravida',
  'Annecy',
  'Antakya',
  'Antalya',
  'Anthedon',
  'Anyang',
  'Apiaí',
  'Apoon',
  'Aquila',
  'Arezzo',
  'Argos',
  'Arles',
  'Armant',
  'Armento',
  'Arpachiyah',
  'Asasif',
  'Ashdown',
  'Ashwell',
  'Asklepieion',
  'Astana',
  'Aston'])

In [13]:
FINDSPOT_TO_COUNTRY = {
    # Egypt
    "Abydos": "Egypt",
    "Amarna, el-": "Egypt",
    "Armant": "Egypt",
    "Asasif": "Egypt",
    "Cairo": "Egypt",
    "Biban el-Muluk": "Egypt", 
    "Fayum": "Egypt",
    "Asyut": "Egypt",
    "Oxyrhynchus": "Egypt",
    "Coptos": "Egypt",
    "Saqqara": "Egypt",
    "Tell Nabasha": "Egypt",
    "Thebes": "Egypt",
    "Temple of Osiris-Khentimentu": "Egypt",

    # Greece and Greek world
    "Acropolis": "Greece",
    "Aegina": "Greece", 
    "Amathus": "Cyprus",
    "Anaphe": "Greece",
    "Argos": "Greece",
    "Delphi": "Greece",
    "Olympia": "Greece",
    "Melos": "Greece",
    "Samos": "Greece",
    "Thespiae": "Greece",
    "Aegina": "Greece",
    "Corfu": "Greece",
    "Paramythia": "Greece",
    "Locri": "Italy",  
    "Tanagra": "Greece",
    "Aegean Region": "Greece",  
    "Cythera": "Greece",
    "Pangaeus, Mount": "Greece",
    "Temple of Demeter": "Greece",  

    # Italy
    "Agrigento": "Sicily", 
    "Albano": "Italy",
    "Amelia": "Italy",
    "Armento": "Italy",
    "Arezzo": "Italy",
    "Naples": "Italy",
    "Cumae": "Italy",
    "Torre Annunziata": "Italy",
    "Bari": "Italy",
    "Brindisi": "Italy",
    "Calvi": "Italy",
    "Pistoia": "Italy",
    "Pompeii": "Italy",
    "Volterra": "Italy",
    "Gela": "Italy",
    "Ruvo": "Italy",
    "Tharros": "Italy",

    # Turkey / Near East
    "Alacahöyük": "Turkey",
    "Antakya": "Turkey",
    "Antalya": "Turkey",
    "Argos": "Greece",
    "Bornova": "Turkey",
    "Carchemish": "Turkey",
    "Kyme": "Turkey",
    "Toprakkale": "Turkey",
    "Tarsus": "Turkey",
    "Troas": "Turkey",
    "Ephesus": "Turkey",
    "Babylon": "Iraq",
    "Abu Habba": "Iraq",
    "Tell Al-'Ubaid": "Iraq",
    "Samarra": "Iraq",
    "Dar al-Khilafa": "Iraq",

    # South Asia
    "Andhra Pradesh": "India",
    "Allahabad": "India",
    "Jaipur": "India",
    "Delhi": "India",
    "Konark": "India",
    "Nilgiri Hills": "India",
    "Punjab": "India",
    "Allahabad": "India",
    "Harappa": "Pakistan",
    "Mohenjo-daro": "Pakistan",
    "Rawalpindi": "Pakistan",
    "Peshawar District": "Pakistan",
    "Bambhra-ka-thul": "Pakistan",
    "Takht-i Kuwad": "Afghanistan", 
    "Dabar Kot": "Pakistan",
    "Kausambi": "India",

    # East Asia
    "Anyang": "China",
    "Luoyang": "China",
    "Changsha": "China",
    "Inner Mongolia": "China",
    "Astana": "Kazakhstan",

    # Americas
    "Anchorage": "United States",
    "Juneau": "United States",
    "Sitka": "United States",
    "Phoenix": "United States",
    "Guatemala": "Guatemala",
    "Palenque": "Mexico",
    "La Huasteca": "Mexico",
    "Easter Island": "Chile",
    "Orongo": "Chile",
    "Rano Kao": "Chile",
    "Vancouver Island": "Canada",
    "Toronto": "Canada",
    "Québec": "Canada",

    # Oceania
    "New Zealand": "New Zealand",

    # Europe (other)
    "Aix-en-Provence": "France",
    "Annecy": "France",
    "Arles": "France",
    "Saint-Didier": "France",
    "Mâcon": "France",
    "Sopron": "Hungary",
    "Astana": "Kazakhstan",
    "Colchester": "United Kingdom",
    "Winchester": "United Kingdom",
    "Hounslow": "United Kingdom",
    "Ashdown": "United Kingdom",
    "Aston": "United Kingdom",
}


In [14]:
def map_findspot_to_country(place):
    if place is None:
        return None
    # Try the dedicated Find spot dictionary first
    if place in FINDSPOT_TO_COUNTRY:
        return FINDSPOT_TO_COUNTRY[place]
    # Fall back to the production place dictionary if shared names exist
    return PLACE_TO_COUNTRY.get(place, None)

bm_df["FindSpotCountry"] = bm_df["FindSpot_Cleaned"].apply(map_findspot_to_country)

unmapped_findspots = bm_df[bm_df["FindSpotCountry"].isna()]["FindSpot_Cleaned"].unique()
len(unmapped_findspots)


553

## 8. Culture-based fallback mapping

So far I've created two country level fields:

- `ProductionCountry` from the cleaned **Production place** field  
- `FindSpotCountry` from the cleaned **Find spot** field  

Some objects still have no mapped country because both of these sources are either missing or too ambiguous. As a final fallback, I'll use the **Culture** field to infer a likely modern country *only when* the other two have failed.

The key ideas are:

- Culture is used as a **regional proxy**, not as proof of the exact place of manufacture  
- Only cultures with a clear, well recognised geographic association are mapped (for example Egyptian dynasties, Greek periods, Roman, specific Chinese dynasties)  
- Broad or cross regional labels (for example *Bronze Age*, *Neolithic*, *Medieval*, *Islamic*) are left as `None` and not forced into a country

The Culture mapping therefore supports the visualisation by filling some remaining gaps, while avoiding over interpretation of the evidence.


### 8.1 Inspect unmapped Culture values

Before building the mapping, I briefly inspected the unique values in the `Culture` field for objects that still had no country from Production place or Find spot. This was only an exploratory step, to understand:

- how many different Culture labels were in use  
- which ones had an obvious geographic focus  
- which ones were broad chronological or archaeological terms that should *not* be mapped

This inspection confirmed that a conservative approach was needed. Many entries describe periods or stylistic phases rather than specific regions, so only a subset of Culture values are suitable for mapping to modern countries.


In [15]:
# Look only at objects that still have no country from Production or Find spot
unmapped_mask = bm_df["ProductionCountry"].isna() & bm_df["FindSpotCountry"].isna()

unmapped_cultures = sorted(
    bm_df.loc[unmapped_mask, "Culture"].dropna().unique()
)

len(unmapped_cultures), unmapped_cultures[:50]


(182,
 ['11th Dynasty',
  '12th Dynasty',
  '17th Dynasty',
  '18th Dynasty',
  '18th Dynasty; 19th Dynasty',
  '18th Dynasty; Meroitic',
  '19th Dynasty',
  '20th Dynasty',
  '20th Dynasty; Late Period',
  '21st Dynasty',
  '22nd Dynasty',
  '25th Dynasty',
  '26th Dynasty',
  '30th Dynasty',
  '4th Dynasty',
  '5th Dynasty',
  '6th Dynasty',
  'Abbasid dynasty',
  'Achaemenid',
  'Ammonite',
  'Ancient Egypt',
  'Anglo-Saxon',
  'Apulian (Greek)',
  'Archaic Greek',
  'Archaic Greek; Classical Greek',
  'Archaic Greek; East Greek',
  'Archaic Period (Etruscan)',
  'Assyrian',
  'Aztec',
  'Aztec; Toltec',
  'Badarian',
  'Bronze Age',
  'Buddhist',
  'Bunsei Era',
  'Celtic',
  'Chalcolithic',
  'Chiriquí',
  'Chola',
  'Christian',
  'Classic Maya',
  'Classic Maya; Olmec',
  'Classic Veracruz',
  'Classical Greek',
  'Classical Greek; Archaic Greek',
  'Classical Greek; Cypro-Classical I',
  'Classical World',
  'Cypriot',
  'Cypro-Archaic',
  'Cypro-Archaic I',
  'Cypro-Archaic I;

### 8.2 Define and apply a conservative Culture → Country mapping

The Culture field contains a mixture of cultural labels, dynasties, time periods, and broad archaeological terms. Not all of these can or should be linked to a single modern country.

To avoid over interpreting the data, I'll use a **conservative mapping** strategy:

- Only cultures with a clear and widely recognised geographic association are mapped (for example *Archaic Greek*, *Classical Greek*, *Hellenistic* → Greece; Egyptian dynasties and periods → Egypt; *Roman* → Italy; specific Chinese dynasties → China; *Anglo-Saxon* → United Kingdom)  
- Broad or cross regional terms (for example *Bronze Age*, *Neolithic*, *Medieval*, *Islamic*) are left unmapped as `None`  
- Mixed labels are only mapped where there is a strong, defensible association with one modern country, otherwise they also remain unmapped

These rules are encoded in the `CULTURE_TO_COUNTRY` dictionary and applied to create a `CultureCountry` column. `CultureCountry` is then used as a **third level fallback** after `ProductionCountry` and `FindSpotCountry` when constructing the final `FinalCountry` field.


In [16]:
# Conservative Culture → Country mapping.
# Only include cultures where the link to a modern country is very clear.
CULTURE_TO_COUNTRY = {
    # --- Greek world (mapped to modern Greece) ---
    "Greek": "Greece",
    "Archaic Greek": "Greece",
    "Archaic Greek; Classical Greek": "Greece",
    "Classical Greek": "Greece",
    "Classical Greek; Archaic Greek": "Greece",
    "Classical Greek; Hellenistic": "Greece",
    "Classical Greek; Western Greek": "Greece",
    "Geometric Greek": "Greece",
    "Early Geometric": "Greece",
    "Hellenistic": "Greece",
    "Hellenistic; Classical Greek": "Greece",
    "Hellenistic; Western Greek": "Greece",
    "Western Greek": "Greece",
    "Attic": "Greece",
    "Attic; Hellenistic": "Greece",
    "Boeotian": "Greece",
    "Boeotian; Hellenistic": "Greece",
    "Corinthian": "Greece",
    "Ionian": "Greece",
    "Mycenaean": "Greece",
    "Late Helladic III": "Greece",
    "Late Helladic IIIA2": "Greece",
    "Late Helladic IIIA2; Late Helladic IIIB1": "Greece",
    "Late Helladic IIIA; Late Helladic IIIB": "Greece",
    "Late Helladic IIIB": "Greece",
    "Orientalising Period": "Greece", 
    "Rhodian": "Greece",


    # --- Magna Graecia / southern Italy (Greek colonies in Italy) ---
    "Apulian (Greek)": "Italy",
    "Apulian (Greek); Hellenistic": "Italy",
    "Campanian": "Italy",
    "Paestan": "Italy",
    "Sicilian": "Italy",
 
    # --- Cypriot traditions (mapped to modern Cyprus) ---
    "Cypriot": "Cyprus",
    "Cypro-Archaic": "Cyprus",
    "Cypro-Archaic I": "Cyprus",
    "Cypro-Archaic II": "Cyprus",
    "Cypro-Archaic I; Cypro-Archaic II": "Cyprus",
    "Cypro-Archaic II; Archaic Greek": "Cyprus",
    "Cypro-Archaic II; Cypro-Classical I": "Cyprus",
    "Cypro-Classical": "Cyprus",
    "Cypro-Classical I": "Cyprus",
    "Cypro-Classical II": "Cyprus",
    "Cypro-Classical I; Hellenistic": "Cyprus",
    "Cypro-Classical II; Hellenistic": "Cyprus",
    "Cypro-Classical; Cypro-Archaic II": "Cyprus",
    "Cypro-Geometric": "Cyprus",
    "Late Cypriot II": "Cyprus",
    "Late Cypriot II; Late Cypriot III": "Cyprus",
    "Late Cypriot III": "Cyprus",
    "Middle Cypriot": "Cyprus",
    "Early Cypriot III; Middle Cypriot I": "Cyprus",

    # --- Egyptian dynasties and periods (mapped to modern Egypt) ---
    "Ancient Egypt": "Egypt",
    "Badarian": "Egypt",
    "Naqada I": "Egypt",
    "Naqada II": "Egypt",
    "Naqada II; Predynastic": "Egypt",
    "Naqada III": "Egypt",
    "Predynastic": "Egypt",
    "Late Predynastic; 1st Dynasty": "Egypt",
    "Early Dynastic (Egypt)": "Egypt",
    "Early Dynastic II; Early Dynastic III": "Egypt",
    "Early Dynastic III": "Egypt",
    "1st Dynasty": "Egypt",
    "1st Dynasty; 2nd Dynasty": "Egypt",
    "1st Dynasty; 2nd Dynasty; Naqada II": "Egypt",
    "3rd Dynasty": "Egypt",
    "4th Dynasty": "Egypt",
    "4th Dynasty; 3rd Dynasty": "Egypt",
    "5th Dynasty": "Egypt",
    "6th Dynasty": "Egypt",
    "9th Dynasty": "Egypt",
    "11th Dynasty": "Egypt",
    "11th Dynasty; 12th Dynasty": "Egypt",
    "12th Dynasty": "Egypt",
    "17th Dynasty": "Egypt",
    "18th Dynasty": "Egypt",
    "18th Dynasty; 19th Dynasty": "Egypt",
    "18th Dynasty; 22nd Dynasty": "Egypt",
    "19th Dynasty": "Egypt",
    "19th Dynasty; 20th Dynasty": "Egypt",
    "20th Dynasty": "Egypt",
    "20th Dynasty; Late Period": "Egypt",
    "21st Dynasty": "Egypt",
    "22nd Dynasty": "Egypt",
    "25th Dynasty": "Egypt",
    "26th Dynasty": "Egypt",
    "27th Dynasty": "Egypt",
    "30th Dynasty": "Egypt",
    "30th Dynasty; Ptolemaic": "Egypt",
    "Old Kingdom": "Egypt",
    "Middle Kingdom": "Egypt",
    "Middle Kingdom; First Intermediate": "Egypt",
    "First Intermediate": "Egypt",
    "First Intermediate; Middle Kingdom": "Egypt",
    "Second Intermediate": "Egypt",
    "New Kingdom": "Egypt",
    "Ramesside": "Egypt",
    "Third Intermediate": "Egypt",
    "Third Intermediate; Late Period": "Egypt",
    "Third Intermediate; 26th Dynasty": "Egypt",
    "Late Period": "Egypt",
    "Late Period; 22nd Dynasty": "Egypt",
    "Late Period; Ptolemaic": "Egypt",
    "Ptolemaic": "Egypt",
    "Ptolemaic; 26th Dynasty": "Egypt",
    "Ptolemaic; Hellenistic": "Egypt",
    "Ptolemaic; Roman": "Egypt",

    # --- Roman / Italic / Etruscan (mapped to modern Italy) ---
    "Roman": "Italy",
    "Roman Imperial": "Italy",
    "Roman Imperial; Hellenistic": "Italy",
    "Roman Period": "Italy",
    "Roman Republican": "Italy",
    "Roman; Classical Greek; Hellenistic": "Italy",
    "Roman; Hellenistic": "Italy",
    "Gallo-Roman; Roman Imperial": "Italy",
    "Graeco-Roman": "Italy",
    "Italic": "Italy",
    "Etruscan": "Italy",
    "Etruscan; Faliscan": "Italy",
    "Etrusco-Campanian; Archaic Period (Etruscan)": "Italy",
    "Etrusco-Campanian; Classical Period (Etruscan)": "Italy",
    "Etrusco-Latin; Archaic Period (Etruscan); Classical Period (Etruscan)": "Italy",
    "Etrusco-Latin; Hellenistic Period (Etruscan)": "Italy",
    "Umbrian": "Italy",
    "Umbrian; Etruscan": "Italy",
    "Samnite": "Italy",
    "Paestan": "Italy",

    # --- Mesopotamian / ancient Near East (mapped mainly to modern Iraq / Iran) ---
    "Assyrian": "Iraq",
    "Neo-Assyrian": "Iraq",
    "Middle Assyrian": "Iraq",
    "Old Babylonian": "Iraq",
    "Old Babylonian; Neo-Assyrian": "Iraq",
    "Old Babylonian; Third Dynasty of Ur": "Iraq",
    "Middle Babylonian": "Iraq",
    "Late Babylonian": "Iraq",
    "Third Dynasty of Ur": "Iraq",
    "Third Dynasty of Ur; Isin-Larsa": "Iraq",
    "Lagash II": "Iraq",
    "Mesopotamian": "Iraq",
    "Ubaid": "Iraq",
    "Halaf": "Iraq",
    "Late Uruk": "Iraq",

    # Elamite and Iranian imperial traditions
    "Middle Elamite": "Iran",
    "Neo-Elamite": "Iran",
    "Achaemenid": "Iran",
    "Parthian": "Iran",
    "Parthian; Sasanian": "Iran",
    "Sasanian": "Iran",
    "Qajar dynasty": "Iran",
    "Islamic; Qajar dynasty": "Iran",

    # --- South Asia (mapped to India / Pakistan / Sri Lanka) ---
    "Gupta": "India",
    "Pala": "India",
    "Mughal dynasty": "India",
    "Paramara": "India",
    "Vijayanagara": "India",
    "Chola": "India",
    "Hoysala": "India",
    "Nayaka": "India",
    "Nolamba Dynasty": "India",
    "Eastern Ganga Dynasty": "India",
    "Kushan": "India", 
    "Indus Valley Civilisation": "Pakistan",  # conservative choice; could be debated
    "Zhob Culture": "Pakistan",
    "Anuradhapura": "Sri Lanka",
    "Buddhist; Anuradhapura": "Sri Lanka",
    "Kandyan; Buddhist": "Sri Lanka",

    # --- Chinese dynasties and periods (mapped to modern China) ---
    "Hongshan culture": "China",
    "Neolithic; Shang dynasty": "China",
    "Shang dynasty": "China",
    "Shang dynasty; Anyang": "China",
    "Shang dynasty; Western Zhou dynasty": "China",
    "Western Zhou dynasty": "China",
    "Zhou dynasty": "China",
    "Eastern Zhou dynasty": "China",
    "Warring States period": "China",
    "Warring States period; Western Han dynasty": "China",
    "Han dynasty": "China",
    "Western Han dynasty": "China",
    "Northern Qi dynasty": "China",
    "Northern Wei dynasty": "China",
    "Six Dynasties": "China",
    "Six Dynasties; Northern Wei dynasty": "China",
    "Six Dynasties; Sui dynasty": "China",
    "Sui dynasty": "China",
    "Sui dynasty; Tang dynasty": "China",
    "Tang dynasty": "China",
    "Tang dynasty; Buddhist": "China",
    "Tang dynasty; Liao dynasty": "China",
    "Tang dynasty; 唐代": "China",
    "Liao dynasty": "China",
    "Song dynasty": "China",
    "Song dynasty; Buddhist": "China",
    "Song dynasty; Jin dynasty": "China",
    "Song dynasty; Yuan dynasty": "China",
    "Jin dynasty": "China",
    "Jin dynasty; Yuan dynasty": "China",
    "Yuan dynasty": "China",
    "Yuan dynasty; Five Dynasties": "China",
    "Ming dynasty": "China",
    "Ming dynasty; Chenghua": "China",
    "Ming dynasty; Chongzhen": "China",
    "Ming dynasty; Hongwu": "China",
    "Ming dynasty; Jiajing": "China",
    "Ming dynasty; Qing dynasty": "China",
    "Ming dynasty; Tianqi": "China",
    "Ming dynasty; Wanli": "China",
    "Ming dynasty; Yongle": "China",
    "Qing dynasty": "China",
    "Qing dynasty; Kangxi": "China",
    "Kangxi; Qing dynasty": "China",
    "Qianlong": "China",
    "Qianlong; Qing dynasty": "China",
    "Qing dynasty; Qianlong": "China",
    "Qing dynasty; Republic": "China",
    "Dali Kingdom": "China",
    "Gaochang kingdom; 高昌王國": "China",

    # --- Japan (mapped to modern Japan) ---
    "Jomon Period": "Japan",
    "Heian Period": "Japan",
    "Kamakura Period": "Japan",
    "Kamakura Period; Muromachi Period": "Japan",
    "Edo Period": "Japan",
    "Edo Period; Meiji Era": "Japan",
    "Edo Period; Reiwa Era": "Japan",
    "Meiji Era": "Japan",
    "Taisho Era; Showa Era": "Japan",
    "Genroku Era": "Japan",
    "Kan'ei Era": "Japan",
    "Bunsei Era": "Japan",

    # --- Korea (mapped to modern South Korea) ---
    "Goryeo Dynasty": "South Korea",
    "Joseon Dynasty": "South Korea",
    "Unified Silla Dynasty": "South Korea",

    # --- Andes / Mesoamerica (mapped to modern Peru / Mexico etc.) ---
    "Inca": "Peru",
    "Moche": "Peru",
    "Aztec": "Mexico",
    "Aztec; Toltec": "Mexico",
    "Classic Maya": "Mexico",           # cross-border, includes Guatemala, Nicaragua, Belize and others. Keeping it as Mexico for simplicity's sake!
    "Classic Maya; Olmec": "Mexico",
    "Classic Veracruz": "Mexico",
    "Late Classic; Classic Veracruz": "Mexico",
    "Pre-Columbian; Classic Veracruz": "Mexico",
    "Huaxtec": "Mexico",
    "Olmec": "Mexico",
    "Mixtec; Olmec": "Mexico",
    "Chiriquí": "Panama",

    # --- Other clearly locatable cultures ---
    "Anglo-Saxon": "United Kingdom",
    "Early Anglo-Saxon": "United Kingdom",
    "Romano-British": "United Kingdom",
    "Celtic": None,  # spread widely, so left unmapped
    "Germanic; Roman": None,  # too mixed to map conservatively

    # --- Sudan / Nubian traditions ---
    "Napatan": "Sudan",
    "Meroitic": "Sudan",
    "Meroitic; Roman Period": "Sudan",
    "Kushite; Meroitic": "Sudan",
    "Kushite; Napatan": "Sudan",
}


def map_culture_to_country(culture: str) -> str | None:
    """
    Map a Culture value to a modern country using a conservative dictionary.

    - If the culture is not in the CULTURE_TO_COUNTRY dictionary, return None.
    - Broad or ambiguous culture labels are deliberately left unmapped.
    """
    if culture is None:
        return None

    # Look up the culture in the dictionary. If it is not present, return None.
    return CULTURE_TO_COUNTRY.get(culture, None)


# Apply the culture mapping to create / update the CultureCountry column
bm_df["CultureCountry"] = bm_df["Culture"].apply(map_culture_to_country)

# Check how many rows still have no mapped country from ANY source
still_unmapped = bm_df[
    bm_df["ProductionCountry"].isna()
    & bm_df["FindSpotCountry"].isna()
    & bm_df["CultureCountry"].isna()
]

print("Rows with no mapped country from production, find spot or culture:", still_unmapped.shape[0])
print("Total rows:", bm_df.shape[0])


Rows with no mapped country from production, find spot or culture: 546
Total rows: 4458


In [17]:
bm_df[
    bm_df["ProductionCountry"].isna()
    & bm_df["FindSpotCountry"].isna()
    & bm_df["CultureCountry"].notna()
][["Culture", "CultureCountry"]].head(20)


Unnamed: 0,Culture,CultureCountry
0,Archaic Greek,Greece
3,Classical Greek,Greece
21,Anglo-Saxon,United Kingdom
43,Classical Greek,Greece
66,Roman,Italy
68,Orientalising Period,Greece
69,Roman,Italy
70,Roman,Italy
75,Roman,Italy
76,Classical Greek,Greece


## 9. Results of the Culture-based fallback mapping

The Culture mapping step assigns a modern country only when both Production place and Find spot are missing or unmappable. This ensures Culture is used strictly as a last resort.

The initial preview confirms that the dictionary is behaving correctly:

- Greek cultural periods (for example Archaic Greek, Classical Greek, Hellenistic) → **Greece**  
- Roman and Romano-Hellenistic labels → **Italy**  
- Anglo-Saxon → **United Kingdom**  
- Other clear cultural regions (for example Cypriot, Ming dynasty, Maya, Inca) will also map successfully

More ambiguous chronological terms (Bronze Age, Medieval, Islamic, Neolithic, Early Christian etc) correctly remain unmapped, since their geographic scope is too broad to assign a modern country with confidence.

This conservative approach avoids over-assigning locations and keeps the spatial analysis as accurate and defensible as possible.

Below is the code that applies the Culture mapping and checks how many rows now have a country assigned from each source.


In [18]:
# --- Function to map Culture to modern country ---
def map_culture_to_country(culture):
    """
    Returns a modern country inferred from the Culture field.

    Uses the CULTURE_TO_COUNTRY dictionary built earlier.
    Only maps cultures with clearly bounded geographic identity.
    Returns None for broad or ambiguous cultural labels.
    """
    if culture is None:
        return None
    return CULTURE_TO_COUNTRY.get(culture, None)


# Apply the Culture mapping
bm_df["CultureCountry"] = bm_df["Culture"].apply(map_culture_to_country)


# --- Diagnostic check: How many rows still lack any mapped country? ---
still_unmapped_after_culture = bm_df[
    bm_df["ProductionCountry"].isna()
    & bm_df["FindSpotCountry"].isna()
    & bm_df["CultureCountry"].isna()
]

print("Rows still unmapped after Culture pass:", still_unmapped_after_culture.shape[0])


# Optional: preview some Culture mappings
bm_df[["Culture", "CultureCountry"]].dropna().head(20)


Rows still unmapped after Culture pass: 546


Unnamed: 0,Culture,CultureCountry
0,Archaic Greek,Greece
1,Classical Greek,Greece
2,Apulian (Greek),Italy
3,Classical Greek,Greece
4,26th Dynasty,Egypt
5,Third Intermediate; Late Period,Egypt
6,Late Period,Egypt
7,Third Intermediate; 26th Dynasty,Egypt
8,26th Dynasty,Egypt
9,Ancient Egypt,Egypt


## 10. Combine Production, Find spot and Culture into a final country field

At this point I have three separate columns that can supply a modern country:

- `ProductionCountry` – derived from the cleaned **Production place** field  
- `FindSpotCountry` – derived from the cleaned **Find spot** field  
- `CultureCountry` – inferred from the **Culture** field using a conservative mapping

These don't all carry the same weight. For the purposes of visualisation and interpretation, I'm using the following priority order:

1. `ProductionCountry` (strongest evidence: where the object was made)  
2. `FindSpotCountry` (fallback: where the object was found or acquired)  
3. `CultureCountry` (last resort: cultural tradition as a regional proxy)

I'll combine these three into a single `FinalCountry` column using this hierarchy. I'll also produce a small summary showing how many objects were assigned a country from each source, and how many remain unknown.


In [19]:
import numpy as np

def choose_final_country(row):
    """
    Choose a single country value for each object using a priority order:

    1. ProductionCountry  – preferred, reflects place of manufacture.
    2. FindSpotCountry    – fallback when production is missing.
    3. CultureCountry     – last resort when neither of the above is available.

    Returns the first non-null value in that order, or None if all are missing.
    """
    if pd.notna(row["ProductionCountry"]):
        return row["ProductionCountry"]
    if pd.notna(row["FindSpotCountry"]):
        return row["FindSpotCountry"]
    if pd.notna(row["CultureCountry"]):
        return row["CultureCountry"]
    return None


# Create the combined FinalCountry column
bm_df["FinalCountry"] = bm_df.apply(choose_final_country, axis=1)


# --- Simple diagnostic summary of where the country came from ---

# Booleans for each source being used
from_production = bm_df["ProductionCountry"].notna()
from_findspot = bm_df["ProductionCountry"].isna() & bm_df["FindSpotCountry"].notna()
from_culture = (
    bm_df["ProductionCountry"].isna()
    & bm_df["FindSpotCountry"].isna()
    & bm_df["CultureCountry"].notna()
)
unknown = bm_df["FinalCountry"].isna()

summary_counts = {
    "Total rows": int(len(bm_df)),
    "Using ProductionCountry": int(from_production.sum()),
    "Using FindSpotCountry": int(from_findspot.sum()),
    "Using CultureCountry": int(from_culture.sum()),
    "Still unknown": int(unknown.sum()),
}

summary_counts


{'Total rows': 4458,
 'Using ProductionCountry': 1448,
 'Using FindSpotCountry': 1124,
 'Using CultureCountry': 1340,
 'Still unknown': 546}

## 11. Summary of country assignment sources

Using the three step hierarchy, each object is assigned a modern country from the first available source in this order:

1. `ProductionCountry`
2. `FindSpotCountry`
3. `CultureCountry`

The distribution across sources is:

- **ProductionCountry:** 1448 objects  
- **FindSpotCountry:** 1124 objects  
- **CultureCountry:** 1340 objects  
- **Unknown:** 546 objects  

This means that around 88% of records have been assigned a modern country for the purposes of visualisation, with the remaining 12% left as Unknown. Given the variability and ambiguity of the original data, I think this is a reasonable and defensible level of coverage without over interpreting the evidence.


In [20]:
# Quick preview of the final structure
bm_df[[
    "Culture", "Production place", "Find spot",
    "ProductionPlace_Cleaned", "FindSpot_Cleaned",
    "ProductionCountry", "FindSpotCountry", "CultureCountry",
    "FinalCountry"
]].head(100)


Unnamed: 0,Culture,Production place,Find spot,ProductionPlace_Cleaned,FindSpot_Cleaned,ProductionCountry,FindSpotCountry,CultureCountry,FinalCountry
0,Archaic Greek,Made in: South Ionia (historic),Excavated/Findspot: Sanctuary of Apollo (Naukr...,South Ionia,Sanctuary of Apollo,,,Greece,Greece
1,Classical Greek,,Excavated/Findspot: Athens,,Athens,,Greece,Greece,Greece
2,Apulian (Greek),Made in: Taranto,Excavated/Findspot: Taranto,Taranto,Taranto,Italy,Italy,Italy,Italy
3,Classical Greek,,Excavated/Findspot: Parthenon,,Parthenon,,,Greece,Greece
4,26th Dynasty,,Excavated/Findspot: Tell Nabasha,,Tell Nabasha,,Egypt,Egypt,Egypt
5,Third Intermediate; Late Period,,Found/Acquired: Egypt,,Egypt,,Egypt,Egypt,Egypt
6,Late Period,,Found/Acquired: Egypt,,Egypt,,Egypt,Egypt,Egypt
7,Third Intermediate; 26th Dynasty,,Found/Acquired: Egypt,,Egypt,,Egypt,Egypt,Egypt
8,26th Dynasty,,Excavated/Findspot: Tell Nabasha,,Tell Nabasha,,Egypt,Egypt,Egypt
9,Ancient Egypt,,Found/Acquired: Egypt,,Egypt,,Egypt,Egypt,Egypt


In [21]:
# Finally, we'll export the fully cleaned dataset with all the helper columns

# Choosing the output path 
output_path = "../data/bm_cleaned_with_countries.csv"

# Export the full dataframe
bm_df.to_csv(output_path, index=False, encoding="utf-8")

output_path


'../data/bm_cleaned_with_countries.csv'