# Victoria & Albert Museum – Production Place Cleaning (Diana)

This notebook cleans the **production place** information in the V&A sculpture dataset.

The aim is to turn inconsistent free text in `_primaryPlace` into:

- clean place names
- modern country names
- a simple flag showing whether a country was successfully inferred.

Final key columns:

- `ProductionPlace_Raw` – original `_primaryPlace` value
- `ProductionPlace_Cleaned` – cleaned place name
- `ProductionCountry` – mapped modern country (from cleaned place)
- `ProductionPlace_Inferred` – True if a country was successfully inferred
- `CountryMapped` – True/False helper (same as `ProductionCountry` non-null)
- `FinalCountry` – same as `ProductionCountry` for this dataset (no fallback fields)


In [12]:
import pandas as pd
import re

# Path assuming the notebook is in /notebooks and data is in /data
va_file_path = "../data/victoria_albert_museum_dataset_raw.csv"

va_df = pd.read_csv(va_file_path)

print("Rows, columns:", va_df.shape)
print("\nColumn names:")
print(va_df.columns.tolist())

va_df.head()

Rows, columns: (3826, 16)

Column names:


Unnamed: 0,accessionNumber,accessionYear,systemNumber,objectType,_primaryTitle,_primaryPlace,_primaryMaker__name,_primaryMaker__association,_primaryDate,_primaryImageId,_sampleMaterial,_sampleTechnique,_sampleStyle,_currentLocation__displayName,_objectContentWarning,_imageContentWarning
0,LOAN:MAKOWER.24:1-2016,2016.0,O1407602,Sculpture,Sculpture,London,Nan Nan Liu,designer and maker,2015,2024NX9357,silver,forging,,"Silver, Room 68",False,False
1,C.205&A-1984,1984.0,O250154,Sculpture,,Great Britain,"Best-Devereux, Tatiana",makers,1983-1984,,,,20TH GB,"Glass, Room 131",False,False
2,M.8-2024,2024.0,O1769534,Sculpture,Chockstones,Eryri National Park,Rauni Higson,designed and made,2021,,sterling silver,casting,,"Silver, Room 66",False,False
3,IS.174-1991,1991.0,O24959,Sculpture,Sculpture,Cambodia,Unknown,,12th century,2006AU3440,sandstone,carving,Angkorian,"The Himalayas and South-East Asia, Room 47a",False,False
4,IS.140-1961,1961.0,O82516,Sculpture,sculpture,"Dvaravati kingdom, Thailand",Unknown,,7th century to 8th century,2006BF2729,sandstone,Sculpture,Dvaravati,"The Himalayas and South-East Asia, Room 47a",False,False


## 1. Identify production place column

In the V&A dataset, the production place field is stored in the
`_primaryPlace` column.

I’ll copy this into a dedicated `ProductionPlace_Raw` column so I can
clean it while keeping the original value for auditability.


In [13]:
# Explicitly define the production place column to avoid NameError
production_col = "_primaryPlace"

# Copy to a raw column for cleaning
va_df["ProductionPlace_Raw"] = va_df[production_col]

va_df[["ProductionPlace_Raw"]].head(10)


Unnamed: 0,ProductionPlace_Raw
0,London
1,Great Britain
2,Eryri National Park
3,Cambodia
4,"Dvaravati kingdom, Thailand"
5,Seto
6,Ulm
7,Italy
8,Japan
9,Kyoto


In [14]:
print("Unique production places:", va_df["ProductionPlace_Raw"].nunique())

print("\nTop 30 raw production places:")
va_df["ProductionPlace_Raw"].value_counts().head(30)


Unique production places: 492

Top 30 raw production places:


ProductionPlace_Raw
London                                        409
England                                       312
Florence                                      207
France                                        193
Italy                                         182
Paris                                         164
Germany                                       163
Rome                                           92
China                                          65
Venice                                         56
Padua                                          45
Florence (city)                                44
Pakistan                                       41
Nuremberg                                      38
Netherlands                                    37
Great Britain                                  37
Milan                                          31
Spain                                          30
Lombardy                                       30
Peshawar                      

## 2. Clean `ProductionPlace_Raw` to `ProductionPlace_Cleaned`

Cleaning rules:

1. If multiple places are separated by semicolons, keep the **first**.
2. Remove any prefix before a colon (e.g. `"Made in: London"` → `"London"`).
3. Strip out bracketed notes like `"(city)"`, `"(probably)"`, `"(historic)"`.
4. Remove stray question marks and non-word characters, normalise spaces.

This gives a simpler place name that we can then map to a modern country.


In [15]:
def clean_production_place(value):
    """Clean V&A _primaryPlace into a simple place name."""
    if pd.isna(value):
        return None

    text = str(value).strip()

    # 1. Only keep first item if semicolon-separated
    text = text.split(";")[0].strip()

    # 2. Remove prefix before colon, if there is one
    if ":" in text:
        text = text.split(":", 1)[1].strip()

    # 3. Remove anything in parentheses
    text = re.sub(r"\(.*?\)", "", text).strip()

    # 4. Remove question marks and odd characters, normalise spaces
    text = text.replace("?", "")
    # Keep letters, numbers, space and common punctuation in place names
    text = re.sub(r"[^A-Za-z0-9 ,'\-]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()

    return text if text else None

# Apply cleaning
va_df["ProductionPlace_Cleaned"] = va_df["ProductionPlace_Raw"].apply(clean_production_place)

va_df[["ProductionPlace_Raw", "ProductionPlace_Cleaned"]].head(20)


Unnamed: 0,ProductionPlace_Raw,ProductionPlace_Cleaned
0,London,London
1,Great Britain,Great Britain
2,Eryri National Park,Eryri National Park
3,Cambodia,Cambodia
4,"Dvaravati kingdom, Thailand","Dvaravati kingdom, Thailand"
5,Seto,Seto
6,Ulm,Ulm
7,Italy,Italy
8,Japan,Japan
9,Kyoto,Kyoto


In [16]:
non_null_cleaned = va_df["ProductionPlace_Cleaned"].notna().sum()
print("Non-null cleaned places:", non_null_cleaned)

print("\nTop 40 cleaned places:")
va_df["ProductionPlace_Cleaned"].value_counts().head(40)


Non-null cleaned places: 3674

Top 40 cleaned places:


ProductionPlace_Cleaned
London                    409
England                   312
Florence                  251
France                    193
Italy                     182
Paris                     164
Germany                   163
Rome                       94
China                      65
Venice                     59
Padua                      49
Pakistan                   41
Nuremberg                  38
Great Britain              37
Netherlands                37
Milan                      35
Lombardy                   30
Spain                      30
Peshawar                   27
Augsburg                   25
Salisbury cathedral        24
Salisbury Cathedral        23
Meissen                    22
Cologne                    21
Xinjiang                   20
Britain                    20
Xinjiang Uygur Zizhiqu     19
Japan                      19
India                      18
Mathura                    18
Brussels                   18
Jingdezhen                 18
Madrid          

## 3. Map cleaned places to modern countries

Now I map each `ProductionPlace_Cleaned` value to a modern country
using a controlled dictionary `PLACE_TO_COUNTRY`.

The dictionary is aligned with the one used for the British Museum
data, with some additions based on common V&A places
(e.g. Florence, Venice, Padua, Sèvres, Delft, Antwerp, etc.).

Anything not in the dictionary will stay as `None`, so we can review
and extend later without hiding uncertainty.


In [17]:
PLACE_TO_COUNTRY = {
    # United Kingdom / Great Britain
    "London": "United Kingdom",
    "England": "United Kingdom",
    "Great Britain": "United Kingdom",
    "Birmingham": "United Kingdom",
    "Southall": "United Kingdom",
    "Ivychurch": "United Kingdom",
    "Acton": "United Kingdom",
    "Liverpool": "United Kingdom",
    "Manchester": "United Kingdom",
    "Leeds": "United Kingdom",
    "Edinburgh": "United Kingdom",
    "Glasgow": "United Kingdom",
    "Bristol": "United Kingdom",
    "Nottingham": "United Kingdom",
    "Sheffield": "United Kingdom",
    "Leicester": "United Kingdom",

    # Italy
    "Italy": "Italy",
    "Florence": "Italy",
    "Rome": "Italy",
    "Venice": "Italy",
    "Padua": "Italy",
    "Pisa": "Italy",
    "Naples": "Italy",
    "Siena": "Italy",
    "Mantua": "Italy",
    "Genoa": "Italy",
    "Perugia": "Italy",
    "Bologna": "Italy",
    "Turin": "Italy",
    "Milan": "Italy",
    "Verona": "Italy",
    "Aquila": "Italy",
    "Amalfi": "Italy",
    "Apulia": "Italy",

    # France
    "France": "France",
    "Paris": "France",
    "Sèvres": "France",
    "Limoges": "France",
    "Nancy": "France",
    "Lyon": "France",
    "Lorraine": "France",
    "Juvisy": "France",

    # Germany
    "Germany": "Germany",
    "Nuremberg": "Germany",
    "Berlin": "Germany",
    "Dresden": "Germany",
    "Würzburg": "Germany",
    "Hildesheim": "Germany",
    "Lower Rhine Germany": "Germany",
    "Aachen": "Germany",

    # Netherlands / Low Countries
    "Netherlands": "Netherlands",
    "Amsterdam": "Netherlands",
    "Delft": "Netherlands",
    "Haarlem": "Netherlands",
    "Limburg Netherlands": "Netherlands",
    "Southern Netherlands": "Netherlands",
    "Almere": "Netherlands",

    # Belgium
    "Belgium": "Belgium",
    "Antwerp": "Belgium",
    "Antwerp City": "Belgium",

    # Spain / Portugal
    "Spain": "Spain",
    "Barcelona": "Spain",
    "Madrid": "Spain",
    "Portugal": "Portugal",
    "Lisbon": "Portugal",

    # USA
    "United States": "United States",
    "United States of America": "United States",
    "New York": "United States",
    "Boston": "United States",
    "Chicago": "United States",

    # Other European
    "Austria": "Austria",
    "Sweden": "Sweden",
    "Russia": "Russia",
    "Moscow": "Russia",
    "St Petersburg": "Russia",
    "Switzerland": "Switzerland",
    "Denmark": "Denmark",
    "Norway": "Norway",
    "Poland": "Poland",
    "Gdansk": "Poland",

    # Asia – India / Pakistan / Afghanistan
    "India": "India",
    "Andhra Pradesh": "India",
    "Tamil Nadu": "India",
    "Karnataka": "India",
    "Uttar Pradesh": "India",
    "Bihar": "India",
    "Kausambi": "India",
    "Goa": "India",
    "Pakistan": "Pakistan",
    "Kashmir": "Pakistan",
    "Swat Valley": "Pakistan",
    "Afghanistan": "Afghanistan",

    # China & East Asia
    "China": "China",
    "Dehua": "China",
    "Jingdezhen": "China",
    "Canton": "China",
    "Beijing": "China",
    "Tibet": "China",   # V&A convention often treats it within PRC
    "Japan": "Japan",
    "Tokyo": "Japan",
    "Kyoto": "Japan",
    "Kanazawa": "Japan",
    "Kanagawa": "Japan",
    "Kasama": "Japan",
    "Java": "Indonesia",

    # Middle East & N. Africa
    "Egypt": "Egypt",
    "Cairo": "Egypt",
    "Jerusalem": "Israel",    # modern state mapping
    "Ar Raqqah": "Syria",
    "Alexandria": "Egypt",

    # Other countries
    "Cambodia": "Cambodia",
    "Thailand": "Thailand",
    "Dvaravati kingdom Thailand": "Thailand",
    "Turkey": "Turkey",
    "Eryri National Park": "United Kingdom",  # in Wales
}

def normalise_key(place):
    """Normalise cleaned place names into dictionary keys."""
    if place is None:
        return None
    p = place.strip()
    # Collapse some variants into keys we defined above
    replacements = {
        "Florence city": "Florence",
        "Antwerp city": "Antwerp City",
        "Limburg Netherlands": "Limburg Netherlands",
        "Lower Rhine Germany": "Lower Rhine Germany",
        "Dvaravati kingdom Thailand": "Dvaravati kingdom Thailand",
    }
    return replacements.get(p, p)

def map_place_to_country(place):
    if place is None:
        return None
    key = normalise_key(place)
    return PLACE_TO_COUNTRY.get(key, None)

va_df["ProductionCountry"] = va_df["ProductionPlace_Cleaned"].apply(map_place_to_country)

va_df[["ProductionPlace_Cleaned", "ProductionCountry"]].head(20)


Unnamed: 0,ProductionPlace_Cleaned,ProductionCountry
0,London,United Kingdom
1,Great Britain,United Kingdom
2,Eryri National Park,United Kingdom
3,Cambodia,Cambodia
4,"Dvaravati kingdom, Thailand",
5,Seto,
6,Ulm,
7,Italy,Italy
8,Japan,Japan
9,Kyoto,Japan


In [18]:
unmapped_places = (
    va_df[va_df["ProductionCountry"].isna()]["ProductionPlace_Cleaned"]
    .dropna()
    .unique()
)

print("Number of unmapped cleaned places:", len(unmapped_places))
unmapped_places[:50]


Number of unmapped cleaned places: 364


array(['Dvaravati kingdom, Thailand', 'Seto', 'Ulm', 'Sahri Bahlol',
       'Eton', 'Gifu', 'Finland', 'Mathura', 'W rzburg', 'Zelezny Brod',
       'Meissen', 'Saint-Gilles-du-Gard', 'Nova Scotia', 'Cumbria',
       'Gandhara', 'Nov Bor', 'Helsinki', 'Cologne', 'Seattle', 'Dublin',
       'Mashiko', 'Lopburi, Thailand', 'Wolverhampton', 'Shanghai',
       'Lopburi, Thailand,', 'Cortona', 'Prague', 'Scotland',
       'Stoke-on-Trent', 'Shan', 'Tours', 'Sari Dheri', 'Roskilde',
       's Hertogenbosch', 'South Korea', 'Burma', 'United Kingdom',
       'Bohemia', 'Los Angeles', 'Thebarton', 'New Jersey', 'Kosta',
       'Caterham', 'Breda', 'Oslo', 'Yixing', 'Chiba', 'Sm land',
       'Pilchuck', 'Takaoka'], dtype=object)

## 4. Create `ProductionPlace_Inferred` and `FinalCountry`

For the V&A dataset I only have a production place field, so:

- `FinalCountry` is just `ProductionCountry`.
- `ProductionPlace_Inferred` is **True** when a modern country has
  been successfully inferred from the place; **False** otherwise.
- `CountryMapped` is the same True/False helper flag.

If later I add fields like find spot or culture, I can upgrade this
logic to follow the same 3-step priority as the British Museum.


In [19]:
# FinalCountry (for consistency with BM notebook)
va_df["FinalCountry"] = va_df["ProductionCountry"]

# Helper: True where a country was successfully inferred from the place
va_df["CountryMapped"] = va_df["ProductionCountry"].notna()

# ProductionPlace_Inferred = True if we managed to infer a country
va_df["ProductionPlace_Inferred"] = va_df["CountryMapped"]

# Quick check
va_df[[
    "ProductionPlace_Raw",
    "ProductionPlace_Cleaned",
    "ProductionCountry",
    "FinalCountry",
    "CountryMapped",
    "ProductionPlace_Inferred"
]].head(30)


Unnamed: 0,ProductionPlace_Raw,ProductionPlace_Cleaned,ProductionCountry,FinalCountry,CountryMapped,ProductionPlace_Inferred
0,London,London,United Kingdom,United Kingdom,True,True
1,Great Britain,Great Britain,United Kingdom,United Kingdom,True,True
2,Eryri National Park,Eryri National Park,United Kingdom,United Kingdom,True,True
3,Cambodia,Cambodia,Cambodia,Cambodia,True,True
4,"Dvaravati kingdom, Thailand","Dvaravati kingdom, Thailand",,,False,False
5,Seto,Seto,,,False,False
6,Ulm,Ulm,,,False,False
7,Italy,Italy,Italy,Italy,True,True
8,Japan,Japan,Japan,Japan,True,True
9,Kyoto,Kyoto,Japan,Japan,True,True


In [20]:
summary = {
    "Total rows": len(va_df),
    "With cleaned place": int(va_df["ProductionPlace_Cleaned"].notna().sum()),
    "With mapped country": int(va_df["ProductionCountry"].notna().sum()),
    "Without mapped country": int(va_df["ProductionCountry"].isna().sum()),
}

summary


{'Total rows': 3826,
 'With cleaned place': 3674,
 'With mapped country': 2605,
 'Without mapped country': 1221}

In [21]:
output_path = "../data/va_cleaned_places_with_countries.csv"
va_df.to_csv(output_path, index=False, encoding="utf-8")

output_path


'../data/va_cleaned_places_with_countries.csv'