# Combined sculpture dataset

This notebook creates a standardised sculpture dataset that brings together records from the British Museum and the V&A. The aim is to produce a single, comparable dataset that follows the integration schema agreed by the project team.

The workflow incorporates:
- Cleaned date fields (`StartDate`, `EndDate`, `MidpointDate`)
- Cleaned production place values and mapped modern countries
- Original acquisition information taken from the raw museum files
- Object type, culture, materials and techniques where available

The final output is a single CSV with the following columns:

- RecordID  
- Museum  
- LocalID  
- AcqDate  
- ObjectType  
- ItemDate  
- StartDate  
- EndDate  
- MidpointDate  
- ItemPlace  
- ModernCountry  
- Culture  
- ItemMaterial  
- ItemTechnique  


In [4]:
import pandas as pd
from pathlib import Path

data_dir = Path("..") / "data"

# -------------------------
# British Museum datasets
# -------------------------

# Cleaned BM files
bm_dates_path = data_dir / "bm_cleaned_dates.csv"
bm_places_path = data_dir / "bm_cleaned_with_countries.csv"

# Raw BM dataset
bm_raw_path = data_dir / "british_museum_dataset_raw.csv"

# Load BM cleaned datasets
bm_dates = pd.read_csv(bm_dates_path)
bm_places = pd.read_csv(bm_places_path)

# Load BM raw dataset (for acquisition date and other fields)
bm_raw = pd.read_csv(bm_raw_path)


# -------------------------
# V&A datasets
# -------------------------

# Cleaned V&A files
va_dates_path = data_dir / "va_cleaned_dates.csv"
va_places_path = data_dir / "va_cleaned_places_with_countries.csv"

# Raw V&A dataset
va_raw_path = data_dir / "victoria_albert_museum_dataset_raw.csv"

# Load V&A cleaned datasets
va_dates = pd.read_csv(va_dates_path)
va_places = pd.read_csv(va_places_path)

# Load V&A raw dataset
va_raw = pd.read_csv(va_raw_path)


bm_dates.head(), bm_places.head(), bm_raw.head(), va_dates.head(), va_places.head(), va_raw.head()


The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


(       Museum number                     Production date prod_date_primary  \
 0   No: 1886,0401.45  520BC-490BC (circa); 550BC - 500BC       520bc-490bc   
 1  No: 1816,0610.321                         420BC-400BC       420bc-400bc   
 2    No: 1893,0315.1                       300BC (circa)             300bc   
 3   No: 1843,0531.26                         447BC-432BC       447bc-432bc   
 4        No: EA20577                                 NaN               NaN   
 
    start_year  end_year  midpoint_year date_parse_status          Culture  \
 0      -520.0    -490.0         -505.0   parsed_bc_range    Archaic Greek   
 1      -420.0    -400.0         -410.0   parsed_bc_range  Classical Greek   
 2      -300.0    -300.0         -300.0  parsed_bc_single  Apulian (Greek)   
 3      -447.0    -432.0         -440.0   parsed_bc_range  Classical Greek   
 4      -664.0    -525.0         -594.5      no_date_text     26th Dynasty   
 
                   Production place  \
 0  Made in: So

## Step 1  Understanding the cleaned datasets before combining them

Before merging the datasets, I need to confirm exactly which fields appear in each file for both the British Museum and the V&A.

What we currently have:

- `bm_cleaned_dates.csv`  
  Contains the cleaned British Museum date fields, including:
  - `start_year`
  - `end_year`
  - `midpoint_year`
  - parsed date status columns and related helpers  

- `bm_cleaned_with_countries.csv`  
  Contains the cleaned British Museum production place fields, including:
  - `ProductionPlace_Cleaned`
  - `FinalCountry`
  - and supporting place parsing columns  

- `va_cleaned_dates.csv`  
  Contains the cleaned V&A date fields, including:
  - `start_year`
  - `end_year`
  - `midpoint_year`
  - parsed date status columns and related helpers  

- `vanda_cleaned_places_with_countries.csv`  
  Contains the cleaned V&A production place fields, including:
  - `ProductionPlace_Cleaned`
  - `FinalCountry`
  - and supporting place parsing columns  

In addition, these files also include shared identifying fields and object metadata, such as:

- For the British Museum:
  - `Museum number`
  - `Production date`
  - object level details  

- For the V&A:
  - `systemNumber`
  - `Production date`
  - object level details  

Before attempting any merges, I need to **inspect the columns of each dataset** to check:

1. Which fields are shared between the date and place files for each museum  
2. Which fields are unique to each file  
3. Which columns should be used as merge keys  
   - For the British Museum this is most likely `Museum number`  
   - For the V&A this is most likely `systemNumber`  

This inspection step helps to avoid accidental overwriting and ensures the datasets are combined safely and intentionally.

Once the column structure is clear, the next step will be to:

- Merge the British Museum date and place files using `Museum number`  
- Merge the V&A date and place files using `systemNumber`  
- Confirm that all cleaned date and place fields appear in the combined dataframes  
- Then bring in acquisition and other fields from the raw source files

The next code cell will output the columns of all four cleaned datasets so they can be compared.


In [5]:
# -----------------------------------------
# Inspect columns and compare structures
# -----------------------------------------

def compare_columns(df1, df2, name1, name2):
    print(f"{name1} columns ({len(df1.columns)}):")
    print(sorted(df1.columns))
    print("\n---\n")

    print(f"{name2} columns ({len(df2.columns)}):")
    print(sorted(df2.columns))
    print("\n---\n")

    # Column comparisons
    common = sorted(set(df1.columns) & set(df2.columns))
    only_df1 = sorted(set(df1.columns) - set(df2.columns))
    only_df2 = sorted(set(df2.columns) - set(df1.columns))

    print(f"Common columns between {name1} and {name2}:")
    for col in common:
        print(" -", col)

    print(f"\nColumns only in {name1}:")
    for col in only_df1:
        print(" -", col)

    print(f"\nColumns only in {name2}:")
    for col in only_df2:
        print(" -", col)

    print("\n" + "=" * 80 + "\n")


# Compare British Museum files
compare_columns(bm_dates, bm_places, "BM dates", "BM places")

# Compare V&A files
compare_columns(va_dates, va_places, "V&A dates", "V&A places")


BM dates columns (10):
['Culture', 'Find spot', 'Museum number', 'Production date', 'Production place', 'date_parse_status', 'end_year', 'midpoint_year', 'prod_date_primary', 'start_year']

---

BM places columns (56):
['Acq date', 'Acq name (acq)', 'Acq name (excavator)', 'Acq name (finding)', 'Acq name (previous)', 'Acq notes (acq)', 'Acq notes (exc)', 'Add ids', 'Assoc events', 'Assoc name', 'Assoc place', 'Assoc titles', 'Authority', 'BM/Big number', 'Banknote serial number', 'Bib references', 'Cat no', 'Condition', 'CountryMapped', 'Culture', 'CultureCountry', 'Curators Comments', 'Denomination', 'Dept', 'Description', 'Dimensions', 'Escapement', 'Ethnic name (assoc)', 'Ethnic name (made by)', 'Exhibition history', 'FinalCountry', 'Find spot', 'FindSpotCountry', 'FindSpot_Cleaned', 'Image', 'Inscription', 'Joined objects', 'Location', 'Materials', 'Museum number', 'Object type', 'Producer name', 'Production date', 'Production place', 'ProductionCountry', 'ProductionPlace_Cleaned',

## Step 2  Choosing merge keys for each museum

The column comparison confirms that the cleaned date and place files for each museum have complementary structures.

For the British Museum:

- Shared identifier and context fields include:
  - `Museum number`
  - `Production date`
  - `Production place`
  - `Culture`
  - `Find spot`
- Cleaned date fields only appear in `bm_cleaned_dates.csv`:
  - `start_year`, `end_year`, `midpoint_year`
  - `prod_date_primary`
  - `date_parse_status`
- Cleaned place and country fields only appear in `bm_cleaned_with_countries.csv`:
  - `ProductionPlace_Cleaned`
  - `FinalCountry`
  - `FindSpot_Cleaned`
  - `FindSpotCountry`
  - `ProductionCountry`
  - and other object level metadata including `Acq date`, `Object type`, `Materials` and `Technique`

This confirms that `Museum number` is a suitable merge key for the British Museum files.

For the V&A:

- The only shared identifier between the two cleaned files is:
  - `systemNumber`
- Cleaned date fields only appear in `va_cleaned_dates.csv`:
  - `start_year`, `end_year`, `midpoint_year`
  - `prod_date_primary`
  - `date_parse_status`
- Cleaned place and country fields only appear in `vanda_cleaned_places_with_countries.csv`:
  - `ProductionPlace_Cleaned`
  - `FinalCountry`
  - `ProductionCountry`
  - plus object level fields such as `accessionNumber`, `accessionYear`, `objectType` and the primary title and place fields

This confirms that `systemNumber` is a suitable merge key for the V&A files.

The next step is to create one combined cleaned dataframe for each museum by merging:

- `bm_dates` and `bm_places` on `Museum number`
- `va_dates` and `va_places` on `systemNumber`


In [7]:

# British Museum: merge on Museum number
bm_combined = bm_dates.merge(
    bm_places,
    on="Museum number",
    how="left",
    suffixes=("_dates", "_places")
)

print("BM rows in bm_dates:", len(bm_dates))
print("BM rows in bm_places:", len(bm_places))
print("BM rows in bm_combined:", len(bm_combined))
print("\n" + "=" * 80 + "\n")

# V&A: merge on systemNumber
va_combined = va_dates.merge(
    va_places,
    on="systemNumber",
    how="left",
    suffixes=("_dates", "_places")
)

print("V&A rows in va_dates:", len(va_dates))
print("V&A rows in va_places:", len(va_places))
print("V&A rows in va_combined:", len(va_combined))

bm_combined.head(), va_combined.head()


BM rows in bm_dates: 4458
BM rows in bm_places: 4458
BM rows in bm_combined: 4522


V&A rows in va_dates: 3826
V&A rows in va_places: 3826
V&A rows in va_combined: 3826


(       Museum number               Production date_dates prod_date_primary  \
 0   No: 1886,0401.45  520BC-490BC (circa); 550BC - 500BC       520bc-490bc   
 1  No: 1816,0610.321                         420BC-400BC       420bc-400bc   
 2    No: 1893,0315.1                       300BC (circa)             300bc   
 3   No: 1843,0531.26                         447BC-432BC       447bc-432bc   
 4        No: EA20577                                 NaN               NaN   
 
    start_year  end_year  midpoint_year date_parse_status    Culture_dates  \
 0      -520.0    -490.0         -505.0   parsed_bc_range    Archaic Greek   
 1      -420.0    -400.0         -410.0   parsed_bc_range  Classical Greek   
 2      -300.0    -300.0         -300.0  parsed_bc_single  Apulian (Greek)   
 3      -447.0    -432.0         -440.0   parsed_bc_range  Classical Greek   
 4      -664.0    -525.0         -594.5      no_date_text     26th Dynasty   
 
             Production place_dates  \
 0  Made in: So

## Step 3  Checking the merge results

The merges for each museum produced different behaviours.

### V&A merge
The V&A merge using `systemNumber` behaved as expected.

- The number of rows in `va_combined` matches the number of rows in the cleaned date file.
- No unexpected duplication occurred.
- This confirms that `systemNumber` functions reliably as a unique identifier across the cleaned V&A datasets.

### British Museum merge
The British Museum merge using `Museum number` resulted in an **increase in row count** in the combined dataframe.  
This means that the merge produced more rows than either input file, which indicates that at least some identifiers appear multiple times in one or both datasets.

This behaviour can be entirely legitimate in museum collections where:

- a single museum number covers multi part objects  
- fragments or joined objects share a registration number  
- acquisitions contain several items under one number  

However, it can also signal a structural issue such as:

1. A one to many match between the dates and places datasets  
2. A many to many match creating a cartesian expansion  
3. Formatting or whitespace variations causing unintended matches  
4. Duplicate identifiers that need to be reviewed before integration  

Before moving forward, I need to confirm:

- whether any `Museum number` values appear more frequently in the merged dataset than in either input dataset  
- whether the expansion reflects real multi part objects or accidental duplication  
- whether the merge key (`Museum number`) is functioning as intended  

The next step is to compare the frequency of each Museum number across:

- `bm_cleaned_dates.csv`  
- `bm_cleaned_with_countries.csv`  
- the merged `bm_combined` dataframe  

This will identify which objects expanded during the merge and whether that expansion is expected.


In [8]:
# Check how often each Museum number appears in the BM datasets

dates_counts = bm_dates["Museum number"].value_counts()
places_counts = bm_places["Museum number"].value_counts()
combined_counts = bm_combined["Museum number"].value_counts()

print("BM – Museum numbers duplicated in bm_dates:",
      (dates_counts > 1).sum())
print("BM – Museum numbers duplicated in bm_places:",
      (places_counts > 1).sum())
print("BM – Museum numbers duplicated in bm_combined:",
      (combined_counts > 1).sum())


BM – Museum numbers duplicated in bm_dates: 12
BM – Museum numbers duplicated in bm_places: 12
BM – Museum numbers duplicated in bm_combined: 12


In [9]:
# Sanity check for V&A as well

va_dates_counts = va_dates["systemNumber"].value_counts()
va_places_counts = va_places["systemNumber"].value_counts()
va_combined_counts = va_combined["systemNumber"].value_counts()

print("\nV&A – systemNumbers duplicated in va_dates:",
      (va_dates_counts > 1).sum())
print("V&A – systemNumbers duplicated in va_places:",
      (va_places_counts > 1).sum())
print("V&A – systemNumbers duplicated in va_combined:",
      (va_combined_counts > 1).sum())



V&A – systemNumbers duplicated in va_dates: 0
V&A – systemNumbers duplicated in va_places: 0
V&A – systemNumbers duplicated in va_combined: 0


In [10]:
# Identify Museum numbers whose count increased after the merge

# Combine counts into a single comparison table
bm_counts = (
    pd.DataFrame({
        "dates": dates_counts,
        "places": places_counts,
        "combined": combined_counts,
    })
    .fillna(0)
)

# Maximum count from the two input files
bm_counts["max_input"] = bm_counts[["dates", "places"]].max(axis=1)

# Museum numbers where the merged count is higher than either input
expanded_records = bm_counts[bm_counts["combined"] > bm_counts["max_input"]]

print("Number of Museum numbers with higher counts after merge:",
      len(expanded_records))

expanded_records.head()


Number of Museum numbers with higher counts after merge: 12


Unnamed: 0_level_0,dates,places,combined,max_input
Museum number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No: EA20577,7,7,49,7
"No: 1904,1010.1",2,2,4,2
No: EA41672,2,2,4,2
"No: 1960,0411.2",2,2,4,2
"No: 1923,0306.1",2,2,4,2


In [11]:
# Inspect a small selection of expanded Museum numbers

if not expanded_records.empty:
    sample_ids = expanded_records.index.tolist()[:5]  # first few problematic ones
    print("Sample Museum numbers with expanded counts:", sample_ids)

    bm_combined[bm_combined["Museum number"].isin(sample_ids)].head(20)
else:
    print("No Museum numbers expanded during the merge.")



Sample Museum numbers with expanded counts: ['No: EA20577', 'No: 1904,1010.1', 'No: EA41672', 'No: 1960,0411.2', 'No: 1923,0306.1']


## Step 4  Understanding the expanded rows and how I am going to fix them

The diagnostic checks showed that the V&A merge behaved normally, but the British Museum merge produced a noticeable increase in row count. A small number of Museum numbers appeared many more times in the merged dataset than in either of the two cleaned input files.

For example, `No: EA20577` appeared:

- 7 times in the dates file  
- 7 times in the places file  
- but 49 times in the merged dataset  

This pattern immediately suggests a structure issue rather than new information. After checking how pandas handles merges and reviewing several explanations online, the cause became clear. If both input tables contain duplicate values for the merge key, pandas does not know which rows belong together, so it creates every possible pairing of the matching rows.

One explanation described it like this:

> “If the key appears multiple times in both tables, the merge creates every possible combination of matching rows.”

For `EA20577`, this means:

- 7 date rows  
- 7 place rows  
- producing **7 × 7 = 49 merged rows**

which is exactly what I saw.

The important question is whether these expanded rows actually contain different information. If the merged rows are identical across all meaningful fields, then they represent true duplicates and should be removed. If they differ - for example different date interpretations or different inferred places - then those differences should be preserved.

To handle this safely and transparently, the next step is to identify:

1. which expanded rows are genuinely distinct combinations of dates and places  
2. which expanded rows are simply duplicates created by the merge and can be removed  

This allows me to keep legitimate variation while collapsing only the repeated rows that add no information.


In [12]:
# Identify which Museum numbers have expanded during the BM merge

# How many times each Museum number appears in each input and in the merged data
bm_dates_counts = bm_dates["Museum number"].value_counts()
bm_places_counts = bm_places["Museum number"].value_counts()
bm_combined_counts = bm_combined["Museum number"].value_counts()

bm_merge_counts = (
    pd.DataFrame({
        "dates": bm_dates_counts,
        "places": bm_places_counts,
        "combined": bm_combined_counts,
    })
    .fillna(0)
)

# For each Museum number, what is the maximum count seen in either input file?
bm_merge_counts["max_input"] = bm_merge_counts[["dates", "places"]].max(axis=1)

# Expanded IDs = keys where the merged dataset has more rows than either input
expanded_ids = bm_merge_counts[
    bm_merge_counts["combined"] > bm_merge_counts["max_input"]
].index.tolist()

print("Museum numbers with expanded rows after merge:", len(expanded_ids))
print("Examples:", expanded_ids[:10])

# Quick look at the merged rows for the first expanded Museum number
if expanded_ids:
    example_id = expanded_ids[0]
    print("\nExample expanded Museum number:", example_id)
    bm_combined[bm_combined["Museum number"] == example_id].head(20)


Museum numbers with expanded rows after merge: 12
Examples: ['No: EA20577', 'No: 1904,1010.1', 'No: EA41672', 'No: 1960,0411.2', 'No: 1923,0306.1', 'No: 1948,0414.1', 'No: EA41549', 'No: 1924,0515.3', 'No: EA65372', 'No: WB.261']

Example expanded Museum number: No: EA20577


In [13]:
# Remove only true duplicates from the merged BM dataset

# Columns that define whether two merged rows are meaningfully different
dedupe_cols = [
    "Museum number",
    "Production date_dates",
    "start_year",
    "end_year",
    "midpoint_year",
    "Culture_dates",
    "ProductionPlace_Cleaned",
    "FinalCountry",
    "FindSpot_Cleaned",
    "FindSpotCountry",
]

# Apply de-duplication
bm_combined_deduped = bm_combined.drop_duplicates(subset=dedupe_cols).copy()

print("Rows before de-duplication:", len(bm_combined))
print("Rows after de-duplication:", len(bm_combined_deduped))

# Re-check counts for the expanded Museum numbers
remaining_counts = (
    bm_combined_deduped["Museum number"].value_counts()[expanded_ids]
)

print("\nRemaining rows per expanded Museum number (after removing true duplicates):")
print(remaining_counts)

# Show all rows for all five problematic Museum numbers
print("\nAll rows for expanded Museum numbers:\n")

problem_rows = bm_combined_deduped[
    bm_combined_deduped["Museum number"].isin(expanded_ids)
].copy()

# Sort for easier manual checking against the original CSVs
problem_rows = problem_rows.sort_values(
    by=["Museum number", "start_year", "FindSpot_Cleaned", "ProductionPlace_Cleaned"]
)

cols_to_show = [
    "Museum number",
    "Production date_dates",
    "start_year",
    "end_year",
    "midpoint_year",
    "Culture_dates",
    "FindSpot_Cleaned",
    "FindSpotCountry",
    "ProductionPlace_Cleaned",
    "FinalCountry",
]

problem_rows[cols_to_show]


Rows before de-duplication: 4522
Rows after de-duplication: 4459

Remaining rows per expanded Museum number (after removing true duplicates):
Museum number
No: EA20577        1
No: 1904,1010.1    4
No: EA41672        1
No: 1960,0411.2    4
No: 1923,0306.1    4
No: 1948,0414.1    4
No: EA41549        4
No: 1924,0515.3    1
No: EA65372        1
No: WB.261         1
No: EA2312         4
No: EA41550        1
Name: count, dtype: int64

All rows for expanded Museum numbers:



Unnamed: 0,Museum number,Production date_dates,start_year,end_year,midpoint_year,Culture_dates,FindSpot_Cleaned,FindSpotCountry,ProductionPlace_Cleaned,FinalCountry
972,"No: 1904,1010.1",2ndC(mid),101.0,200.0,150.0,Roman,Guilden Morden,,,United Kingdom
971,"No: 1904,1010.1",2ndC(mid),101.0,200.0,150.0,Roman,Paramythia,Greece,,Greece
2928,"No: 1904,1010.1",6thC-7thC,501.0,700.0,600.0,Early Anglo-Saxon,Guilden Morden,,,United Kingdom
2927,"No: 1904,1010.1",6thC-7thC,501.0,700.0,600.0,Early Anglo-Saxon,Paramythia,Greece,,Greece
1515,"No: 1923,0306.1",16thC (Vijayanagara period),1501.0,1600.0,1550.0,,Blandford Forum,,,
1514,"No: 1923,0306.1",16thC (Vijayanagara period),1501.0,1600.0,1550.0,,India,India,Deccan,India
4023,"No: 1923,0306.1",,,,,Iron Age,Blandford Forum,,,
4022,"No: 1923,0306.1",,,,,Iron Age,India,India,Deccan,India
2691,"No: 1924,0515.3",2ndC BC-1stC BC,-200.0,-1.0,-100.0,Hellenistic,Trikomo,,Cyprus,Greece
3820,"No: 1948,0414.1",350BC (around),-350.0,-350.0,-350.0,Classical Greek,,,Japan,Japan


## Step 5  Manually resolving ambiguous British Museum records

The merge diagnostics showed that only five British Museum records expanded into four rows each after the dates and places tables were combined:

- `No: 1904,1010.1`
- `No: 1948,0414.1`
- `No: 1960,0411.2`
- `No: EA2312`
- `No: EA41549`

Looking at the merged rows for these IDs, it is clear that each Museum number is being used for two different objects rather than one:

- **1904,1010.1**  
  - one Roman object (2nd century) associated with Paramythia, Greece  
  - one Early Anglo Saxon object (6th–7th century) associated with Guilden Morden, UK  

- **1948,0414.1**  
  - one Japanese figure associated with Japan  
  - one Greek object associated with Greece  

- **1960,0411.2**  
  - one object dated to around 800 AD  
  - one object dated to around 600 BC  

- **EA2312**  
  - one record dated to the Roman Period  
  - one record dated to an earlier Egyptian phase  

- **EA41549**  
  - one New Kingdom record  
  - one 19th Dynasty record  

In each case the four merged rows represent all possible combinations of two date profiles and two place/findspot profiles. Two of those combinations describe coherent objects; the other two are crossed pairings that do not make sense historically.

Because there are only five such Museum numbers, and because the original documentation clearly points to two distinct objects in each case, I resolve them as follows:

- keep **one row for each distinct date profile** (for example, Roman vs Early Anglo Saxon)  
- within each date profile, keep the row that has the **most complete place and country information**  
- drop the remaining rows for that Museum number

This reflects the underlying object histories more accurately and avoids carrying forward impossible date and place combinations. It also illustrates a wider documentation issue: nominally “unique” museum numbers may, in practice, be reused for multiple objects. For this project, assigning our own `RecordID` to each row in the combined dataset helps separate these cases explicitly.


In [14]:
# Resolve the five ambiguous BM records by keeping one row per distinct date profile
# and choosing the row with the most complete place information.

# 1. Subset the problematic Museum numbers
problem_mask = bm_combined_deduped["Museum number"].isin(expanded_ids)
problem_rows = bm_combined_deduped[problem_mask].copy()

# 2. Define the "date profile" key and the fields that count as "information"
date_key_cols = [
    "Museum number",
    "Production date_dates",
    "start_year",
    "end_year",
    "midpoint_year",
    "Culture_dates",
]

info_cols = [
    "FindSpot_Cleaned",
    "FindSpotCountry",
    "ProductionPlace_Cleaned",
    "FinalCountry",
]

# 3. Compute an information score: how many of the place/country fields are non-null
problem_rows["info_non_null"] = problem_rows[info_cols].notna().sum(axis=1)

# 4. For each Museum number + date profile, keep the row with the highest info score
problem_rows_sorted = problem_rows.sort_values(
    date_key_cols + ["info_non_null"],
    ascending=[True, True, True, True, True, True, False],
)

problem_rows_reduced = (
    problem_rows_sorted
    .drop_duplicates(subset=date_key_cols, keep="first")
    .drop(columns=["info_non_null"])
)

# 5. Combine the reduced problem rows back with the unaffected BM records
bm_combined_clean = pd.concat(
    [
        bm_combined_deduped[~problem_mask],
        problem_rows_reduced,
    ],
    ignore_index=True,
)

print("Rows in bm_combined_deduped:", len(bm_combined_deduped))
print("Rows in bm_combined_clean:", len(bm_combined_clean))

# Check that each of the five Museum numbers now has exactly two rows
final_counts = bm_combined_clean["Museum number"].value_counts()[expanded_ids]
print("\nRows per problematic Museum number after cleaning:")
print(final_counts)

# Quick sanity check for one example
bm_combined_clean[bm_combined_clean["Museum number"] == "No: 1904,1010.1"]


Rows in bm_combined_deduped: 4459
Rows in bm_combined_clean: 4447

Rows per problematic Museum number after cleaning:
Museum number
No: EA20577        1
No: 1904,1010.1    2
No: EA41672        1
No: 1960,0411.2    2
No: 1923,0306.1    2
No: 1948,0414.1    2
No: EA41549        2
No: 1924,0515.3    1
No: EA65372        1
No: WB.261         1
No: EA2312         2
No: EA41550        1
Name: count, dtype: int64


Unnamed: 0,Museum number,Production date_dates,prod_date_primary,start_year,end_year,midpoint_year,date_parse_status,Culture_dates,Production place_dates,Find spot_dates,...,Joined objects,prod_prefix,prod_after_prefix,ProductionPlace_Cleaned,ProductionCountry,CountryMapped,FindSpot_Cleaned,FindSpotCountry,CultureCountry,FinalCountry
4429,"No: 1904,1010.1",2ndC(mid),2ndc,101.0,200.0,150.0,parsed_century_single,Roman,,Excavated/Findspot: Paramythia (near),...,,,,,,False,Paramythia,Greece,Italy,Greece
4430,"No: 1904,1010.1",6thC-7thC,6thc-7thc,501.0,700.0,600.0,parsed_century_range,Early Anglo-Saxon,,Found/Acquired: Guilden Morden,...,,,,,,False,Paramythia,Greece,Italy,Greece


## Step 6  Building the standardised British Museum table

With the British Museum dates and places safely merged and the ambiguous cases resolved, I can now map the BM fields into the shared integration schema.

For the BM data I will use:

- `Museum`  
  - fixed value `"BM"`
- `LocalID`  
  - `Museum number`
- `AcqDate`  
  - `Acq date` from the BM places file
- `ObjectType`  
  - `Object type`
- `ItemDate`  
  - cleaned production date text from the dates file (`Production date_dates`)
- `StartDate`, `EndDate`, `MidpointDate`  
  - numeric cleaned date fields: `start_year`, `end_year`, `midpoint_year`
- `ItemPlace`  
  - cleaned production place: `ProductionPlace_Cleaned`
- `ModernCountry`  
  - mapped country: `FinalCountry`
- `Culture`  
  - `Culture_dates` from the dates file
- `ItemMaterial`  
  - `Materials`
- `ItemTechnique`  
  - `Technique`

This produces a BM-only table that already follows the final column structure. Later, I will stack the V&A table under this one and then assign a new `RecordID` to each row in the combined dataset.


In [15]:
# Helper: safely get a column if it exists, otherwise return empty strings
def get_col(df, name):
    if name in df.columns:
        return df[name]
    return ""

# Build the standardised British Museum table from the cleaned merged data
bm_standardised = pd.DataFrame({
    "Museum": "BM",
    "LocalID": bm_combined_clean["Museum number"],
    "AcqDate": get_col(bm_combined_clean, "Acq date"),
    "ObjectType": get_col(bm_combined_clean, "Object type"),
    "ItemDate": bm_combined_clean["Production date_dates"],
    "StartDate": bm_combined_clean["start_year"],
    "EndDate": bm_combined_clean["end_year"],
    "MidpointDate": bm_combined_clean["midpoint_year"],
    "ItemPlace": bm_combined_clean["ProductionPlace_Cleaned"],
    "ModernCountry": bm_combined_clean["FinalCountry"],
    "Culture": get_col(bm_combined_clean, "Culture_dates"),
    "ItemMaterial": get_col(bm_combined_clean, "Materials"),
    "ItemTechnique": get_col(bm_combined_clean, "Technique"),
})

print("Rows in bm_standardised:", len(bm_standardised))
bm_standardised.head()


Rows in bm_standardised: 4447


Unnamed: 0,Museum,LocalID,AcqDate,ObjectType,ItemDate,StartDate,EndDate,MidpointDate,ItemPlace,ModernCountry,Culture,ItemMaterial,ItemTechnique
0,BM,"No: 1886,0401.45",,acroterion,520BC-490BC (circa); 550BC - 500BC,-520.0,-490.0,-505.0,South Ionia,Greece,Archaic Greek,marble,painted
1,BM,"No: 1816,0610.321",1816.0,acroterion,420BC-400BC,-420.0,-400.0,-410.0,,Greece,Classical Greek,marble,
2,BM,"No: 1893,0315.1",,acroterion,300BC (circa),-300.0,-300.0,-300.0,Taranto,Italy,Apulian (Greek),limestone,
3,BM,"No: 1843,0531.26",1843.0,acroterion,447BC-432BC,-447.0,-432.0,-440.0,,Greece,Classical Greek,marble,
4,BM,No: EA11468,1848.0,amulet; figure,,-1069.0,-332.0,-700.5,,Egypt,Third Intermediate; Late Period,glazed composition,glazed


## Step 7  Building the standardised V&A table

Now that the British Museum data is mapped into the shared schema, I can do the same for the V&A dataset. The V&A merged file (`va_combined`) already contains cleaned dates and cleaned production places, so I can map its fields into the same column structure as the BM table.

This'll give me a V&A table with the standardised fields needed for the combined cross-museum dataset.


In [16]:
# Build the standardised V&A table from the merged V&A data

va_standardised = pd.DataFrame({
    "Museum": "VAM",
    "LocalID": va_combined["systemNumber"],
    "AcqDate": get_col(va_combined, "accessionYear"),
    "ObjectType": get_col(va_combined, "objectType"),
    "ItemDate": va_combined["Production date"],
    "StartDate": va_combined["start_year"],
    "EndDate": va_combined["end_year"],
    "MidpointDate": va_combined["midpoint_year"],
    "ItemPlace": va_combined["ProductionPlace_Cleaned"],
    "ModernCountry": va_combined["FinalCountry"],
    "Culture": get_col(va_combined, "culture"),          # will be blank if no such column
    "ItemMaterial": get_col(va_combined, "_sampleMaterial"),
    "ItemTechnique": get_col(va_combined, "_sampleTechnique"),
})

print("Rows in va_standardised:", len(va_standardised))
va_standardised.head()


Rows in va_standardised: 3826


Unnamed: 0,Museum,LocalID,AcqDate,ObjectType,ItemDate,StartDate,EndDate,MidpointDate,ItemPlace,ModernCountry,Culture,ItemMaterial,ItemTechnique
0,VAM,O1407602,2016.0,Sculpture,2015,2015.0,2015.0,2015.0,London,United Kingdom,,silver,forging
1,VAM,O250154,1984.0,Sculpture,1983-1984,1983.0,1984.0,1984.0,Great Britain,United Kingdom,,,
2,VAM,O1769534,2024.0,Sculpture,2021,2021.0,2021.0,2021.0,Eryri National Park,United Kingdom,,sterling silver,casting
3,VAM,O24959,1991.0,Sculpture,12th century,1101.0,1200.0,1150.0,Cambodia,Cambodia,,sandstone,carving
4,VAM,O82516,1961.0,Sculpture,7th century to 8th century,601.0,700.0,650.0,"Dvaravati kingdom, Thailand",,,sandstone,Sculpture


## Fixing V&A cleaned date fields to convert from floats to integers

When inspecting the standardised V&A table, I noticed that the cleaned date fields (`StartDate`, `EndDate`, `MidpointDate`) were stored as floats, which means whole years appeared with a `.0` (for example 1101.0 instead of 1101).  

Since the British Museum cleaned dates are stored as integers, I need to convert the V&A date columns to integers as well to keep the final dataset consistent. Any missing values should remain as empty cells rather than turning into zeros.


In [17]:
# Convert V&A cleaned date columns from float to nullable integer (Int64)
va_standardised["StartDate"] = (
    pd.to_numeric(va_standardised["StartDate"], errors="coerce")
      .astype("Int64")
)

va_standardised["EndDate"] = (
    pd.to_numeric(va_standardised["EndDate"], errors="coerce")
      .astype("Int64")
)

va_standardised["MidpointDate"] = (
    pd.to_numeric(va_standardised["MidpointDate"], errors="coerce")
      .astype("Int64")
)

va_standardised.head()


Unnamed: 0,Museum,LocalID,AcqDate,ObjectType,ItemDate,StartDate,EndDate,MidpointDate,ItemPlace,ModernCountry,Culture,ItemMaterial,ItemTechnique
0,VAM,O1407602,2016.0,Sculpture,2015,2015,2015,2015,London,United Kingdom,,silver,forging
1,VAM,O250154,1984.0,Sculpture,1983-1984,1983,1984,1984,Great Britain,United Kingdom,,,
2,VAM,O1769534,2024.0,Sculpture,2021,2021,2021,2021,Eryri National Park,United Kingdom,,sterling silver,casting
3,VAM,O24959,1991.0,Sculpture,12th century,1101,1200,1150,Cambodia,Cambodia,,sandstone,carving
4,VAM,O82516,1961.0,Sculpture,7th century to 8th century,601,700,650,"Dvaravati kingdom, Thailand",,,sandstone,Sculpture


## Step 8  Combining the BM and V&A tables

With both museums mapped into the same schema, I can now concatenate the British Museum and V&A tables into a single combined sculpture dataset. After this, I'll assign a new sequential `RecordID` so that each row has a stable, project-specific unique identifier.


In [18]:
# Combine BM and V&A standardised tables

combined_standardised = pd.concat(
    [bm_standardised, va_standardised],
    ignore_index=True
)

print("Rows in bm_standardised:", len(bm_standardised))
print("Rows in va_standardised:", len(va_standardised))
print("Total rows in combined_standardised:", len(combined_standardised))

combined_standardised.head()


Rows in bm_standardised: 4447
Rows in va_standardised: 3826
Total rows in combined_standardised: 8273


Unnamed: 0,Museum,LocalID,AcqDate,ObjectType,ItemDate,StartDate,EndDate,MidpointDate,ItemPlace,ModernCountry,Culture,ItemMaterial,ItemTechnique
0,BM,"No: 1886,0401.45",,acroterion,520BC-490BC (circa); 550BC - 500BC,-520.0,-490.0,-505.0,South Ionia,Greece,Archaic Greek,marble,painted
1,BM,"No: 1816,0610.321",1816.0,acroterion,420BC-400BC,-420.0,-400.0,-410.0,,Greece,Classical Greek,marble,
2,BM,"No: 1893,0315.1",,acroterion,300BC (circa),-300.0,-300.0,-300.0,Taranto,Italy,Apulian (Greek),limestone,
3,BM,"No: 1843,0531.26",1843.0,acroterion,447BC-432BC,-447.0,-432.0,-440.0,,Greece,Classical Greek,marble,
4,BM,No: EA11468,1848.0,amulet; figure,,-1069.0,-332.0,-700.5,,Egypt,Third Intermediate; Late Period,glazed composition,glazed


## Step 9  Assigning project-specific RecordIDs

Because some museum numbers are reused for more than one object, I assign a new `RecordID` for this project. Each row in the combined dataset receives a sequential 5 digit identifier:

- `00001` for the first row  
- `00002` for the second  
- and so on  

This `RecordID` is unique within the combined BM and V&A sculpture dataset and provides a stable reference for analysis and integration.


In [19]:
# Assign a new 5 digit RecordID across the combined dataset

combined_standardised = combined_standardised.reset_index(drop=True)

combined_standardised["RecordID"] = (
    combined_standardised.index + 1
).astype(str).str.zfill(5)

# Reorder columns to match the specified output schema
column_order = [
    "RecordID",
    "Museum",
    "LocalID",
    "AcqDate",
    "ObjectType",
    "ItemDate",
    "StartDate",
    "EndDate",
    "MidpointDate",
    "ItemPlace",
    "ModernCountry",
    "Culture",
    "ItemMaterial",
    "ItemTechnique",
]

combined_standardised = combined_standardised[column_order]

combined_standardised.head()


Unnamed: 0,RecordID,Museum,LocalID,AcqDate,ObjectType,ItemDate,StartDate,EndDate,MidpointDate,ItemPlace,ModernCountry,Culture,ItemMaterial,ItemTechnique
0,1,BM,"No: 1886,0401.45",,acroterion,520BC-490BC (circa); 550BC - 500BC,-520.0,-490.0,-505.0,South Ionia,Greece,Archaic Greek,marble,painted
1,2,BM,"No: 1816,0610.321",1816.0,acroterion,420BC-400BC,-420.0,-400.0,-410.0,,Greece,Classical Greek,marble,
2,3,BM,"No: 1893,0315.1",,acroterion,300BC (circa),-300.0,-300.0,-300.0,Taranto,Italy,Apulian (Greek),limestone,
3,4,BM,"No: 1843,0531.26",1843.0,acroterion,447BC-432BC,-447.0,-432.0,-440.0,,Greece,Classical Greek,marble,
4,5,BM,No: EA11468,1848.0,amulet; figure,,-1069.0,-332.0,-700.5,,Egypt,Third Intermediate; Late Period,glazed composition,glazed


## Step 10  Exporting the final combined dataset

The final step is to export the standardised BM and V&A data into a single CSV file. This file follows the agreed schema and includes a unique five digit `RecordID` for every object.

The output file is named `combined_collections_dataset.csv`.


In [20]:
# Export the final combined dataset
output_path = data_dir / "combined_collections_dataset.csv"

combined_standardised.to_csv(output_path, index=False)

output_path


WindowsPath('../data/combined_collections_dataset.csv')

## Summary of work completed

- Loaded the cleaned British Museum and V&A date and place datasets.

- Identified that the V&A merge behaved as expected, while the British Museum merge produced expanded rows due to duplicated Museum numbers in both input tables.

- Inspected all duplicated Museum numbers and confirmed which represented real object variation versus accidental duplication.

- Removed true duplicate rows and manually retained only meaningful variants for the five problematic BM identifiers.

- Built a standardised export schema covering both museums and mapped all fields accordingly.

- Converted V&A cleaned date fields to integers for consistency with the British Museum dataset.

- Created separate standardised tables for BM and V&A, then concatenated them into one combined dataset.

- Assigned a new unique five digit `RecordID` to each row to avoid relying on inconsistent museum numbering.

- Exported the final output as `combined_collections_dataset.csv`.
