# Victoria & Albert Museum – Date Cleaning (Diana)

This notebook cleans the **Production date** column for the Victoria & Albert Museum (V&A) sculpture dataset so it can be aligned with the British Museum dataset and used for combined timeline analysis.

The aim is to replicate the cleaning pipeline developed for the British Museum dataset (Nicky’s notebook), using the same approach, the same column names, and the same sequence of steps. This ensures the two datasets can be merged confidently and compared on a shared chronological scale.

---

## Goals for this notebook

1. Load the V&A sculpture dataset into pandas.
2. Explore the V&A’s `_primaryDate` field (the equivalent of the British Museum “Production date” column).
3. Clean the raw date strings into a standardised, consistent text format.
4. Extract a primary date string for each object (`prod_date_primary`).
5. Convert cleaned dates into numeric:
  - `start_year`
  - `end_year`
  - `midpoint_year`
  - `date_parse_status`
6. Keep the original raw columns unchanged so:
  - the raw V&A data is preserved, and
  - the cleaning steps are transparent, reviewable, and easy for the group to follow.
7. Export a cleaned V&A dates CSV that is structurally aligned with `bm_cleaned_dates.csv` from the British Museum.


### Load the V&A Dataset

The V&A CSV sits in the same `/data` directory as the British Museum dataset. To ensure the cleaning pipeline is reusable, I rename the V&A'S `_primaryDate` column to match Nicky's name (`Production date.`)

This gives both datasets the same starting point got cleaning and parsing

In [2]:
import pandas as pd
import numpy as np
import re

# Path to the V&A dataset
va_file_path = "../data/victoria_albert_museum_dataset_raw.csv"

# Load into pandas
va_df = pd.read_csv(va_file_path)

print("Rows, columns:", va_df.shape)
print("Column names:", va_df.columns.tolist())

# Rename _primaryDate to match the British Museum cleaning pipeline
# The working column to be called 'Production date' in both notebooks
va_df = va_df.rename(columns={"_primaryDate": "Production date"})

va_df[["Production date"]].head(10)

Rows, columns: (3826, 16)


Unnamed: 0,Production date
0,2015
1,1983-1984
2,2021
3,12th century
4,7th century to 8th century
5,1980
6,ca. 1470 - ca. 1480
7,2012
8,1988
9,1988


### Explore the V&A `Production date` Column

While the V&A dataset is structured differently from the British Museum dataset, the date formats show similar inconsistencies, including:
-  ### Single years
    - `"1895"`, `"1750"`
- ### Year ranges
    - `"1880–1890"`, `"1200-1250"`
- ### Decades
    - `"1860s"`, `"1950s–1960s"`
- ### Centuries
    - `"15th century"`, `"18th century"`
- ### Century ranges
    - `"17th century – 18th century"`
- ### “Circa” and uncertainty formats
    - `"ca. 1500"`, `"c. 1850"`, `"about 1800"`
- ### Mixed labels and curator notes
    - `"Signed and dated 1784",`
    - `"Late 19th century (restored)"`

Before building any parsing logic, I check how many rows contain dates, and inspect a random sample to understand the patterns.

In [3]:
missing_dates = va_df["Production date"].isna().sum()
print("Missing Production dates:", missing_dates)

va_df["Production date"].dropna().sample(20, random_state=42).tolist()

Missing Production dates: 89


['2001',
 '1501-1504',
 'ca. 1697-1715',
 'ca. 1750-1780',
 '1540-1560',
 'ca. 1325-50',
 '1988',
 '8th century',
 '1635',
 'ca. 1475-1490',
 'ca. 1340-1370',
 '3rd century-4th century',
 'ca. 1484 - ca. 1490',
 '1827',
 'ca. 1767',
 'early to mid nineteenth century',
 '2nd century-3rd century',
 '1970',
 '1961',
 '1741-1742']

This helps identify similarities and differences with the British Museum dataset and informs how much of Nicky’s cleaning functions can be reused.

### Cleaning the Raw `Production Date` Strings

Like the British Museum dataset, the V&A dates contain:
-  uncertainty words (circa, c., ca., about, after, before)
-  century words (“18th century”)
-  inconsistent spacing and hyphens
-  BC/AD formats
-  multiple date ranges in one string
-  text in brackets (usually curator notes, not dates)

To make the numeric parsing step consistent, I apply a cleaning function that:
1. Lowercases and trims whitespace
2. Removes everything in brackets
3. Removes uncertainty words like circa, about, ca., early, late
4. Normalises “century” → “c” for easier parsing
5. Converts “bce/ce” to “bc/ad”
6. Standardises spaces and hyphens
7. Preserves valid BC and AD formats

The output column is named `prod_date_clean`, aligned to Nicky’s notebook.

In [4]:
# Cleaning the data

def clean_production_date(value):
    """
    Clean the raw 'Production date' text while preserving BC/AD correctly.
    Works for both British Museum and V&A formats.
    """
    if pd.isna(value):
        return None

    text = str(value).strip().lower()

    # 1. Remove anything in brackets (usually curator notes or uncertainty)
    text = re.sub(r"\([^)]*\)", "", text)

    # Handle 'ca.' and 'c.' (common V&A shorthand for 'circa')
    text = text.replace("ca.", " ")
    text = text.replace("c.", " ")

    # 2. Remove common 'fluff' words that do not affect the actual date
    fluff_words = [
        "circa", "about", "approx", "approximately",
        "early", "late",
        "possibly", "probably",
        "after", "before",
        "made", "dated",
        "ca"
    ]
    for word in fluff_words:
        text = re.sub(rf"\b{word}\b", "", text)

    # 2b. Remove 'c' / 'c.' when used alone as circa, but keep 'bc'
    text = re.sub(r"\bc\.\b", "", text)
    text = re.sub(r"\bc\b", "", text)

    # 3. Simplify the word 'century' to 'c'
    text = re.sub(r"\bcentury\b", "c", text)

    # 3b. Standardise BCE/CE to BC/AD
    text = text.replace("bce", "bc").replace("ce", "ad")

    # 4. Clean extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    # 5. Remove spaces before BC/AD (e.g. "500 bc" -> "500bc")
    text = re.sub(r"\s*bc", "bc", text)
    text = re.sub(r"\s*ad", "ad", text)

    # 6. Standardise hyphens (e.g. "500 bc - 350 bc" -> "500bc-350bc")
    text = re.sub(r"\s*-\s*", "-", text)

    text = text.strip()
    return text if text else None

### Extract the Primary Date Segment

Some V&A entries contain multiple dates expressions, eg:

- `"1850-1870; restored later"`
- `"18th century; possibly 19th century"`
For consistency with the British Museum pipeline, I follow Nicky's rule keep only the first date segment before any semicolon.
This is saved into a new column:

- `prod_date_primary`

This reduces complexity during numeric parsing and ensures consistent behaviour across both museums.

In [5]:

def select_primary_date_segment(text):
    """
    If there are multiple parts separated by ';', keep only the first one.
    This matches the British Museum pipeline rule.
    """
    if text is None:
        return None
    first_part = text.split(";")[0].strip()
    return first_part if first_part else None


**Parsing Cleaned Dates into Numeric Year Ranges**

I reuse the same parsing functions Nicky created for:

- single AD years
- single BC years
- ranges (BC and AD)
- century formats (“15thc”)
- century ranges (“17thc–18thc”)
- decades (“1950s”)
- decade ranges (“1950s–1960s”)
Each parsed data produces:
- `start_year`
- `end_year`
- `date_pase_status` (e.g., “parsed_ad_single”, “parsed_century_range”, “unparsed”)

This ensures full alignment with the British Museum’s cleaned output..

In [6]:

def century_to_range(n, era="AD"):
    """
    Convert a century number into an approximate year range.
    AD: 7thC -> 601 to 700
    BC: 3rdC BC -> -300 to -201
    """
    n = int(n)
    if era.upper() == "AD":
        start = (n - 1) * 100 + 1
        end = n * 100
    else:  # BC
        start = - (n * 100)
        end = - ((n - 1) * 100 + 1)
    return start, end


def parse_single_century(part):
    """
    Parse a single century expression like:
    - '7thc'
    - '3rdcbc'
    - '1stcad'
    Returns (start_year, end_year) or (None, None).
    """
    if part is None:
        return None, None

    text = part.strip().lower()

    # Detect BC/AD from suffix (if present)
    era = "AD"
    if text.endswith("bc"):
        era = "BC"
        text = text[:-2].strip()
    elif text.endswith("ad"):
        era = "AD"
        text = text[:-2].strip()

    # Remove the 'c' for century if present
    text = text.replace("c", "")

    # Extract the century number
    match = re.search(r"\d+", text)
    if not match:
        return None, None

    century_num = int(match.group(0))
    return century_to_range(century_num, era=era)

def parse_prod_date(text):
    """
    Parse a cleaned primary date string into start_year, end_year, and status.
    Handles:
    - single years (AD and BC)
    - year ranges (AD and BC)
    - BC prefixed ranges like 'bc1069-747'
    - century expressions and ranges
    - decade formats like '1860s' or '1950s-1960s'
    """
    if text is None:
        return None, None, "no_date_text"

    t = text.strip().lower()

    # 1. BC prefix range like 'bc1069-747'
    match = re.match(r"^bc(\d+)-(\d+)$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range_prefix"

    # 2. BC range like '520bc-490bc'
    match = re.match(r"^(\d+)bc-(\d+)bc$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range"

    # 3. Single BC year '300bc'
    match = re.match(r"^(\d+)bc$", t)
    if match:
        year = -int(match.group(1))
        return year, year, "parsed_bc_single"

    # 4. AD year range '1250-1350'
    match = re.match(r"^(\d+)-(\d+)$", t)
    if match:
        start = int(match.group(1))
        end = int(match.group(2))
        return start, end, "parsed_ad_range"

    # 5. Single AD year '1892'
    match = re.match(r"^(\d+)$", t)
    if match:
        year = int(match.group(1))
        return year, year, "parsed_ad_single"

    # 6. Decade range '1950s-1960s'
    match = re.match(r"^(\d{3,4})s-(\d{3,4})s$", t)
    if match:
        start_decade = int(match.group(1))
        end_decade = int(match.group(2))
        start = start_decade
        end = end_decade + 9  # e.g. 1960s -> 1969
        return start, end, "parsed_decade_range"

    # 7. Single decade '1860s'
    match = re.match(r"^(\d{3,4})s$", t)
    if match:
        start = int(match.group(1))
        end = start + 9
        return start, end, "parsed_decade_single"

    # 8. Century formats ('7thc', '3rdcbc', '16thc-17thc')
    if "c" in t:
        if "-" in t:
            left, right = t.split("-", 1)
            left_start, left_end = parse_single_century(left)
            right_start, right_end = parse_single_century(right)
            if left_start is not None and right_start is not None:
                start = min(left_start, right_start)
                end = max(left_end, right_end)
                return start, end, "parsed_century_range"
            else:
                return None, None, "unparsed_century_range"
        else:
            c_start, c_end = parse_single_century(t)
            if c_start is not None:
                return c_start, c_end, "parsed_century_single"
            else:
                return None, None, "unparsed_century"

    # If nothing matched
    return None, None, "unparsed"


def refine_primary_date(text):
    """
    Small fixes to catch awkward 'ad' and century endings, e.g.
    - 'ad200-300' -> '200-300'
    - 'AD 120 - 140' -> '120-140'
    - '20th .' -> '20thc'
    - '1778 )' -> '1778'
    """
    if text is None:
        return None

    t = str(text).strip().lower()

    # Turn 'ad 120-140' or 'ad200-300' into '120-140' / '200-300'
    t = re.sub(r"\bad\s*(\d+)", r"\1", t)

    # Fix '20th .' style endings to '20thc'
    t = re.sub(r"\b(\d+)(st|nd|rd|th)\s*\.?\s*$", r"\1\2c", t)

    # Remove trailing junk like ')' or '.'
    t = re.sub(r"[^\w]+$", "", t)

    t = t.strip()
    return t if t else None


I then calculate midpoint_year in the same way as Nicky’s notebook.

In [7]:
def compute_midpoint(row):
    s = row["start_year"]
    e = row["end_year"]
    if pd.isna(s) or pd.isna(e):
        return None
    return int(round(s + (e - s) / 2))

**Applying Cleaning Functions**

I then apply the cleaning functions and check a preview to see if the cleaning was successful

In [8]:
# 1. Clean the raw 'Production date' text
va_df["prod_date_clean"] = va_df["Production date"].apply(clean_production_date)

# 2. Extract the first date (before any ';')
va_df["prod_date_primary"] = va_df["prod_date_clean"].apply(select_primary_date_segment)

# 3. Additional refinement for tricky 'ad' / 'century' cases
va_df["prod_date_primary"] = va_df["prod_date_primary"].apply(refine_primary_date)

# Quick preview to check the cleaning
va_df[["Production date", "prod_date_clean", "prod_date_primary"]].head(20)

Unnamed: 0,Production date,prod_date_clean,prod_date_primary
0,2015,2015,2015
1,1983-1984,1983-1984,1983-1984
2,2021,2021,2021
3,12th century,12th c,12th c
4,7th century to 8th century,7th c to 8th c,7th c to 8th c
5,1980,1980,1980
6,ca. 1470 - ca. 1480,1470-1480,1470-1480
7,2012,2012,2012
8,1988,1988,1988
9,1988,1988,1988


**Parse into numeric years**

After parsing the numeric years I will review the results of the columns so that they align with Nicky's data set

In [9]:
# Parse the primary cleaned date i
parsed = va_df["prod_date_primary"].apply(parse_prod_date)
va_df["start_year"], va_df["end_year"], va_df["date_parse_status"] = zip(*parsed)

# Compute midpoint_year
va_df["midpoint_year"] = va_df.apply(compute_midpoint, axis=1)

# Preview results, aligned with bm_cleaned_dates.csv structure
va_df[[
    "Production date",
    "prod_date_primary",
    "start_year",
    "end_year",
    "midpoint_year",
    "date_parse_status"
]].head(20)


Unnamed: 0,Production date,prod_date_primary,start_year,end_year,midpoint_year,date_parse_status
0,2015,2015,2015.0,2015.0,2015.0,parsed_ad_single
1,1983-1984,1983-1984,1983.0,1984.0,1984.0,parsed_ad_range
2,2021,2021,2021.0,2021.0,2021.0,parsed_ad_single
3,12th century,12th c,1101.0,1200.0,1150.0,parsed_century_single
4,7th century to 8th century,7th c to 8th c,601.0,700.0,650.0,parsed_century_single
5,1980,1980,1980.0,1980.0,1980.0,parsed_ad_single
6,ca. 1470 - ca. 1480,1470-1480,1470.0,1480.0,1475.0,parsed_ad_range
7,2012,2012,2012.0,2012.0,2012.0,parsed_ad_single
8,1988,1988,1988.0,1988.0,1988.0,parsed_ad_single
9,1988,1988,1988.0,1988.0,1988.0,parsed_ad_single


**Check Parsing Coverage**

As in Nicky’s British Museum notebook, it is useful to see how many values fall into each parsing category. This shows:

- how many rows were successfully parsed,
- how many rows are missing dates entirely, and
- how many required special handling (centuries, decades, BC ranges, etc.).

In [10]:
va_df["date_parse_status"].value_counts(dropna=False)

date_parse_status
parsed_ad_single          1756
parsed_ad_range           1361
parsed_century_single      345
parsed_century_range       176
no_date_text                89
unparsed                    53
unparsed_century            15
parsed_decade_single        10
parsed_bc_range              8
parsed_bc_single             7
unparsed_century_range       5
parsed_decade_range          1
Name: count, dtype: int64

This gives the team visibility on:
- how complete the cleaned dates are
- how many remain ambiguous
- whether we need additional rules for V&A-specific formats

**Result: Aligned Structure with British Museum Dataset**

After this notebook, the V&A dataset now contains the same key cleaned date fields as the British Museum dataset:
- `prod_date_clean`
- `prod_date_primary`
- `start_year`
- `end_year`
- `midpoint_year`
- `date_parse_status`

The British Museum file also includes:

- `Museum number`
- `Culture`
- `Production place`
- `Find spot`

For the V&A, I include the closest equivalents where possible (e.g. object ID and place fields). These column names may need to be adjusted based on the actual V&A schema.

This means both museums’ data can now be:
- combined
- plotted on chronological timelines
- compared historically
- used for group visualisation work
The cleaning logic is transparent, reproducible, and matches the structure of Nicky’s British Museum notebook

# Export Columns used to match British Musuem Columns

In [11]:
export_columns = [

    "systemNumber",

    # Date fields
    "Production date",        # raw text
    "prod_date_primary",      # cleaned main date string
    "start_year",             # numeric start
    "end_year",               # numeric end
    "midpoint_year",          # numeric midpoint
    "date_parse_status",      # parsing method / status,
]

# Filter the DataFrame
va_export = va_df[export_columns].copy()

# Save to CSV
output_path = "../data/va_cleaned_dates.csv"
va_export.to_csv(output_path, index=False)

print("Export complete!")
print("File saved to:", output_path)
print("Rows exported:", len(va_export))

Export complete!
File saved to: ../data/va_cleaned_dates.csv
Rows exported: 3826


In [12]:
for col in export_columns:
    print(col, col in va_df.columns)

systemNumber True
Production date True
prod_date_primary True
start_year True
end_year True
midpoint_year True
date_parse_status True


## Summary

This notebook applies a parallel cleaning and parsing process to the V&A sculpture dataset so that it can be aligned with Nicky’s British Museum date-cleaning notebook.

The work involved:

- Standardising a wide range of inconsistent date formats used in the V&A `_primaryDate` field.
- Converting BC and AD dates onto a single numeric timeline.
- Generating `start_year`, `end_year`, and `midpoint_year` for each object.
- Flagging how each date string was interpreted using `date_parse_status`.
- Exporting a cleaned V&A dates file (`va_cleaned_dates.csv`) whose structure mirrors `bm_cleaned_dates.csv` for the British Museum.

Both raw and cleaned date fields are retained to support transparency, manual review, and any future changes.