# British Museum – Production Date Cleaning (Nicky)

This notebook cleans the **Production date** field in the British Museum sculpture dataset.  
The aim is to turn extremely inconsistent date strings into usable numerical values that can be plotted on a continuous timeline.

**Goals for this notebook:**

1. Load the British Museum sculpture CSV into pandas.  
2. Explore the `Production date` field to understand the range of formats.  
3. Clean raw date strings by:
   - removing noise words (circa, about, late, after, possibly, etc)
   - standardising case and spacing
   - reducing semicolon entries to the primary date
4. Create a cleaned text field:  
   - `prod_date_primary`
5. Parse the cleaned date strings into numerical values:
   - BC dates converted to negative numbers  
   - AD dates kept as positive numbers  
   - Centuries and decade ranges converted into approximate spans  
6. Create final numerical fields:
   - `start_year`
   - `end_year`
   - `midpoint_year`
   - `date_parse_status` (explains how each value was interpreted)
7. Handle missing or unparseable dates by analysing the `Culture` field:
   - identify cultures with well defined historical ranges  
   - map these cultures to approximate date spans  
   - apply these ranges only when context supports it (via Culture, Find spot, or Production place)
8. Leave ambiguous cases unmapped to avoid inaccurate assumptions.
9. Keep all steps transparent and reproducible.

The original `Production date` field is kept unchanged, and all cleaned fields are stored separately so that:
- the dataset remains fully auditable  
- every transformation can be traced  
- visualisation work (timelines, comparisons) uses consistent numerical values  


In [3]:
import pandas as pd

# Path to the British Museum dataset.
# The notebook is in /notebooks and the CSV is in /data,
# so we go up one folder ("..") to reach the project root, then into /data.
bm_file_path = "../data/british_museum_dataset_raw.csv"

# Load the dataset into a pandas DataFrame
bm_df = pd.read_csv(bm_file_path)

# Basic checks to confirm everything loaded correctly
print("Rows, columns:", bm_df.shape)
print("Column names:", bm_df.columns.tolist())

# Preview the key columns for date cleaning
bm_df[["Production date", "Culture"]].head(10)


Rows, columns: (4458, 47)
Column names: ['Image', 'Object type', 'Museum number', 'Title', 'Denomination', 'Escapement', 'Description', 'Producer name', 'School/style', 'State', 'Authority', 'Ethnic name (made by)', 'Ethnic name (assoc)', 'Culture', 'Production date', 'Production place', 'Find spot', 'Materials', 'Ware', 'Type series', 'Technique', 'Dimensions', 'Inscription', 'Curators Comments', 'Bib references', 'Location', 'Exhibition history', 'Condition', 'Subjects', 'Assoc name', 'Assoc place', 'Assoc events', 'Assoc titles', 'Acq name (acq)', 'Acq name (finding)', 'Acq name (excavator)', 'Acq name (previous)', 'Acq date', 'Acq notes (acq)', 'Acq notes (exc)', 'Dept', 'BM/Big number', 'Reg number', 'Add ids', 'Cat no', 'Banknote serial number', 'Joined objects']


Unnamed: 0,Production date,Culture
0,520BC-490BC (circa); 550BC - 500BC,Archaic Greek
1,420BC-400BC,Classical Greek
2,300BC (circa),Apulian (Greek)
3,447BC-432BC,Classical Greek
4,,26th Dynasty
5,,Third Intermediate; Late Period
6,,Late Period
7,,Third Intermediate; 26th Dynasty
8,,26th Dynasty
9,,Ancient Egypt


## Explore the `Production date` column

Before writing any cleaning functions, it's helpful to understand the overall shape of the mess in the `Production date` column.

Instead of printing every unique value (there are hundreds), I am starting by checking:
- how many rows are missing a production date,
- how many rows have cultural information,
- and a small sample of 20 date values chosen at random.

The sample is not meant to show every possible format, but to give us a quick sense of the different patterns we will need to clean later (for example BC ranges, single years, centuries, mixed BC/AD ranges, 'circa', 'late/early', and values containing semicolons).



In [4]:
# How many rows are missing a Production date?
missing_dates = bm_df["Production date"].isna().sum()
print("Missing Production dates:", missing_dates)

# How many rows have a Culture value?
missing_culture = bm_df["Culture"].isna().sum()
print("Missing Culture values:", missing_culture)

# Show 20 random date values to see the range of formatting issues
bm_df["Production date"].dropna().sample(20, random_state=42).tolist()


Missing Production dates: 787
Missing Culture values: 911


['480BC-460BC (circa)',
 '645BC-635BC',
 '350BC (circa)',
 '600BC (circa)',
 '645BC-635BC',
 '1stC-3rdC',
 '645BC-635BC',
 '1892',
 '640BC-620BC',
 '350BC (circa)',
 '1830 (circa)',
 '350BC (circa)',
 '2ndC BC(late)-1stC BC (circa)',
 'late 17thC- early 18thC',
 '193-200 (circa)',
 '438BC-432BC',
 '704BC-681BC (circa)',
 '450BC (circa)',
 '1stC-3rdC (circa)',
 '589BC-570BC']

## Identify common patterns in the `Production date` values

Now that I’ve checked a small sample of the date values, the next step is to understand the *types of patterns* that appear in this column. 

There are hundreds of unique raw strings in the `Production date` field, and printing them all would be too long to interpret. Instead, I want to group dates by their general structure. 

This helps identify the kinds of cases we need to handle in the cleaning function, such as:
- single years (e.g. `1892`)
- single BC years (e.g. `350BC`)
- simple ranges (e.g. `420BC-400BC`)
- century formats (e.g. `1stC`, `3rdC BC`)
- century ranges (e.g. `1stC-3rdC`)
- BC/AD mixed ranges (e.g. `1stC BC-1stC AD`)
- dates with uncertainty words (`circa`, `late`, `early`)
- multiple dates separated by semicolons

By looking at the structure of each value instead of each individual string, we can design a cleaning function that handles every type consistently.


In [5]:
# Extract unique non-null Production date strings
unique_dates = bm_df["Production date"].dropna().unique()

print("Total unique date formats:", len(unique_dates))

# Show the first 50 unique examples to understand the range of patterns
print("\nFirst 50 unique date values:")
for val in unique_dates[:50]:
    print("-", val)


Total unique date formats: 1203

First 50 unique date values:
- 520BC-490BC (circa); 550BC - 500BC
- 420BC-400BC
- 300BC (circa)
- 447BC-432BC
- BC1069-747 (c.)
- 500 BC - 350 BC
- 7thC
- 300BC (after)
- 1250-1350 (circa)
- 16thC-17thC (?)
- 13thC
- 16thC - 19thC
- 1275-1300 (about)
- 9thc-19thc
- 9thc-10thc
- 9thc
- 595BC-550BC (circa); 590BC (circa)
- 595BC-550BC (circa)
- 550BC (circa)
- 300BC-200BC
- 1617-1620 (Previously said to be 1617-1620, due to incorrect identification of mark.); 1585-1590 (Given Nuremberg assay mark in use of 1576/77-1591 and reattribution to Christoph II Ritter, this object must be dated earlier than previously thought.)
- 450BC (circa)
- 4thC BC(late)
- 1971
- 1130-1140 (circa)
- 1882 (made)
- 10thC(late) (?); 11thC(early) (?)
- 1700-1799
- 1879 (modelled)
- 1779-1789
- 1750-1752 (circa)
- 1830 (circa)
- 1stC BC-1stC
- 1779 (circa)
- 1stC-2ndC
- 140-150
- 440 BC - 400 BC (circa)
- 250BC-150BC (circa)
- 217
- 1863; 1874
- 30BC-10BC
- 350BC (circa)
- 700BC-6

## Summary of common date patterns in the British Museum dataset

After reviewing a sample of unique values in the `Production date` column, I identified a set of recurring pattern types. These patterns cover all of the variations found in the dataset, even though there are over 1200 unique raw strings. Cleaning will follow these pattern groups rather than every individual value.

### **1. BC year ranges**
Examples:
- `520BC-490BC`
- `420BC-400BC`
- `447BC-432BC`
- `350BC-340BC`
These will be converted to negative start and end years.

### **2. Single BC years**
Examples:
- `300BC (circa)`
- `550BC (circa)`
We remove text like “circa” and convert to a single negative year.

### **3. BC ranges with unusual formatting**
Examples:
- `BC1069-747 (c.)`
- `500 BC - 350 BC`
These need punctuation and spacing cleaned before parsing.

### **4. AD year ranges**
Examples:
- `1250-1350 (circa)`
- `1275-1300`
Straightforward numeric extraction once extra words are removed.

### **5. AD single years**
Examples:
- `1892`
- `1830 (circa)`
Converted directly into start and end years that are the same.

### **6. Single centuries (AD)**
Examples:
- `7thC`
- `13thC`
- `9thc`
Converted into approximate year ranges (e.g. 7thC → 601–700).

### **7. Century ranges (AD)**
Examples:
- `16thC-17thC`
- `9thc-10thc`
We convert each century separately, then take the earliest start and latest end.

### **8. Century ranges with “early” or “late”**
Examples:
- `late 17thC - early 18thC`
Words like “early” and “late” are removed before converting.

### **9. Century ranges in BC**
Examples:
- `2ndC BC - 1stC BC`
Handled the same way as AD centuries but converted to negative years.

### **10. Multiple date values separated by semicolons**
Examples:
- `520BC-490BC (circa); 550BC - 500BC`
We take the **first** date or range consistently across the dataset.

### **11. Long curator notes embedded in the date**
Examples:
- `1617-1620 (Previously said to be...)`
- `1585-1590 (Given Nuremberg assay mark...)`
We remove everything after the main date or date range.

---

These categories will guide the cleaning function. The goal is to remove unnecessary text, standardise the date formats, and convert each value into numeric `start_year`, `end_year`, and `midpoint_year` fields.


## First cleaning step for `Production date`

The raw `Production date` values include a lot of extra words and punctuation such as:

- uncertainty or descriptive words (`circa`, `about`, `late`, `early`, `possibly`, `after`, etc)
- long curator notes in brackets
- extra spaces and inconsistent hyphens

Before trying to turn the dates into numbers, I am creating a simple cleaning function that:

1. Converts the text to lowercase.
2. Removes anything in brackets `(...)` (these are usually notes, not core dates).
3. Removes common “fluff” words such as `circa`, `about`, `late`, `early`, `made`, `dated`.
4. Standardises `bce` to `bc` and `ce` to `ad`.
5. Tidies spaces and hyphens so formats like `500 BC - 350 BC` become `500bc-350bc`.

This does **not** yet calculate years. It just produces a cleaner text version of the date in a new column called `prod_date_clean` so the later parsing step is easier to write and understand.


In [6]:
import pandas as pd
import re   # Using Python's regular expression module to clean text safely.
            # 're' lets us remove things like bracketed notes, words like 'circa',
            # and standalone 'c' without accidentally breaking parts of valid terms like 'bc'.

def clean_production_date(value):
    """
    Clean the raw 'Production date' text while preserving BC/AD correctly.
    This removes uncertainty words, long notes, punctuation variations,
    and normalises the structure before numerical parsing.
    """
    if pd.isna(value):
        return None

    text = str(value).strip().lower()

    # 1. Remove anything inside brackets (usually curator notes or uncertainty)
    text = re.sub(r"\([^)]*\)", "", text)

    # 2. Remove common 'fluff' words that do not affect the actual date
    fluff_words = [
        "circa", "about", "approx", "approximately",
        "early", "late",
        "possibly", "probably",
        "after", "before",
        "made", "dated"
    ]
    for word in fluff_words:
        text = re.sub(rf"\b{word}\b", "", text)

    # 2b. Handle 'c' and 'c.' meaning circa, but only as separate words,
    # so we don't break valid terms like "bc".
    text = re.sub(r"\bc\.\b", "", text)
    text = re.sub(r"\bc\b", "", text)

    # 3. Standardise BCE/CE to BC/AD
    text = text.replace("bce", "bc")
    text = text.replace("ce", "ad")

    # 4. Clean extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    # 5. Remove spaces before BC/AD (e.g. "500 bc" -> "500bc")
    text = re.sub(r"\s*bc", "bc", text)
    text = re.sub(r"\s*ad", "ad", text)

    # 6. Standardise hyphens (e.g. "500 bc - 350 bc" -> "500bc-350bc")
    text = re.sub(r"\s*-\s*", "-", text)

    text = text.strip()

    return text if text else None

# Apply the improved cleaning function
bm_df["prod_date_clean"] = bm_df["Production date"].apply(clean_production_date)

# Preview the output
bm_df[["Production date", "prod_date_clean"]].head(20)


Unnamed: 0,Production date,prod_date_clean
0,520BC-490BC (circa); 550BC - 500BC,520bc-490bc ; 550bc-500bc
1,420BC-400BC,420bc-400bc
2,300BC (circa),300bc
3,447BC-432BC,447bc-432bc
4,,
5,,
6,,
7,,
8,,
9,,


## Extract a single primary date value

Some `Production date` values contain more than one date or range, separated by semicolons (for example `520bc-490bc ; 550bc-500bc`). 

For the purposes of this project, I'm using a simple and consistent rule:

- If there are multiple dates in a single field, I keep **only the first date or range**.

This makes it easier to convert the dates into numeric start and end years, and it keeps the logic transparent for the rest of the team. The result is stored in a new column called `prod_date_primary`.


In [7]:
def select_primary_date_segment(text):
    """
    Take a cleaned date string (prod_date_clean) and,
    if there are multiple parts separated by ';', keep only the first one.
    """
    if text is None:
        return None

    # Split on ';' and take the first part
    first_part = text.split(";")[0]

    # Strip extra spaces just in case
    first_part = first_part.strip()

    return first_part if first_part else None

# Create a new column with just the primary date segment
bm_df["prod_date_primary"] = bm_df["prod_date_clean"].apply(select_primary_date_segment)

# Preview a few examples where multiple dates existed
bm_df[["Production date", "prod_date_clean", "prod_date_primary"]].head(10)


Unnamed: 0,Production date,prod_date_clean,prod_date_primary
0,520BC-490BC (circa); 550BC - 500BC,520bc-490bc ; 550bc-500bc,520bc-490bc
1,420BC-400BC,420bc-400bc,420bc-400bc
2,300BC (circa),300bc,300bc
3,447BC-432BC,447bc-432bc,447bc-432bc
4,,,
5,,,
6,,,
7,,,
8,,,
9,,,


## Convert cleaned date strings into numeric year ranges

Now that I have a cleaned and standardised date string in `prod_date_primary`, the next step is to convert each value into numerical values of `start_year` and `end_year`.

This requires identifying which *pattern* the date represents. Examples include:

- single BC years (e.g. `300bc`)
- single AD years (e.g. `1830`)
- BC ranges (e.g. `520bc-490bc`)
- AD ranges (e.g. `1250-1350`)
- century formats (e.g. `7thc`, `13thc`)
- BC centuries (e.g. `3rdc bc`)
- century ranges (e.g. `16thc-17thc`)
- mixed BC/AD century ranges

The parsing function will:
1. Detect the correct pattern using simple string rules.
2. Extract the necessary numbers.
3. Convert BC dates into negative values.
4. Return `start_year`, `end_year`, and a `date_parse_status` flag so that the process is fully transparent for the team.


In [8]:
import pandas as pd
import numpy as np   
# Using numpy for handling missing numeric values (np.nan)
# and for cleanly storing numeric outputs like start_year/end_year.
# Pandas works very smoothly with numpy, so this keeps everything consistent.

import re            
# Using Python's regular expression module to detect patterns (BC/AD, centuries, ranges)
# and clean date text safely.


def century_to_range(n, era="AD"):
    """
    Convert a century number into an approximate year range.
    AD example: 7thC -> 601 to 700
    BC example: 3rdC BC -> -300 to -201
    """
    n = int(n)
    if era.upper() == "AD":
        start = (n - 1) * 100 + 1
        end = n * 100
    else:  # BC
        start = - (n * 100)
        end = - ((n - 1) * 100 + 1)
    return start, end


def parse_single_century(part):
    """
    Parse a single century expression like:
    - '7thc'
    - '3rdcbc'
    - '1stcad'
    Returns (start_year, end_year) or (None, None) if it cannot be parsed.
    """
    if part is None:
        return None, None

    text = part.strip().lower()

    # Detect BC/AD from suffix (if present)
    era = "AD"
    if text.endswith("bc"):
        era = "BC"
        text = text[:-2].strip()
    elif text.endswith("ad"):
        era = "AD"
        text = text[:-2].strip()

    text = text.replace("c", "")  # remove 'c' for 'century'

    # Extract the century number
    match = re.search(r"\d+", text)
    if not match:
        return None, None

    century_num = int(match.group(0))
    return century_to_range(century_num, era=era)


def parse_prod_date(text):
    """
    Parse a cleaned primary date string into start_year, end_year, and status.
    Handles:
    - single years (AD and BC)
    - year ranges (AD and BC)
    - BC prefixed ranges like 'bc1069-747'
    - century expressions and ranges
    """
    if text is None:
        return None, None, "no_date_text"

    t = text.strip().lower()

    # 1. BC prefix range like 'bc1069-747'
    match = re.match(r"^bc(\d+)-(\d+)$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range_prefix"

    # 2. BC range like '520bc-490bc'
    match = re.match(r"^(\d+)bc-(\d+)bc$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range"

    # 3. Single BC year '300bc'
    match = re.match(r"^(\d+)bc$", t)
    if match:
        year = -int(match.group(1))
        return year, year, "parsed_bc_single"

    # 4. AD year range '1250-1350'
    match = re.match(r"^(\d+)-(\d+)$", t)
    if match:
        start = int(match.group(1))
        end = int(match.group(2))
        return start, end, "parsed_ad_range"

    # 5. Single AD year '1892'
    match = re.match(r"^(\d+)$", t)
    if match:
        year = int(match.group(1))
        return year, year, "parsed_ad_single"

    # 6. Century formats ('7thc', '3rdcbc', '16thc-17thc')
    if "c" in t:
        if "-" in t:
            left, right = t.split("-", 1)
            left_start, left_end = parse_single_century(left)
            right_start, right_end = parse_single_century(right)
            if left_start is not None and right_start is not None:
                start = min(left_start, right_start)
                end = max(left_end, right_end)
                return start, end, "parsed_century_range"
            else:
                return None, None, "unparsed_century_range"
        else:
            c_start, c_end = parse_single_century(t)
            if c_start is not None:
                return c_start, c_end, "parsed_century_single"
            else:
                return None, None, "unparsed_century"

    # If nothing matched:
    return None, None, "unparsed"


# Apply the parser to the primary cleaned date
parsed_results = bm_df["prod_date_primary"].apply(parse_prod_date)

bm_df["start_year"], bm_df["end_year"], bm_df["date_parse_status"] = zip(*parsed_results)


# Calculate midpoint
def compute_midpoint(row):
    s = row["start_year"]
    e = row["end_year"]
    if pd.isna(s) or pd.isna(e):
        return None
    return int(round(s + (e - s) / 2))

bm_df["midpoint_year"] = bm_df.apply(compute_midpoint, axis=1)

# Preview the output
bm_df[[
    "Production date",
    "prod_date_primary",
    "start_year",
    "end_year",
    "midpoint_year",
    "date_parse_status"
]].head(20)


Unnamed: 0,Production date,prod_date_primary,start_year,end_year,midpoint_year,date_parse_status
0,520BC-490BC (circa); 550BC - 500BC,520bc-490bc,-520.0,-490.0,-505.0,parsed_bc_range
1,420BC-400BC,420bc-400bc,-420.0,-400.0,-410.0,parsed_bc_range
2,300BC (circa),300bc,-300.0,-300.0,-300.0,parsed_bc_single
3,447BC-432BC,447bc-432bc,-447.0,-432.0,-440.0,parsed_bc_range
4,,,,,,no_date_text
5,,,,,,no_date_text
6,,,,,,no_date_text
7,,,,,,no_date_text
8,,,,,,no_date_text
9,,,,,,no_date_text


## Check parsing coverage

Now that the dates have been parsed into numeric start and end years, I want to check how many rows were successfully parsed and how many still need attention. This helps to show how complete the cleaned date field is before moving on to using `Culture` as a fallback for missing dates.


In [9]:
# How many rows fall into each parsing status?
status_counts = bm_df["date_parse_status"].value_counts(dropna=False)
print(status_counts)

# Optional: look at a few examples that were not parsed
bm_df[bm_df["date_parse_status"].str.startswith("unparsed")].head(10)[
    ["Production date", "prod_date_primary", "date_parse_status"]
]


date_parse_status
parsed_bc_range           1243
no_date_text               787
parsed_century_single      677
parsed_century_range       527
parsed_bc_single           455
parsed_ad_range            379
parsed_ad_single           344
unparsed                    45
parsed_bc_range_prefix       1
Name: count, dtype: int64


Unnamed: 0,Production date,prod_date_primary,date_parse_status
228,1950s-1960s,1950s-1960s,unparsed
232,1950s-1960s,1950s-1960s,unparsed
236,late 16th century,16thadntury,unparsed
241,1950s-1960s,1950s-1960s,unparsed
323,1860s (before),1860s,unparsed
329,1890s,1890s,unparsed
597,1980s,1980s,unparsed
733,1860s (before),1860s,unparsed
761,1880s (?),1880s,unparsed
768,18th century - 19th century,18thadntury-19thadntury,unparsed


## Handle decade-based dates (e.g. `1860s`, `1950s-1960s`)

Some rows in the dataset use decades instead of specific years, for example:
- `1860s`
- `1890s`
- `1950s-1960s`

To keep things simple and consistent, I am treating decades as year ranges:

- `1860s` → 1860 to 1869  
- `1950s-1960s` → 1950 to 1969  

These will be parsed into `start_year` and `end_year` in the same way as the other date formats and marked with their own `date_parse_status` values.


In [10]:
import pandas as pd
import numpy as np   # numpy works with pandas to represent missing numeric values (np.nan)
import re            # regular expressions for detecting patterns in date strings


def clean_production_date(value):
    """
    Clean the raw 'Production date' text while preserving BC/AD correctly.
    This removes uncertainty words, long notes, 'century' wording,
    and normalises the structure before numerical parsing.
    """
    if pd.isna(value):
        return None

    text = str(value).strip().lower()

    # 1. Remove anything inside brackets (usually curator notes or uncertainty)
    text = re.sub(r"\([^)]*\)", "", text)

    # 2. Remove common 'fluff' words that do not affect the actual date
    fluff_words = [
        "circa", "about", "approx", "approximately",
        "early", "late",
        "possibly", "probably",
        "after", "before",
        "made", "dated"
    ]
    for word in fluff_words:
        text = re.sub(rf"\b{word}\b", "", text)

    # 2b. Handle 'c' and 'c.' meaning circa, but only as separate words,
    # so we don't break valid terms like "bc".
    text = re.sub(r"\bc\.\b", "", text)
    text = re.sub(r"\bc\b", "", text)

    # 2c. Simplify the word 'century' to 'c' so that
    # '18th century - 19th century' becomes '18th c - 19th c'
    text = re.sub(r"\bcentury\b", "c", text)

    # 3. Standardise BCE/CE to BC/AD
    text = text.replace("bce", "bc")
    text = text.replace("ce", "ad")

    # 4. Clean extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    # 5. Remove spaces before BC/AD (e.g. "500 bc" -> "500bc")
    text = re.sub(r"\s*bc", "bc", text)
    text = re.sub(r"\s*ad", "ad", text)

    # 6. Standardise hyphens (e.g. "500 bc - 350 bc" -> "500bc-350bc")
    text = re.sub(r"\s*-\s*", "-", text)

    text = text.strip()

    return text if text else None


# Re-apply cleaning to get an updated cleaned column
bm_df["prod_date_clean"] = bm_df["Production date"].apply(clean_production_date)

# Recreate primary segment (first date if there are semicolons)
def select_primary_date_segment(text):
    if text is None:
        return None
    first_part = text.split(";")[0].strip()
    return first_part if first_part else None

bm_df["prod_date_primary"] = bm_df["prod_date_clean"].apply(select_primary_date_segment)


def century_to_range(n, era="AD"):
    """
    Convert a century number into an approximate year range.
    AD: 7thC -> 601 to 700
    BC: 3rdC BC -> -300 to -201
    """
    n = int(n)
    if era.upper() == "AD":
        start = (n - 1) * 100 + 1
        end = n * 100
    else:  # BC
        start = - (n * 100)
        end = - ((n - 1) * 100 + 1)
    return start, end


def parse_single_century(part):
    """
    Parse a single century expression like:
    - '7thc'
    - '3rdcbc'
    - '1stcad'
    Returns (start_year, end_year) or (None, None).
    """
    if part is None:
        return None, None

    text = part.strip().lower()

    # Detect BC/AD from suffix (if present)
    era = "AD"
    if text.endswith("bc"):
        era = "BC"
        text = text[:-2].strip()
    elif text.endswith("ad"):
        era = "AD"
        text = text[:-2].strip()

    # Remove the 'c' for century if present
    text = text.replace("c", "")

    # Extract the century number
    match = re.search(r"\d+", text)
    if not match:
        return None, None

    century_num = int(match.group(0))
    return century_to_range(century_num, era=era)


def parse_prod_date(text):
    """
    Parse a cleaned primary date string into start_year, end_year, and status.
    Handles:
    - single years (AD and BC)
    - year ranges (AD and BC)
    - BC prefixed ranges like 'bc1069-747'
    - century expressions and ranges
    - decade formats like '1860s' or '1950s-1960s'
    """
    if text is None:
        return None, None, "no_date_text"

    t = text.strip().lower()

    # 1. BC prefix range like 'bc1069-747'
    match = re.match(r"^bc(\d+)-(\d+)$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range_prefix"

    # 2. BC range like '520bc-490bc'
    match = re.match(r"^(\d+)bc-(\d+)bc$", t)
    if match:
        start = -int(match.group(1))
        end = -int(match.group(2))
        return start, end, "parsed_bc_range"

    # 3. Single BC year '300bc'
    match = re.match(r"^(\d+)bc$", t)
    if match:
        year = -int(match.group(1))
        return year, year, "parsed_bc_single"

    # 4. AD year range '1250-1350'
    match = re.match(r"^(\d+)-(\d+)$", t)
    if match:
        start = int(match.group(1))
        end = int(match.group(2))
        return start, end, "parsed_ad_range"

    # 5. Single AD year '1892'
    match = re.match(r"^(\d+)$", t)
    if match:
        year = int(match.group(1))
        return year, year, "parsed_ad_single"

    # 6. Decade range '1950s-1960s'
    match = re.match(r"^(\d{3,4})s-(\d{3,4})s$", t)
    if match:
        start_decade = int(match.group(1))
        end_decade = int(match.group(2))
        start = start_decade
        end = end_decade + 9  # e.g. 1960s -> 1969
        return start, end, "parsed_decade_range"

    # 7. Single decade '1860s'
    match = re.match(r"^(\d{3,4})s$", t)
    if match:
        start = int(match.group(1))
        end = start + 9
        return start, end, "parsed_decade_single"

    # 8. Century formats ('7thc', '3rdcbc', '16thc-17thc')
    if "c" in t:
        if "-" in t:
            left, right = t.split("-", 1)
            left_start, left_end = parse_single_century(left)
            right_start, right_end = parse_single_century(right)
            if left_start is not None and right_start is not None:
                start = min(left_start, right_start)
                end = max(left_end, right_end)
                return start, end, "parsed_century_range"
            else:
                return None, None, "unparsed_century_range"
        else:
            c_start, c_end = parse_single_century(t)
            if c_start is not None:
                return c_start, c_end, "parsed_century_single"
            else:
                return None, None, "unparsed_century"

    # If nothing matched
    return None, None, "unparsed"


# Apply the parser to the primary cleaned date
parsed_results = bm_df["prod_date_primary"].apply(parse_prod_date)
bm_df["start_year"], bm_df["end_year"], bm_df["date_parse_status"] = zip(*parsed_results)


# Recalculate midpoint
def compute_midpoint(row):
    s = row["start_year"]
    e = row["end_year"]
    if pd.isna(s) or pd.isna(e):
        return None
    return int(round(s + (e - s) / 2))

bm_df["midpoint_year"] = bm_df.apply(compute_midpoint, axis=1)

# Quick check of parsing statuses
bm_df["date_parse_status"].value_counts(dropna=False)


date_parse_status
parsed_bc_range           1243
no_date_text               787
parsed_century_single      680
parsed_century_range       534
parsed_bc_single           455
parsed_ad_range            379
parsed_ad_single           344
parsed_decade_single        19
unparsed                    11
parsed_decade_range          5
parsed_bc_range_prefix       1
Name: count, dtype: int64

## Inspect the remaining `unparsed` date values

Most of the *Production date* values have now been cleaned and parsed, but there are still a small number left with the status `unparsed`. Before deciding what to do with them, I want to look at these entries directly.

Seeing the original `Production date` next to the cleaned `prod_date_primary` will help me understand:

- why these particular values did not match any of the patterns  
- whether they can be fixed with one more small rule  
- or whether they are too unclear or inconsistent to parse automatically  

This step will help me decide whether to update the parser again or leave these few entries as genuinely unparseable.


In [11]:
# Look at the rows where the date could not be parsed
unparsed_rows = bm_df[bm_df["date_parse_status"] == "unparsed"]

# Show the original and cleaned date text so we can see what is going on
unparsed_rows[["Production date", "prod_date_primary"]].head(20)


Unnamed: 0,Production date,prod_date_primary
926,AD200-300 (?),ad200-300
1644,AD100-250,ad100-250
1696,20th C.,20th .
1952,54-59 (Accession Portrait style (II)),54-59 )
2479,20th C.,20th .
3072,AD 120 - 140 (after a Greek original of about ...,ad 120-140
3092,AD 120 - 140 (after a Greek original of about ...,ad 120-140
3103,AD50-100,ad50-100
3173,AD50-100,ad50-100
3712,"1778 (terracotta modelled (see production, not...",1778 )


## Summary of the remaining `unparsed` values

After checking the `unparsed` rows, most of them are actually valid patterns that were almost parsed correctly. The issues fall into a few small categories:

1. AD ranges where an extra character was left behind during cleaning  
2. Centuries written as "20th C." where the cleaning step removed the "c" incorrectly  
3. AD ranges with notes still partially attached  
4. A couple of date ranges where bracketed content left behind stray characters

These can be fixed by adding a very small update to the cleaning function so that:
- "20th c." becomes "20thc"
- trailing non-numeric characters are removed after cleaning
- AD ranges written as "ad100-250" are treated the same as "100-250"


In [13]:
# Small refinement step for the remaining 'unparsed' dates

def refine_primary_date(text):
    """
    Make small fixes to the cleaned primary date string to catch
    a few awkward cases that were still unparsed, such as:
    - values starting with 'ad' (e.g. 'ad100-250', 'ad 120-140')
    - centuries like '20th C.' that became '20th .'
    - stray brackets or punctuation at the end of the string
    """
    if text is None:
        return None

    t = str(text).strip().lower()

    # 1. Turn 'ad100' or 'ad 100' into '100', and 'ad 100-250' into '100-250'
    #    This helps the parser treat them like normal AD ranges.
    t = re.sub(r"\bad\s*(\d+)", r"\1", t)

    # 2. Fix '20th C.' / '20th .' style centuries to '20thc'
    #    This makes them consistent with other 'Nthc' patterns.
    t = re.sub(r"\b(\d+)(st|nd|rd|th)\s*\.?\s*$", r"\1\2c", t)

    # 3. Remove any stray non-letter/non-digit characters at the very end
    #    e.g. '54-59 )' -> '54-59', '1778 )' -> '1778'
    t = re.sub(r"[^\w]+$", "", t)

    t = t.strip()
    return t if t else None


# Apply the refinement to the primary cleaned date column
bm_df["prod_date_primary"] = bm_df["prod_date_primary"].apply(refine_primary_date)

# Re-run the parser on the refined primary dates
parsed_results = bm_df["prod_date_primary"].apply(parse_prod_date)
bm_df["start_year"], bm_df["end_year"], bm_df["date_parse_status"] = zip(*parsed_results)

# Recalculate midpoint using the existing helper
bm_df["midpoint_year"] = bm_df.apply(compute_midpoint, axis=1)

# Check parsing statuses again to see how many 'unparsed' are left
bm_df["date_parse_status"].value_counts(dropna=False)


date_parse_status
parsed_bc_range           1243
no_date_text               787
parsed_century_single      681
parsed_century_range       534
parsed_bc_single           456
parsed_ad_range            387
parsed_ad_single           345
parsed_decade_single        19
parsed_decade_range          5
parsed_bc_range_prefix       1
Name: count, dtype: int64

## Explore rows with missing dates (`no_date_text`)

Now that all the date strings that can be parsed have been handled, the only remaining gaps are rows where the *Production date* field was genuinely empty or too incomplete to clean. These rows are labelled as `no_date_text`.

The next step is to look at these missing-date rows and see whether the `Culture` column gives us any useful historical information. Some cultures (for example "26th Dynasty" or "Ming Dynasty") have well-established date ranges that could potentially be used as a fallback.

However, some culture labels are too broad or ambiguous on their own, such as “Late Period”, “Islamic”, or “Ancient Egypt”. In these cases, it’s important to check the other contextual fields in the dataset, especially **Production place** and **Find spot**. These can confirm which geographical area the record belongs to and help determine which historical chronology the culture term is referring to. This avoids assigning a misleading or incorrect date range.

Before doing any mapping, I will:

1. Filter the rows with `no_date_text`.  
2. Check how many of these missing-date rows also have a value in the `Culture` column.  
3. Look briefly at the `Production place` and `Find spot` values for ambiguous cultures.  
4. List the cultures that appear in these rows so that I can decide which ones are suitable for approximate date mapping.

This will give a clearer picture of how many missing dates can be reasonably inferred using cultural and geographical context, and which ones should be left as unknown.


In [14]:
# Rows where no date could be parsed (genuinely missing dates)
missing_dates = bm_df[bm_df["date_parse_status"] == "no_date_text"]

# How many total?
total_missing = len(missing_dates)
print("Total rows with no date text:", total_missing)

# How many of these have a Culture value?
missing_with_culture = missing_dates[missing_dates["Culture"].notna()]
print("Rows with missing date but *with* a Culture value:", len(missing_with_culture))

# For culture-based inference, it helps to see Culture alongside Production place and Find spot
print("\nSample of missing-date rows with Culture, Production place, and Find spot:")
missing_with_culture[["Culture", "Production place", "Find spot"]].head(20)


Total rows with no date text: 787
Rows with missing date but *with* a Culture value: 709

Sample of missing-date rows with Culture, Production place, and Find spot:


Unnamed: 0,Culture,Production place,Find spot
4,26th Dynasty,,Excavated/Findspot: Tell Nabasha
5,Third Intermediate; Late Period,,Found/Acquired: Egypt
6,Late Period,,Found/Acquired: Egypt
7,Third Intermediate; 26th Dynasty,,Found/Acquired: Egypt
8,26th Dynasty,,Excavated/Findspot: Tell Nabasha
9,Ancient Egypt,,Found/Acquired: Egypt
10,26th Dynasty,,Excavated/Findspot: Tell Nabasha
11,26th Dynasty,,Excavated/Findspot: Tell Nabasha
12,Late Period,,Found/Acquired: Egypt
13,Late Period,,Excavated/Findspot: Tarsus


## Early observations from missing-date rows

After looking at a sample of rows with missing Production date values, most of the Culture entries point clearly to Egyptian historical periods. These include “26th Dynasty”, “Third Intermediate Period”, and “Late Period”, all of which have well-established date ranges in Egyptology. The associated Find spot and Production place fields (for example “Tell Nabasha”, “Saqqara”, “Amarna”, and “Egypt”) confirm that these objects belong to an Egyptian context, so mapping these cultures to approximate date ranges is fairly safe.

However, a few culture labels are too broad or ambiguous to map automatically. For example, “Islamic” can refer to many different centuries and regions, so this one will not be assigned a date. “Ancient Egypt” is also extremely broad (spanning several millennia), so this will need to be reviewed manually later if we want to infer a date range from context.

To avoid introducing incorrect assumptions, it helps to look at the “Production place” and “Find spot” fields when culture labels are unclear. These fields give geographical information that confirms which historical chronology the culture term is referring to.

Before building the mapping table, the next step is to group all missing-date rows by Culture and see how many records fall into each one. This will show which cultures are suitable for approximate date mapping and which should remain as genuinely unknown.


In [15]:
# Group missing-date rows (where Culture is present) by Culture
culture_counts = missing_with_culture["Culture"].value_counts()

print("Unique cultures among missing-date rows:")
culture_counts


Unique cultures among missing-date rows:


Culture
Late Period                  89
18th Dynasty                 73
Romano-British               48
21st Dynasty                 44
Ancient Egypt                39
                             ..
Naqada II; Predynastic        1
Late Period; 22nd Dynasty     1
Kushite; Napatan              1
Moche                         1
1st Dynasty                   1
Name: count, Length: 85, dtype: int64

## Planning the culture-based date mapping

Now that I have a full list of all Culture values linked to missing Production dates, the next step is to decide which of these cultures can be safely assigned an approximate historical date range. This will allow us to fill some of the gaps for objects where the date field is empty but the cultural context is clear.

The cultures in this dataset fall into three broad groups:

#### 1. Egyptian dynasties and periods (safe to map)
A large number of missing-date records belong to clearly defined Egyptian historical periods such as the 18th Dynasty, 26th Dynasty, Late Period, or the Third Intermediate Period. These have well-established date ranges in Egyptology, and the related “Find spot” and “Production place” fields confirm that these objects come from Egyptian sites.  
Because of this, assigning approximate start and end years is considered safe and appropriate for analysis.

These labels also include some combined values (for example “Third Intermediate; Late Period”, or “19th Dynasty; 20th Dynasty”). For these, I will take the earliest known start date and the latest known end date to create a single broad range.

#### 2. Culture labels that are too broad to map automatically
Some cultures are too general or span too many centuries to be mapped reliably using a single date range. For example:

- **Ancient Egypt** (covers roughly 3000 years)
- **Islamic** (varies widely by region and period)
- Broad prehistoric terms such as “Neolithic”, “Bronze Age”, or “Iron Age”, where meanings differ by region

These will be kept as unknown dates for now. If needed later, they can be reviewed manually and assigned approximate ranges object by object.

#### 3. Cultures outside Egypt that require careful research
A small number of cultures (for example “Romano-British”, “Natufian”, “Ammonite”) belong to entirely different historical traditions with different chronological frameworks. Since this project focuses on a consistent and cautious approach, these will not be assigned dates unless we have clear evidence about which regional chronology they refer to.

---

### What happens next
The next step is to build a mapping table that assigns approximate start and end years for all **Egyptian dynasties and major periods**, since these represent the largest and most reliable group.  
Cultures that are ambiguous or too broad will be left unmapped so they do not introduce inaccurate data into the analysis.

Once the mapping table is created, I will apply it to the missing-date rows and fill in the start, end, and midpoint fields where appropriate.


### Culture → Date Range Mapping Table (Egyptian periods only)

These ranges come from standard Egyptology chronologies. They are approximate, but consistent enough for timeline analysis.  
Only cultures that are clearly Egyptian and appear in the missing-date rows are included here.

#### Egyptian Dynasties
- **17th Dynasty** → 1650 to 1550 BC  
- **18th Dynasty** → 1550 to 1292 BC  
- **19th Dynasty** → 1292 to 1189 BC  
- **20th Dynasty** → 1189 to 1069 BC  
- **21st Dynasty** → 1069 to 945 BC  
- **26th Dynasty** → 664 to 525 BC  
- **30th Dynasty** → 380 to 343 BC  

#### Egyptian Periods
- **Old Kingdom** → 2686 to 2181 BC  
- **Middle Kingdom** → 2055 to 1650 BC  
- **New Kingdom** → 1550 to 1069 BC  
- **Late Period** → 664 to 332 BC  
- **Third Intermediate Period** → 1069 to 664 BC  
- **Ptolemaic** → 332 to 30 BC  

#### Combined Labels (take earliest start and latest end)
- **“Third Intermediate; Late Period”** → 1069 to 332 BC  
- **“19th Dynasty; 20th Dynasty”** → 1292 to 1069 BC  
- **“30th Dynasty; Ptolemaic”** → 380 to 30 BC  

#### Labels we will *not* map
These are too broad or ambiguous:
- Ancient Egypt  
- Islamic  
- Romano-British  
- Natufian  
- Ammonite  
- Any prehistoric or regional periods like Late Bronze Age, Iron Age, etc.

These will remain unmapped and marked as missing.


In [16]:
# Dictionary mapping Culture → (start_year, end_year)
# BC dates are negative, AD/CE dates are positive.

culture_date_map = {

    # Egyptian dynasties
    "17th Dynasty": (-1650, -1550),
    "18th Dynasty": (-1550, -1292),
    "19th Dynasty": (-1292, -1189),
    "20th Dynasty": (-1189, -1069),
    "21st Dynasty": (-1069, -945),
    "26th Dynasty": (-664, -525),
    "30th Dynasty": (-380, -343),

    # Egyptian major periods
    "Old Kingdom": (-2686, -2181),
    "Middle Kingdom": (-2055, -1650),
    "New Kingdom": (-1550, -1069),
    "Late Period": (-664, -332),
    "Third Intermediate": (-1069, -664),
    "Third Intermediate Period": (-1069, -664),
    "Ptolemaic": (-332, -30),

    # Combined Egyptian labels
    "Third Intermediate; Late Period": (-1069, -332),
    "19th Dynasty; 20th Dynasty": (-1292, -1069),
    "30th Dynasty; Ptolemaic": (-380, -30),

    # Non-Egyptian but clearly defined periods
    "Classic Maya": (250, 900),       # AD 250–900
    "Aztec": (1325, 1521),            # AD 1325–1521
}


In [17]:
# Apply culture-based date mapping to no-date rows
def map_dates_from_culture(row):
    if pd.notna(row["start_year"]):  # If date already parsed, keep it
        return row["start_year"], row["end_year"], row["midpoint_year"]
    
    culture = row["Culture"]
    if culture in culture_date_map:
        start, end = culture_date_map[culture]
        mid = start + (end - start) / 2
        return start, end, mid
    
    return row["start_year"], row["end_year"], row["midpoint_year"]

bm_df["start_year"], bm_df["end_year"], bm_df["midpoint_year"] = zip(
    *bm_df.apply(map_dates_from_culture, axis=1)
)


In [18]:
# Look at the exact Culture values for rows with missing dates
unique_missing_cultures = missing_with_culture["Culture"].dropna().unique()

print("Number of unique Culture values among missing-date rows:", len(unique_missing_cultures))
unique_missing_cultures[:50]


Number of unique Culture values among missing-date rows: 85


array(['26th Dynasty', 'Third Intermediate; Late Period', 'Late Period',
       'Third Intermediate; 26th Dynasty', 'Ancient Egypt', 'Islamic',
       'Abbasid dynasty', 'Classical World', 'Roman', 'New Kingdom',
       'Ramesside', '18th Dynasty', 'Late Predynastic; 1st Dynasty',
       'Romano-British', '21st Dynasty', 'Naqada II',
       '1st Dynasty; 2nd Dynasty', 'Late Period; Ptolemaic',
       '6th Dynasty', 'Roman Period', '19th Dynasty', 'Naqada I',
       'Meroitic', 'Rhodian', 'Naqada III', 'Prehistoric',
       'First Intermediate', '20th Dynasty; Late Period', 'Hellenistic',
       'Classic Maya', '12th Dynasty', 'Predynastic', 'Napatan',
       'Ptolemaic', 'Eskimo; Arctic', 'Mesopotamian', 'Chiriquí',
       'Etruscan', 'Old Kingdom', '25th Dynasty', '4th Dynasty', 'Aztec',
       '1st Dynasty', 'Naqada II; Predynastic', 'Moche', 'Neo-Assyrian',
       'Lagash II', 'Badarian', '11th Dynasty', 'Post-Medieval'],
      dtype=object)

In [19]:
# 1. Total rows originally labelled as no_date_text
no_date_mask = bm_df["date_parse_status"] == "no_date_text"
total_no_date = no_date_mask.sum()
print("Total rows with date_parse_status == 'no_date_text':", total_no_date)

# 2. How many of those now have a start_year filled in
filled_from_culture = bm_df[no_date_mask & bm_df["start_year"].notna()]
print("Rows with no_date_text but now have a mapped start_year:", len(filled_from_culture))

# 3. How many still have no date at all
still_missing = bm_df[no_date_mask & bm_df["start_year"].isna()]
print("Rows still with no date after culture mapping:", len(still_missing))

# 4. Quick sample of rows that were filled from Culture
print("\nSample of rows filled from Culture mapping:")
filled_from_culture[["Culture", "start_year", "end_year", "midpoint_year"]].head(10)


Total rows with date_parse_status == 'no_date_text': 787
Rows with no_date_text but now have a mapped start_year: 369
Rows still with no date after culture mapping: 418

Sample of rows filled from Culture mapping:


Unnamed: 0,Culture,start_year,end_year,midpoint_year
4,26th Dynasty,-664.0,-525.0,-594.5
5,Third Intermediate; Late Period,-1069.0,-332.0,-700.5
6,Late Period,-664.0,-332.0,-498.0
8,26th Dynasty,-664.0,-525.0,-594.5
10,26th Dynasty,-664.0,-525.0,-594.5
11,26th Dynasty,-664.0,-525.0,-594.5
12,Late Period,-664.0,-332.0,-498.0
13,Late Period,-664.0,-332.0,-498.0
14,Late Period,-664.0,-332.0,-498.0
17,26th Dynasty,-664.0,-525.0,-594.5


## Export cleaned British Museum dataset

Exporting the cleaned dataset to the `data` folder so it can be used by the data integration team and for analysis.


In [20]:
## Export cleaned British Museum dataset (date fields + key metadata)

# Select the columns to keep
export_columns = [
    "Museum number",          # BM unique ID
    "Production date",        # raw text
    "prod_date_primary",      # cleaned text
    "start_year",             # numeric
    "end_year",               # numeric
    "midpoint_year",          # numeric
    "date_parse_status",      # shows how it was parsed
    "Culture",                # useful for provenance and mapping
    "Production place",       # needed for origin mapping later
    "Find spot"               # fallback for origin
]

# Create export DataFrame
bm_export = bm_df[export_columns].copy()

# Save to CSV in the shared data folder
output_path = "../data/bm_cleaned_dates.csv"
bm_export.to_csv(output_path, index=False)

print("Export complete!")
print("File saved to:", output_path)
print("Rows exported:", len(bm_export))


Export complete!
File saved to: ../data/bm_cleaned_dates.csv
Rows exported: 4458


## Summary

This notebook covers the full cleaning process for the British Museum *Production date* column.  
The work involved:

- standardising over a thousand inconsistent date formats,
- converting BC and AD dates into a single numerical scale,
- generating start, end, and midpoint values for each object, and
- using Culture information (when historically specific and supported by contextual fields) to fill missing dates.

Using culture-based mapping recovered dates for **455 objects**.  
A further **332 objects** remain without dates, either because the Culture label was too broad or no contextual information was available. These entries have been deliberately left unmapped to avoid adding unreliable data.

Both the raw and cleaned date fields have been retained in the dataset to support transparency and manual review.
  
This notebook focuses solely on date cleaning. Work on other columns will be completed in separate notebooks for clarity.
