# British Museum – Materials and Technique fields (quick audit)

In this notebook I'll take a focused look at the **Materials** and **Technique** fields in the British Museum sculpture dataset.

The aim here is not to do heavy cleaning, but to answer two questions:

1. Have these fields already been entered using a reasonably standardised vocabulary?
2. If not, what kinds of inconsistencies or variants would I need to normalise?

I'll:

- Load the British Museum dataset used in the other cleaning notebooks  
- Inspect a sample of the raw Materials and Technique fields  
- Look at unique values and common combinations  
- Decide whether any additional cleaning is needed, or whether a light normalisation step is enough

If the vocabularies already look consistent, I'll just document that and keep any changes minimal.


In [2]:
import pandas as pd

bm_file_path = "../data/british_museum_dataset_raw.csv"

# 1. Load the dataset into a pandas DataFrame
bm_df = pd.read_csv(bm_file_path)

# 2. Basic shape check: how many rows and columns?
print("Rows, columns:", bm_df.shape)

# 3. Print column names so we can confirm the exact labels
print("\nColumn names:")
print(list(bm_df.columns))

# 4. Preview a few rows of materials and techniques
cols_to_preview = [c for c in bm_df.columns if c.lower() in ["materials", "techniques"]]
print("\nColumns to preview:", cols_to_preview)

bm_df[cols_to_preview].head(50)

Rows, columns: (4458, 47)

Column names:
['Image', 'Object type', 'Museum number', 'Title', 'Denomination', 'Escapement', 'Description', 'Producer name', 'School/style', 'State', 'Authority', 'Ethnic name (made by)', 'Ethnic name (assoc)', 'Culture', 'Production date', 'Production place', 'Find spot', 'Materials', 'Ware', 'Type series', 'Technique', 'Dimensions', 'Inscription', 'Curators Comments', 'Bib references', 'Location', 'Exhibition history', 'Condition', 'Subjects', 'Assoc name', 'Assoc place', 'Assoc events', 'Assoc titles', 'Acq name (acq)', 'Acq name (finding)', 'Acq name (excavator)', 'Acq name (previous)', 'Acq date', 'Acq notes (acq)', 'Acq notes (exc)', 'Dept', 'BM/Big number', 'Reg number', 'Add ids', 'Cat no', 'Banknote serial number', 'Joined objects']

Columns to preview: ['Materials']


Unnamed: 0,Materials
0,marble
1,marble
2,limestone
3,marble
4,glazed composition
5,glazed composition
6,glazed composition
7,glazed composition
8,glazed composition
9,glazed composition


## 2. Checking for multi value strings in Materials and Technique

Although a quick visual scan of the CSV suggests that both fields sometimes contain multiple values separated by semicolons, I want to confirm this properly. This will tell me:

- how many records use multiple materials  
- how many records use multiple techniques  
- whether everything is consistently separated with semicolons  
- whether the British Museum appears to be using controlled vocabularies

This will help me decide whether any cleaning or normalisation is actually needed.


In [3]:
# Check for multi value rows using semicolons
multi_materials = bm_df["Materials"].fillna("").str.contains(";")
multi_techniques = bm_df["Technique"].fillna("").str.contains(";")

print(f"Total rows: {len(bm_df)}")
print(f"Rows with multiple materials: {multi_materials.sum()}")
print(f"Rows with multiple techniques: {multi_techniques.sum()}")

# Show a few examples
print("\nExample multi-material rows:")
display(bm_df.loc[multi_materials, ["Materials"]].head(20))

print("\nExample multi-technique rows:")
display(bm_df.loc[multi_techniques, ["Technique"]].head(20))


Total rows: 4458
Rows with multiple materials: 504
Rows with multiple techniques: 719

Example multi-material rows:


Unnamed: 0,Materials
38,pottery; terracotta
39,pottery; terracotta
50,lacquer; jade; bamboo; glass; ivory; metal; ag...
51,sycomore fig wood; clay; vegetal
52,porcelain; stoneware
57,terracotta; gold
94,gold; enamel
96,marble; bronze
97,bronze; silver
132,porcelain; gold



Example multi-technique rows:


Unnamed: 0,Technique
13,pierced; glazed
16,glazed; mould-made; incised
23,solid-cast; pierced
27,applied; slip-decorated; handmade (?)
32,mould-made; carved
33,carved; mould-made; painted
34,painted; carved; moulded
38,mould-made; painted
39,mould-made; painted; wheel-made
40,painted; hand-modelled


## 3. Checking for uncertainty markers in Materials and Technique

While reviewing the multi-value examples above, I noticed that one of the Technique entries included a **"(?)"** marker (for example: *handmade (?)*). This made me wonder whether similar markers appear elsewhere in the Technique field or even in the Materials field.

A quick check online suggested that **"(?)" is a standard uncertainty marker in museum cataloguing**, used when a cataloguer is not fully confident about a material or technique. Since it carries real meaning, it should not be removed automatically. Even so, I want to confirm how many objects contain this kind of marker so I can decide whether it has any impact on later analysis.


In [4]:
# Look for "(?)" uncertainty markers in Materials and Technique

import re

# Masks: only True where the string actually contains "(?)"
materials_mask = bm_df["Materials"].str.contains(r"\(\?\)", na=False)
technique_mask = bm_df["Technique"].str.contains(r"\(\?\)", na=False)

# Filtered series (no NaNs unless they genuinely contain "(?)", which they won't)
materials_q = bm_df.loc[materials_mask, "Materials"]
technique_q = bm_df.loc[technique_mask, "Technique"]

# Counts for each field
materials_q_count = len(materials_q)
technique_q_count = len(technique_q)

print("Explicit '(?)' uncertainty markers found:")
print(f"- Materials: {materials_q_count} rows")
print(f"- Technique: {technique_q_count} rows")

# Show sample rows for inspection
materials_q.head(20), technique_q.head(20)


Explicit '(?)' uncertainty markers found:
- Materials: 0 rows
- Technique: 13 rows


(Series([], Name: Materials, dtype: object),
 27         applied; slip-decorated; handmade (?)
 206     hand-modelled (?); mould-made; burnished
 357      biscuit-fired; slip-moulded (?); glazed
 1087            biscuit-fired; press-moulded (?)
 1467                         painted; gilded (?)
 1483                       pierced; polished (?)
 1605           wheel-made (?); handmade; painted
 1994                       polished (?); incised
 2546      mould-made (?); hand-modelled; painted
 2548                           hand-modelled (?)
 2557      handmade; incised; inlaid (?); painted
 3272                                 incised (?)
 3933                                 painted (?)
 Name: Technique, dtype: object)

## 4. Conclusions on the Materials and Technique fields

Both the **Materials** and **Technique** columns appear to be already well structured. The British Museum seems to be using controlled vocabularies, because:

- Entries are clean and consistently formatted  
- There are no prefixes or extraneous text to remove  
- Multi value entries use a standard semicolon separator  
- Spelling and terminology look stable across the dataset  

I also checked for uncertainty markers such as **"(?)"**, after spotting one in a multi value Technique entry. There are no instances of "(?)" in the Materials field, and only a very small number in the Technique field (13 objects in total). In each case the marker is used in the expected cataloguing sense, to indicate uncertainty about a particular technique. Since it carries genuine meaning rather than being a formatting problem, it makes more sense to keep it than to strip it out.

Taken together, this suggests that **no corrective cleaning is required** for either column. Leaving the fields as they are also avoids creating inconsistencies when we later merge the British Museum data with the cleaned V&A dataset.

There is still optional organisational work that could support analysis:

- Many objects contain multiple materials or techniques (Materials: 504 rows, Technique: 719 rows)  
- These multi value strings can be split into lists during analysis if we want to filter or group objects by material or technique  
- Splitting them is only needed for exploration or visualisation, not for cleaning  

Overall, the main cleaning choice here is simply to **retain the original fields exactly as provided**, acknowledging that they are already standardised and ready for downstream use.
