# Mind the Gap: Metadata Completion for Wellcome Catalogue
```Bash
author Louis Larcher, Arthur Taieb and Cassio Manuguera
```

**Abstract**  
The Wellcome Collection is a vast and culturally diverse set of materials, ranging from medical and ethnographic objects, historical manuscripts, books and journals spanning dozens of cultures and over 50 languages. Its materials range from everyday items to rare documents and artworks, making it a rich resource for understanding health, culture and the human experience across time.

Yet despite the breadth and value of these holdings, many items suffer from missing or incomplete metadata—such as unknown dates, origins or creators—which limits how effectively they can be catalogued, searched or interpreted. Our project aims to address this challenge by developing machine-learning methods to predict or approximate these missing fields using available textual descriptions and, when possible, images. By providing archivists with reliable, data-driven estimates, we seek to help enrich the Wellcome Collection’s records and improve access to its diverse cultural heritage.



In [1]:
# IMPORT ALL THE NEEDED FILES AND LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sys
from pathlib import Path
from IPython.display import display, HTML

sys.path.insert(0, './scripts/') 
from loading import *

%matplotlib inline
%load_ext autoreload
%autoreload 2

  from .autonotebook import tqdm as notebook_tqdm


## Exploratory Data Analysis 

To start this project we will first load the data and perform some basic analysis 
to understand the data we are working with.

In [2]:
raw_dataset = load_wellcome_data(n_samples=100000)

Using existing file: C:\Users\Surface\Desktop\EPFL\MASTER\MA1_2025_2026\ML\WellcomeML\data\works.json.gz
Loading 100000 samples from works.json.gz...


Parsing JSON lines: 100%|██████████| 100000/100000 [00:13<00:00, 7377.24it/s]



✓ Loaded 100,000 works
✓ DataFrame shape: (100000, 45)

Missing values (count):
genre_ids                  100000
production_date_to         100000
production_date_from       100000
precededBy_title            99838
precededBy_id               99838
succeededBy_title           99836
succeededBy_id              99836
partOf_id                   99365
issn                        98249
alternativeTitles           94187
edition                     93747
lettering                   90729
referenceNumber             87362
production_function         86240
description                 85337
isbn                        82311
wellcome_library_number     81644
thumbnail_url               69123
partOf_title                65230
genres                      42400
production_agents           32710
availability_status         27918
subject_ids                 21621
subjects                    21621
production_places           20158
notes                       16229
note_types                  16229
p

We can see that a lot of fields are quite empty thus we will juste get rid of it and work with less fields. Arbitrarly we choose to take out the fields that have more than 80% of missing values, the only exception is the description field because it can really gives a lot of infromation when present.

In [5]:
threshold = 0.80
missing_pct = raw_dataset.isnull().sum() / len(raw_dataset)
cols_to_keep = missing_pct[missing_pct <= threshold].index.tolist()

filtered = raw_dataset[cols_to_keep]

In [None]:
# print some basic stats about the dataset
# how many missing values per column
missing_values = filtered.isnull().sum()
print("Missing values per column:\n")
print(missing_values[missing_values > 0])

# shape of the dataset
print(f"\nShape of the filtered dataset: {filtered.shape}")

# show full content of a random picked row without truncation
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

display(raw_dataset.sample(1).T)


Missing values per column:

physicalDescription     16122
production_date         15100
production_places       20158
production_agents       32710
contributors            15709
contributor_roles       15709
contributor_ids         15709
subjects                21621
subject_ids             21621
genres                  42400
languages                9702
language_ids             9702
sierra_system_number     6763
notes                   16229
note_types              16229
thumbnail_url           69123
availability_status     27918
partOf_title            65230
dtype: int64

Shape of the filtered dataset: (100000, 28)


Unnamed: 0,5629
id,kzpbwk5s
title,"A thousand notable things of sundrie sortes : vvhereof some are wonderfull, some strange, some pleasant, diuers necessary, a great sort profitable, and many very precious."
alternativeTitles,"Thousand notable things, of sundry sortes; Thousannd notable things of sundrie sortes."
workType,Books
workType_id,a
description,
physicalDescription,"6 unnumbered pages, 210, 171-174 pages, 20 unnumbered pages"
lettering,
edition,
production_date,1627


__Our dataset only contains columns that interest us.__ 

#### We now have to preprocess the data.
The goal is to predict missing values in the dataset using the other fields as input so we will not have to fill NaNs. But we still have to deal with classic preprocessings such as categorical variables encoding, text vectorization and normalization.