Oral history archives preserve irreplaceable records of lived experience, yet most remain computationally inaccessible due to the prohibitive cost of manual transcription. We present a transcription and enrichment dataset derived from the Nelson Mandela Foundation's audiovisual archive, a major repository of liberation-era oral history. Using WhisperX automatic speech recognition and GPT-4o-powered annotation, we processed 1,204 recordings across seven archival collections spanning South Africa's transition from apartheid to democracy. The resulting dataset, Ukuvula (Zulu: "to open"), comprises transcriptions derived from 557.2 hours of archival audio, organized into 17,185 transcript windows (120 seconds each) containing approximately 4.08 million words, enriched with named entities, thematic classifications, and linguistic quality scores. The dataset enables longitudinal discourse analysis of liberation rhetoric, study of multilingual content in English-Afrikaans-isiXhosa political discourse, and text-based NLP benchmarking on transcripts from acoustically heterogeneous archival recordings. All transcription outputs, enrichment annotations, and associated metadata are publicly available at https://archive.nelsonmandela.org/index.php/ukv.
Figure 1. The Ukuvula pipeline transforms 557.1 hours of archival audio across seven Nelson Mandela Foundation collections into structured, AI-enriched research outputs through five stages: (a) Archival audio corpus comprising 1,204 recordings in MP3/WAV formats spanning the apartheid and post-apartheid era; (b) Audio pre-processing including metadata extraction, acoustic feature analysis, quality triage, speaker diarization (pyannote.audio 3.1), and voice activity detection; (c) Speech transcription via WhisperX large-v2 with forced alignment, confidence reconstruction, post-processing, redundancy cleaning, and 120 s fixed-window aggregation; (d) AI enrichment powered by GPT-4o performing named entity extraction (31,686 entities across five categories), thematic clustering (20 predefined themes), quality evaluation, summary generation, and geographic analysis; (e) Research outputs including 17,185 provenance-rich transcript segments (4.08 M words), an entity catalog, thematic index, quality profiles, and an interactive semantic search interface.
Figure 2. Distribution of recording durations across the archive (N=1,213; mean 28.0 min, median 15.3 min). The heavy right tail reflects multi-hour event recordings; the left-tail mass captures brief archival fragments.
Figure 3. Estimated unique speaker counts per recording via pyannote.audio speaker diarization. Modal cases are single-speaker and two-speaker recordings; the tail of 3–6 speakers reflects multi-participant discussions.
Figure 4. Global geographic footprint of the top 100 location entities by distinct file count. Spatial concentration in Southern Africa reflects regional organizational activity; transnational points (London, United States) indicate diplomatic and advocacy networks.
Figure 5. Multi-panel views for two archival recordings showing waveform, spectrogram, proxy speaker trajectory, and sample transcription. Left: VTS_09_1 (Mandela at 90, WAV); Right: Maharaj_Tape1_Side2 (The Authorised Portrait, MP3).
# 1. Create environment
conda env create -f environment.yml
conda activate nmf
# 2. Configure credentials (copy template, fill in values)
cp .env.example .env
# ⚠️ .env contains secrets — do NOT commit it to source control
# 3. Authenticate with Azure (required for GPT features)
az login
# 4. Run the transcription pipeline
bash run_pipeline.sh large-v2 "data/nmf_recordings/<Collection>"conda env create -f environment.yml
conda activate nmfRequires Python 3.10-3.11 and conda. PyTorch and cuDNN are installed via
conda channels (pytorch, nvidia) for reliable GPU setup. Remaining dependencies
are installed via requirements.txt, referenced from environment.yml.
Note: The pipeline script (
run_pipeline.sh) expects a conda environment namednmf. Using conda for PyTorch ensures cuDNN libraries are correctly placed in$CONDA_PREFIX/lib/without manual symlinks.
Copy .env.example to .env and fill in the required values:
| Variable | Required For | Description |
|---|---|---|
AZURE_OPENAI_ENDPOINT |
GPT features (Steps 6-8) | Your Azure OpenAI resource URL |
AZURE_OPENAI_MODEL |
GPT features | Deployment name (default: gpt-4o) |
HUGGINGFACE_TOKEN |
Speaker diarization | HuggingFace access token for pyannote models |
GPT features use Azure AD authentication (DefaultAzureCredential). Run az login before use.
The pipeline runs sequentially. Steps 1-4 produce the core transcription corpus. Step 5 is optional redundancy cleaning. Steps 6-8 add GPT-powered enrichment. Step 9 generates analytics.
Inventory all audio/video files and extract technical metadata (duration, format, codec).
python src/analysis/generate_metadata.py # Basic inventory (ffprobe)
python src/analysis/generate_metadata.py --all # Full: includes acoustic featuresOutput: results/audiovisual_metadata.csv
Run WhisperX on a collection directory. Each file produces per-segment transcripts with confidence scores and timestamps.
# Single collection
python src/pipeline/create_transcription_main.py \
--input_dir "data/nmf_recordings/<Collection>" \
--model_size large-v2 \
--use_gpu true
# Or use the wrapper script (handles CUDA setup)
bash run_pipeline.sh large-v2 "data/nmf_recordings/<Collection>"Key options: --chunk_duration (default: 120s), --min_confidence (default: 0.4),
--enable_diarization (default: false), --language (default: en).
Output: transcription_outputs/<Collection>/<file>.csv (per-file transcripts)
Consolidate per-file transcripts into collection-level CSVs with 2-minute windows.
python src/aggregation/aggregate_transcriptions.pyOutput: results/aggregated_transcriptions/<Collection>.csv
Merge all collection-level CSVs into a single corpus file.
python src/aggregation/create_final_transcriptions.pyOutput: results/final_transcriptions.csv
Remove mechanical repetitions and filler stretches while preserving original text.
python src/enrichment/clean_redundancy_transcriptions.py --input results/final_transcriptions.csv
# Dry run first to inspect metrics
python src/enrichment/clean_redundancy_transcriptions.py --input results/final_transcriptions.csv --dry-runOutput: results/final_transcriptions.cleaned.csv
Extract named entities (PERSON, ORGANIZATION, LOCATION, EVENT, DATE_TIME) using GPT-4o.
python src/enrichment/extract_names_records_match_with_gpt.pyOutput: results/entities_with_records_gpt.csv, results/entity_extraction_analysis_gpt.json
Generate thematic summaries for each archival collection.
python src/enrichment/generate_collection_summaries.pyOutput: results/gpt_based_collection_summary.csv
Classify transcription segments into curated thematic clusters.
python src/enrichment/gpt_clustering_mandela.pyOutput: results/gpt_cluster_results.csv, results/final_classification_results.json
Generate exploratory statistics and visualizations from the metadata and transcriptions.
python src/analysis/metadata_analysis.pyOutput: Figures in results/metadata_analysis/
| Script | Purpose |
|---|---|
src/analysis/create_people_recording_counts_summary.py |
Heuristic (non-GPT) person mention counts |
src/enrichment/aggregate_transcriptions_scope_note.py |
GPT-generated scope-and-content notes per collection |
src/analysis/estimate_unique_speakers.py |
Speaker diarization using pyannote.audio |
src/analysis/compute_collection_quality_stats.py |
Collection-level quality statistics for tables |
src/analysis/quality_evaluation.py |
GPT-based multi-dimensional transcription quality scoring |
.
├── src/
│ ├── config.py # Shared pipeline configuration
│ ├── azure_openai_utils.py # Shared Azure OpenAI client setup
│ ├── pipeline/ # Core ASR transcription
│ │ ├── create_transcription_main.py # Main transcription orchestrator
│ │ ├── transcriber.py # WhisperX model loading and inference
│ │ ├── audio_utils.py # Audio loading, resampling, normalization
│ │ ├── postprocess.py # Segment cleaning and confidence synthesis
│ │ └── save_utils.py # Output persistence (CSV/JSON/TXT)
│ ├── enrichment/ # GPT-powered enrichment
│ │ ├── extract_names_records_match_with_gpt.py # GPT entity extraction
│ │ ├── generate_collection_summaries.py # GPT collection summaries
│ │ ├── gpt_clustering_mandela.py # GPT thematic clustering
│ │ ├── aggregate_transcriptions_scope_note.py # GPT scope-and-content notes
│ │ └── clean_redundancy_transcriptions.py # Redundancy cleaning
│ ├── analysis/ # Analytics and quality
│ │ ├── metadata_analysis.py # Exploratory analytics
│ │ ├── quality_evaluation.py # GPT-based quality scoring
│ │ ├── compute_collection_quality_stats.py # Collection quality statistics
│ │ ├── generate_metadata.py # Audio/video metadata extraction
│ │ ├── estimate_unique_speakers.py # Speaker diarization (pyannote)
│ │ └── create_people_recording_counts_summary.py # Person mention counts
│ └── aggregation/ # Corpus building
│ ├── aggregate_transcriptions.py # Collection-level consolidation
│ └── create_final_transcriptions.py # Corpus-level merge
├── run_pipeline.sh # Pipeline wrapper with CUDA setup
├── setup_environment.sh # Conda environment setup
├── requirements.txt # Pinned Python dependencies
├── environment.yml # Conda environment specification
├── .env.example # Environment variable template
├── figures/ # Figures for README and documentation
│ ├── ukuvula_overview.png # Pipeline overview diagram
│ ├── top_locations_world_overlay_pub.png # Geographic entity map
│ ├── metadata_analysis/ # Corpus characterization plots
│ └── examples/ # Illustrative recording panels
├── LICENSE # MIT License
└── THIRD_PARTY_NOTICES.md # Third-party license attributions
| Directory | Contents |
|---|---|
data/ |
Input audio/video files |
transcription_outputs/ |
Per-file transcription CSVs |
results/ |
All derived artifacts (aggregated transcripts, entities, clusters, figures) |
logs/ |
Pipeline execution logs |
All pipeline parameters are defined in src/config.py with sensible defaults.
Override via environment variables or CLI arguments:
| Parameter | Default | Env Override | Description |
|---|---|---|---|
| Model size | large-v2 |
WHISPER_MODEL_SIZE |
WhisperX model variant |
| Language | en |
WHISPER_LANGUAGE |
Transcription language |
| Batch size | 32 |
WHISPER_BATCH_SIZE |
Inference batch size (tune for GPU memory) |
| Confidence threshold | 0.4 |
— | Minimum segment confidence |
| Chunk duration | 120s |
— | Audio chunk length for processing |
| VAD method | pyannote |
WHISPER_VAD_METHOD |
Voice activity detection backend |
- Per-file try/except with logging; failures never abort the batch.
- Chunk fallback when zero segments detected (prevents empty outputs).
- Confidence safety: always retains first non-empty segment per file.
- GPT scripts checkpoint intermediates every 500-1000 segments.
- Graceful degradation when optional dependencies (diarization, alignment) are missing.
| Item | Status |
|---|---|
| Environment file | environment.yml / requirements.txt (pinned) |
| Deterministic flag | WHISPERX_CUDA_PATCH=1 for cuDNN determinism |
| Random seeds | Set (e.g., 42) in clustering and sampling |
| Provenance columns | Collection/file/window tracked everywhere |
| Intermediate artifacts | GPT classification and entity checkpoints retained |
| Corpus rebuild | src/aggregation/create_final_transcriptions.py |
- No raw audio is redistributed; transcripts may contain personal names.
- API credentials are loaded from environment variables (
.env), never hardcoded. - Review transcripts for PII before any public release of derived data.
@article{nmf2025ukuvula,
title = {Ukuvula: Making Nelson Mandela's Legacy Accessible
through Large-Scale Transcription},
author = {TBD},
year = {2025},
journal = {Scientific Data (under preparation)}
}This project has adopted the Microsoft Open Source Code of Conduct. For more information see the FAQ or contact opencode@microsoft.com.
This project is licensed under the MIT License.
Copyright (c) Microsoft Corporation. All rights reserved.
See THIRD_PARTY_NOTICES.md for third-party license attributions.





