Skip to content

DEERIN Prototypes

Daniel edited this page Apr 13, 2026 · 3 revisions

Status: Year 1 development complete (March 2026). Year 2 development and AIMS workshop integration in progress.

ISSA_townhall

The DEERIN prototype (Data Exploration, Enrichment, Retrieval and Interaction) is structured as a set of four chained minimum viable products (MVPs). Each MVP addresses a distinct use case drawn from research with UK moving image archive partners, and together they form a modular pipeline for AI-assisted workflows useful to moving image archives.

Why MVPs?

The MVP approach was chosen deliberately. These are stripped-down implementations — sparse interfaces, core features only — designed to communicate ideas, test assumptions, and invite structured feedback before further resources are committed. They serve as a concrete intermediate output of the project and a cut off point to steer the project into its second year.

The Pipeline

The four MVPs are chained, together they reflect a larger transformation in how audiovisual material can be processed and made meaningful as data that undergoes:

[!Tip] Atomisation → Densification → Integration

  • MVP 1 breaks material into meaningful fragments (segmentation)
  • MVPs 2 and 3 add layers of metadata to those fragments (context and description)
  • MVP 4 integrates all of the above to enable new forms of discovery and creative reuse

The connective logic across all four is: optimisation → densification → integration.

The shared pre-processing backbone is FrameSense, an open-source command-line tool developed by King's Digital Lab that extracts frame-level and audio data from video collections for downstream indexing.


What it does: Automatically breaks archival video into semantically coherent segments, each with generated metadata including timecodes, summaries, topics, and programme classifications.

Why build it: ISSA Archives partners have a significant cataloging backlog of broadcast material, where human viewing and processing is a significant bottleneck. This prototype tests whether small, locally-runnable models can produce usable segment boundaries and descriptive metadata at scale.

Core pipeline:

  1. Frame extraction (OpenCV, ~0.5–1 FPS)
  2. Audio transcription with timestamps (Whisper)
  3. Frame captioning using a vision-language model (Moondream2 by default but with functionality to change this back end)
  4. Caption and transcript alignment
  5. Boundary detection using LLM reasoning (Gemma / Qwen)
  6. Segment merging, summarisation, and classification

Key finding from development: Smaller models (e.g. Gemma 3) produce many false positives in boundary detection. Larger models (Qwen 3 20B, Gemini Pro) yield substantially better results. Processing a 30-minute video on an RTX 4090 takes approximately 2 hours end-to-end.

Output: JSON segments with start/end timecodes, summaries, and structured classification fields. And an evaluation.html tool allows human review of segments against the source video.


What it does: Two complementary search interfaces over an indexed video collection — one map-based (searching spoken place names extracted from audio), one visual (semantic search over keyframe embeddings).

Why build it: Archive material from out ISSA partners is geographically relevant to their respective communities. Internal and external archive users want to find materials relating to specific places that are historically contingent. This prototype tests whether place references spoken in audio, combined with visual similarity search over frames, can support place-as-entry-point discovery.

Core pipeline:

  • Audio track → speech-to-text (Whisper) → LLM place extraction → geocoding via Nominatim → map index
  • Shot detection → middle-frame extraction → VLM captioning + visual embeddings → shot index

Output/interfaces:

  • Map search (places.html): All geocoded place mentions plotted on a map. Hovering a marker loads the corresponding video moment with auto-generated clip summary.
  • Semantic keyframe search (shots.html): All shots browsable by keyword or visual similarity (embedding search). Filterable by auto-generated facets (topic, visual category).

Key finding from development: Place names are distributed across audio and visual channels in complementary ways — spoken references tend to name areas; visual content tends to show place types. Both channels are needed for effective geographic retrieval.


What it does: Generates timestamped audio description (AD) text for archival video clips using vision-language models, producing output suitable for accessibility use and as rich descriptive metadata.

Why build it: AI-generated Audio Description has the potential both to improve accessibility for blind and visually impaired users and to accelerate descriptive cataloguing. This prototype tests current open-weight VLM capability against the task, including comparison with specialist AD models.

Approach: Vision-language models (Qwen 3/3.5) are applied to sampled frames or video clips to generate scene-level descriptions timed to gaps in dialogue. Output is a timestamped sequence of description sentences, visualised against source video frames.

Key finding from development: The gap between current VLM capability and professional AD standards is larger than initially anticipated. Existing specialist models (e.g. DANTE-AD, AutoAD III) benchmark against human-authored AD as ground truth, their work is pushing the state of the art in automated AD while revealing significant ceilings in narrative depth and contextual interpretation. This is an active research area and ISSA's position here has been as a constructive critical tested and evaluator more than builder.


MVP 4 — Algorithmic Editing and Creative Reuse

What it does: Combines the outputs of MVPs 1–3 to prototype interfaces for intent-guided discovery and creative reuse of archival material — imagining possible interfaces for users to find, assemble, and contextualise material through semantic queries, mood or theme, and visual pattern matching.

Why design it: The outputs of the upstream MVPs — segments, place metadata, visual embeddings, audio descriptions — are themselves rich data. MVP 4 explores what becomes possible when these are integrated: new routes into collections, serendipitous discovery across holdings, and tools that could support creative practitioners, educators, and researchers in working with archival material in new ways.

Current status: The most conceptual of the four MVPs. Interface sketches and design concepts were presented at the Demonstrator (March 2026); no full working implementation exists yet. Key technical dependencies are the outputs of MVPs 1–3 being consistently available and indexed.

Concept directions explored:

  • Generative editing interfaces: assemblies generated by geo-walks, visual similarity, or speech patterns
  • Conversational/chat-based retrieval over indexed collections
  • Session memory and collection-saving to build context over time
  • Cross-collection discovery (surfacing related material across partner archives)

Challenges identified: Rights and licensing complexity is highest here — archives identify this as the space where legal questions become unavoidable. Cross-archive data federation is a policy and institutional alignment challenge as much as a technical one.


Feedback and Development Status

All four MVPs were demonstrated to archive partners at the ISSA Demonstrator event on 25 March 2026. Partners from NLS, NLW, NIS, YFA, and NWFA participated in structured working sessions to give feedback and steer future development.

Analysis of this feedback is documented in the internal Demonstrator Working Sessions Analysis report (April 2026). Key cross-cutting findings include:

  • Precision is the dominant concern across all four MVPs — archives want accurate retrieval, not just discovery
  • Human oversight and intervention (the ability to correct and validate AI outputs) is a non-negotiable expectation
  • Export and usability of outputs matter — tools that cannot easily integrate with existing workflows will frustrate
  • Sensitivity flagging is a shared concern across MVP2, 3, and 4, yet something the KCL team did not initially considered for the MVPs
  • The MVP1→MVP3 chain is already intuited by archives as a desirable processing workflow

Development priorities for Year 2 will be shaped by this feedback in dialogue with the KDL RSE team.


Related Pages

  • FrameSense — pre-processing tool used across MVP 2 and other pipelines
  • Technology Review — survey of AI tools and methods informing prototype design
  • Use cases ― real challenges faced by archivists, technicians, and users across national and regional archives in the UK