# Tracking and tracing audiovisual reuse: Introducing the Video Reuse Detector

## Tomas  Skotare [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/ORCID_ID) 
Humlab, Umeå University.

## Pelle  Snickars [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0001-5122-1549) 
Department of Art and Cultural Sciences, Lund University.

## Maria  Eriksson [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0002-7534-4268) 
Department of Arts, Media, Philosophy, University of Basel.

The cultural reuse and reappropriation of audiovisual content has been a recurring topic of research in the humanities, not least in studies of remix cultures. An open question that remains, however, is how artificial intelligence and machine learning may help scholars study the reuse and reappropriation of audiovisual heritage. In this article, we introduce the Video Reuse Detector (VRD) – a methodological toolkit for identifying visual similarities in audiovisual archives with the help of machine learning. Designed to assist in the study of the “social life” and “cultural biographies” (<cite data-cite="1971321/6PBSUNC7"></cite>, <cite data-cite="1971321/N7PZTCJU"></cite>) of video clips, the VRD helps explore how the meaning of historic footage changes when it circulates and is recycled/cross-referenced in video productions through time. The toolkit uses machine learning techniques (specifically, convolutional neural networks), combined with tools for performing similarity searches (specifically, the Faiss library) to detect copies and near-copies in audiovisual archives. It also assembles a series of tools for trimming and preparing datasets and filtering/visualizing matching results (such as introducing similarity thresholds, filtering based on sequential matches of frames, and visually viewing the final matching results). Inspired by the “visual turn” in digital history and digital humanities research, the article will introduce the basic logic and rationale behind the VRD, exemplify how the toolkit works, and discuss how the digitization of audiovisual archives open for new ways of exploring the reuse of historic moving images. 

# Introduction

This section will provide a general introduction to the article, discuss previous research concerning cultural reuse, and provide insights into state-of-the-art uses of machine learning/CNN's in historic research, focusing on research that deals with still and moving images.

Chapter 3 will then guide the reader through the practical application of the toolkit in textual/narrative/descriptive form, while ***chapter 4 consists of a technical demo that showcases how the toolkit works. It is this demo that should be tested during the technical check.*** 

In chapter 5, we present a more complex demonstration of what the toolkit can do. Here, however, we work with datasets that cannot be publically released for copyright and practical reasons. Thus, the case study discussed in chapter 5 will presented in the form of images/texts. While not directly reproducible for readers, we hope that this study will illustrate the potentials and pitfalls of using CNN's to study cultural reuse.

# VRD introduction

The VRD is a methodological toolkit for identifying visual similarities in audiovisual archives with the help of machine learning. It has been assembled because of the lack of open source solutions for audiovisual copy detection and is meant to help archivists and humanistic scholars study video reuse. The toolkit was originally developed within the research project European History Reloaded: Curation and Appropriation of Digital Audiovisual Heritage, funded by the JPI Cultural Heritage project, EU Horizon 2020 research and innovation program. Its main developer is Tomas Skotare, with assistance from Maria Eriksson and Pelle Snickars. The toolkit is open-source, supports all video encodings in the FFmpeg library, and is built to be used in Jupyter Notebook. It can be downloaded as a docker container () and the source code is openly available on Github ().

In what follows, we introduce how the toolkit functions. We also present a demo where the VRD is applied to two videos that are openly available on [Archive.org](https://archive.org/). The first video, entitled [*The Eagle Has Landed: The Flight of Apollo 11*](https://archive.org/details/gov.archives.arc.45017) (license [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/)), contains footage from the first moon landing in July 1969, and was released by the U.S. National Aeronautics and Space Administration (NASA) in the same year. The second video, entitled [The Moon and Us](https://archive.org/details/journey-through-the-solar-system-episode-06-the-moon-us) (license [Attribution-NonCommercial-NoDerivs 4.0 International](https://creativecommons.org/licenses/by-nc-nd/4.0/)), consists of Episode 6 from the documentary series *Journey Through the Solar System* which was also produced by NASA. First released in 1983, the series contains footage from various Apollo missions, incuding Apollo 11. Both clips are roughly 30 minutes long.

The VRD performs its similarity searches in four main steps. Below, we explain how those steps work.

## Step 1. Extract frames

To begin with, the VRD includes tools for dividing audiovisual content into still frames. Digital videos generally contain 24-30 frames per second and the VRD is instructed to extract one frame per second of video by default. Rather than analyzing audiovisual content in motion, it is these still images that constitute the VRD's main object of analysis. This means that the core of the VRD toolkit can be used to study *all* digital images (incl. photographs, computer animations etc.), although we present a toolkit that has been customized to deal with *moving* digital images. 

## Step 2. Produce fingerprints

Second, the VRD uses a so-called Convolutional Neural Network – or CNN – to extract the key visual features found in each frame. We call these extracted visual features 'fingerprints'. CNN's constitue the most commonly used technique for studying visual imagery with the help of artificial intelligence. Modelled to imitate the connectivity pattern of neurons in the visual cortex of animals, neural networks are currently used in areas such as facial recognition (), medical imaging (), and autonomous driving (). Important for our purposes here, CNN's are specialized in identifying visual similarities which makes them suitable for studying cultural reuse.

While the detailed technical workings of individual CNNs differ, neural networks are broadly designed according to multiple layers of analysis and abstraction. Each layer in a CNN will process an input and produce an output, which is passed on to the next layer. For instance, one layer in a CNN may observe how pixels are spatially arranged in an image and search for areas with a high contrast between nearby pixels (a good marker for what is visually unique in a picture), while another layer might focus on reducing what information is stored about pixel contrasts (instructing the model to “forget” all areas in a picture with a lower pixel contrast than a given value, for example). In this way, the CNN produces a successively smaller and hopefully more precise “map” of the analyzed image. Somewhere before the final layer of a CNN is reached, the network will produce a compressed interpretation of the key visual characteristics of images. It is then common for the remaining layers in a CNN to analyze what appears in the image, for instance by recognizing faces and objects.

In our case, the VRD will apply a CNN to process individual thumbnails but stop when a compressed yet sufficiently complex interpretation of the key visual features the image has been produced. Again, we call these key visual features fingerprints and the VRD will use them to find patterns of similarity across videos, while disregarding the analysis done in the remaining layers of the neural net. In more detail, the data in fingerprints consists of Euclidean vectors that mirror the visual charachteristics found in the original frames. A vector is a mathematical quantity that discloses the magnitude (or length) and direction of a geometric object. In other words, fingerprints are mathematical abstractions that no longer carry any immediate visual resemblence with their original frames. They do, however, carry important information about what is visually unique in each frame in ways that make them recognizable – even if content has been modified or distorted. For instance, images can be recognized as similar even if someone has adjusted their color, resolution, or composition. This is useful for studying cultural remix practices, for example.

The current version of VRD is based on Tensorflow2, which includes the open source Keras API. We use version 2.9.0 of the Keras API which makes 11 pre-trained convolutional neural networks available for use. All of these networks can easily be applied in the VRD. Initially, we worked with the neural network ResNet50 (see Snickars et.al. 2023) but later switched to a network called EfficientNetB4, which uses significantly less memory. EfficientNet first was released in 2019 and exists in multiple versions ((“Tan and Le - 2020 - EfficientNet Rethinking Model Scaling for Convolu.pdf”)). We have found that version B4 works well for the purpose of identifying video reuse, although we recommend exploring the other CNN's. The current version of the VRD applies EfficientNetB4 as is (that is, without any re-training) and extracts its fingerprints from layer 462 out of 476 in the network.

## Step 3. Calculate similarity neighbours

After saving extracted fingerprints in a database, the VRD applies its third step, where the so-called FAISS library is used to calculate the closest simiarity neighbour for each fingerprint. The Faiss library is a software that specializes in large-scale similarity searches and was developed by Facebook AI Research in 2018 (see Faiss.ai). It is considered to be one of the most efficient open-source tools for conducting large-scale similarity searches. For instance, Faiss can be run on Graphical Processing Units (GPU’s) which provides significant advantages in terms of speed. While Faiss can be used for any sort of similarity search, the VRD uses it to identify similarities between frame fingerprints.

Faiss uses the extracted fingerprints to index and calculate the visually most similar “neighbors” for each video frame. In more detail, all fingerprints are first added to a Faiss index. Each fingerprint is then compared with all other fingerprints in the index, producing an arbitrarily long (as defined by the user) list of similarity neighbours. Requesting more similarity neighbours requires more processing time, meaning that the number of neighbours that the VRD is instructed to find is always a tradeoff between time and accuracy.

With the help of the Faiss library, all fingerprint neighbours are then assigned a so-called 'distance metric.' This is a value that indicates how closely related their visual features are, according to Faiss. A low distance metric value (or short distance) indicates high visual similarity and a high distance metric value (or long distance) indicates low visual similarity. The distance metric 0.0 represents the absolute closest similarity Faiss can ascribe to two compared fingerprints, and essentially corresponds to the distance that a fingerprint would have to itself (i.e. an absolute match).

To save time/memory and minimize the display of poor similarity matches, the VRD is set to only export the 250 closest neighours for each analyzed fingerprint. It is important to note that this threshold is applied ***before*** the VRD runs another filter that removes all matches from the same video. This can cause problems if a video contains a lot of still images or slow visual movements, since long sequeces of frames from the same video could then be given a very low distance metric. In such cases, a frame's 250 closest neighbours may be occupied by lots of frames from the same video, while other relavant matches are filtered out. While it would be preferrable to filter matches from the same video *before* distance metrics are calculated and exported, the Faiss library does currently not support this feature (see for example https://github.com/facebookresearch/faiss/issues/40). It is, however, possible to adjust the 250 nearest neighbour threshold if desired. 

The distance metrics produced by the FAISS library constitute a core element of the VRDs evaluation of visual similarity, although it is important to note that these metrics are dynamic and will change/flucuate depending on which dataset is processed. For instance, the quality of the source material and the number of images/frames in the dataset will affect how distance metrics are distributed. This means that there is no absolute value or threshold that indicates what distance metric value corresponds to a "correct" or "actual" instance of reuse. Instead, any use of the VRD will always involve manually exploring how each dataset's unique distance metric values correspond to actual visual likenesses in the source material.

## Step 4.  Filter matching results

When the distance metrics for each fingerprint's similarity neighbours have been calculcated, the VRD reaches its forth and final step of image processing. In this step, it is possible to narrow down the search results with the help of two major filtering features. To begin with, it is possible to implement a general distance metric threshold before the final matching results are shown. For instance, the VRD may be instructed to only show fingerprint neighbours with a distance metric below the value 30 000. If this threshold is accurate (again, manual exploration is always necessary here), it should greatly reduce the number of shown non-matching frames.

Second, the VRD includes a feature for applying what we call 'sequential filtering'. We define a sequence as an instance where two or more sequential fingerprints (i.e. frames that were extracted one after the other from the original file) from two videos have been given a distance metric below a specified value. If frame 1-6 in Video X and frame 11-16 in Video Y are each given a distance metric below the threshold 20 000, for example, this may be defined as a sequential match. Sequential filtering is used to identify instances when longer chunks of moving images have been reused and we assume that such chunks are more interesting to users than individual matching frames. Furthermore, we have found that sequential matches are generally of better quality (i.e. more indicative of actual reuse) than individual matching frames, since there is a higher likelihood that actual cases of reuse have been found when at least two frames in a row have been assigned a low distance metric. Sequential filtering is implemented by deciding the minimum sequential length (or duration in seconds/frames) of shown fingerprint neigbours. It is also possible to instruct the VRD to 'tolerate' that a given number of frames within a sequence deviates from the assigned distance metric threshold.

To summarize, the VRD thus performs its similarity search by processing audiovisual content in three main steps: as original video files are converted to still frames, as still frames are converted to fingerprints, and as fingerprints are plotted as similarity neigbours. Alternatively, one could describe this process as a matter of compressing, converting, abstracting 'raw' audiovisual content to frames, frames to vectors, and vectors to distance metrics. Furthermore, the VRD contains tools for narrowing down the search results, including features for implementing distance metric thresholds and applying sequential filtering. The final matching results are displayed in diagrams, tables, and frame previews. On the whole, these matching results are meant to function as a guide that point users towards videos that might be interesting to study manually in more detail. In other words, we strongly advise against exporting and using diagrams, tables, and thumbnail comparisons as absolute proof of video reuse and instead emphasize the need double-check the VRD's search results. In short, we recommend approaching the search results as assistance tools in navigating large video datasets. As will become evident in our explorations of reuse in the SF-archive, this is not least due to the weaknesses and pitfalls that the toolkit brings with it.

# VRD demo

In what follows, we demonstrate the technical functioning of the VRD.

## Similarity search

We begin by importing a series of necessary modules. 

In [None]:
import numpy as np
import pandas as pd
import os
# we disable tensorflow warnings as they are verbose
# if things do not work, this supression should be removed
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from IPython.core.display import display, HTML

from vrd import neural_networks, vrd_project, faiss_helper
from vrd.neural_networks import NeuralNetworks
from vrd.sequence_finder import SequenceFinder

We also need to install ffmpeg since it is a prerequisite for ffmpeg-python.

In [None]:
!apt update
!apt install -y ffmpeg

Next, we start a new project and give it a name. We also inform the VRD where to locate the video files we want to work with and choose to apply with the neural network EfficientNetB4 from the Keras API. 

In [None]:
# set project configuration
demo_project = vrd_project.VRDProject(
    name = 'demo_project', 
    project_base_path="./vrd_projects/",
    video_path = 'media/demo/', 
    network = neural_networks.NeuralNetworks.efficientnetb4,
)

We then extract one frame per second from each video file. These frames are saved in the project directory, under the frames subdirectory.

In [None]:
demo_project.initialize_frame_extractor()

Next, we use EfficientNetB4 to create a fingerprint for each extracted frame and save these fingerprints in a database. The database is saved as 'database_file' in the project directory. We extract the fingerprints from layer 463 in EfficientNetB4. Note that it is important to delete any pre-existing databases and start over if another CNN is used. This is done by specifying force_recreate=True. 

In [None]:
# create fingerprint
demo_project.override_network_default_layer=463
demo_project.populate_fingerprint_database(force_recreate=False)

Next, we import all fingerprints to the Faiss index and save them as a faiss_index in the project directory. If this has already been done once, the VRD will fetch the saved index. Note that if any changes are done to the source material (i.e. the videos, frames, or fingerprints) it is necessary to recreate the index by setting force_recreate to True. Batches of the faiss index are saved separately in the "neighbour_batch" subdirectory in the project directory.

In [None]:
demo_project.initialize_faiss_index(force_recreate=False)

We instruct Faiss to calculate the closest similarity neighbours for each fingerprint. Faiss will also assign every compared fingerprint pair with a distance metric. We also decide to consider the 250 closest simiarity neighbours for each analyzed fingerprint. 

In [None]:
# calculate similarity neighbours
demo_project.neighbours_considered=250
demo_project.initialize_neighbours(force_recreate=True)


## Analyze distance metric distribution

Before we apply some final filters, we analyze the distribution of distance metric values in a historgram to get an idea of where it might be suitable to place a distance metric threshold.

In [None]:
demo_project.neighbours.get_distance_histogram();

## Configure filters

Finally, we configure some filters to narrow down the search results. We apply a shortest sequence threshold (in seconds), set a maximum distace metric threshold for all found sequential matches, and decide how many frames should be allowed to deviate from this threshold in a sequence (see'allowed skip' configuration).

In [None]:
# configure filters
SHORTEST_SEQUENCE = 2
MAX_DISTANCE      = 125
ALLOW_SKIP        = 2


### Nothing to change below this line! ###

finder = SequenceFinder(
    demo_project.neighbours, 
    max_distance=MAX_DISTANCE)

finder.filter_matches_from_same_video()

sequences = finder.find_sequences(
    shortest_sequence=SHORTEST_SEQUENCE, 
    allow_skip=ALLOW_SKIP,
    combine_overlap=True
)

## Output matching results

We then have a look at the matching results in the form of a table.

In cases where only two files are involved, two identical bars will be showed, as both are involved in all overlap.

In [None]:
display(demo_project.show_most_reused_files(sequences))

Next, we have a look at the sequential matching results in the form of frame thumbnails.

In [None]:
finder.show_notebook_sequence(sequences,show_limit=100, show_shift=False, frame_resize=(30,30))

## Potentials and pitfalls

Here, we will discuss the potentials and pitfalls of using the tool.

# VRD case study

Here, we will present the results of the case study. 

# Bibliography

<div class="cite2c-biblio"></div>