# Tracking and tracing audiovisual reuse: Introducing the Video Reuse Detector

## anonym


## anonym

## anonym

[![cc-by](https://licensebuttons.net/l/by/4.0/88x31.png)](https://creativecommons.org/licenses/by/4.0/) 
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the [Creative Commons Attribution License CC-BY](https://creativecommons.org/licenses/by/4.0/)


In [None]:
from IPython.display import Image, display

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
  tf.config.experimental.set_memory_growth(gpu, True)

display(Image("./media/notebook/sample_footprint.jpg"))

# THIS OPTIONAL STEP EXPORTS PLOTS AS PNGS
import plotly.io as pio
pio.renderers.default = "png"
import kaleido

%whoami

The cultural reuse and reappropriation of audiovisual content has been a recurring topic of research in the humanities, not least in studies of remix cultures. An open question that remains, however, is how artificial intelligence and machine learning may help scholars study the reuse and reappropriation of audiovisual heritage. 
In this article, we introduce the Video Reuse Detector (VRD) – a methodological toolkit for identifying visual similarities in audiovisual archives with the help of machine learning.  Designed to assist in the study of the “social life” and “cultural biographies” (<cite data-cite="1971321/6PBSUNC7"></cite>, <cite data-cite="1971321/N7PZTCJU"></cite>) of video clips, the VRD helps explore how the meaning of historic footage changes when it circulates and is recycled/cross-referenced in video productions through time.  The toolkit uses machine learning techniques (specifically, convolutional neural networks), combined with tools for performing similarity searches (specifically, the Faiss library) to detect copies in audiovisual archives.  It also assembles a series of tools for trimming and preparing datasets and filtering/visualizing matching results (such as introducing similarity thresholds, filtering based on sequential matches of frames, and visually viewing the final matching results).  Inspired by the “visual turn” in digital history and digital humanities research, the article will introduce the basic logic and rationale behind the VRD, exemplify how the toolkit works, and discuss how the digitization of audiovisual archives open for new ways of exploring the reuse of historic moving images. 

Digital methods, Machine learning, Cultural reuse, Video archives

Excerpts from this article have been published in Eriksson, Skotare & Snickars. 2022. 'Understanding Gardar Sahlberg with neural nets: On algorithmic reuse of the Swedish SF archive' in Journal of Scandinavian Cinema, Volume 12(3). p. 225 - 247.

# Introduction

On Febuary 28, 2023 the Swedish YouTuber RosaMannen—the Pink Man—uploaded an episode from the TV documentary series *Sweden, a long time ago* on YouTube. Originally broadcasted on Swedish public television in 1997, the theme of the 18 minutes long clip was the history of Swedish handicrafts, ranging from knitting to the tying of brooms and the production of metal nails. Unspectacular in its execution, the episode featured black-and-white footage of farmers preparing wool, cutting grass, and craftsmen building wooden ladders. Since 2006, RosaMannen (also known as Daniel Johansson) has uploaded tens of thousands of similar TV clips on his YouTube channel, which now provides an impressive overview of Swedish television history. Alongside episodes from the TV series *Sweden, a long time ago*, his channel features an eclectic mix of cooking shows from the early 1980’s and 90’s, children’s shows, televised concerts by popular Scandinavian artists, investigative journalism series, historic news reports, and much, much more. These videos have accumulated millions of views and roughly 40 000 subscribers now follow RosaMannen’s channel. Indeed, Johansson has become something of a national video celebrity, whose persistent commitment to digitizing either purchased or donated home-recorded VHS-tapes has made a vast number of Swedish television shows available to the wider public. 

Rosa Mannen’s ways of digitizig and making Swedish TV history available on YouTube does not involve remixing or modifying any origial content, but his channel is illustrative of amateur archivist practices and a flourishing online video culture. During the past decades, a massive body of audiovisual heritage has been digitized and made available online––not just by television ethusiasts such as Johansson, but by a wide range of cultural heritage institutions. This has radically transformed public access to cultural heritage and fostered new ways of engaging with the audiovisual past. By increasig the visibililty of historic video content and allowing it to circulate online, audiovisual memories are constantly revived, rewritten, and renegotiated. This does not least include breathing new life into seemingly mundane and low-key video clips––such as the episode on Swedish handicrafts––which otherwise likely would have drifted far into oblivion. Accompanied by ample user commentaries that interweave nostalgia and personal memories with discussios surrounding nationalism, economics, and politics, the videos on Rosa Mannens YouTube channel are an exellent example of how digital techologies help rewrite and remediate cultural memories. 

Importantly, the digitization of audiovisual heritage has also opened up for new ways of studying video archives at scale, as moving image collections can now be explored through computational means. In similar ways as the field of literature has expanded to engage with “distant reading” in the large-scale analysis of text (<cite data-cite="1971321/F98YWULX"></cite>), scholars have highlighted the potential in engaging with what has been described as “distant viewing” in the context of audiovisual scholarship (<cite data-cite="1971321/DHJJL7VR"></cite>). For instance, previous studies have explored audiovisual content using algorithmic object and facial recognition techniques (ibid.), geographic mapping technologies (<cite data-cite="1971321/DRRKS77U"></cite>), and automatic color extraction and analysis (<cite data-cite="1971321/6C4DNEI4"></cite>). This growing body of research has inspired researchers to speak of a “visual turn” in the digital humanities, which takes advantage of how “computer vision techniques offer us a macroscopic perspective on thousands of images” (<cite data-cite="1971321/XVEENKLM"></cite>, p.24).

In this article, we explore how filmic reuse can be tracked and traced through algorithmic means. Starting from a selection of television documetary series that are available on RosaMannen's YouTube channel, we explore how digital tools can assist in studying video reuse in historic compilation documentaries. In particular, we set out to explore how footage in some of RosaMannens digitized television series can be traced back to the so-called SF archive––which contains some of the earliest recordings in Swedish film history. The SF archive includes roughly 5,500 newsreels and film fragments, predominantly from the silent period of film history. Around the millenium, the entire SF-archive was digitized in a collaborative effort between Radio Sweden and the Swedish National Archive of Recorded Sound and Moving Images. In our experiment, we work with roughly 1400 of the oldest digitized clips found within the archive, to explore how reuse has taken place between the archive and the selected television series. Four different TV-series from the 1980's, 90's and earlu 2000's were chosen, based on RosaMannen's digitization efforts: (1.) *Förväntningarnas tid* (The Time of Expectations) from 1985, a series in four hour-long episodes; (2.) *Guldkorn* (Gold Nuggets) from the SF-archive, six hour-long episodes from 1985 and six from 2002; (3.) *Sverige för länge sedan* (Sweden, A Long Time Ago), a TV-series from 1984 in ten episodes, each around twenty minutes long; and (4.) *Hundra svenska år* (A Hundred Swedish Years), eight hour-long episodes from 1999. Since all these TV-series focus on Swedish 20th century history, we knew that they would likely include reused content from the SF-archive. However, we were unaware of the scale and precise details of what such practices of reuse would look like. 

To assist us in unraveling reuse between the SF archive and the selected TV series, we introduce the Video Reuse Detector, or VRD––a toolkit for detecting visual similarities in audiovisual datasets. The VRD opens up for asking questions such as: How is video content recycled, cross-referenced, and re-used in video archives throughout history? And in what ways can the “social life” and “cultural biographies” (<cite data-cite="1971321/6PBSUNC7"></cite>, <cite data-cite="1971321/N7PZTCJU"></cite>) of audiovisual content be traced? These questions are aimed at following the histories and career paths of individual video clips and exploring how their cultural value and meaning transforms as they circulate in space and time. As Wolfgang Ernst notes, automated techniques for identifying and processing cultural content provide a unique opportunity to side-step traditional metadata (i.e. textual/semantic descriptions of cultural content), to instead perform searches *within* content itself (<cite data-cite="1971321/FGUTFYAY"></cite>). Thereby, they also introduce a radically new way of navigating and interacting with historic archives.

In the text that follows, we first provide an overview of previous research concerning cultural reuse. We also discuss previous uses of audiovisual analysis tools within digital history and the digital humanities. This overview shows that even though scholars have started to use AI to analyze audiovisual archives, there is a lack of open source toolkits that are customized for studying video reuse. Next, we introduce the Video Reuse Detector; an audiovisual content identification toolkit developed at Humlab, Umeå University. Using machine learning models (convolutional neural nets) and indexation libraries (Faiss), the VRD allows for studying how audiovisual content is reused, remixed and re-appropriated within a given archive. We make use of the Journal of Digital History's notebook setup to first privde a detailed technical demonstration of how the VRD functions. Here, we apply the toolkit to two demonstration videos: [The Eagle Has Landed: The Flight of Apollo 11](https://archive.org/details/gov.archives.arc.45017) which contains original footage from the first moon landing in July 1969, and was released by the U.S. National Aeronautics and Space Administration (NASA) in the same year, as well as [The Moon and Us](https://archive.org/details/journey-through-the-solar-system-episode-06-the-moon-us), which consists of Episode 6 from the documentary series *Journey Through the Solar System* that was also produced by NASA and contains footage from Apollo 11 mission. The data and code needed to run this tookit demo on any computer is openly available on Github, making it possible for anyone to reproduce the demo and test the toolkit by themselves. 

After this demonstration, we show how the VRD can also be of help in the analysis of larger audiovisual datasets, namely the aforementioned SF-archive and TV-series. In this case, however, the original datasets used could not be made openly available online for copyright and practical reasons. Hence, the second part of the article is closed and non-interactive, yet aimed to illustrate the broader potential in applying the VRD to larger source materials. Our demonstration of the VRD toolkit is thus simultaneously an attempt to introduce a method for using machine learning to study video reuse––and an attempt to explore how reuse took place between the SF archive and historic, Swedish television documentaries in the 80's, 90's and early 2000's. Here, we also reflect upon the double remediation taking place as RosaMannen later pickes up on and reuses the same video content on YouTube.

## Studying reuse with neural nets

The reuse, remix, and reappropriation of cultural content––such as the production of compilation documentaries or the circulation of cultural video content online––has been a recurring topic of investigation in the humanities. Central to the field is exploring how cultural memory and identities are (re)shaped through the circulation of historic content, including things such as images, texts, and films. As David Jay Bolter and Richard Grusin notes, cultural memory is fundamentally built on the "remediation" and repurposing of already existing memory-matters (<cite data-cite="1971321/8WKJL6UD"></cite>). Through acts of reuse, memories become created, stabilized, and consolidated; and "just as there is no cultural memory prior to mediation there is no mediation without remediation: all representations of the past draw on available media technologies, on existent media products, on patterns of representation and medial aesthetics" ((Erll and Rigney 2009, p. 4)). To borrow from Aleida Assmann: the medium and media of cultural memories therefore matter (<cite data-cite="1971321/PW6NUM4D"></cite>) and by studying how specific cultural objects (letters, photographs, moving images etc.) have been copied and reused, we can gain important clues regarding the transformation of cultural ideas and identities over time. This is true regardless if we are interested in the history of Shakespeare's manuscripts, the afterlife of home recorded VHS tapes on YouTube, or the circulation of viral TikTok memes. Acts of copying and/or reusing cultural materials as deeply *productive* and performative events where new meanings are always added and produced. To pick up and reuse a fragment from a particular cultural work implies writing a new chapter in the work's afterlife; to embedd it new meanings. Importantly, it also implies forgetting and ignoring other things, namely those fragments that are *not* reproduced. This selective process is at the heart of cultural reuse and adds a political dimension to the practice.

The historic study of cultural reuse has taken many forms––stretching from philology in literature (<cite data-cite="1971321/SQBG3YDT"></cite>), to the study of remix and sampling in popular culture (<cite data-cite="1971321/IQDXCEY4"></cite>, <cite data-cite="1971321/QBIZTPES"></cite>). In the particular case of television history, many have also pointed out the special role of TV in shaping cultural memory, creating nostalgia, and reproducing historic narratives (<cite data-cite="1971321/HDLKG7JB"></cite>, <cite data-cite="1971321/L2CHQEY6"></cite>, <cite data-cite="1971321/F3L4G94K"></cite>). Yet the research field is also faced with methodological challenges. To this date, scholars of cultural reuse have mainly had to follow traces in individual documents (such as references to previous works), or rely on textual metadata. For example, studying how a particular photograph has been reused through time may have involved searching through archives using metadata keywords, and manually studying interesting finds. This is highly time consuming and implies relying on textual descriptions that may be of poor quality and level of detail. As a result of digitization, however, it becomes possible to 'look into' and search through the actual contents in archives at a scale that has not been possible before. Drawing inspiration from fields such as digital forensics (<cite data-cite="1971321/AVXYWLKN"></cite>), we can begin to use computational techniques to investigate the origins and authenticity of cultural objects and search for visual similarities in large archives. 

To this date, automatic content recognition techniques have been amply explored and developed in various computer science fields such as machine vision and information processing/retrieval. Early approaches for detecting image reuse, included algorithmic techniques for finding geometric shapes watermarking which drew on stenographic techniques to encode messages in images to make them recognizable (<cite data-cite="1971321/WU3IZPYP"></cite>), as well as hashing and feature detection technologies like the ORB, or [Oriented FAST and Rotated BRIEF](https://docs.opencv.org/3.4/d1/d89/tutorial_py_orb.html), which finds visual similarities by identifying pixel intensities and keypoints. A problem with many of these early techniques, however, was that they were slow in processing large amounts of data. Furthermore, they were often unable to detect visual similarities if content had been modified. While moving images that have been copied to the point––minute by minute, frame by frame––are often fairly easy to recognize, partial reuses of video content (where short segments of videos are copied), or instances where videos have been heavily remixed or manipulated (for example by changing their color, sharpness, composition or playback speed) presents a more difficult task. To face this problem, the field of machine vision and image retrieval has increasingly turned to artificial intelligence and machine learning, where techniques such as Convolutional Neural Networks (or CNN's) provide radically new ways of searching for visual similarities in datasets. 

This is also true in commercial contexts, where the automatic detection of copied cultural content is crucial to efforts to enforce copyright and restrict the circulation of pirated content online. In 2018, for instance, Google claimed that it had invested more than $100 million in building Content ID; an automated system for detecting copyright abuse which scans and monitors reuse in *all* videos that are uploaded on YouTube (<cite data-cite="1971321/6PUB4HEY"></cite>). Google claimes that Content ID is responsible for handling more than 98 percent of the copyright disputes that take place on YouTube, meaning that the technology plays a key role in controlling how content enters and exits one of the world’s largest websites. Elswhere, businesses like Audible Magic have also developed advanced systems for monitoring how copyright protected content circulates online. Specialized in the real-time identification and classification of recorded speech, videos, and music, Audible Magic claimes to have the capacity to identify how 25 million “media assets” stemming from 1000 video suppliers and 140,000 record labels appears in television broadcasts, on streaming platforms, archived TV, and the motion pictures worldwide (Audible Magic, 2020). Would it be possible to adopt similar techniques as those developed in computer science and the rights industry to study cultural reuse?

Recently, scholars in the humanities and social sciences have began to show how machine learning can be productively applied to visual archives to answer historic research questions. While not immediately concerned with the study of cultural reuse, such research has illustrated how artificial intelligence can assist in navigating and exploring digital archives in radically new ways. For instance, scholars have used artificial intelligence to identify scenes in collections of historic press photos (<cite data-cite="1971321/U9S22YVC"></cite>), and to explore gender representations in millions of historic advertisements (<cite data-cite="1971321/UN4WYWZW"></cite>), as well as nearly 3400 issues of Times Magazine (<cite data-cite="1971321/5BEURW8Z"></cite>). Others have used machine learning to conduct portrait studies in historic photographic datasets (<cite data-cite="1971321/Y5C3N4AT"></cite>), and used AI to explore how iconic photographs are (and aren't) memorized (<cite data-cite="1971321/GFJWVYNX"></cite>). 

A number of projects have also developed methodological tookits for exploring audiovisual content with the help of AI – a source material that is often more complex to work with than still images, because of its temporal dimension and the large size of datasets. Consider, for instance, that digital videos generally contain 24-30 frames or images per second, which means that 10 minutes of video may contain 14 400-18 000 images alone. This poses problems in terms of access to computational power. In the United States, however, the [Media Ecology Project](https://mediaecology.dartmouth.edu/wp/) at Dartmouth College has prototyped machine vision search engine for identifying objects and actions in archival video, and at the University of Richmond, the [Distant Viewing Lab](https://distantviewing.org/) has released a Python package for analysing visual culture on a large scale, including features like facial recognition. In Germany, the project [Visual Information Retrieval in Video Archives](https://projects.tib.eu/en/viva/projekt/) has also been established to support media archives by developing a ‘videomining software’ that can for example be used to classify events (<cite data-cite="1971321/VST2JWDJ"></cite>), person relations (<cite data-cite="1971321/EUA6HDDW"></cite>) and video scenes based on geolocation (<cite data-cite="1971321/ER4QZFVE"></cite>).

Similarily, [The Sensory Moving Image Archive](https://sensorymovingimagearchive.humanities.uva.nl) project in the Netherlands has experimented with stylometric film analysis (i.e. exploring characteristic stylistic features of films and filmmakers) and released a so-called [Nearest Neighbour Browser](https://isis-data.science.uva.nl/nanne/SEMIA/#662692_1) that allows for exploring visual similarities in datasets (although without using specific images as input). The toolkit [PixPlot](https://github.com/YaleDHLab/pix-plot), developed by scholars at the Digital Humanities lab at Yale, does something similar, as it allows for two-dimensional clustering and visualization of visual similarities in tens of thousands of images. Another example is the software Snoop, which has been developed by the French Institut national de l'audiovisuel (Ina) in collaboration with the French National Institute for Research in Digital Science and Technology (Inria) since since 2003. Snoop, which previously went under the name Signatur, is a visual search engine that applies the open source machine learning framework PyTorch to find visual similarities in datasets (<cite data-cite="1971321/RZCYR4B3"></cite>). The toolkit has been used to explore and enrich visual archives by researchers connected to Ina/Inria, as well as cultural heritage institutions such as Musée d’Orsay and the Bibliothèque nationale de France. It has also been used in citizen science projects such as [PlantNet](https://plantnet.org/en/), where it assists in identifying plants in photographic images. However, Snoop is a [proprietary software](https://hal.science/hal-02096036v1) that is not openly available for use.

In the toolkits mentioned above, the developed software has mainly been aimed at using machine learning to extract semantic metadata and textual interpretations of what appears in images (including scene-, object-, or facial recognition), or to find *general* visual similarities i datasets. For example, the toolkits are often geared towards finding examples of when say, austronauts, appear in images, so that several depictions of austronauts can quickly be retrieved and studied. In our case, however, we are not interested in finding examples of how austronauts appear in datasets *in general*. Rather, we want to explore how a *specific* depiction of a *specific* austronaut appears in a dataset - say footage of Neil Armstrong taking his first steps on the moon. And toolkits that are especially developed to find these kind of visual similarities are less common to find.

# Video Reuse Detector

As a response to the lack of open source software for audiovisual copy detection, we introduce the VRD – a methodological toolkit for identifying visual similarities in audiovisual archives with the help of machine learning. The VRD is meant to help archivists and humanistic scholars study video reuse and was originally developed within the research project European History Reloaded: Curation and Appropriation of Digital Audiovisual Heritage, funded by the JPI Cultural Heritage project, EU Horizon 2020 research and innovation program. Its main developer is Tomas Skotare, with assistance from Maria Eriksson and Pelle Snickars. The toolkit handles all video formats supported by the FFmpeg framework, and is built to be used in Jupyter Notebooks. It is free to use and downloaded and the source code is openly available on Github ().

When assembling the VRD, the following rationale has guided our work:

- **All applied software solutions are open source.** In other words, they are licensed to be openly available for anyone to use, study, change and distribute without cost.

- **All applied software solutions can be exchanged at any time.** This is ensured by the modularity of the VRD’s Jupyter Notebook and its associated Github page which is open for anyone to tweak and adjust.

- **No data is exchanged with commercial actors.** Aside from sharing information with Jupyter Notebook and local storage systems chosen by the user, the VRD does not transmit any data to third-party actors (including those who developed parts of its code).

In what follows, we present a demonstration of how the toolkit functions by applying the VRD to two videos that are openly available on [Archive.org](https://archive.org/). 






The first video, entitled [*The Eagle Has Landed: The Flight of Apollo 11*](https://archive.org/details/gov.archives.arc.45017), contains footage from the first moon landing in July 1969, and was released by the U.S. National Aeronautics and Space Administration (NASA) in the same year. 

In [None]:
# question to person doing technical check: any idea why this code isn't working? 
# videos won't play when notebook is rendered on journalofdigitalhistory.org

# also captions/metadata are rendered twice. Bug?

# remove too heavy

The second video, entitled [The Moon and Us](https://archive.org/details/journey-through-the-solar-system-episode-06-the-moon-us), consists of Episode 6 from the documentary series *Journey Through the Solar System* which was also produced by NASA. First released in 1983, the series contains footage from various Apollo missions, incuding Apollo 11.

In [None]:
# question to person doing technical check: any idea why this code isn't working? 
# videos won't play when notebook is rendered here on journalofdigitalhistory.org

# remove too heavy

Both clips are roughly 30 minutes long and are used in compliance with their respective licenses. Before uploading the videos on Github for this demo, we compressed them with HEVC in order to stay within Github's allowed file size of 100 MB/file.

The VRD toolkit is designed to process data in seven main steps. Technical details regarding these steps can be found in the hermeneutics layer.

## Step 1. Pre-process videos

To begin with, the VRD includes tools for dividing audiovisual content into still frames. Digital videos generally contain 24-30 frames per second and the VRD is instructed to extract one frame per second of video by default. It is these still images that constitute the VRD's main object of analysis.

We begin by importing a series of necessary modules.  

In [None]:
# import modules
import numpy as np
import pandas as pd
import os
import sys
sys.path.insert(0, './script/')
# we disable tensorflow warnings as they are verbose
# if things do not work, this suppression should be removed
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from IPython.display import display, HTML

from vrd import neural_networks, vrd_project, faiss_helper
from vrd.neural_networks import NeuralNetworks
from vrd.sequence_finder import SequenceFinder



We also install ffmpeg since it is a prerequisite for ffmpeg-python.

In [None]:
# question to person doing technical check: 
# do you have any suggestions for how to install this in a better way?

!apt update
!apt install -y ffmpeg


We start a new project and give it a name. We also inform the VRD where to locate the video files we want to work with and choose to apply with the neural network EfficientNetB4 from the Keras API. We decide to export vectors/fingerprints from a layer called Block7b_se_squeeze in EfficientNetB4.

In [None]:
# set project configuration
demo_project = vrd_project.VRDProject(
    name = 'demo_project', 
    project_base_path="./vrd_projects/",
    video_path = 'media/demo/', 
    network = neural_networks.NeuralNetworks.efficientnetb4,
    stop_at_layer='block7b_se_squeeze'
)

We extract one frame per second from each video file. These frames are saved in the project directory, under the frames subdirectory.

In [None]:
# extract frames
demo_project.initialize_frame_extractor(force_recreate=True)

# ^^^^ TOMAS! lägg till en count-funktion här som visar hur många frames som har extraherats (förslag från Pelle när han läste ignenom notebooken)? Obs, endast om det inte innebär för mycket jobb.

## Step 2. Produce fingerprints

Second, the VRD uses a so-called Convolutional Neural Network-–or CNN-–to extract the key visual features found in each frame. We call these extracted visual features 'fingerprints'. CNN's constitue a common technique for studying visual imagery with the help of artificial intelligence. Modelled to imitate the connectivity pattern of neurons in the visual cortex of animals, neural networks are used in areas such as facial recognition, medical imaging, and autonomous driving. We decided to work with CNN’s after doing multiple tests with more traditional tools for visual content recognition, including image hashing and video fingerprinting with help of [ORB](https://docs.opencv.org/3.4/d1/d89/tutorial_py_orb.html) (Oriented FAST and rotated BRIEF)-–a method for extracting and comparing visual features in images. CNN’s quickly outperformed the ORB technology’s ways of identifying visual similarities in video content, however, both in terms of accuracy and processing speed.

It is beyond the scope of this article to explain how CNN's work in depth (for more information on this, a good place to start is [Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network)), but we make a couple of remarks regarding the technology's basic technical structure. While the detailed technical workings of individual CNN's differ, neural networks are broadly designed according to multiple layers of analysis and abstraction. Each layer in a CNN will process an input and produce an output, which is passed on to the next layer. For instance, one layer in a CNN may observe how pixels are spatially arranged in an image and search for areas with a high contrast between nearby pixels (a good marker for what is visually unique in a picture), while another layer might focus on reducing what information is stored about pixel contrasts (instructing the model to “forget” all areas in a picture with a lower pixel contrast than a given value, for example). In this way, the CNN produces a successively smaller and hopefully more precise map of the analyzed image. Somewhere before the final layer of a CNN is reached, the network will produce a highly compressed interpretation of the key visual characteristics of images. It is then common for the remaining layers in a CNN to classify what appears in the image, for instance by recognizing faces and objects.

In our case, the VRD applies a CNN to process individual frames but stops when a compressed yet sufficiently complex interpretation of the key visual features of an image has been produced. Again, we call these compressed interpretations fingerprints and the VRD will use them to find patterns of similarity across videos. In more detail, the VRD will export data in the form of vectors from a given layer in a CNN. These vectors mirror the visual charachteristics found in the original frames. A vector is a mathematical quantity that discloses the magnitude (or length) and direction of a geometric object. In other words, we can say that fingerprints are mathematical abstractions that carry information about what is visually unique in each frame. What is good about fingerprints is that their detailed yet highly abstracted information can be used to recognize visual similarities even if a video has been modified or distorted. For instance, frames can be recognized as similar even if someone has adjusted their color, resolution, or composition. This is useful when studying cultural remix practices, for example. 

In [None]:
# Question: can the display of this code be hidden from the hermeneutics view?

from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "type":"image",
            "source": ["Example of what a processed frame can look like when extracted from an \
intermediate layer in a CNN, before it is converted into vectors. \
In this example, the CNN ResNet50 was used."
            ]
        }
    }
}

display(Image("./media/notebook/layer_example.jpg"), metadata=metadata)

When the VRD extracts its vectors/fingerprints from a given intermediate layer in a CNN––something that is otherwise done to visualize the learning process of a CNN, for example––the remaining layers of a neural net are not evaluated (more about this in the [Keras documentation](https://keras.io/getting_started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer-feature-extraction)). We partly do this to save time/computing power, and partly because we are not interested in classifying the content in images. Instead, we essentially use a CNN to compress and vectorize video frames. It is important to understand that this is not how CNN's are normally used. Therefore, we also know comparatively little about which CNN and extraction layer is best suited for the task. Commonly, the performance of CNN's are evaluated based on tests such as the [ImageNet Large Scale Visual Recognition Challenge](https://www.image-net.org/challenges/LSVRC/), where a neural network's capacity recognize objects/faces in images is compared against human beings's ways of performing the same task. Tests such as the ImageNet Challenge are not designed to evaluate a CNN's ability to compress and vectorize frames, however. When looking at previous evaluations of CNN's and deciding what neural network to use (see [Keras documentation](https://keras.io/api/applications/)), we have therefore mainly searched for CNN's that are relatively new and perform well in over-all tests. 

The current version of VRD is based on TensorFlow2, which includes the open source Keras API. [TensorFlow2](https://www.tensorflow.org/) is an open-source machine learning platform and software library for working with machine learning and artificial intelligence. It was originally developed by Google and can be used for multiple purposes, including the training and inference deep neural networks such as CNN's. The [Keras API](https://keras.io/about/) is is a deep learning API that runs on top of the TensorFlow platform. Keras functions as an interface for solving machine learning problems and provides building blocks for using machine learning methods. We use version 2.11.0 of the Keras API which makes 11 pre-trained convolutional neural networks available for use. All of these networks are open source and can easily be applied in the VRD. We mainly decided to work with pre-trained neural nets since it was unfeasible for our small research team to train a new network from scratch. Furthermore, we found that many of the neural nets in the Keras API succeeded in producing vectors that were compressed yet detailed enough to study video reuse, without needing to be re-trained our fine-tuned. Hence, we decided to work with the networks directly in their original form. Re-training the networks on relevant datasets would likely further improve the performance of the VRD, however. 

Initially, we mainly worked with the neural network [ResNet50](https://keras.io/api/applications/resnet/) (see Snickars et.al. 2023) but later switched to a network called [EfficientNetB4](https://keras.io/api/applications/efficientnet/), which is newer and performs better in accuracy tests. EfficientNet was first released by engineers at Google Research's so-called Brain Team in 2019 (<cite data-cite="1971321/K5A3FXNI"></cite>), and at the time of this article's writing, it existed in seven different versions (see [Keras documentation](https://keras.io/api/applications/efficientnet/)). We have found that version B4––which lies in the mid-range in the tradeoff between accuracy and speed among the EfficientNet versions––works well for the purpose of identifying video reuse. The current version of the VRD applies EfficientNetB4 *as is* (that is, without any re-training) and extracts its fingerprints from a layer called 'block7b_se_squeeze'. This layer is found towards the end of the network model (at the time of this article's writing, layer 463 out of 476). When deciding where to extract our fingerprint vectors, we wanted to find a layer that contained complex interpretations of the visual features in frames, yet produced vectors that were sufficiently compressed to work with large datasets. In addition, we wanted to find a layer where the neural network had not started to apply its object classification too strictly. In our tests, Block7b_se_squeeze appeared to live up to these qualifications. Importantly, however, we recommend exploring the use of other layers (and CNN's) when using the VRD, as we have not performed any comprehensive performance tests of all CNN's in the Keras API and their respective layers.

We use EfficientNetB4 to create a fingerprint for each extracted frame and use these to populate the fingerprint database. The database is saved as 'database_file' in the project directory. Note that it is important to delete any pre-existing databases and start over if another CNN is used. This is done by specifying force_recreate=True. 

In [None]:
# create fingerprints
demo_project.populate_fingerprint_database(force_recreate = False)

## Step 3. Calculate similarity neighbours

After saving extracted fingerprints in a database, the VRD applies its third step, where the so-called [Faiss library](https://faiss.ai/) is used to calculate the closest simiarity neighbours for each fingerprint. Faiss is an open source software library that specializes in large-scale similarity searches. First released by Facebook AI Research in 2018, it efficiently clusters dense vectors and conducts large-scale similarity searches. For instance, Faiss can be run on Graphical Processing Units (GPU’s) which provides significant advantages in terms of speed. As an open [Faiss manual](https://www.pinecone.io/learn/faiss-tutorial/) explains, Faiss will index a given set of vectors and then use another vector (called the query vector) to search for the most similar vectors within the index. Faiss allows for determining which vectors are similar by measuring the Euclidean distance between *all* given points within an index using an index called IndexFlatL2. FlatL2 performs a so-called *exhaustive* search and is very accurate in its evaluation of vector similarity, but slow since it matches/compares *every* point in the index (in our case, fingerprints) against the other, one by one. To speed up this process, the Faiss index contains several methods for optimizing similarity searches, although these will always be implemented at the cost of accuracy.

For instance, it is possible to optimize the speed of a Faiss similarity search by partitioning the index (i.e. limiting the search scope to *approximate* similarity, rather than calculating an absolute similarity), or by way of so-called quantization, which involves compressing the size of vectors (more about this in the following [Faiss tutorial](https://www.pinecone.io/learn/faiss-tutorial/)). When the Faiss index has determined (or approximated) the similarities found within an index, it will output a so-called distance metric to each compared set of vectors. This is a value that indicates how closely related their features are, according to Faiss. A low distance metric value (or short distance) indicates high similarity and a high distance metric value (or long distance) indicates low similarity. The distance metric 0.0 represents the absolute closest similarity Faiss can ascribe to two compared vectors, and essentially corresponds to the distance that a vector would have to itself (i.e. an absolute match).

While Faiss can be used for any sort of similarity search, the VRD uses it to identify visual similarities between frame fingerprints. In particular, the VRD will apply IndexFlatL2 to perform an exhaustive search and compare all fingerprints against each other, without optimizing the similarity search. While this is costly in terms of speed/processing power, it allows the VRD to later find sequential matches in the analyzed videos--a feature that is central for how the toolkit works. It is possible for VRD users to overwrite the use of IndexFlatL2 and instead use the Faiss library's optimization methods. However, this implies that the VRD's current structure for outputting final matching results in the form of sequential matches will be lost.

To save time/memory and minimize the display of poor similarity matches, the Faiss index comes with a default setting that only shows the 100 closest neighbours for each analyzed vector. While it is generally desired to limit the amount of neighbours shown for each vector, this threshold comes with drawbacks because of how the VRD is built. More specifically, it is important to note that the threshold is applied ***before*** the VRD runs another filter that removes all matches from the same video. This can cause problems if a video contains a lot of still images or slow visual movements, since long sequeces of frames from the same video could then be given a very low distance metric. In such cases, a frame's 100 closest neighbours may be occupied by lots of frames from the same video, while other relevant matching results are "pushed" out of the top 100 list. When all matches from the same video are filtered out from the top 100 similarity neighbours, important matches could thus be lost. While it would be preferrable to filter matches from the same video *before* distance metrics are calculated and exported, the Faiss library does unfortunately not support this feature at the moment (a problem that has also been noted by [others on Github](https://github.com/facebookresearch/faiss/issues/40), for example). It is, however, possible to adjust the 100 nearest neighbour threshold to reduce the risk of filtering out interesting matching results. This is done in the VRD, as the threshold is increased from 100 to 250 by default. 

The distance metrics produced by the Faiss library constitute a core element of the VRDs evaluation of visual similarity, although it is important to note one final thing, namely that these metrics are dynamic and will change/flucuate depending on which dataset is processed. For instance, the quality of the source material and the number of images/frames in the dataset will affect how distance metrics are distributed. Likewise, the distribution of distance metrics are highly affected by which neural network and neural network layer is used. This means that there is no absolute value or threshold that indicates what distance metric value corresponds to a "correct" or "actual" instance of reuse. Instead, any use of the VRD will always involve manually exploring how each project's unique distance metric values correspond to actual visual likenesses in the source material.

We instruct Faiss to output the 250 closest neigbours for each fingerprint. These will be saved in the "neighbour_batch" subdirectory in the project directory. The VRD will apply Faiss IndexFlatL2 as is (i.e. perform an exhaustive search where all fingerprints in the index are compared against each other). To change this setting and make use of the Faiss library's similarity search optimization, changes have to be made to the source code. Note, however, that this will imply that the VRD's ways of finding sequential matches is lost.

We create a Faiss index of the fingerprints and save it as faiss_index in the project directory. If this has already been done once, the VRD will fetch the saved index. Note that if any changes are done to the source material (i.e. the videos, frames, or fingerprints) or the selected CNN model it is necessary to recreate the index by setting force_recreate to True.

In [None]:
# index fingerprints using Faiss
demo_project.initialize_faiss_index(force_recreate = True)

In [None]:
# calculate similarity neighbours
demo_project.neighbours_considered = 250
demo_project.initialize_neighbours(force_recreate = True)

## Step 4. Analyze distance metric distribution

To assist in determining how distance metric values should be treated, it is possible to view a distance metric visualisation (histogram) that displays how metrics are distributed within a dataset. This may help in figuring out where to place a distance metric threshold and filter out irrelevant matching results. If the threshold is placed too low, interesting matching results may get lost. If the threshold is placed too high, there is a risk of being shown a high number of uninteresting matches.

We create a histogram to get an idea of where it might be suitable to place a distance metric threshold. When using the tool, we have found that placing the threshold around the point where the histogram indicates a sharp increase in found neighbours is a good start. 

In [None]:
# question to person doing technical check: 
# the metadata/captions to images are rendered twice
# when looking at the notebook in "preview notebook"
# possible bug?

from IPython.display import Image, display

metadata={
    "jdh": {
        "module": "object",
        "object": {
            "type":"image",
            "source": ["Suggestion of where to set the threshold in the \
analysis of a distance metric histogram."
            ]
        }
    }
}

display(Image("./media/notebook/distance_illustration.jpg"), metadata=metadata)

In [None]:
# get distance histogram
demo_project.neighbours.get_distance_histogram();

## Step 5.  Filter matching results

In the final step of the VRD's image processing, it is possible to narrow down the search results with the help of two major filtering features. To begin with, it is possible to implement a general distance metric threshold before the final matching results are shown. For instance, the VRD may be instructed to only show fingerprint neighbours with a distance metric below the value 30 000. If this threshold is accurate (again, manual exploration is always necessary here), it should greatly reduce the number of shown non-matching frames.

Second, the VRD includes a feature for applying what we call 'sequential filtering'. We define a sequence as an instance where two or more sequential fingerprints (i.e. frames that were extracted one after the other from the original file) from two videos that have been given a distance metric below a specified value. If frame 1-6 in Video X and frame 11-16 in Video Y are each given a distance metric below the threshold 20 000, for example, this may be defined as a sequential match. 

Sequential filtering is used to identify instances when longer chunks of moving images have been reused and we assume that such chunks are more interesting to users than individual matching frames. Furthermore, we have found that sequential matches are generally more indicative of actual reuse than individual matching frames, since there is a higher likelihood that actual cases of reuse have been found when at least two frames in a row have been assigned a low distance metric. Sequential filtering is implemented by deciding the minimum sequential length, or duration in seconds/frames. It is also possible to instruct the VRD to 'tolerate' that a given number of frames within a sequence deviates from the assigned distance metric threshold.

Importantly, the VRD's sequential filtering also includes an option called 'combine overlap'. This option is a best-effort attempt at combining overlapping sequences into one. Overlapping sequences is an issue that may occur if several seconds in a reused video looks very similar. For instance, this can involve reused videos with still footage, or slowly changing scenes such as panoramic overviews of landscapes. 

For example, let's imagine that a scene depicting an ocean has been reused in two videos. In the reused scene we see a clear blue sky, a horizon, and moving waves but nothing else is happening in the picture. In such an example, several overlapping sequences may be found relating to the same footage. For instance, the VRD may find that frame 5-10 in Video 1 look very similar to frame 20-25 in Video 2. But the VRD may also find that frame 6-11 in Video 1, and frames 21-26 Video 2 have a high visual similarity. This creates an overlapping sequence match with a slight time shift (i.e. starting one or two seconds before or after). If such overlapping sequences are not merged into one, users are faced with unneccessecary clutter when looking at the sequential matching results. Hence, our solution is to combine overlapping sequences into one single sequence. 

In order to achieve this, we search for sequences that overlap (i.e. contain the same frames), and merge them by starting at the first start time and ending at the last end time. This method can combine multiple different sequences into one, even if any one sequence does not overlap with all others. In the best-case scenario, this will leave us with a longer overall sequence that contains frames from two videos that are perfectly aligned. However, in some cases, a cumulative error from merging multiple overlapping sequences can result in a combined sequence that is significantly degraded from the original in the current implementation. This appears when the actual overlap has an offset of several seconds. As a consequence, the distance metric (which is calculated from the mean of the frame-by-frame distance) can become misleadingly large, or even in some cases undefined. While this issue is currently unresolved, we believe that this can be mitigated by performing a final matching step where the sequences are realigned to better match. A preview of this feature can be viewed by specifying show_shift=True as an argument to the show_notebook_sequence function.

To summarize, the VRD thus performs its similarity search by processing audiovisual content in three main steps: as original video files are converted to still frames, as still frames are converted to fingerprints, and as fingerprints are plotted as similarity neigbours. Alternatively, one could describe this process as a matter of compressing, converting, abstracting 'raw' audiovisual content to frames, frames to vectors, and vectors to distance metrics. Furthermore, the VRD contains tools for narrowing down the search results, including features for implementing distance metric thresholds and applying sequential filtering. The final matching results are displayed in diagrams, tables, and frame previews. On the whole, these matching results are meant to function as a guide that point users towards videos that might be interesting to study manually in more detail. In other words, we strongly advise against exporting and using diagrams, tables, and thumbnail comparisons as absolute proof of video reuse and instead emphasize the need double-check the VRD's search results. The VRD should be approached as an assistance tool in navigating large video datasets. As will become evident in our explorations of reuse in the SF-archive, this is not least due to the weaknesses and pitfalls that the toolkit brings with it.

We configure some filters to narrow down the search results. We set a maximum distance metric threshold for all found sequential matches, filter out all fingerprint matches from the same video, and decide on a shortest sequence threshold (in seconds). Furthermore, we decide how many frames should be allowed to deviate from this threshold in a sequence (see 'allow skip' configuration) and combine all overlapping sequences into one.

In [None]:
# configure filters

demo_finder = SequenceFinder(
    demo_project.neighbours, 
    max_distance=125)

demo_finder.filter_matches_from_same_video()

demo_sequences = demo_finder.find_sequences(
    shortest_sequence=5, 
    allow_skip=2,
    combine_overlap=True
)

## Step 6. Output matching results

Final matching results are shown in the form of frame thumbnails. The longest found sequences is shown first. It is possible to limit the number of shown sequences and customize the frame size. To study the matching results, go to the article's hermeneutics layer.

We adjust the frame size and number of shown sequences and have a look at the final matching results.  Test

In [None]:
demo_finder.show_notebook_sequence(demo_sequences,show_limit=10, frame_resize=(70,70))

In the matching results above, we see that the VRD finds 12 sequences (including sequence no. 0) that are at least 5 seconds long. According to the previous filtering configurations, each pair of frames has been assigned a distance metric value below 125 and a maximum of 2 frames per sequence are allowed to deviate from this threshold.

The longest found sequence is 26 frames (or seconds) long and consists of clip where Neil Armstrong slowly climbes down a ladder and takes his first steps on the surface of the moon (see sequence no. 0). Vaguely discernable due to low resolution and a dark shadow cast by the Eagle shuttle, we see Armstrong wearing a white space suit in the lower left corner.

The second longest sequence (no. 1) depicts the Eagle shuttle's descent towards the moon, and includes panoramic scenes of the moon surface. The top left corner of the image is black, since the camera's view is partly blocked by the aircraft. Upon closer inspection, however, we find that the shown frames are not 100% identical. This illustrates the VRD's (or rather, the Faiss library's) particular ways of dealing with videos that contain slowly changing scenes. While the toolkit has, indeed, found frames with a high visual similarity, it is clear that we are seeing a matching result that appears to be a few seconds "off". This is simply due to the fact that the frames are similar enough to recieve a comparatively low distance metric value. Theoretically, it would be possible to deal with this problem by increasing the distance metric threshold. However, this would likely imply that other interesting matches could get lost. In cases such as this, we therefore recommend accepting the somewhat imperfect matching result.

In the same sequence, we also notice that the VRD has matched several near-black frames (see the eight frames furthest to the right). Notably, this issue also occurs in sequence no. 9, where a series of near-black frames have been given a low distance metric value. While these matches are *technically correct* – indeed, the frames do have a very high visual similarity – the matching results are practically irrelevant for the purpose of studying video reuse. If matching single-color or near-single-color frames is a big problem in our dataset, we can deal with this problem by running an additional filter called "ignore_these_frames" (more about this soon).

When looking at the remaining matching results, we see that footage of the Eagle shuttle's descent towards the moon can also be found in three other sequences (no. 3, 4 and 11). This is due to the fact that the documentary *The Moon and Us* only reuses parts of the original footage. More precisely, we find that footage from the Eagle's descent was shown during 20 seconds in the tv-series episode *The Moon and Us* – starting 6 min. and 23 sec. into the film. In the original video *The Eagle Has Landed*, however, footage from the Eagle's descent is shown during roughly 2 minutes and 34 seconds (starting at 09 min. and 3 sec.). Furthermore, very similar footage is shown for 1 min. and 8 seconds, starting 18 min. and 20 sec into *The Eagle Has Landed*. This time, however, it is not the Eagle's descent, but *ascent* that is shown, as the shuttle begins its journey back to earth. 

What has happened in our matching results, is that the same sequence of frames from *The Moon and Us* has been matched with different scenes depicting the Eagle's decent and ascent from the moon in the *The Eagle Has Landed*. This issue is difficult to correct with any filter configurations, since the footage does share many visual similarities. However, it can be helpful to know that when similar frames are found in different sequences, this can be an indication that similar footage appears several times within the same original video. Alternatively, it can incate that footage has been cut/shortened in the process of reuse.

In the final sequence that includes footage of the Eagle's descent (sequence no. 11 shown at the very bottom of the matching results), the matching results are simply inacurrate. It is difficult to determine precisely why these matched frames have been given such a low distance metric value. This is a recurring problem with both CNN's and the Faiss index, given that the algoritmic models are largely 'black boxed' and do not contain any features for explaining *how* the models have come to a particular decision. One can speculate, however, that it has to do with the existance of vertical lines in the images.

The third longest sequence found in the matching results (no. 2, 17 seconds long) shows a series of still images that were taken during the collection of rock samples and the placement of scientific equipment on the moon. Still photographs taken from the same scientific excursion can be found in three of the other sequences (no. 5, 6 and 10). In the longest two of these sequences, the matching results are accurate, although in the third sequence we once again see an example of how very similar footage have been mismatched. 

Of the remainig sequences, the first (no. 7) is 8 seconds long and depicts Neil Armstrong's and Buzz Aldrin's view from the Eagle shuttle. The second sequence (no. 8) is 4 seconds long, and depics footprints on the moon surface. 

## Step 7. Run additional filters

After having studied the first set of found sequences, we can choose to apply some additional filters and rearrange the sorting of the final output. For instance, it is possible to remove unwanted sequences from the matching results, thereby 'cleaning' the dataset from uninteresting frames. This can for example be useful if long sequences of black (or near-black) frames are found in the search results. Likewise, it can be helpful if frames with text have been excessivly matched, as this is a commonly occuring problem. 

It is also possible to change the sorting of the matching results, only show matching results for one or several selected videos, and re-adjust the distance metric threshold, shortest sequence threshold, and allowed skipped frames within each sequence.

To remove unwanted sequences, we fill out an Excel template called "demo_unwanted_frames.xlsx" which is placed in the demo project folder, containing the columns `Video`, `Start time`, and `Duration`. We copy-paste the information to be entered in the columns directly into the Excel spreadsheet from the text shown above each sequence preview above. The "Video" column should contain information about the video name (including any extension such as .avi or .mpg). The "Start time" column should include information about the start time on a HH:MM:SS format (such as `01:55:22` for 1 hour, 55 minutes and 22 seconds into the video file), and the "Duration" column should state the duration of the sequence to ignore in seconds (e.g. `120` for 2 minutes). There is no problem with additional columns, as these will simply be ignored. When the Excel spreadsheet is finished and saved, we run the remove_unwanted_sequences filter.

In [None]:
before_filtering = set(demo_sequences)
SequenceFinder.remove_unwanted_sequences(demo_sequences, demo_project, 'demo_unwanted_frames.xlsx')
after_filter = set(demo_sequences)

When the filter has been applied, we can double-check what sequences were removed below. The filtered sequences will not appear again in later results.

In [None]:
demo_finder.show_notebook_sequence(before_filtering.difference(after_filter), show_limit=10, frame_resize=(100,100))

As can be seen in the code cell above, two matched sequences were removed from the dataset.

Aside from removing unwanted sequences, it is possible to change the sorting of the matching results. Initially, matching results are sorted according to the length of sequences, but it is possible to adjust this to instead sort according to the lowest/shortest median distance metric value per sequence. We can also re-adjust the distance metric threshold, shortest sequence threshold, and allowed skipped frames within each sequence, or to filter the search results so that only matches from one or several selected videos are shown. The prompts for applying additional filters can be found below.

In [None]:
# Additional filters

def sort_by_lowest_distance(sequences, finder):
    new_seq_order = np.argsort([finder.get_sequence_mean_distance(*x)[0] for x in sequences])
    new_sequence = [sequences[x] for x in new_seq_order]
    return new_sequence

def sort_by_longest_duration(sequences):
    return sorted(sequences, key = lambda x: x[2], reverse=True)

def filter_minimum_duration(sequences, minimum_duration):
    return [(v1,v2,dur) for v1,v2,dur in sequences if dur >= minimum_duration]

def filter_maximum_duration(sequences, maximum_duration):
    return [(v1,v2,dur) for v1,v2,dur in sequences if dur <= maximum_duration]

def sort_by_shortest_duration(sequences):
    longest_duration = sort_by_longest_duration(sequences)
    longest_duration.reverse()
    return longest_duration

def contains_video(sequences: list , video: str, finder: SequenceFinder):
    frames = finder.neigh.frames
    indexes = frames.get_index_from_video_name(video)
    return [(v1,v2,dur) for v1,v2,dur in sequences if (v1 in indexes) or (v2 in indexes)]

To try them out we recommend you download the VRD and demo dataset on your local computer. For instance, if you want to show sequences that are a maximum of 5 seconds, sorted by shortest duration first, you run the code below. You can also decide to run more filters at the same time. 

In [None]:
def demonstration_filter(sequences):
    sequences = filter_maximum_duration(sequences, 5)
    sequences = sort_by_shortest_duration(sequences)
    return sequences

demo_finder.show_notebook_sequence(demo_sequences,show_limit=10, frame_resize=(100,100), sort_order=demonstration_filter)

## Potentials and pitfalls

>> Add section on the potentials when/if applied to a larger archive/cultural heritage institution etc.

Unsurprisingly, the VRD toolkit works very well for finding visual similarities in video collections that have a high resulotion and similar color/contrast scheme. The lesser original materials have been modified, edited, and/or remixed, the easier it also is for the toolkit to find relevant similarity matches. There are, however, a couple of instances where the toolkit performs less well. One such instance is when the VRD is applied to source materials that contain lots of textual and symbolic overlays, such as subtitles, news show banners, and/or tv-channel symbols. Another problem is videos that contain a lot of single-color (or near-single color) frames. This is because text, symbols and (near-)single-color frames tend to recieve very low distance metric values. While this is often technically correct – since the pictures often do look very similar - it may cause problems in the search for cultural reuse. 

For instance, finding out that a several black frames share visual similarities is of little use-value in research situations. Likewise, it is comparatively uninteresting to find out that frames with text look alike, especially when the text in such images does not contain the same words or even language, which is often the case. Similar problems also arise with symbols that are mistaken for other symbols, even though they share few visual similarities. If a recurring problem in the matching results are images with texts, symbols, or (near-)single-color frames, we suggest using the function "remove_unwanted_frames" to clean the dataset (see hermeeutics layer). This can be applied by manually selecting precisely what sequences to remove/ignore in the matching results, or by applying a general rule for ignoring frames in all videos. For instance, if a recurring problem in the matching results is frames showing video introductions and aftertexts (sections that often contain text, symbols, and black frames), it might be worth considering removing a fixed number of frames from the beginning and end of each analyzed video. However, this will always come at the expense of loosing possibly interesting matching results.

The VRD's embeddedness in Jupyter Notebooks also brings with it some potentials and pitfalls. While Jupyter Notebook provides excellent opportunities to document workflows and comment on code, the software's memory storage carries some drawbacks. More precisely, Jupyter Notebook stores information about executed code during each session, so that it is possible to work in a non-linear way, jumping back and forth between code cells without neccessarily re-running the entire notebook (<cite data-cite="1971321/B4MKWQ9W"></cite> p. 58). This can be very useful, but it also means that that if code is executed and later deleted from a notebook during a session, the deleted code will still be saved in memory and continue to influence the analysis. For instance, this means that other researchers may have difficulties to reproduce the results. To ensure that deleted code does not influence the analysis, it is possible to re-run the entire notebook regularly. However, this can cause problems when the VRD is applied to large datasets that may take several hours or days to process. 

# ^^^^TOMAS! Fundera på om du vill tillägga något kring hur data cachas här... 

Aside from these practical issues, a couple of broader pitfalls and ethical issues regarding the use of CNN's must also be considered. We will return to a discussion concerning these issues in section XXXX. For now, however, let us have a look at an example of how the VRD can be used to study audiovisual reuse based on a larger dataset.

# Finding reuse in the SF archive

From the late 1960s until the 1990s the SF-archive was reused in thousands of Swedish TV-programs and TV-series. As production of televison became digital so did the SF-archive. Roughly X years later, Sahlberg's films were also shown on Swedish television, recorded on a VHS tape, and eventually digitized by the afore mentioned Rosa Mannen. Made publicly available on YouTube, the 

To illustrate how the VRD toolkit can also be uded at scale, we analyze 1,400 digitized videos from the old SF-archive mentioned in the introduction to this article, and compare them with 34 episodes taken from four different TV-series, all with a focus on Swedish 20th century history. The videos in the SF-archive vary in lenght, but are usually between five to fifteen minutes long. Hence, our SF dataset contains some 10,000 hours of moving image material. 

In contrast, the 34 TV episodes amount to roughly 30 hours of broadcasting time. Most (but not all) of the TV-series can be found online—particularly at RosaMannen’s YouTube channel. The TV-series were foremost picked out because they are compilation films produced in a similar way as director Gardar Sahlberg pioneered in the early 1960s. After Sahlberg started to work at Radio Sweden, the public service broadcaster made an exceptional cultural heritage endeavour in preserving the SF-archive, migrating the old footage (some nitrate prints) to polyester-based safety film and also cataloguing the material. Correspondingly, Sahlberg started producing a number of TV-programmes compiled from the SF-archive. Later documentary TV-producers adopted a similar compilation strategy, also evident in the four TV-series we have selected.

The first TV series, entitled *The Time of Expectations*—with an unknown producer—was broadcasted on Swedish Television during autumn 1985. The series consisted of four episodes that depicted the economic and political developments in Sweden following World War II, and explored the landscape of opportunities that opened up when the country emerged as one of few European countries with an undamaged industry—and population—due to the country’s political neutrality. The series is foremost based on footage from the SF-archive but also contains interviews and talking heads, featuring well known Swedish intellectuals and public figures such as novelist Per-Olov Engkvist and the political scientist Gunnar Myrdal. 

The second TV-series, *Gold Nuggets*—sometimes with the addition “from SF” (Swedish Film Industry) or from a particular year—was a long running series broadcasted on public service television in Sweden from the mid 1980s up until the early 2000s. All episodes were produced by Jan Bergman. He was in many ways a disciple of Sahlberg, and like him worked at the film archive at Swedish Television for a number of years. Bergman’s Gold Nuggets combined a deep knowledge of the SF-archive, often detecting and inserting unusual footage, with a more traditional style of filmmaking. Characteristic of his TV-series is the combination of his own personal narration with that of the typical fast-paced voice-over taken from the original newsreels in the SF-archive. In the dataset six hour-long episodes of Gold Nuggets from 1985 and six from 2002 are included.

Bergman—together with Britta Emmer and Ebbe Schön—was also the producer of the third selected TV-series, entitled *Sweden, A Long Time Ago*, broadcasted in 1984 in ten short episodes. The series—basically depicting different aspects of Sweden and its history—is an illustrative example of how instrumental the SF-archive was during the 1980s. The series (in the forms on individual films) were produced for the Swedish Educational Broadcasting Company (a subsidiary to the public broadcaster) and clearly aimed as a kind of filler for school-TV. “Compiled of material from the SF-archive”, was explicitly stated in all of the film’s closing credits. As in *Gold Nuggets* Bergman’s voice-of-god sutured the footage. 

If the series *Sweden, A Long Time Ago* was all likely produced fairly rapidly—a budget compilation of sort with its ten short episodes—the opposite is true of the final TV series that was selected for our experiment, namely the eight, hour-long episodes of *A Hundred Swedish Years* (1999). The latter series was directed by Olle Häger, a long time TV-producer with a career at public service television spanning three decades. Häger—and his narrator Hans Villius—are arguably the best known documentary film makers in Sweden, having worked both with still photography productions and films made from the SF-archive. *A Hundred Swedish Years* is usually regarded as one the finest historical documentaries compiled from the SF-archive, surpassing even Sahlberg. The production was a lavish one with a major budget, and also included numerous interviews with elderly Swedes (recorded in the mid 1990s). The eight episodes each has a distinct theme—politics, fashion, labor market, children, technology, vacation, monarchy and foreign policy—giving the series a dynamic setting. To prepare his TV-production, Häger worked for years with the SF-archive, and even reprinted some footage with a higher resolution. *A Hundred Swedish Years* was broadcasted on prime time during the autumn of 1999, and was later also commercially released on both VHS and DVD.

In our exploration of the reuse taking place between the SF archive and these four TV series, we perform two different similarity searches. First, we compare all selected TV serie episodes against each other to see if they contained any internal examples of reuse. Does any particular footage appear to have been reused multiple times in the TV documentaries? for instance, can we find examples of Jan Bergman and Olle Häger reusing the same footage? Second, we compare the TV archive against the SF archive, ignoring any potential examples of reuse that might be discovered within the TV archive and SF archive respectively. Here, we wish to re-trace the steps of the TV documentary filmmakers, and map their choice of what historic footage to reuse.

## Reuse in the TV dataset

Within the TV dataset, we find that 

In order to arrive at this result, the matching

We start a new project called "tv_series", extract one frame per second from the TV dataset, and decide to apply the neural network EfficientNetB4 in our analysis. 

In [None]:
video_source_base_folder = "/home/jovyan/work/videos/"
TV_dataset_project = vrd_project.VRDProject(
    name = 'tv_series', 
    project_base_path=f"/home/jovyan/work/notebooks/projects/",
    video_path = f'{video_source_base_folder}/tv_series', 
    network = neural_networks.NeuralNetworks.efficientnetb4,
    additional_video_extensions = ['.webm', '.mpg'])
TV_dataset_project.initialize_frame_extractor()

We then apply the selected neural net to compress and vectorize the frames into fingerprints and use Faiss to first index and then calculate the most similar neighbours for each fingerprint.

In [None]:
TV_dataset_project.populate_fingerprint_database()
TV_dataset_project.initialize_faiss_index()
TV_dataset_project.initialize_neighbours()

To get a sense of where to apply a distance metric threshold, we have a look at the initial distance metric distribution in a histogram.

In [None]:
TV_dataset_project.neighbours.get_distance_histogram();

Based on this histogram, we decide to begin by applying a distance metric threshold of 120 and decide to filter out all matching frames that have origins in the same video. We also decide to proceed by only searching for sequential matches that are at least five seconds long with an allowed skip rate of two frames per sequence. We merge overlapping sequences into one. As can be seen in the statistics below the code cell, this reduces the number of found sequences from 14 571 to 1 302. 

In [None]:
TV_dataset_finder = SequenceFinder(
    TV_dataset_project.neighbours, 
    max_distance=120)
TV_dataset_finder.filter_matches_from_same_video()
TV_dataset_sequences = TV_dataset_finder.find_sequences(
    shortest_sequence=5, 
    allow_skip=2,
    combine_overlap=True
)
len(TV_dataset_sequences)

We also decide that we are not interested in analyzing sequences that come from the same TV series. For instance, one TV series production team may have reused the same video segment several times in different episodes, but such examples of reuse are uninteresting to us, since we want to explore potential instances of reuse among *different* TV series producers. To do this, we apply a customized filter called not_from_same_tv_series. This filter will be applied in all proceeding displays of matching results.

In [None]:
def not_from_same_tv_series(sequences, project):
    # NOTE: This example only works as we know that the first word is different in each included
    # TV series, and is not a general solution.
    frames = project.frame_extractor
    
    new_sequence = []
    for start1, start2, dur in sequences:
        vid1 = frames.get_video_name_from_index(start1)
        vid2 = frames.get_video_name_from_index(start2)
        if vid1.split()[0] != vid2.split()[0]:
            new_sequence.append((start1,start2,dur))        
    return new_sequence
TV_dataset_sequences = not_from_same_tv_series(TV_dataset_sequences, TV_dataset_project)

Next, we look at the remaining matching results in the form of a stacked bar plot, where the reuse count shown on the Y-axis refers to the number of found sequences that are at least 5 seconds long. 

In [None]:
TV_dataset_project.show_most_reused_files(TV_dataset_sequences, size=(1200, 800))

As can be seen in the bar plot, it appears as if footage from primarily two videos have been excessivly reused within the TV datset. This concerns the episode *Vägen genom krisen* (The path through the crisis) from the TV series *Förväntningarnas tid* (The time of expectations) and the episode *Nu har jag kastat min blå overall* (Now I have discarded by blue overall) from the TV series *Hundra Svenska år* (A Hundred Swedish Years). 

We proceed by also looking at the matching results in the form of thumbnail previews. The quality of these matches will give hints regarding how well the distance metric threshold was specified in the `SequenceFinder` step above. If we find that many sequences that do not match, perhaps the distance metric threshold is set too high. Similarly, if all matches look good but a bit short, perhaps the distance metric threshold is set too low. In other words, we can use this data to adjust the previous filtering settings according to our specific dataset. As noted in the previous demo section, the distribution of distance metric values will for example be impacted by the choice of neural net, the layer from which to extract the fingerprints, the size of the dataset, and the quality of the source material. We recommend finding a threshold that is slightly too high (rather than too low), to avoid missing relevant matches. Invalid matches can then be filtered out using other tools. In this notebook, we limit the number of shown sequences to 10 in the example below. During our analysis, however, the 200 longest sequences were analyzed. 

In [None]:
TV_dataset_finder.show_notebook_sequence(TV_dataset_sequences,  show_limit=10, frame_resize=(70,70))

When studying these thumbnails, we decide to keep working with a distance metric threshold of 120, but notice that a special type of frames seem to distort the matching results. These are frames showing talking heads––that is, shots of people's heads and shoulders as they are talking to the camera. As can be seen in several examples, such frames have been given a comparatively low distance metric, even though they depict entirely different people. We decide to filter out these sequences, even though this means that actual reuse of 'talking heads' may get lost. This is done by adding the unwanted sequeces to an Excel (.xlsx) template entitled "unwanted_frames" and running the remove_unwanted_sequences function. We also find some black (or near-black) sequences + sequences containing text in the matching results and add these to the list of unwanted frames as well. As can be seen in the statistics after the code cell below, this reduces the number of found sequences from 402 to 105. 

In [None]:
before_set = set(TV_dataset_sequences)
SequenceFinder.remove_unwanted_sequences(TV_dataset_sequences, TV_dataset_project, 'TV_unwanted_frames.xlsx')
after_set = set(TV_dataset_sequences)
print(f'Before removal: {len(before_set)} sequences.\nAfter removal: {len(after_set)} sequences.\n({(len(before_set)-len(after_set)) / len(before_set)*100:.2f}% removed)')

To double-check what sequences were removed in the previous step, we run the code below. This will provide a sample 50 sequences that were filtered out using the unwanted_sequences feature.

In [None]:
TV_dataset_finder.show_notebook_sequence(before_set.difference(after_set), show_limit=10, show_shift=False, frame_resize=(70,70))

As can be seen in the previews above, the Excel spreadsheet did indeed include a large number of so-called talking heads, as well as black or near-black frames.

Next, we have a look at our updated matching results in the form of a stacked bar plot.

In [None]:
TV_dataset_project.show_most_reused_files(not_from_same_tv_series(TV_dataset_sequences, TV_dataset_project), size=(1200, 800))

As can be seen in the bar plot, the number of found sequences are now highly reduced. Still however, we find the same two TV series episodes being seeminly reused the most amount of times (i.e. the episodes *Vägen genom krisen* and *Nu har jag kastat min blå overall*). 

We have a look at the remaining matching results in the form of thumbnail previews once more. The longest found sequence will be shown first, and we decide to only show the 200 longest sequences found.

In [None]:
TV_dataset_finder.show_notebook_sequence(TV_dataset_sequences, show_limit=10, show_shift=False, frame_resize=(70,70))

When looking at these thumbnails, we notice that a few unwanted sequences remain in the matching results. If desired, we could have gone back to add these to the unwanted_frames Excel spreadsheet and filtered the matching results once more. For instance, sequence 60-65 appear to show mis-matched newspaper pages, and in sequence 103, clouds appear to have been mixed up. In this case, however, we decide that the dataset is sufficiently clean to facilitate the rest of our analysis. 

Next, we decide to drill deeper into the matching results by having a closer look at reuse based on some specific TV serie episodes. Using the latest bar plot, we begin by looking at the episode with the most identified instances of reuse (i.e. the video entitled "Förväntningarnas tid_Vägen genom krisen.webm").

In [None]:
TV_dataset_finder.show_notebook_sequence(contains_video(TV_dataset_sequences, 'Förväntningarnas tid_Vägen genom krisen.webm', TV_dataset_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

We also have a closer look at all matching results for the four more clips that can be found furthest to the left in the bar plot above.

In [None]:
from collections import Counter
df = TV_dataset_finder.get_sequence_dataframe(TV_dataset_sequences)
c = Counter()
df[['Video 1', 'Video 2']].apply(lambda x: c.update(x))
# display(pd.DataFrame([{'Video':name, 'Count':count} for name, count in c.most_common()[:10]]))
for name, count in c.most_common()[:10]:
    print(name)
    
    
TV_dataset_finder.show_notebook_sequence(contains_video(TV_dataset_sequences, 'Hundra svenska år_Nu har jag kastat min blå overall.mkv', TV_dataset_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

In [None]:
TV_dataset_finder.show_notebook_sequence(contains_video(TV_dataset_sequences, 'Hundra svenska år_Folkhemmet tur och retur.webm', TV_dataset_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

In [None]:
TV_dataset_finder.show_notebook_sequence(contains_video(TV_dataset_sequences, 'Hundra svenska år_Jag har varit med om allt som blivit nytt.webm', TV_dataset_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

In [None]:
TV_dataset_finder.show_notebook_sequence(contains_video(TV_dataset_sequences, 'Förväntningarnas tid_Öppnade gränser.webm', TV_dataset_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

## Reuse in the TV and SF dataset

Next, we want to find visual similarities between the TV dataset and SF dataset, without searching for examples of reuse within the SF dataset as such. This is because we are not interested in studying internal reuse in the SF archive (although this could have been a goal of the experiment). Instead, we focus our efforts on exploring which video clips from the SF archive have been picked up and reused in the selected television series. This will save us time since the SF dataset is considerably larger than the TV dataset and we won't have to match and compare all SF videos against each other.

To perform this kind of similarity search, we first need to create two sets of indexed fingerprints. We will then combine the datasets and use Faiss to calculate the closest similarity neighbours between the two. Since the TV dataset has already been fingerprinted and indexed once, we can reuse our previous data. However, we need still need to process the videos from the SF dataset.

We begin by starting a second project called "sf_archive" and extract one frame per second from the SF dataset. Again, we decide to apply the neural network EfficientNetB4 in our analysis.

# ^^^^TOMAS! clean code below before article is submitted?

In [None]:
from pathlib import Path
SF_dataset_project = vrd_project.VRDProject(
    name = 'sf_archive', 
    project_base_path=f"/home/jovyan/work/notebooks/projects/",
    video_path = f'{video_source_base_folder}/from_pelle',
    network = neural_networks.NeuralNetworks.efficientnetb4,
    additional_video_extensions = ['.webm', '.mpg'])
SF_dataset_project.initialize_frame_extractor()
# Only for this project: 
# Keep the sf2 subdirectory, discard the rest.
fixed_frames = [x for x in SF_dataset_project.frame_extractor.all_images if x.startswith('/home/jovyan/work/notebooks/projects/sf_archive/frames/sf/sf2')]
file_names = set()
for f in fixed_frames:
    p = Path(f)
    file_names.add(p.name.split('_frame_')[0])    
SF_dataset_project.frame_extractor.video_list = list(file_names)
SF_dataset_project.frame_extractor.all_images = fixed_frames

Next, we compress and vectorize the extracted frames into fingerprints using the selected CNN.

In [None]:
# OBS! DENNA KOD SKA KÖRAS I DEN SLUTGILTIGA ARTIKELVERSIONEN MEN VI STRUNTAR I 
# ATT GÖRA DET SÅ LÄNGE VI SKRIVER FÄRDIGT ARTIKELN
# SF_dataset_project.populate_fingerprint_database()

We index the fingerprints using Faiss. 

In [None]:
SF_dataset_project.initialize_faiss_index(force_recreate=False)

And then combine the two projects using the `combine projects` feature. This will initialize the use of Faiss to  calculate the most similar neighbours between the indexed fingerprints in SF dataset and TV dataset.

In [None]:
combined_project = vrd_project.combine_projects(SF_dataset_project, TV_dataset_project, name='reference_vs_tv_series')

When the similarity neughbours have been calculated, we display a histogram of the distances metric value distribution and use it to determine a reasonable threshold for finding sequences.

In [None]:
combined_project.neighbours.get_distance_histogram();

Based on the data shown in this histogram, we decide to apply the distance metric threshold 120 and filter out all matching results from the same video. We also decide to only search for sequences that are at least five seconds long, and allow a skip rate of 2 frames per sequence. As can be seen in the statistics below, this reduces our dataset from 10 670 to 1983 found sequences.

In [None]:
combined_project_finder = SequenceFinder(combined_project.neighbours, max_distance=120)
combined_project_finder.filter_matches_from_same_video()
combined_project_sequences = combined_project_finder.find_sequences(shortest_sequence=5, allow_skip=2,combine_overlap=True)
len(combined_project_sequences)

Using the same rationale as before, we filter out the same sequences that were labelled as unwanted in the TV dataset. Since we will only match the TV dataset against the SF dataset (and not search for any examples of reuse within the SF dataset as such), the removal of these frames should be enough to 'clean' the search results. Note, however, that if any "real" reuse has occured unwanted frames (for instace, reuse of footage containing talkig heads), these sequences will be lost from the search results. As can be seen in the statistics below the code cell, this reduces the number of found requences from 1983 to 1553. 

In [None]:
before_set_combined = set(combined_project_sequences)
SequenceFinder.remove_unwanted_sequences(combined_project_sequences, combined_project, 'TV_unwanted_frames.xlsx')
after_set_combined = set(combined_project_sequences)
print(f'Before removal: {len(before_set_combined)} sequences.\nAfter removal: {len(after_set_combined)} sequences.\n({(len(before_set_combined)-len(after_set_combined)) / len(before_set_combined)*100:.2f}% removed)')

To double-check what sequences were removed in the previous step, we output a sample of 50 sequences that were removed using unwanted frames functionality.

In [None]:
combined_project_finder.show_notebook_sequence(before_set_combined.difference(after_set_combined), show_limit=10, show_shift=False, frame_resize=(70,70))

When looking at these thumbnails, we notice that at least one accurate match seems to have been removed. This concerns the first sequence shown above (sequence no.0). Other than that, however, the removed sequences do seem to contain a large number of wrongly matched talking heads, or uninteresting frames.

We have a look at the remaining results in the form of a stacked bar plot.

In [None]:
combined_project.show_most_reused_files(combined_project_sequences, video_list=TV_dataset_project.frame_extractor.video_list)

As can be seen in this plot, footage from the video SF27891.1 appears to have been reused the most amount of times, as over 35 sequences from the video have been identified in 11 of the other TV serie episodes. More than 15 matching sequences have also been found in the videos entitled SF2410A-C.1, 2764.1, 2554.1, 2183.1, and 2831.A-C.1. We take note of these video clips and will return to them later. 

We also output the matching results in the form of thumbnail previews, sorted with the longest found sequence shown first. We limit the number of shown matching results to the 200 longest sequences. To enlarge the thumbnail previews, it is possible to double-click on the sequences.

In [None]:
combined_project_finder.show_notebook_sequence(combined_project_sequences, show_limit=10, frame_resize=(70,70))

When browsing through these thumbnail previews, we do find that most of the sequences look correct. In some cases, there is a slight time shift in the identified frames, but over-all, it seems as if the VRD has successfully found several instances of reuse. Some notable exceptions, however, includes sequences no. 36, 42, 44, 48, 53, 61, 62, 105, 114, and 121 where crowds of people has been mis-matched. We also see examples of talking heads that remain in the search results (sequence 37). Other mis-matched sequences includes frames showing ocean horizons (sequence 86, 106, and 107), sailing boats (sequence 110), and snow landscapes (sequence 122). We also find examples of frames that have likely been mismatched since the frames are dark or near-black (such as sequence 58 and 126). 

We add these and other found unwanted sequences to a new Excel spreadsheet called "Unwanted_frames_2" and save this document for later.

To further inspect the search results, we also have a look at the 200 longest sequences from the 5 most reused videos, as seen in the stacked bar plot above. We begin by looking at the matching results for the video SF2789.1, that is, the video clip with the most identified instances of reused sequences.

In [None]:
combined_project_finder.show_notebook_sequence(contains_video(combined_project_sequences, 'SF2789.1.mpg', combined_project_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

In this case, we find that a particular sequence from the video SF2789.1.mpg has been excessively mis-matched in the search results. As it turns out, this sequence shows prince Wilhelm of Norway and Sweden reading a poem called "While the boat is drifting" ("Medan båten glider"), which is introduced as providing an evocation of northern Scandinavia's "sparkling mountain lakes and mile-wide desert forests with their strange and enchanting mysticism". The original video features several shots showing the upper body of prince Wilhelm, who was otherwise famous for publishing cronicles about his (and colonialism-inspired) travels, covering topics such as Hindu eroticism and primate wildlife in Central Africa. This illustrates how the matching results from the bar-plot should be analyzed with care, since the most reused video sequences does not necessarily correspond to the sequences with the highest *quality* in matches. On the contrary, video sequences that are identified as being excessively reused by the VRD may be of poor quality (i.e. contain footage that easily gets mis-matched). 

We add the sequences showing Prince Wilhelm and other talking heads to the Excel spreadsheet "Unwanted_frames_2" and continue by studying the second most reused video clip in the same way, zooming in on the mathing results for the video SF2410A-C.1, which depics a [military exercise](https://smdb.kb.se/catalog/search?q=SF2410&x=0&y=0), performed by the Swedish Navy's in 1918.

In [None]:
combined_project_finder.show_notebook_sequence(contains_video(combined_project_sequences, 'SF2410A-C.1.mpg', combined_project_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

In this case, we find that frames depicting an oscean horizon has been excessively matched against oscean horizons in other TV serie episodes. While this is technically correct––since the ocean horizons do look very much alike––it is comparatively uninteresting for our purposes here. However, skimming though the matching results we do find two interesting sequences. The first, sequence no. depics an explosion on the ocean horizon, although with a slight time-shift. In sequence No.9, we also see a correct match between frames that include shots taken from boat.

We add the remaining uninteresting ocean horizon sequences to the "Unwanted_frames_2" spreadsheet. We also look through the matching results for the video SF2764.1, SF2554.1, and SF2831A-C.1 and add unwanted frames found there to the same Excel document. In the interest of space, however, these matching results are not displayed in this notebook. 

When the updated Excel sheet is finished, we apply the remove_unwanted_sequences function once more and output a sample of what the removed sequences looked like, in order to double-check the filtering.

In [None]:
before_set_combined = set(combined_project_sequences)
SequenceFinder.remove_unwanted_sequences(combined_project_sequences, combined_project, 'TV_unwanted_frames_2.xlsx')
after_set_combined = set(combined_project_sequences)
print(f'Before removal: {len(before_set_combined)} sequences.\nAfter removal: {len(after_set_combined)} sequences.\n({(len(before_set_combined)-len(after_set_combined)) / len(before_set_combined)*100:.2f}% removed)')
combined_project_finder.show_notebook_sequence(before_set_combined.difference(after_set_combined), show_limit=10, show_shift=False, frame_resize=(70,70))

We have a look at the remaining matching results in the form of a bar plot once more.

In [None]:
combined_project.show_most_reused_files(combined_project_sequences, video_list=TV_dataset_project.frame_extractor.video_list)

We look at the remaining matching rersults in the form of thumbnail previews. Once more, the longest found sequence will be shown first and we limit the number of shown sequences to 200.

In [None]:
combined_project_finder.show_notebook_sequence(combined_project_sequences, show_limit=10, frame_resize=(70,70))

In [None]:
# HITTA KLIPP SOM ÄR MED I MER ÄN 3 ST
## MARIA: Jag vet att det inte är exakt som du beskrev, men detta är de 5 videos i SF som använder flest videos ifrån TV-serier, i fallande antal tv-serier använt.

Last, we also have a look at thumbnail previews in the 5 remaining SF clips that have been reused, showing the most reused clip first.

In [None]:
df = SequenceFinder.get_sequence_dataframe(combined_project_finder,combined_project_sequences)
grouped = df.groupby('Video 2')['Video 1'].nunique().sort_values(ascending=False).head(5)
print('SF video, and number of unique TV series used:')
display(grouped)
for video in grouped.keys():
    display(df[df['Video 2'] == video])
    combined_project_finder.show_notebook_sequence(contains_video(combined_project_sequences, video, combined_project_finder), show_limit=10, show_shift=False, frame_resize=(70,70))

## Discussion

## Notes on ethics

Aside from these practical issues, a couple of broader pitfalls and ethical issues regarding the use of CNN's must also be considered. CNN's and other machine learnig techologies have been critiqued for relying on the underpaid work of so-called 'click-workers' who tag and prepare the training datasets that make articficial intelligence seem intelligent (<cite data-cite="1971321/RZIC9T2H"></cite>). CNN's have also been widely critiqued for being biased – primarily because of problems surrounding their training datsets. As previously mentioned, most CNN's (including all neural nets in the VRD) are trained and developed using the so-called ImageNet challenge. ImageNet is a database consisting of over 14 millon images collected from the Internet, which have been been classified by human beings according to a taxonomy called WordNet that was developed by scholars at Princeton University in the 1980's (<cite data-cite="1971321/FXZYJ9Z8"></cite>). This database is used to teach CNN's how to analyze visual content, as well as backtrack and fact-check the model's interpretations. Yet previous research has found that several sociodemographic groups – such as women, people of color, LGBTQ+ communities, and people with disabilities – are underrepresented in the dataset (ibid.). As a result, CNN's trained on ImageNet have been shown to carry racial, sexual, and heteronormative biases, as well as biases relating to functionality. Similarly, images from non-Western countries have been found to be heavily underrepresented in the ImageNet database, meaning that CNN's trained on ImageNet have difficulties recognizing non-Western environments (<cite data-cite="1971321/ZFWA7ZZI"></cite>). For instance, this includes having problems recognizing basic non-Western household environments such as kitchens/bathrooms and the everyday objects, furniture, and utensils found therein (ibid.). Others have found that CNN's tend to perform "a capitalist and product-focused reading of the world", as the ImageNet dataset and WordNet taxonomy is biased towards commerce and contemporary consumer products (<cite data-cite="1971321/JP2386CT"></cite>).

In addition to dataset biases, convolutional neural networks also carry a series of "perceptual biases" that concern their fundamental ways of representing the world (<cite data-cite="1971321/YVM62ZRV"></cite>). For instance, CNN's are trained to make classifications, and classifications by necessity involve reducing and limiting interpretations of what exists (<cite data-cite="1971321/SP5DNFWD"></cite>). As Kate Crawford and Trevor Paglen argue, each layer in a CNN is “infused with politics” in the sense that it is programmed to assume that fixed, universal and consistent concepts exists including, for example, concepts such as 'man' or 'woman' (<cite data-cite="1971321/VKPZKVVC"></cite>). This assumption leaves little room for hesitations and interpretative complexity. Furthermore, CNN's rely on the assumption that it is possible to capture the visual 'essence' in images through statistics, a question that could be critiqued from a philosophical standpoint (ibid.,). CNN's are also "biased towards a distributed, entangled, deeply non-human way of representing the world", which often makes it difficult to understand how a CNN draws its conclusions (<cite data-cite="1971321/YVM62ZRV"></cite>). This has implications for possibilities of providing transparency (i.e., explaining precisely how a CNN makes decisions), as well as abilities to hold the model/researchers accountable for its interpretations.

The VRD does not use machine learning to reproduce semantic binary gender divisions, or output normative textual classifications of things and humans that appear in images. However, this does not mean that the toolkit escapes the general biases that are associated with CNN's. For instance, there is every reason to assume that the VRD will be better at recognizing and processing certain kinds of content above others – because of how its underlying CNN's have been trained and developed. For instance, there are good reasons to assume that the toolkit will be best at processing visual imagery that depicts contemporary, commercial, Western environments. It is also important to recognize that the VRD's ability to find similarities in visual datasets relies on the labour of thousands of underpaid clickworkers, who's work with tagging and preparing training datsets has been crucial to the development of CNN's.

# Concluding remarks

## Areas of future development

At the moment, we have identified several areas where the VRD toolkit could be developed further. For example, adding more features for analyzing the effects of applied filters (for instance in the form of statistics showing lost matching results) could provide more transparency and help in finding the right parameters for each filter. 

>> More efforts could also go into:

>>> OCR/black frames filters

>>> Redesigning the database structure that underlies the toolkit

>>> Making it possible to apply filters *before* CNN's and the Faiss library is applied

>>> Conducting rigid tests of the toolkit

>>> ???

Given the possibility to embed complex plots, graphs, tables, and visualizations in Jupyter Notebooks, there are also many opportunities to develop how matching results are shown. In many cases, however, these solutions need to be customized to the specific dataset and research questions that are explored. For instance, the possibility to present matching results in graphic visualizations in a relevant way is heavily dependent on the size of datasets, which makes it difficult to embedd general features in toolkits such the VRD. One of the great advantages with Jupyter Notebooks, however, is that users have every opportunity to add and remove features from the VRD toolkit by themselves. A good place to start is by using the [Plotly](https://plotly.com/python/) library, which is embedded in the VRD docker. In the case study presented in the next chapter, we provide an example of what a customized application of the VRD could look like. 

# Bibliography

<div class="cite2c-biblio"></div>