This repository provides implementation code of our entity, property & relation extraction methods which we introduced in our paper Event and Entity Extraction from Generated Video Captions (CD-MAKE 2023) (Johannes Scherer, Ansgar Scherp and Deepayan Bhowmik). We proposed a framework (combining Video Captioning and NLP methods) to extract semantic metadata solely from automatically generated video captions. As metadata, we considered entities, the entities’ properties, relations between entities, and the video category.
Test the extraction methods on custom text and captioned events (see Usage). This repository does not contain the implementation of our method for video classification using generated captioned video events, the scripts that we used to evaluate our extraction methods, nor the trained models of the Dense Video Captioning methods that we employed (see References) and the captioned events they generated.
Create conda environment.
conda create -n Video2Metadata python=3.7
conda activate Video2Metadata
Install spaCy with NeuralCoref from source (see huggingface/neuralcoref#310).
cd src
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .
cd ../../
When an error occurs when installing spaCy with NeuralCoref from source, the following installation may work instead (see huggingface/neuralcoref#209). Note that spaCy 2.1.0 is much slower.
pip install spacy==2.1.0
pip install neuralcoref
Download spaCy language model of choice.
python -m spacy download en_core_web_lg
Install WordNet to validate (compound) nouns, verbs, adjectives and adverbs.
conda install -c anaconda nltk
Apply the semantic metadata extraction methods on custom text. For example, the following command
python extract_from_text.py --text "A man is standing in front of a fridge. He opens it and takes out a red glass."
results in the output
Input: A man is standing in front of a fridge. He opens it and takes out a red glass
Detected Sentences:
A man is standing in front of a fridge.
He opens it and takes out a red glass.
Entities:
fridge
front
glass
man
Entity-Property Pairs:
glass [red]
Relations:
(man, standing, ['in'], front)
(man, takes, ['out'], glass)
To apply the semantic metadata extraction methods on captioned events (including temporal information) instead of text,
you may add an example consisting of sentences and temporal segments to the given list of examples in
extract_from_captioned_events.py
(already included there are the examples as presented in the paper).
python extract_from_captioned_events.py
The DVC models that we used for testing our framework
- End-to-End Dense Video Captioning with Masked Transformer
- End-to-End Dense Video Captioning with Parallel Decoding
Johannes Scherer, Ansgar Scherp and Deepayan Bhowmik