Audio Checkworthiness Detection

Research on using audio data in detecting checkworthy claims in political events (debates, interviews, and speeches). Neural models using audio or both text and audio are prepared. The results show that an audio model could boost the performance of a powerful textual one when combined.

The research was held at Sofia University (Bulgaria), Faculty of Mathematics and Informatics.

Paper

Paper describing this work can be found at:

https://ieeexplore.ieee.org/document/10447064 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
https://arxiv.org/pdf/2306.05535.pdf

Short Description of the Project

What this work contains in short:

Multimodal dataset (text and audio in English) - 48 hours of speech, 34,489 sentences. Labels are on sentence level - whether each contains a checkworthy claim or not.
Variants of audio segments (audio files for each sentence):
- Original audio.
- With reduced noise (using noisereduce).
- With generated speech (using FastSpeech 2).
Variant of the dataset with a single speaker (Donald Trump, see checkworthy-by-speaker).
Real-world data - the multimodal dataset is based on the one from the CLEF Check-That! 2021 Challenge (see this paper, page 7 (Task 1B) and the event links).
Addressing class skewness of the train dataset via oversampling (duplicating checkworthy claims 15 and 30 times (referred to as '15x' (example) and '30x' (example))) and undersampling (removing non-checkworthy samples until they equalise with the checkworthy, referred to as '1:1' or 'train-balanced-1' (example)). The variant without changes to the train dataset is referred to as 'Without changes' or 'as-is' (example).
Textual baselines:
- N-gram baseline from the CLEF Check-That! 2021 Challenge (reference).
- Counts of named entities from different categories (feedforward neural network).
- BERT-base uncased (paper, model).
Fine-tuning audio models:
- wav2vec 2.0 (paper, model).
- HuBERT (paper, model).
- data2vec-audio (paper, model).
- WavLM (paper, model).
Knowledge alignment - training an audio model to represent the input it receives in the same way a fine-tuned textual model represents its input in a teacher-student mode.
Training models on audio features:
- MFCC (extracted with openSMILE, GRU and Transformer Encoder).
- L³-Net (extracted with openl3, GRU and Transformer Encoder).
- i-vector (extracted with kaldi, feedforward neural network).
- Interspeech ComParE 2013 (extracted with openSMILE, feedforward neural network).
- Interspeech ComParE 2016 (extracted with openSMILE, feedforward neural network).

Metric

Mean Average Precision (MAP) is being used - the same metric as in CLEF Check-That! 2021 Challenge (paper). The higher the MAP score, the better.

Best Results

This section contains the results from the textual baselines and most notable results using audio. The best epoch is chosen according to the MAP score on the dev dataset and the model at this epoch is used for evaluation over the test dataset. The tables below are sorted according to MAP(test) in descending order.

Results With Multiple Speakers

Row #	Model Type	Model	Train dataset variant	Audio segments variant	MAP(test)
1	Early fusion ensemble	BERT & HuBERT (rows 5 & 9)	Without changes	Original	0.3817
2	Late fusion ensemble	BERT & HuBERT (rows 5 & 9)	15x	Original	0.3758
3	Early fusion ensemble	BERT & aligned data2vec (rows 5 & 6)	Without changes	Original	0.3735
4	Late fusion ensemble	BERT & aligned data2vec (rows 5 & 6)	30x	Original	0.3724
5	Textual	BERT	1:1	N/A	0.3715
6	Aligned audio model	data2vec-audio	Without changes	Original	0.2999 (best epoch on dev: 8)
7	Aligned audio model	wav2vec 2.0	Without changes	Original	0.2996 (best epoch on dev: 10)
8	Aligned audio model	HuBERT	Without changes	Original	0.2787 best epoch on dev: 12
9	Audio	HuBERT	30x	Original	0.2526
10	Textual	n-gram baseline	15x	N/A	0.2392
11	Audio	wav2vec 2.0	15x	Original	0.2365
12	Audio	data2vec-audio	30x	Redused noise	0.2330
13	Textual	FNN with named entities	15x	N/A	0.2228

Results With Single Speaker

Row #	Model Type	Model	Audio segments variant	MAP(test)
1	Audio	wav2vec 2.0	Reduced noise	0.3427
2	Textual	BERT	N/A	0.3267
3	Textual	n-gram baseline	N/A	0.2693
4	Audio	HuBERT	Original	0.2478
5	Textual	FNN with named entities	N/A	0.2193
6	Audio	data2vec-audio	Reduced noise	0.2129

Setup the Project

Create a directory for this project and create a virtual environment in it (assuming Python 3 is used):

mkdir checkworthy-research && cd checkworthy-research
python3 -m venv .
source ./bin/activate

Clone this git repository and cd into it
Install dependencies

pip install -r requirements.txt

The first time also download en_core_web_sm from spacy.

Export the project root as environment variable:

export PROJECT_ROOT=/<...>/checkworthy-research/audio-checkworthiness-detection

cd into scripts directory
Download the full transcripts (CLEF2021, Check That! challenge, task 1b) via running:

python 01-download-full-transcripts.py --data_dir ${PROJECT_ROOT}/data

Sentences with speaker SYSTEM are not part of the multimodal datataset - there are several entries that are marked as checkworthy although they are not (like applauses). One may check which these are via running:

python 02-extract-system-sentences.py --data_dir ${PROJECT_ROOT}/data

and having a look at the files in data/clef2021/task-1b-english/full/v1/system-sentences.

Retrieve the audio for the events (see event links)

Put the audio files in ${PROJECT_ROOT}/data/clef2021/task-1b-english/reduced/v1/02-audios. Expected directory structure:

02-audios
|- dev
   |- 20170228_Trump_Congress_joint_session
      |- audio-1.wav
   | ...
   | ...
|- test
   |- 20170512_Trump_NBC_holt_interview
      |- audio-1.wav
   | ...
   | ...
|- train
   | ...
   | ...
   |- 20160907_NBC_commander_in_chief_forum
      |- audio-1.wav
      |- audio-2.wav
   | ...
   | ...

Note: Not all of the event recordings were found and some are incomplete. Hence, we have 2 datasets:

The original CLEF Check-That! 2021 dataset - called full or original.
The multimodal dataset (which is a subset of the full dataset) - also called reduced.

Apply begin-end audio alignment

This is required as audio files not always cover all the lines of a transcript - in some cases the audio may start at line 50 of a transcript (for example), in other cases - it may not cover the whole transcript until the end.

With that step we align the audio files with their transcripts. The begin-end audio alignments could be found here. Run the following script to align the audio files and the transcripts:

python 05-apply-begin-end-manual-alignment.py --data_dir ${PROJECT_ROOT}/data

It takes about 8 minutes on a MacBook Pro with an Intel Core i9-9880H processor and 32GB of RAM.

Word-level text-audio alignment

No action required, it is already built. The audio-text aligner in use is Gentle. The word-level alignment is available here.

Sentence-level text-audio alignment

The data is annotated on sentence level. We have 2 classes - 0 (sentence is not checkworthy) and 1 (sentence is checkworthy). We need to leverage the word-level alignment to build text-audio alignment on sentence level. Run the following:

python 10-build-sentence-level-alignment.py --data_dir ${PROJECT_ROOT}/data > output-building-sentence-level-alignment.txt

Remarks:

The script runs quickly (for about a minute), but its output is verbose so it is handy to redirect it to a file in case one would like to inspect it.
The line numbers in the output are those from the full dataset, the original line numbers.
We filter out sentences with speaker SYSTEM.
If the end of a sentence is not found, then we traverse up to 3 words from the next sentence to find an ending of the previous one.
Senetence-level segments that have a duration of less than a second are skipped. Some models cannot process audio that is that short.

Cut sentence-level audio segments

Run the following:

python 12-chop-audio-segments.py --data_dir ${PROJECT_ROOT}/data

Remarks:

The script takes less than 5 minutes to run.
Each segment is resampled to 16 kHz, is converted to mono (single channel) and has sample width set to 16 bits.
The audio segment file names contain a line number - this is the corresponding line number in the reduced dataset.

Reducing noise

This is performed using:

python 13-reduce-noise.py --data_dir ${PROJECT_ROOT}/data

This takes approximately 2 and a half hours.

Generating speech

The results with generated speech were not the worst, but never the best. As it consumes time, it is up to the reader to decide whether to invest into that approach. Run the following to generate those audio segments:

python 14-generate-speech.py --data_dir ${PROJECT_ROOT}/data --target_dir_name ${PROJECT_ROOT}/data/clef2021/task-1b-english/reduced/v1/10-audio-segments-gs

Output from our run.

Extracting reduced dataset stats

One may find stats about the full dataset in this paper, page 7 (Task 1B).

For convenience, a file with stats about the reduced dataset is also present in this repository and could be found here.

One may extract the stats using:

python 15-reduced-dataset-stats.py --data_dir ${PROJECT_ROOT}/data > reduced-dataset-stats.txt

Remarks:

It takes about half a minute for the script to run.
By default stats about number of tokens are extracted using distilbert-base-uncased tokenizer, but that could be changed via command-line option --tokenizer_name.

Extract MFCC, ComParE 2013, and ComParE 2016 features with openSMILE

Download openSMILE from here.

Here is a template command to use:

python 16-build-openSMILE-features.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/<features-type-dir-name> \
   --openSMILE_dir <path-to-opensmile>/opensmile-3.0 \
   --feature_set <feature-set>

It is executed for every variant of the audio segments (original, reduced noise, and generated speech) and for every type of features - MFCC, ComParE 2013, and ComParE 2016.

And here are the values for the placeholders:

Placeholder	Value
<segments-dir>	For original audio use `08-audio-segments`. For reduced noise: `09-audio-segments-rn`. For generated speech: `10-audio-segments-gs`.
<features-dir-name>	For original audio use `features`. For reduced noise: `features-rn`. For generated speech: `features-gs`.
<features-type-dir-name>	For MFCC use `opensmile-mfcc`. For ComParE 2013: `compare-2013`. For ComParE 2016: `compare-2016`.
<path-to-opensmile>	The path where you have extracted openSMILE to, up to 'bin' (does not include it).
<feature-set>	For MFCC use `mfcc`. For ComParE 2013: `compare-2013`. For ComParE 2016: `compare-2016`.

Extracting i-vectors

First, prepare Kaldi. See the instructions in this script and then run it.

Template for the next command to run (executed for every audio segments variant (original, reduced noise, and generated speech)):

python 18-extract-ivectors.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/ivectors \
   --kaldi_dir <path-to-kaldi> \
   --mfcc_config ${PROJECT_ROOT}/kaldi-configs/mfcc.conf

The placeholders <segments-dir> and <features-dir-name> are analogous to those in the openSMILE feature extraction. <path-to-kaldi> is chosen by the reader as described in this script.

Extracting L³-Net features

Command template:

python 19-extract-openl3-embeddings.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/openl3

The placeholders <segments-dir> and <features-dir-name> are analogous to those in the openSMILE feature extraction.

If in doubt about the --target_dir_name, check where the features are read from afterwards (during training) in this script.

Oversampling

Run the following:

python 20-duplicate-train-checkworthy.py --data_dir ${PROJECT_ROOT}/data --num_duplicates 15

and then:

python 20-duplicate-train-checkworthy.py --data_dir ${PROJECT_ROOT}/data --num_duplicates 30

Undersampling

Run the following:

python 21-balance-dataset.py --data_dir ${PROJECT_ROOT}/data

Create the single speaker subset

Run the following:

python 22-filter-trump.py --data_dir ${PROJECT_ROOT}/data

Extract textual features - counts of named entities

Run the following:

python 24-ner.py --data_dir ${PROJECT_ROOT}/data

Generate the data files

All the features we have extracted so far are on sentence level and are stored in files. The final data files used during training contain path to those files with features. Run the following commands:

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train --dev_dir_name dev --test_dir_name test

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-15x --dev_dir_name dev --test_dir_name test

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-30x --dev_dir_name dev --test_dir_name test

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-balanced-1 --dev_dir_name dev --test_dir_name test

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name trump-train --dev_dir_name trump-dev --test_dir_name trump-test

Train scalers

In this step scalers for the different feature types are being prepared. Run the following:

python train-scalers.py --data_dir ${PROJECT_ROOT}/data --target_dir ${PROJECT_ROOT}/trained-scalers

Output from our run could be found here.

Train models

Run the train.py script to train a model. Refer to the experiments in training-results for hyperparameter values. For example, these are the values used for fine-tuning BERT on the multimodal dataset (without changes to the train dataset for addressing skewness). And these are the values when fine-tuning HuBERT on the same format of the train dataset (without changes, 'as-is').

Use the alignment.py script for the knowledge alignment procedure in teacher-student mode. Note that the textual model should be fine-tuned in advance.

Use the extract-vector-representations.py script for extracting vector representations of already fine-tuned textual or audio models.

Inspect ranking performed by a model

Example command for inspecting actually checkworthy claims and their ranks as given by a model. It also displays the top-ranked and bottom-ranked sentences. Run the following:

python inspect-ranking.py \
   --data_dir ${PROJECT_ROOT}/data \
   --actual_dir_name test \
   --predictions_dir ${PROJECT_ROOT}/training-results/features/early-fusion/bert-and-hubert/best-as-is/epoch-2/test-predictions

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/clef2021/task-1b-english/reduced/v1		data/clef2021/task-1b-english/reduced/v1
kaldi-configs		kaldi-configs
script-outputs		script-outputs
scripts		scripts
training-results		training-results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Checkworthiness Detection

Paper

Short Description of the Project

Metric

Best Results

Results With Multiple Speakers

Results With Single Speaker

Setup the Project

About

Releases

Packages

Languages

petar-iv/audio-checkworthiness-detection

Folders and files

Latest commit

History

Repository files navigation

Audio Checkworthiness Detection

Paper

Short Description of the Project

Metric

Best Results

Results With Multiple Speakers

Results With Single Speaker

Setup the Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages