Skip to content

petar-iv/audio-checkworthiness-detection

Repository files navigation

Audio Checkworthiness Detection

Research on using audio data in detecting checkworthy claims in political events (debates, interviews, and speeches). Neural models using audio or both text and audio are prepared. The results show that an audio model could boost the performance of a powerful textual one when combined.

The research was held at Sofia University (Bulgaria), Faculty of Mathematics and Informatics.

Paper

Paper describing this work can be found at:

Short Description of the Project

What this work contains in short:

  • Multimodal dataset (text and audio in English) - 48 hours of speech, 34,489 sentences. Labels are on sentence level - whether each contains a checkworthy claim or not.
  • Variants of audio segments (audio files for each sentence):
  • Variant of the dataset with a single speaker (Donald Trump, see checkworthy-by-speaker).
  • Real-world data - the multimodal dataset is based on the one from the CLEF Check-That! 2021 Challenge (see this paper, page 7 (Task 1B) and the event links).
  • Addressing class skewness of the train dataset via oversampling (duplicating checkworthy claims 15 and 30 times (referred to as '15x' (example) and '30x' (example))) and undersampling (removing non-checkworthy samples until they equalise with the checkworthy, referred to as '1:1' or 'train-balanced-1' (example)). The variant without changes to the train dataset is referred to as 'Without changes' or 'as-is' (example).
  • Textual baselines:
  • Fine-tuning audio models:
  • Knowledge alignment - training an audio model to represent the input it receives in the same way a fine-tuned textual model represents its input in a teacher-student mode.
  • Training models on audio features:
    • MFCC (extracted with openSMILE, GRU and Transformer Encoder).
    • L3-Net (extracted with openl3, GRU and Transformer Encoder).
    • i-vector (extracted with kaldi, feedforward neural network).
    • Interspeech ComParE 2013 (extracted with openSMILE, feedforward neural network).
    • Interspeech ComParE 2016 (extracted with openSMILE, feedforward neural network).

Metric

Mean Average Precision (MAP) is being used - the same metric as in CLEF Check-That! 2021 Challenge (paper). The higher the MAP score, the better.

Best Results

This section contains the results from the textual baselines and most notable results using audio. The best epoch is chosen according to the MAP score on the dev dataset and the model at this epoch is used for evaluation over the test dataset. The tables below are sorted according to MAP(test) in descending order.

Results With Multiple Speakers

Row # Model Type Model Train dataset variant Audio segments variant MAP(test)
1 Early fusion ensemble BERT & HuBERT (rows 5 & 9) Without changes Original 0.3817
2 Late fusion ensemble BERT & HuBERT (rows 5 & 9) 15x Original 0.3758
3 Early fusion ensemble BERT & aligned data2vec (rows 5 & 6) Without changes Original 0.3735
4 Late fusion ensemble BERT & aligned data2vec (rows 5 & 6) 30x Original 0.3724
5 Textual BERT 1:1 N/A 0.3715
6 Aligned audio model data2vec-audio Without changes Original 0.2999 (best epoch on dev: 8)
7 Aligned audio model wav2vec 2.0 Without changes Original 0.2996 (best epoch on dev: 10)
8 Aligned audio model HuBERT Without changes Original 0.2787 best epoch on dev: 12
9 Audio HuBERT 30x Original 0.2526
10 Textual n-gram baseline 15x N/A 0.2392
11 Audio wav2vec 2.0 15x Original 0.2365
12 Audio data2vec-audio 30x Redused noise 0.2330
13 Textual FNN with named entities 15x N/A 0.2228

Results With Single Speaker

Row # Model Type Model Audio segments variant MAP(test)
1 Audio wav2vec 2.0 Reduced noise 0.3427
2 Textual BERT N/A 0.3267
3 Textual n-gram baseline N/A 0.2693
4 Audio HuBERT Original 0.2478
5 Textual FNN with named entities N/A 0.2193
6 Audio data2vec-audio Reduced noise 0.2129

Setup the Project

  1. Create a directory for this project and create a virtual environment in it (assuming Python 3 is used):
mkdir checkworthy-research && cd checkworthy-research
python3 -m venv .
source ./bin/activate
  1. Clone this git repository and cd into it

  2. Install dependencies

pip install -r requirements.txt

The first time also download en_core_web_sm from spacy.

  1. Export the project root as environment variable:
export PROJECT_ROOT=/<...>/checkworthy-research/audio-checkworthiness-detection
  1. cd into scripts directory

  2. Download the full transcripts (CLEF2021, Check That! challenge, task 1b) via running:

python 01-download-full-transcripts.py --data_dir ${PROJECT_ROOT}/data

Sentences with speaker SYSTEM are not part of the multimodal datataset - there are several entries that are marked as checkworthy although they are not (like applauses). One may check which these are via running:

python 02-extract-system-sentences.py --data_dir ${PROJECT_ROOT}/data

and having a look at the files in data/clef2021/task-1b-english/full/v1/system-sentences.

  1. Retrieve the audio for the events (see event links)

Put the audio files in ${PROJECT_ROOT}/data/clef2021/task-1b-english/reduced/v1/02-audios. Expected directory structure:

02-audios
|- dev
   |- 20170228_Trump_Congress_joint_session
      |- audio-1.wav
   | ...
   | ...
|- test
   |- 20170512_Trump_NBC_holt_interview
      |- audio-1.wav
   | ...
   | ...
|- train
   | ...
   | ...
   |- 20160907_NBC_commander_in_chief_forum
      |- audio-1.wav
      |- audio-2.wav
   | ...
   | ...

Note: Not all of the event recordings were found and some are incomplete. Hence, we have 2 datasets:

  • The original CLEF Check-That! 2021 dataset - called full or original.
  • The multimodal dataset (which is a subset of the full dataset) - also called reduced.
  1. Apply begin-end audio alignment

This is required as audio files not always cover all the lines of a transcript - in some cases the audio may start at line 50 of a transcript (for example), in other cases - it may not cover the whole transcript until the end.

With that step we align the audio files with their transcripts. The begin-end audio alignments could be found here. Run the following script to align the audio files and the transcripts:

python 05-apply-begin-end-manual-alignment.py --data_dir ${PROJECT_ROOT}/data

It takes about 8 minutes on a MacBook Pro with an Intel Core i9-9880H processor and 32GB of RAM.

  1. Word-level text-audio alignment

No action required, it is already built. The audio-text aligner in use is Gentle. The word-level alignment is available here.

  1. Sentence-level text-audio alignment

The data is annotated on sentence level. We have 2 classes - 0 (sentence is not checkworthy) and 1 (sentence is checkworthy). We need to leverage the word-level alignment to build text-audio alignment on sentence level. Run the following:

python 10-build-sentence-level-alignment.py --data_dir ${PROJECT_ROOT}/data > output-building-sentence-level-alignment.txt

Remarks:

  • The script runs quickly (for about a minute), but its output is verbose so it is handy to redirect it to a file in case one would like to inspect it.
  • The line numbers in the output are those from the full dataset, the original line numbers.
  • We filter out sentences with speaker SYSTEM.
  • If the end of a sentence is not found, then we traverse up to 3 words from the next sentence to find an ending of the previous one.
  • Senetence-level segments that have a duration of less than a second are skipped. Some models cannot process audio that is that short.
  1. Cut sentence-level audio segments

Run the following:

python 12-chop-audio-segments.py --data_dir ${PROJECT_ROOT}/data

Remarks:

  • The script takes less than 5 minutes to run.
  • Each segment is resampled to 16 kHz, is converted to mono (single channel) and has sample width set to 16 bits.
  • The audio segment file names contain a line number - this is the corresponding line number in the reduced dataset.
  1. Reducing noise

This is performed using:

python 13-reduce-noise.py --data_dir ${PROJECT_ROOT}/data

This takes approximately 2 and a half hours.

  1. Generating speech

The results with generated speech were not the worst, but never the best. As it consumes time, it is up to the reader to decide whether to invest into that approach. Run the following to generate those audio segments:

python 14-generate-speech.py --data_dir ${PROJECT_ROOT}/data --target_dir_name ${PROJECT_ROOT}/data/clef2021/task-1b-english/reduced/v1/10-audio-segments-gs

Output from our run.

  1. Extracting reduced dataset stats

One may find stats about the full dataset in this paper, page 7 (Task 1B).

For convenience, a file with stats about the reduced dataset is also present in this repository and could be found here.

One may extract the stats using:

python 15-reduced-dataset-stats.py --data_dir ${PROJECT_ROOT}/data > reduced-dataset-stats.txt

Remarks:

  • It takes about half a minute for the script to run.
  • By default stats about number of tokens are extracted using distilbert-base-uncased tokenizer, but that could be changed via command-line option --tokenizer_name.
  1. Extract MFCC, ComParE 2013, and ComParE 2016 features with openSMILE

Download openSMILE from here.

Here is a template command to use:

python 16-build-openSMILE-features.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/<features-type-dir-name> \
   --openSMILE_dir <path-to-opensmile>/opensmile-3.0 \
   --feature_set <feature-set>

It is executed for every variant of the audio segments (original, reduced noise, and generated speech) and for every type of features - MFCC, ComParE 2013, and ComParE 2016.

And here are the values for the placeholders:

Placeholder Value
<segments-dir> For original audio use 08-audio-segments.
For reduced noise: 09-audio-segments-rn.
For generated speech: 10-audio-segments-gs.
<features-dir-name> For original audio use features.
For reduced noise: features-rn.
For generated speech: features-gs.
<features-type-dir-name> For MFCC use opensmile-mfcc.
For ComParE 2013: compare-2013.
For ComParE 2016: compare-2016.
<path-to-opensmile> The path where you have extracted openSMILE to, up to 'bin' (does not include it).
<feature-set> For MFCC use mfcc.
For ComParE 2013: compare-2013.
For ComParE 2016: compare-2016.
  1. Extracting i-vectors

First, prepare Kaldi. See the instructions in this script and then run it.

Template for the next command to run (executed for every audio segments variant (original, reduced noise, and generated speech)):

python 18-extract-ivectors.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/ivectors \
   --kaldi_dir <path-to-kaldi> \
   --mfcc_config ${PROJECT_ROOT}/kaldi-configs/mfcc.conf

The placeholders <segments-dir> and <features-dir-name> are analogous to those in the openSMILE feature extraction. <path-to-kaldi> is chosen by the reader as described in this script.

  1. Extracting L3-Net features

Command template:

python 19-extract-openl3-embeddings.py \
   --data_dir ${PROJECT_ROOT}/data \
   --segments_dir_name <segments-dir> \
   --target_dir_name <features-dir-name>/openl3

The placeholders <segments-dir> and <features-dir-name> are analogous to those in the openSMILE feature extraction.

If in doubt about the --target_dir_name, check where the features are read from afterwards (during training) in this script.

  1. Oversampling

Run the following:

python 20-duplicate-train-checkworthy.py --data_dir ${PROJECT_ROOT}/data --num_duplicates 15

and then:

python 20-duplicate-train-checkworthy.py --data_dir ${PROJECT_ROOT}/data --num_duplicates 30
  1. Undersampling

Run the following:

python 21-balance-dataset.py --data_dir ${PROJECT_ROOT}/data
  1. Create the single speaker subset

Run the following:

python 22-filter-trump.py --data_dir ${PROJECT_ROOT}/data
  1. Extract textual features - counts of named entities

Run the following:

python 24-ner.py --data_dir ${PROJECT_ROOT}/data
  1. Generate the data files

All the features we have extracted so far are on sentence level and are stored in files. The final data files used during training contain path to those files with features. Run the following commands:

python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train --dev_dir_name dev --test_dir_name test
python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-15x --dev_dir_name dev --test_dir_name test
python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-30x --dev_dir_name dev --test_dir_name test
python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name train-balanced-1 --dev_dir_name dev --test_dir_name test
python build-complete-reduced-data-files.py --data_dir ${PROJECT_ROOT}/data --train_dir_name trump-train --dev_dir_name trump-dev --test_dir_name trump-test
  1. Train scalers

In this step scalers for the different feature types are being prepared. Run the following:

python train-scalers.py --data_dir ${PROJECT_ROOT}/data --target_dir ${PROJECT_ROOT}/trained-scalers

Output from our run could be found here.

  1. Train models

Run the train.py script to train a model. Refer to the experiments in training-results for hyperparameter values. For example, these are the values used for fine-tuning BERT on the multimodal dataset (without changes to the train dataset for addressing skewness). And these are the values when fine-tuning HuBERT on the same format of the train dataset (without changes, 'as-is').

Use the alignment.py script for the knowledge alignment procedure in teacher-student mode. Note that the textual model should be fine-tuned in advance.

Use the extract-vector-representations.py script for extracting vector representations of already fine-tuned textual or audio models.

  1. Inspect ranking performed by a model

Example command for inspecting actually checkworthy claims and their ranks as given by a model. It also displays the top-ranked and bottom-ranked sentences. Run the following:

python inspect-ranking.py \
   --data_dir ${PROJECT_ROOT}/data \
   --actual_dir_name test \
   --predictions_dir ${PROJECT_ROOT}/training-results/features/early-fusion/bert-and-hubert/best-as-is/epoch-2/test-predictions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages