This is a recipe for data based on AMI [1] used in our ICASSP 2022 paper [2] on using unsupervised sound separation with mixture invariant training (MixIT) [3] for adaptation of speech separation models to real-world meeting data.
Specifically, we provide a CSV and code that provide an exact recipe to recreate the synthetic AMI evaluation set used in [2].
You can do this from this download page. Under scenario meetings, you need to select a, b, c, and d for all Edinburgh, IDIAP, and TNO meetings. Under non-scenario meetings, also select all Edinburgh meetings. Select "Individual headsets" and "Microphone array" under "Audio related media streams". The total estimated download size should be 185 GB.
Run the following command, given the bash variables AMI_DIRECTORY and OUTPUT_DIRECTORY are set to the path for the downloaded AMI dataset and desired output directory, respectively:
python3 make_synthetic_ami.py -a ${AMI_DIRECTORY} -o ${OUTPUT_DIRECTORY}
A "segment" is a section of the meeting that is single-speaker, as indicated by the AMI annotations. The wav file path from which a segment is extracted is given by wav_<bg,fg>
, and the start and end times of the segments are given by seg_start_<bg,fg>
and seg_end_<bg,fg>
. For each segment, we use least-squares to estimate the best linear time-invariant finite impulse response (FIR) filter that maps single-speaker headset audio to distant microphone audio. This provides clean reverberant versions of the anechoic headset audio, which can then be mixed together.
This filtering procedure also provides an estimate of the background noise. Given headset audio $ x $, distant microphone audio $ y
To construct the synthetic AMI mixtures with their corresponding references, we extract shorter "clips" from the segments decribed above. The offset of a clip within a segment is given by offset_<bg,fg>
, and the duration of a clip from this offset is given by duration_<bg,fg>
. Each synthetic AMI example is constructed from two clips: a "background" clip, which is always 5 seconds long, and a "foreground" clip, which has duration less than or equal to 5 seconds. For the background clip, two sources are created: reverberant filtered headset shift_fg
.
The outputs of the make_synthetic_ami.py
script are a folder for each example, where each folder contains 3 wav files:
receiver_audio.wav
: single-channel audio of distant microphone.source_images.wav
: 3-channel audio of reverberant sources, in order of imperfect background noise, background source, foreground source.source_audio.wav
: 3-channel audio of original headset audio. First source is all-zeros, second source is original background headset audio, and third source is original foreground headset audio.
The CSV provides the mixing recipe, and contains the following columns:
- wav_bg: string, relative path to background headset wav.
- seg_start_bg: float, start time in seconds of background clip.
- seg_end_bg: float, end time in seconds of background clip.
- offset_bg: float, offset of background clip within segment.
- duration_bg: float, duration in seconds of background clip (always 5).
- wav_fg: string, relative path to foreground headset wav.
- seg_start_fg: float, start time in seconds of foreground segment.
- seg_end_fg: float, end time in seconds of foreground segment.
- offset_fg: float, offset of foreground clip within segment.
- duration_fg: float, duration in seconds of foreground clip.
- shift_fg: float, shift of fg clip relative to bg clip in seconds.
The CSV is released under a Creative Commons license (CC-BY 4.0).
Code is under Apache 2.0 license.