# Reproducible experimental protocol

This notebook builds the database with all the information we need to perform domain-adversarial speech activity detection.

## Requirements

### Python packages

- pyannote.audio
- pyannote.core
- pyannote.database
- pandas

### Datasets

- `ldc2019e31`: [Second DIHARD Challenge Development Data](https://coml.lscp.ens.fr/dihard/)
- `ldc2019e32`: [Second DIHARD Challenge Evaluation Data](https://coml.lscp.ens.fr/dihard/)
- `musan`: [A corpus of MUsic, Speech, And Noise](https://www.openslr.org/17/) 

In [64]:
# where ldc2019e31 dataset has been downloaded
ldc2019e31 = '/vol/corpora1/data/ldc/ldc2019e31/LDC2019E31_Second_DIHARD_Challenge_Development_Data'

# where ldc2019e32 dataset has been downloaded 
ldc2019e32 = '/vol/corpora1/data/ldc/ldc2019e32/LDC2019E32_Second_DIHARD_Challenge_Evaluation_Data_V1.1'

# where MUSAN has been downloaded from https://www.openslr.org/17/
musan = '/vol/corpora4/musan'

# where github.com/hbredin/DomainAdversarialVoiceActivityDetection has been cloned
ROOT = '/vol/work1/bredin/jsalt/DomainAdversarialVoiceActivityDetection'

In [65]:
# create 'database' sub-directory that is meant to store audio and reference files
!mkdir -p {ROOT}/database/DIHARD

In [77]:
# define utility functions

from pyannote.core import Timeline
from pyannote.core import Annotation
from typing import TextIO

def write_rttm(file: TextIO, reference: Annotation):
    """Write reference annotation to "rttm" file

    Parameters
    ----------
    file : file object
    reference : `pyannote.core.Annotation`
        Reference annotation
    """

    for s, t, l in reference.itertracks(yield_label=True):
        line = (
            f'SPEAKER {reference.uri} 1 {s.start:.3f} {s.duration:.3f} '
            f'<NA> <NA> {l} <NA> <NA>\n'
        )
        file.write(line)

def write_uem(file: TextIO, uem: Timeline):
    """Write evaluation map to "uem" file

    Parameters
    ----------
    file : file object
    uem : `pyannote.core.Timeline`
        Evaluation timeline
    """

    for s in uem:
        line = f'{uem.uri} 1 {s.start:.3f} {s.end:.3f}\n'
        file.write(line)

## Preparing the DIHARD dataset

For some reason, the development and evaluation subsets have files that share the same names: `DH_0001` to `DH_0192` exist in both subsets.  
To avoid any confusion in `pyannote.database`, we create symbolic links so we can distinguish `dev/DH_0001` from `tst/DH_0001`.

In [10]:
!ln --symbolic {ldc2019e31}/data/single_channel/flac {ROOT}/database/DIHARD/dev
!ln --symbolic {ldc2019e32}/data/single_channel/flac {ROOT}/database/DIHARD/tst

ln: impossible de créer le lien symbolique '/home/lavechin/Bureau/DomainAdversarialVoiceActivityDetection/database/DIHARD/dev/flac': Le fichier existe
ln: impossible de créer le lien symbolique '/home/lavechin/Bureau/DomainAdversarialVoiceActivityDetection/database/DIHARD/tst/flac': Le fichier existe


In [21]:
from pandas import read_csv

# load list of test files (and their domain)
tst = read_csv(f'{ldc2019e32}/docs/sources.tbl', 
               delim_whitespace=True,
               names=['uri', 'language', 'domain', 'source'],     
               index_col='uri').filter(like='DH', axis=0)
# load list of development files (and their domain)
dev = read_csv(f'{ldc2019e31}/docs/sources.tbl', 
               delim_whitespace=True,
               names=['uri', 'language', 'domain', 'source'], 
               index_col='uri').filter(like='DH', axis=0)

# obtain list of domains
dihard_domains = sorted(dev.domain.unique())

The next cell will create four files per (domain, subset) pair:
- `{domain}.{subset}.txt` contains list of files
- `{domain}.{subset.rttm` contains manual annotation
- `{domain}.{subset}.uem` contains unpartitioned evaluation map (uem)
- `{domain}.domain.{subset}.txt` contains file-to-domain mapping

In [25]:
from pyannote.database.util import load_rttm
from pyannote.database.util import load_uem
from pyannote.audio.features.utils import get_audio_duration
from pyannote.core import Segment

# split ldc2019e31 into training set (two third) and developement set (one third)

# for each domain in ldc2019e31
for domain, files in dev.groupby('domain'):
    
    # load unpartitioned evaluation map (uem)
    uems = load_uem(f'{ldc2019e31}/data/single_channel/uem/{domain}.uem')
    
    # create four files per (domain, subset) pair
    # {domain}.{subset}.txt contains list of files
    # {domain}.{subset}.rttm contains manual annotation
    # {domain}.{subset}.uem contains unpartitioned evaluation map (uem)
    # {domain}.domain.{subset}.txt contains file-to-domain mapping
    with open(f'{ROOT}/database/DIHARD/{domain}.dev.txt', 'w') as uris_dev, \
         open(f'{ROOT}/database/DIHARD/{domain}.trn.txt', 'w') as uris_trn, \
         open(f'{ROOT}/database/DIHARD/{domain}.dev.rttm', 'w') as rttm_dev, \
         open(f'{ROOT}/database/DIHARD/{domain}.trn.rttm', 'w') as rttm_trn, \
         open(f'{ROOT}/database/DIHARD/{domain}.dev.uem', 'w') as uem_dev, \
         open(f'{ROOT}/database/DIHARD/{domain}.trn.uem', 'w') as uem_trn, \
         open(f'{ROOT}/database/DIHARD/{domain}.domain.dev.txt', 'w') as domain_dev, \
         open(f'{ROOT}/database/DIHARD/{domain}.domain.trn.txt', 'w') as domain_trn:
        
        # for each file in current domain
        for i, (uri, file) in enumerate(files.iterrows()):
            
            duration = get_audio_duration({'audio': f'{ROOT}/database/DIHARD/dev/{uri}.flac'})
            # ugly hack to avoid rounding errors: this has the effect of not considering 
            # the last millisecond of each file
            duration -= 0.001
            support = Segment(0, duration)
            
            # i = 0 ==> dev
            # i = 1 ==> trn
            # i = 2 ==> trn
            # i = 3 ==> dev
            # i = 4 ==> trn
            # i = 5 ==> trn
            # i = 6 ==> dev 
            # ...
            f_uris = uris_trn if i % 3 else uris_dev
            f_uris.write(f'dev/{uri}\n')
            
            # dump domain to disk
            f_domain = domain_trn if i % 3 else domain_dev
            f_domain.write(f'dev/{uri} {domain}\n')
            
            # load and crop reference (cf above hack)
            reference = load_rttm(f'{ldc2019e31}/data/single_channel/rttm/{uri}.rttm')[uri]
            reference.uri = f'dev/{uri}'
            reference = reference.crop(support, mode='intersection')
            
            # dump reference to disk
            f_rttm = rttm_trn if i % 3 else rttm_dev
            write_rttm(f_rttm, reference)
            
            # load and crop unpartitioned evaluation map
            uem = uems[uri]
            uem.uri = f'dev/{uri}'
            uem = uem.crop(support, mode='intersection')
            
            # dump uem to disk
            f_uem = uem_trn if i % 3 else uem_dev
            write_uem(f_uem, uem)

# same as above but applied to ldc2019e32 that is used entirely for test
for domain, files in tst.groupby('domain'):
    
    uems = load_uem(f'{ldc2019e32}/data/single_channel/uem/{domain}.uem')

    with open(f'{ROOT}/database/DIHARD/{domain}.tst.txt', 'w') as f_uris, \
         open(f'{ROOT}//database/DIHARD/{domain}.tst.rttm', 'w') as f_rttm, \
         open(f'{ROOT}/database/DIHARD/{domain}.tst.uem', 'w') as f_uem, \
         open(f'{ROOT}/database/DIHARD/{domain}.domain.tst.txt', 'w') as f_domain:

        for i, (uri, file) in enumerate(files.iterrows()):
            
            duration = get_audio_duration({'audio': f'{ROOT}/database/DIHARD/tst/{uri}.flac'})
            duration -= 0.001
            support = Segment(0, duration)
            
            f_uris.write(f'tst/{uri}\n')
            
            f_domain.write(f'tst/{uri} {domain}\n')
            
            reference = load_rttm(f'{ldc2019e32}/data/single_channel/rttm/{uri}.rttm')[uri]
            reference.uri = f'tst/{uri}'
            reference = reference.crop(support, mode='intersection')

            write_rttm(f_rttm, reference)
            
            uem = uems[uri]
            uem.uri = f'tst/{uri}'
            uem = uem.crop(support, mode='intersection')

            write_uem(f_uem, uem)

Create `database.yml`:

In [30]:
import yaml

database_yml = {
    'Databases': {
        'DIHARD': f'{ROOT}/database/DIHARD/{{uri}}.flac',
        'MUSAN': f'{musan}/{{uri}}.wav',
    },
    'Protocols': {
        'DIHARD': {'SpeakerDiarization': {}},
        'X': {'SpeakerDiarization': {}}
    }
}

for domain in dihard_domains:
    database_yml['Protocols']['DIHARD']['SpeakerDiarization'][f'{domain}'] = {}
    for subset, short in {'train': 'trn', 'development': 'dev', 'test': 'tst'}.items():
        database_yml['Protocols']['DIHARD']['SpeakerDiarization'][f'{domain}'][subset] = {
            'uris': f'{ROOT}/database/DIHARD/{domain}.{short}.txt',
            'annotation': f'{ROOT}/database/DIHARD/{domain}.{short}.rttm',
            'annotated': f'{ROOT}/database/DIHARD/{domain}.{short}.uem',
            'domain': f'{ROOT}/database/DIHARD/{domain}.domain.{short}.txt',
        }
    
    all_but_domain = sorted(set(dihard_domains) - {domain})
    database_yml['Protocols']['X']['SpeakerDiarization'][f'DIHARD_LeaveOneDomainOut_{domain}'] = {}
    for subset in ['train', 'development']:
        database_yml['Protocols']['X']['SpeakerDiarization'][f'DIHARD_LeaveOneDomainOut_{domain}'][subset] = {
            f'DIHARD.SpeakerDiarization.{other_domain}': [subset] for other_domain in all_but_domain
        }
    database_yml['Protocols']['X']['SpeakerDiarization'][f'DIHARD_LeaveOneDomainOut_{domain}']['test'] = {
        f'DIHARD.SpeakerDiarization.{domain}': ['test']
    }   
    
database_yml['Protocols']['X']['SpeakerDiarization']['DIHARD_Official'] = {
    subset: {
        f'DIHARD.SpeakerDiarization.{domain}': [subset] for domain in dihard_domains
    } for subset in ['train', 'development', 'test']
}

with open(f'{ROOT}/database.yml', 'w') as f:
    f.write(yaml.dump(database_yml, 
                      default_flow_style=False))

Setting `PYANNOTE_DATABASE_CONFIG` environment variable to `{ROOT}/database.yml` will give you a bunch of `pyannote.database` protocols:

- `X.SpeakerDiarization.DIHARD_Official` is the official protocol for `DIHARD2` 
- `X.SpeakerDiarization.DIHARD_LeaveOneDomainOut_{domain}` uses all domains but {domain} in the training and development sets, and only {domain} in the test set.

Once, you're done with the data preparation step, you can go back to [the main README](../README.md) to run the experiments.