# Preparing M2D pre-training data

This notebook prepares the pre-training data (log-mel specgtrogram (LMS) files and list files) for the M2D pre-training for the ablation study of the paper.

### 1. Prepare M2D pre-training on AudioSet.

Please make sure you can pre-train an M2D model on AudioSet files. Find the details [here](https://github.com/nttcslab/m2d#3-pre-training-from-scratch).

### 2. Aggregate the respiratory sound data beforehand

Run the [`Collecting_Respiratory_Data.ipynb`](Collecting_Respiratory_Data.ipynb) to aggregate all the respiratory sound files from ICBHI 2017, CoughVIDE, and HF_Lung into the `resp_wav` folder.



### 3. Converting respiratory sound data files into log-mel spectrogram files

Finished running [`Collecting_Respiratory_Data.ipynb`](Collecting_Respiratory_Data.ipynb)?

If yes, please convert the .wav files into log-mel spectrogram (LMS) files in .npy format as follows.

Use [`wav_to_lms.py`](https://github.com/nttcslab/m2d/blob/master/wav_to_lms.py) from your local copy of the M2D repository:

```sh
    python wav_to_lms.py /your/local/resp_wav /your/m2d/data/resp_lms
```

What we need is to create LMS files under the `data/resp_lms` folder in your local copy of `m2d`.

You will have the following folders, and the files under the folders will be used for pre-training an M2D model.

```
    /your/m2d/data/resp_lms
        coughvid
        HF_Lung_V2
        ICBHI2017
```


In [2]:
import IPython

from IPython import get_ipython
ipython = get_ipython()
ipython.run_line_magic('reload_ext', 'autoreload')
ipython.run_line_magic('autoreload', '2')
ipython.run_line_magic('matplotlib', 'inline')

import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pandas as pd

### 4. Configure yours

Set `base` the path in the following cell.

In [None]:
# EDIT ME: YOUR M2D DATA PATH
base = '/your/m2d/data'

In [4]:
# Make the list of files under resp_lms.
files = [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/ICBHI2017/**/*.npy')]
files += [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/HF_Lung_V2/**/*.npy')]
files += [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/coughvid/**/*.npy')]

df = pd.DataFrame({'file_name': sorted(files)})
df['dataset'] = df.file_name.apply(lambda x: str(x).split('/')[1])
df.groupby('dataset').count()

Unnamed: 0_level_0,file_name
dataset,Unnamed: 1_level_1
HF_Lung_V2,3839
ICBHI2017,539
coughvid,7054


### Make the Resp file list

In [5]:
# Oversample ICBHI2017
ICBHI2017 = df[df.dataset == 'ICBHI2017']
d = pd.concat([df] + [ICBHI2017]*5)
d.groupby('dataset').count()

Unnamed: 0_level_0,file_name
dataset,Unnamed: 1_level_1
HF_Lung_V2,3839
ICBHI2017,3234
coughvid,7054


### Load the AudioSet file list

We assume that the `files_audioset.csv` file lists the .npy files of AudioSet, and it contains about 2 million files.

In [6]:
# Load AudioSet data list
asdf = pd.read_csv(base + '/files_audioset.csv')
len(asdf)  # --> about 2 million files; 2005132 for example of ours.

2005132

## Create the list: Resp only

In [7]:
total = d[['file_name']]
total.to_csv(base + '/files_R_F_M_1.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 100K:', sum(total.file_name.str.startswith('resp_lms')))

AudioSet: 0 Resp 100K: 14127


## Create the lists: AS + Resp 100K to 500K

In [8]:
total = pd.concat([asdf] + [d[['file_name']]] * 7)
total.to_csv(base + '/files_A_S_R_F_M_1.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 100K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 15)
total.to_csv(base + '/files_A_S_R_F_M_2.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 200K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 22)
total.to_csv(base + '/files_A_S_R_F_M_3.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 300K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 29)
total.to_csv(base + '/files_A_S_R_F_M_4.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 400K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 36)
total.to_csv(base + '/files_A_S_R_F_M_5.csv', index=None)
print('AudioSet:', sum(total.file_name.str.startswith('audioset_lms')), 'Resp 500K:', sum(total.file_name.str.startswith('resp_lms')))


AudioSet: 2005132 Resp 100K: 98889
AudioSet: 2005132 Resp 200K: 211905
AudioSet: 2005132 Resp 300K: 310794
AudioSet: 2005132 Resp 400K: 409683
AudioSet: 2005132 Resp 500K: 508572
