# Preparing pre-training data

This notebook prepares the pre-training data (LMS files and list files) for the M2D pre-training for the ablation study of the paper.

You need:
- Prepare M2D pre-training on AudioSet in advance.
- Collect respiratory sound data files from ICBHI 2017, CoughVIDE, and HF_Lung in advance.
- Convert respiratory sound data files into log-mel spectrogram files.
- Set the path `base` in the following cell.
- Run all the cells.

Running all cells will create data list files.


### Converting respiratory sound data files into log-mel spectrogram files

Please make sure you have collected respiratory sound data files in a local folder.
Then, use the M2D's `wav_to_lms.py` to convert these files into a log-mel spectrogram (LMS) folder.

```sh
    python wav_to_lms.py /your/local/resp_wav /your/m2d/data/resp_lms
```

This example converts files from `/your/local/resp_wav` to `/your/m2d/data/resp_lms`. You will have the following folders.


```
    /your/m2d/data/resp_lms
        coughvid
        HF_Lung_V2
        ICBHI2017
```


In [1]:
import IPython

from IPython import get_ipython
ipython = get_ipython()
ipython.magic('reload_ext autoreload')
ipython.magic('autoreload 2')
ipython.magic('matplotlib inline')

import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pandas as pd

  ipython.magic('reload_ext autoreload')
  ipython.magic('autoreload 2')
  ipython.magic('matplotlib inline')


In [None]:
# EDIT ME: YOUR M2D DATA PATH
base = '/your/m2d/data'

# Make the list of files under resp_lms.
files = [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/ICBHI2017/**/*.npy')]
files += [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/HF_Lung_V2/**/*.npy')]
files += [str(f.relative_to(base)) for f in Path(base).glob('resp_lms/coughvid/**/*.npy')]

df = pd.DataFrame({'file_name': sorted(files)})
df['dataset'] = df.file_name.apply(lambda x: str(x).split('/')[1])
df.groupby('dataset').count()


Unnamed: 0_level_0,file_name
dataset,Unnamed: 1_level_1
HF_Lung_V2,3839
ICBHI2017,539
coughvid,7054


In [3]:
# Make the basic list of respiratory audio files.
ICBHI2017 = df[df.dataset == 'ICBHI2017']
d = pd.concat([df] + [ICBHI2017]*5)
d.groupby('dataset').count()

Unnamed: 0_level_0,file_name
dataset,Unnamed: 1_level_1
HF_Lung_V2,3839
ICBHI2017,3234
coughvid,7054


In [4]:
# Load AudioSet data list
asdf = pd.read_csv(base + '/files_audioset.csv')
len(asdf)

2005132

## 100K to 500K

In [5]:
total = pd.concat([asdf] + [d[['file_name']]] * 7)
total.to_csv(base + '/files_A_S_R_F_M_1.csv', index=None)
print('Total:', sum(total.file_name.str.startswith('audioset_lms')), '100K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 15)
total.to_csv(base + '/files_A_S_R_F_M_2.csv', index=None)
print('Total:', sum(total.file_name.str.startswith('audioset_lms')), '200K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 22)
total.to_csv(base + '/files_A_S_R_F_M_3.csv', index=None)
print('Total:', sum(total.file_name.str.startswith('audioset_lms')), '300K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 29)
total.to_csv(base + '/files_A_S_R_F_M_4.csv', index=None)
print('Total:', sum(total.file_name.str.startswith('audioset_lms')), '400K:', sum(total.file_name.str.startswith('resp_lms')))

total = pd.concat([asdf] + [d[['file_name']]] * 36)
total.to_csv(base + '/files_A_S_R_F_M_5.csv', index=None)
print('Total:', sum(total.file_name.str.startswith('audioset_lms')), '500K:', sum(total.file_name.str.startswith('resp_lms')))


Total: 2005132 100K: 98889
Total: 2005132 200K: 211905
Total: 2005132 300K: 310794
Total: 2005132 400K: 409683
Total: 2005132 500K: 508572
