**Dataset Name:** Egyptian Fruit Bat <br>
**Paper under which this dataset was shared**: [An annotated dataset of Egyptian fruit bat vocalizations across varying contexts and during vocal ontogeny](https://www.nature.com/articles/sdata2017143) <br>
**Lead Researcher**: [Yossi Yovel](http://www.yossiyovel.com/) <br>
**Dataset License**: [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/) <br>
**Code License**: to be linked to

**Please note**: This notebook documents we took to preprocess the dataset. It is for informational purposes only. You do not need to run the code and can safely navigate to 01_Download_The_Data_And_Construct_Dataloaders.ipynb.

# Data Preprocessing - The Egyptian Fruit Bat Dataset

The original data can be found on [figshare](https://figshare.com/collections/An_annotated_dataset_of_Egyptian_fruit_bat_vocalizations_across_varying_contexts_and_during_vocal_ontogeny/3666502).

It is not provided as a single archive, but rather as a collection of links to 65 webpages from which each piece of the data can be downloaded.

Let us download the audio files along with the provided annotations.

In [1]:
!mkdir data_preprocessing

In [2]:
download_urls = [
    'https://ndownloader.figshare.com/files/8879599',
    'https://ndownloader.figshare.com/files/8879602',
    'https://ndownloader.figshare.com/files/8879608',
    'https://ndownloader.figshare.com/files/8879611',
    'https://ndownloader.figshare.com/files/8879617',
    'https://ndownloader.figshare.com/files/8879623',
    'https://ndownloader.figshare.com/files/8879632',
    'https://ndownloader.figshare.com/files/8879641',
    'https://ndownloader.figshare.com/files/8879653',
    'https://ndownloader.figshare.com/files/8879659',
    'https://ndownloader.figshare.com/files/8879662',
    'https://ndownloader.figshare.com/files/8879674',
    'https://ndownloader.figshare.com/files/8879683',
    'https://ndownloader.figshare.com/files/8879179',
    'https://ndownloader.figshare.com/files/8879287',
    'https://ndownloader.figshare.com/files/8879338',
    'https://ndownloader.figshare.com/files/8879392',
    'https://ndownloader.figshare.com/files/8879404',
    'https://ndownloader.figshare.com/files/8879425',
    'https://ndownloader.figshare.com/files/8879428',
    'https://ndownloader.figshare.com/files/8879431',
    'https://ndownloader.figshare.com/files/8879521',
    'https://ndownloader.figshare.com/files/8879533',
    'https://ndownloader.figshare.com/files/8879536',
    'https://ndownloader.figshare.com/files/8879545',
    'https://ndownloader.figshare.com/files/8879548',
    'https://ndownloader.figshare.com/files/8879554',
    'https://ndownloader.figshare.com/files/8879572',
    'https://ndownloader.figshare.com/files/8879578',
    'https://ndownloader.figshare.com/files/8879596',
    'https://ndownloader.figshare.com/files/7379008'
]

In [3]:
import subprocess

for url in download_urls:
    subprocess.run(
        ["curl", "-O", "-J", "-L", url],
        cwd="data_preprocessing",
        check=True,
    )

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2897M  100 2897M    0     0  7446k      0  0:06:38  0:06:38 --:--:-- 5439k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3498M  100 3498M    0     0  10.7M      0  0:05:26  0:05:26 --:--:-- 10.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2862M  100 2862M    0     0  10.4M      0  0:04:33  0:04:33 --:--:-- 14.2M
  % Total    % Received % Xferd  Average Speed   Tim

KeyboardInterrupt: 

Now that we have donwloaded the data, let us extract it.

In [8]:
!mkdir -p extracted
!cd data_preprocessing && 7z x files222.zip -o../extracted


7-Zip [64] 17.05 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.05 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,10 CPUs LE)

Scanning the drive for archives:
  0M Sca        1 file, 3001287451 bytes (2863 MiB)

Extracting archive: files222.zip
 40% 4096 Op            --
Path = files222.zip
Type = zip
Physical Size = 3001287451

      0% 1        0% 3        0% 4        0% 6        0% 8        0% 9        1% 1        1% 1        1% 1        1% 1        1% 1        1% 1        2% 2        2% 2        2% 242 - 130522063220344042.WA                                  2% 2        2% 2        2% 2        3% 3        3% 3        3% 3        3% 3        3% 3        3% 3        4% 4        4% 4        4% 4        4% 4        4% 4        4% 4        5% 5        5% 5        5% 5        5% 5        5% 5        5% 5        6% 6        6% 6        6% 6        6% 6        6% 6        6% 6        7% 7        7% 7        7% 7        7% 7        7% 7        8% 7        8% 8        8% 8

Having extracted the date, let us look at the annotations.

In [9]:
import pandas as pd

In [11]:
anno = pd.read_csv('Annotations_Original.csv')

anno.head()

Unnamed: 0,FileID,Emitter,Addressee,Context,Emitter pre-vocalization action,Addressee pre-vocalization action,Emitter post-vocalization action,Addressee post-vocalization action,Start sample,End sample
0,7,118,0,9,2,2,3,3,1,336720
1,11,0,0,11,0,0,0,0,1,787280
2,12,118,0,12,2,2,3,3,1,566096
3,15,0,0,12,0,0,0,0,1,402256
4,20,0,0,12,0,0,0,0,1,394064


To go from the `FileID` identifier to an actual filename, we will need the `FielInfo.csv` reference file.

In [12]:
!cd data_preprocessing && curl -O -J -L https://ndownloader.figshare.com/files/8900695

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 30.1M  100 30.1M    0     0  8680k      0  0:00:03  0:00:03 --:--:-- 13.7M


The `FileInfo.csv` file is not a proper csv file. But the only bit of information that is important to us, is the mapping from a `FileID` to filename. Let us read this information in.

In [13]:
file_id2filename = {}

with open('data_preprocessing/FileInfo.csv') as f:
    for line in f.readlines():
        file_id, _, file_name, *_ = line.split(',')
        try:
            file_id2filename[int(file_id)] = file_name
        except: pass

In [14]:
import librosa
import glob

Before we move any further, let's confirm all the files indeed were recorded with the same sample rate.

In [16]:
%%time

srs = set()
for path in glob.glob('extracted/*'):
    _, sr = librosa.core.load(path, sr=None)
    srs.add(sr)

srs

CPU times: user 9.17 s, sys: 5.25 s, total: 14.4 s
Wall time: 24.5 s


{250000}

This confirms that all the files have been recorded with a sample rate of 250_000 Hz.

To make the dataset easier to work with, we will iterate over the examples, offset into the relevant part of each, and write it out into a stand alone wav file.

The naming convention we will adopt is that the index of the row corresponding to a given example will form the stem of the file name.

In [17]:
!mkdir audio

In [None]:
# Note from mahika: we don't need to run this since we already have the audio files from the 10k zip file
%%time

for idx, example in anno.iterrows():
    path = file_id2filename[example["FileID"]]
    x, _ = librosa.core.load(f'extracted/{path}', sr=None)
    librosa.output.write_wav(f'audio/{idx}.wav', x[example['Start sample']:example['End sample']], 250_000)

CPU times: user 697 Î¼s, sys: 1.89 ms, total: 2.59 ms
Wall time: 4.81 ms


	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


FileNotFoundError: [Errno 2] No such file or directory: 'extracted/120601002132055008.WAV'

Let us now ammend the annotations to include correct file names.

In [19]:
anno['File Name'] = [f'{idx}.wav' for idx in anno.index]

We can remove the columns that are no longer neede.

In [None]:
# Note from mahika: keeping these just in case
# anno.drop(columns=['FileID', 'Start sample', 'End sample'], inplace=True)

In [20]:
anno.head()

Unnamed: 0,FileID,Emitter,Addressee,Context,Emitter pre-vocalization action,Addressee pre-vocalization action,Emitter post-vocalization action,Addressee post-vocalization action,Start sample,End sample,File Name
0,7,118,0,9,2,2,3,3,1,336720,0.wav
1,11,0,0,11,0,0,0,0,1,787280,1.wav
2,12,118,0,12,2,2,3,3,1,566096,2.wav
3,15,0,0,12,0,0,0,0,1,402256,3.wav
4,20,0,0,12,0,0,0,0,1,394064,4.wav


Let's save the new annotations file.

In [21]:
anno.to_csv('annotations_filenames.csv', index=False)

This data has multiple labels that can be used for training. It is not apparent what splits would work best for this data. As such, we will not partition the dataset into a train and validation set.

We can now archive the data for distribution.

In [None]:
# Note from mahika: we don't need to run this
!zip -1qr egyptian_fruit_bats.zip annotations_filenames.csv audio