# Pre-prepare data for noise reduction

This notebook is for collecting raw data (clean voice and noises) and organize them in the way that is required for the next step, i.e. creating actual training data for DNN.

## Download dataset

In [1]:
!wget https://github.com/karoldvl/ESC-50/archive/master.zip

--2021-05-26 10:22:20--  https://github.com/karoldvl/ESC-50/archive/master.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/karolpiczak/ESC-50/archive/master.zip [following]
--2021-05-26 10:22:20--  https://github.com/karolpiczak/ESC-50/archive/master.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/karolpiczak/ESC-50/zip/master [following]
--2021-05-26 10:22:20--  https://codeload.github.com/karolpiczak/ESC-50/zip/master
Resolving codeload.github.com (codeload.github.com)... 140.82.113.10
Connecting to codeload.github.com (codeload.github.com)|140.82.113.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [   <=>              ] 615

In [2]:
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz

--2021-05-26 10:23:04--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘dev-clean.tar.gz’


2021-05-26 10:23:17 (25.2 MB/s) - ‘dev-clean.tar.gz’ saved [337926286/337926286]



## Uncompressing datasets

In [3]:
!unzip -q master.zip -d .

In [4]:
!tar -xf dev-clean.tar.gz

## Unpacking files from LibriSpeech to `clean_voice`

In [5]:
from pathlib import Path

def make_dir(path):
  try:
      os.makedirs(path)
  except OSError:
      print ("Creation of the directory %s failed" % path)
  else:
      print ("Successfully created the directory %s " % path)

In [6]:
from shutil import copyfile
import os

# Create output dir
output_dir = './clean_voice'
make_dir(output_dir)

# Copy all .flac files
for path in Path('LibriSpeech').rglob('*.flac'):
  copyfile(path, Path(output_dir, path.name))

Successfully created the directory ./clean_voice 


In [7]:
!ls ./Data/clean_voice | head

ls: cannot access './Data/clean_voice': No such file or directory


## Unpacking files from ESC to `noises`

Firstly I've uploaded manually `esc50.csv` from `docs/` to Google Colab.

In [9]:
!git clone https://github.com/karlosos/noise_reduction

Cloning into 'noise_reduction'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 52 (delta 8), reused 47 (delta 7), pack-reused 0[K
Unpacking objects: 100% (52/52), done.


In [10]:
import pandas as pd
df = pd.read_csv('noise_reduction/docs/esc50.csv')

In [11]:
df.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
0,1-100032-A-0.wav,1,0,dog,True,100032,A
1,1-100038-A-14.wav,1,14,chirping_birds,False,100038,A
2,1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A
3,1-100210-B-36.wav,1,36,vacuum_cleaner,False,100210,B
4,1-101296-A-19.wav,1,19,thunderstorm,False,101296,A


### Dostępne dźwięki

In [12]:
import numpy as np
np.unique(list(df['category']))

array(['airplane', 'breathing', 'brushing_teeth', 'can_opening',
       'car_horn', 'cat', 'chainsaw', 'chirping_birds', 'church_bells',
       'clapping', 'clock_alarm', 'clock_tick', 'coughing', 'cow',
       'crackling_fire', 'crickets', 'crow', 'crying_baby', 'dog',
       'door_wood_creaks', 'door_wood_knock', 'drinking_sipping',
       'engine', 'fireworks', 'footsteps', 'frog', 'glass_breaking',
       'hand_saw', 'helicopter', 'hen', 'insects', 'keyboard_typing',
       'laughing', 'mouse_click', 'pig', 'pouring_water', 'rain',
       'rooster', 'sea_waves', 'sheep', 'siren', 'sneezing', 'snoring',
       'thunderstorm', 'toilet_flush', 'train', 'vacuum_cleaner',
       'washing_machine', 'water_drops', 'wind'], dtype='<U16')

In [13]:
chosen_noises = ['mouse_click', 'keyboard_typing', 'chirping_birds']

In [14]:
filenames = []
for category_name in chosen_noises:
  filenames += list(df.loc[df['category'] == category_name]['filename'])

print(f"Chosen {len(filenames)} files.")

Chosen 120 files.


### Copy chosen noises from ESC-50-master to ./Data/noises/

In [15]:
# Create output dir
output_dir = './noises'
make_dir(output_dir)

# Copy all .flac files
for name in filenames:
  copyfile(Path('./ESC-50-master/audio/', name), Path(output_dir, name))

Successfully created the directory ./noises 


## Split data to `train` and `test` folders

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
clean_voices = list(Path('./clean_voice').glob('*.*'))
noises = list(Path('./noises').glob('*.*'))
clean_voices_train, clean_voices_test = train_test_split(clean_voices, test_size=0.2, random_state=42)
noises_train, noises_test = train_test_split(noises, test_size=0.2, random_state=42)

In [18]:
train_clean_voices_dir = './data/train/clean_voice/'
make_dir(train_clean_voices_dir)

for path in clean_voices_train:
  copyfile(path, Path(train_clean_voices_dir, path.name))

Successfully created the directory ./data/train/clean_voice/ 


In [19]:
test_clean_voices_dir = './data/test/clean_voice/'
make_dir(test_clean_voices_dir)

for path in clean_voices_test:
  copyfile(path, Path(test_clean_voices_dir, path.name))

Successfully created the directory ./data/test/clean_voice/ 


In [20]:
train_noises_dir = './data/train/noise/'
make_dir(train_noises_dir)

for path in noises_train:
  copyfile(path, Path(train_noises_dir, path.name))

Successfully created the directory ./data/train/noise/ 


In [21]:
test_noises_dir = './data/test/noise/'
make_dir(test_noises_dir)

for path in noises_test:
  copyfile(path, Path(test_noises_dir, path.name))

Successfully created the directory ./data/test/noise/ 


# Data preparation

In this step we create augmented sounds with artificially added noises. Single long audio is created with clean voices, with noise and with noised voice. For each audio we also calculate spectrogram.

In [25]:
!python noise_reduction/create_dataset.py

Successfully created the directory ./data/train/timeseries/ 
Successfully created the directory ./data/train/combined_sound/ 
Successfully created the directory ./data/train/spectogram/ 
Successfully created the directory ./data/test/timeseries/ 
Successfully created the directory ./data/test/combined_sound/ 
Successfully created the directory ./data/test/spectogram/ 


## Save data to google drive

In [27]:
 !zip -qr data.zip data/ 

In [23]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [28]:
#!mkdir /content/gdrive/MyDrive/noise_reduction/
!cp data.zip /content/gdrive/MyDrive/noise_reduction/development_data.zip