<a href="https://colab.research.google.com/github/kaindad/masters-thesis/blob/main/audio-dataset-curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Upload the clean Medndeley data samples



The audio samples should be `.wav` format, mono, and 1 second long. Bitrate and bitdepth should not matter. Samples shorter than 1 second will be padded with 0s, and samples longer than 1 second will be truncated to 1 second. The exact name of each `.wav` matter, as they will be read, mixed with background noise, and saved to a separate file with an auto-generated name. Directory name does matter (it is used to determine the name of the class during neural network training).

Right-click on each keyword directory and upload all of your samples. Your directory structor should look like the following:

```
/
|- chicken-data-healthy-combined-clean.wav
|- chicken-data-noise-combined-clean.wav
|- chicken-data-unhealthy-combined-clean.wav
```




In [1]:
pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
import os
from pydub import AudioSegment
import math

class WavFileSplitter():
    def __init__(self, source_filename):
        self.current_directory = os.getcwd()  # Get the current directory
        self.source_filename = source_filename
        self.source_filepath = os.path.join(self.current_directory, source_filename)

        self.audio_segment = AudioSegment.from_wav(self.source_filepath)

    def _calculate_audio_duration(self):
        return self.audio_segment.duration_seconds

    def _export_audio_slice(self, start_second, end_second, output_filename, destination_directory):
        start_time = start_second * 1000  # Convert to milliseconds
        end_time = end_second * 1000  # Convert to milliseconds
        audio_slice = self.audio_segment[start_time:end_time]
        audio_slice.export(os.path.join(destination_directory, output_filename), format="wav")

    def split_audio_into_intervals(self, seconds_per_slice, output_prefix):
        destination_directory = os.path.join(self.current_directory, output_prefix)
        if not os.path.exists(destination_directory):
            os.makedirs(destination_directory)  # Create the directory if it doesn't exist

        total_seconds = math.ceil(self._calculate_audio_duration())
        for i in range(0, total_seconds, seconds_per_slice):
            slice_filename = f"{output_prefix}_{i+1}.wav"  # Naming files like prefix_1.wav, prefix_2.wav, ...
            self._export_audio_slice(i, i+seconds_per_slice, slice_filename, destination_directory)
            print(f"Exported: {slice_filename}")
            if i == total_seconds - seconds_per_slice:
                print('All slices exported successfully')

# Example usage:
source_filename = "unhealthy-combined-clean.wav"
audio_splitter = WavFileSplitter(source_filename)
audio_splitter.split_audio_into_intervals(1, "unhealthy")  # This will split the WAV file into 1-second intervals with the prefix "unhealthy_slice"

In [None]:
import shutil
import os

# Define the directory to be compressed and the output compressed filename
source_directory = os.path.join(os.getcwd(), "unhealthy")
output_filename = "unhealthy_compressed.zip"

# Compress the directory
shutil.make_archive(output_filename[:-4], 'zip', source_directory)

print(f"'{source_directory}' has been compressed to '{output_filename}' in the current directory.")


'/content/unhealthy' has been compressed to 'unhealthy_compressed.zip' in the current directory.


# Keyword Spotting Dataset Curation

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/ei-keyword-spotting/blob/master/ei-audio-dataset-curation.ipynb)

Use this tool to download the Google Speech Commands Dataset, combine it with your own keywords, mix in some background noise, and upload the curated dataset to Edge Impulse. From there, you can train a neural network to classify spoken words and upload it to a microcontroller to perform real-time keyword spotting.

 1. Upload samples of your own keyword (optional)
 2. Adjust parameters in the Settings cell (you will need an [Edge Impulse](https://www.edgeimpulse.com/) account)
 3. Run the rest of the cells! ('shift' + 'enter' on each cell)



In [None]:
### Update Node.js to the latest stable version
!npm cache clean -f
!npm install -g n
!n 16.18.1

In [None]:
pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
### Install required packages and tools
!python -m pip install soundfile
!npm install -g --unsafe-perm edge-impulse-cli

In [None]:
### Settings (You probably do not need to change these)
BASE_DIR = "/content"
OUT_DIR = "keywords_curated"
GOOGLE_DATASET_FILENAME = "speech_commands_v0.02.tar.gz"
GOOGLE_DATASET_URL = "http://download.tensorflow.org/data/" + GOOGLE_DATASET_FILENAME
GOOGLE_DATASET_DIR = "google_speech_commands"
CUSTOM_KEYWORDS_FILENAME = "main.zip"
CUSTOM_KEYWORDS_URL = "https://github.com/ShawnHymel/custom-speech-commands-dataset/archive/" + CUSTOM_KEYWORDS_FILENAME
CUSTOM_KEYWORDS_DIR = "custom_keywords"
CUSTOM_KEYWORDS_REPO_NAME = "custom-speech-commands-dataset-main"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py"
NUM_SAMPLES = 1500    # Target number of samples to mix and send to Edge Impulse
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "_background_noise_"
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

In [None]:
### Download Google Speech Commands Dataset
!cd {BASE_DIR}
!wget {GOOGLE_DATASET_URL}
!mkdir {GOOGLE_DATASET_DIR}
!echo "Extracting..."
!tar xfz {GOOGLE_DATASET_FILENAME} -C {GOOGLE_DATASET_DIR}

In [None]:
### Pull out background noise directory
!cd {BASE_DIR}
!mv "{GOOGLE_DATASET_DIR}/{BG_DIR}" "{BG_DIR}"

In [None]:
### (Optional) Download custom dataset--uncomment the code in this cell if you want to use my custom datase

## Download, extract, and move dataset to separate directory
# !cd {BASE_DIR}
# !wget {CUSTOM_KEYWORDS_URL}
# !echo "Extracting..."
# !unzip -q {CUSTOM_KEYWORDS_FILENAME}
# !mv "{CUSTOM_KEYWORDS_REPO_NAME}/{CUSTOM_KEYWORDS_DIR}" "{CUSTOM_KEYWORDS_DIR}"

In [None]:
### User Settings (do change these)

# Location of your custom keyword samples (e.g. "/content/custom_keywords")
# Leave blank ("") for no custom keywords. set to the CUSTOM_KEYWORDS_DIR
# variable to use samples from my custom-speech-commands-dataset repo.
CUSTOM_DATASET_PATH = ""

# Edge Impulse > your_project > Dashboard > Keys
EI_API_KEY = "ei_e544..."

# Comma separated words. Must match directory names (that contain samples).
# Recommended: use 2 keywords for microcontroller demo
TARGETS = "go, stop"

In [None]:
### Download curation and utils scripts
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

In [None]:
### Perform curation and mixing of samples with background noise
!cd {BASE_DIR}
!python {CURATION_SCRIPT} \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BG_DIR}" \
  -o "{OUT_DIR}" \
  "{GOOGLE_DATASET_DIR}" \
  "{CUSTOM_DATASET_PATH}"

In [None]:
### Use CLI tool to send curated dataset to Edge Impulse

!cd {BASE_DIR}

# Imports
import os
import random

# Seed with system time
random.seed()

# Go through each category in our curated dataset
for dir in os.listdir(OUT_DIR):

  # Create list of files for one category
  paths = []
  for filename in os.listdir(os.path.join(OUT_DIR, dir)):
    paths.append(os.path.join(OUT_DIR, dir, filename))

  # Shuffle and divide into test and training sets
  random.shuffle(paths)
  num_test_samples = int(TEST_RATIO * len(paths))
  test_paths = paths[:num_test_samples]
  train_paths = paths[num_test_samples:]

  # Create arugments list (as a string) for CLI call
  test_paths = ['"' + s + '"' for s in test_paths]
  test_paths = ' '.join(test_paths)
  train_paths = ['"' + s + '"' for s in train_paths]
  train_paths = ' '.join(train_paths)

  # Send test files to Edge Impulse
  !edge-impulse-uploader \
    --category testing \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {test_paths}

  # # Send training files to Edge Impulse
  !edge-impulse-uploader \
    --category training \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {train_paths}