# Auditory Classification of Syllabic Structures (ACSS)

Classifying diphones using Recurrent Neural Networks

## Setup

In [1]:
include("../src/Audios.jl")
include("../src/Model.jl")

using ..Audios, ..Model
using Flux, Plots, Random, StatsBase
Plots.PythonPlotBackend()
Random.seed!(1234)

TaskLocalRNG()

## Data preparation

### Import audios

Audios are located in the `sounds/` folder in WAV format. Each file corresponds to a diphone, as spoken by one of the speakers available in the Apple Text-to-Speech software. These diphones were created by [Magnuson et al. (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.12823) for their EARSHOT model, and are openly available on the [GitHub repository](https://github.com/maglab-uconn/EARSHOT), following this [link](https://drive.google.com/file/d/1pujVHSPtwXWZiQeutFJwxdsz1mz0Lddi/view). File names are in the form `X_Y_Z.WAV`, where:

- `X` indicates the structure of the diphone (`CV` for consonant-vowel dipnes, `VC` for vowel-consonant diphones).
- `Y` indicates the orthographic form of the diphone in English, which was used to feed the Text-to-Speech software
- `Z` indicates the name of speaker used to generate the diphone.

In total, 7,190 audios are available, which correspond to 719 unique diphones, each generated by 10 different speakers (five female, five male).

In [2]:
# list WAV files
audio_path = "../sounds/raw/"
wav_files = readdir(audio_path, join = true)[contains.(lowercase.(readdir(audio_path)), ".wav")]
wav_names = basename.(wav_files)
N = length(wav_names)

# unique diphones
unique_diphones = unique(getindex.(split.(wav_names, "_"), 2));
unique_speakers = unique(getindex.(split.(wav_names, "_"), 3));

### Extract features

#### Training dataset

In [3]:
indices = 1:length(unique_diphones)
train_indices = sample(indices, 5_000)
X = make_features(wav_files; index = train_indices)
y, labels = make_targets(wav_names; index = train_indices)
train_data = zip(X, y)

┌ Info: Importing WAV files...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:21
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:02[39m[K
┌ Info: Generating spectrograms...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:23
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:01[39m[K
┌ Info: Preprocessing data...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:25
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:01[39m[K


zip(LinearAlgebra.Transpose{Float32, Matrix{Float32}}[[-2.900915 -0.8033451 … 1.4656874 0.2373703; -1.1848588 -0.32224324 … 0.34310904 0.7141051; … ; -0.6419591 -1.160352 … 0.18467367 0.33294165; -0.90498763 -12.676145 … -1.1325258 0.68805784], [1.9413339 1.8679272 … -0.14133209 -0.43237373; 2.4651058 1.9721726 … -1.3515939 0.6343603; … ; -1.0980818 -1.8162167 … -0.5514691 1.1922574; -1.07537 -2.843637 … 0.554736 0.029145826], [-2.1069002 -1.8607394 … -0.08539538 0.6992973; -1.4829526 -0.72676253 … -1.260625 1.0913529; … ; -1.0974041 -1.6183134 … -1.6887677 0.19816586; -1.3642161 -2.2249534 … 1.1705172 1.464256], [4.0061393 3.0491338 … 1.6732551 0.9387228; 4.3048925 4.7448483 … 1.0974582 -1.5076638; … ; -0.4282731 0.024752703 … 0.12192191 -0.3813027; -0.83939445 -2.6623583 … 0.37163132 0.21108212], [-2.005648 -1.3736727 … -0.560224 -1.1128218; -0.73847336 -0.70448685 … -1.6712902 -1.6800766; … ; -1.3198649 -1.3577572 … 0.29409704 0.007222439; -1.7639633 -1.7111619 … 0.17442471 -1.31883

In [4]:
prop_train = 0.85
indices = 1:length(unique_diphones)
n_train = convert(Int32, floor(length(indices) * prop_train))
train_diphones = sample(unique_diphones, n_train, replace = false)
test_diphones = unique_diphones[findall(i -> i ∉ train_diphones, unique_diphones)]

diphones = getindex.(split.(wav_names, "_"), 2)
train_idx = findall(i -> i ∈ train_diphones, diphones)
test_idx = findall(i -> i ∈ test_diphones, diphones)

println("Training dataset ($(round(prop_train, sigdigits = 2))): $(length(train_diphones)) diphones, $(length(train_idx)) audios")
println("Test dataset ($(round(1-prop_train, sigdigits = 2))): $(length(test_diphones)) diphones, $(length(test_idx)) audios")

Training dataset (0.85): 611 diphones, 6110 audios
Test dataset (0.15): 108 diphones, 1080 audios


In [5]:
X_train = make_features(wav_files, index = train_idx)
X_test = make_features(wav_files, index = test_idx)

┌ Info: Importing WAV files...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:21
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:02[39m[K
┌ Info: Generating spectrograms...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:23
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:01[39m[K
┌ Info: Preprocessing data...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:25
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:02[39m[K
┌ Info: Importing WAV files...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:21
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m[K
┌ Info: Generating spectrograms...
└ @ Main.Model c:\Users\Gonzalo\Documents\projects\diphone-classification\src\Model.jl:23
[32mProgress: 100%|████████████████████████████

1080-element Vector{LinearAlgebra.Transpose{Float32, Matrix{Float32}}}:
 [0.9646915 1.3297795 … -0.38702404 0.32790098; 1.1354706 1.1554209 … 0.21656851 -0.5294764; … ; -0.5941241 -0.6987119 … 0.26957914 0.08746557; -1.2231112 -0.9572643 … 1.2210736 -1.0890306]
 [2.2616997 3.263589 … -0.7430939 -0.29446977; 2.4100785 3.4706445 … 1.1167059 0.4489199; … ; -0.7192092 -0.68366736 … 0.3474237 1.5776587; -1.1455209 -0.97700554 … -0.05596022 1.305211]
 [3.7113996 1.2494646 … 0.562361 -1.2853897; 4.10461 3.5755372 … -1.2824167 -1.780012; … ; -0.3754916 -1.0473464 … 1.3707763 0.3758522; -1.7845855 -1.3765903 … -0.032090828 -0.5596054]
 [1.1161231 1.244833 … 0.70537734 -1.5409138; 2.4377081 1.8188325 … 0.6683324 -1.1745634; … ; -0.8191212 -0.49735883 … 1.2708838 0.88218075; -1.1817685 -0.77553195 … 1.2344519 0.20007922]
 [0.6057954 1.9065223 … -0.021743685 1.6887599; 1.5503292 2.0412514 … -0.26031935 0.9686876; … ; -0.94257265 -0.66432685 … -0.55308574 1.1084069; -1.2491556 -0.9403461 … -0.50983

In [6]:
y_train, labels_train = make_targets(wav_files, index = train_idx)
y_test, labels_test = make_targets(wav_files, index = test_idx)

(Int32[1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0], SubString{String}["CV", "CV", "CV", "CV", "CV", "CV", "CV", "CV", "CV", "CV"  …  "VC", "VC", "VC", "VC", "VC", "VC", "VC", "VC", "VC", "VC"])

## Model structure

In [27]:
args = Args(epochs = 10)

n_input = size(X[1], 1)
n_hidden = trunc(Int, n_input / 2)
n_output = length(unique(y)) - 1

# model structure
model = Chain(
	LSTM(n_input => n_hidden),
	Flux.Dropout(0.1),
	Dense(n_hidden => n_output),
	sigmoid,
)



Chain(
  Recur(
    LSTMCell(257 => 128),               [90m# 197_888 parameters[39m
  ),
  Dropout(0.1),
  Dense(128 => 0),                      [90m# 0 parameters[39m[36m  (all zero)[39m
  NNlib.σ,
) [90m        # Total: 7 trainable arrays, [39m197_888 parameters,
[90m          # plus 2 non-trainable, 256 parameters, summarysize [39m773.492 KiB.

### Initial parameters and predictions

In [9]:
params₀ = Flux.params(model)

probs = [model(i) for i in X_test]
preds = [p .>= 0.5 for p in probs]
last_pred = [X_test[end] for (x) in preds]
acc = mean(last_pred .== y_test)

probs₀, preds₀, acc₀ = make_predictions(model, X_test, y_test)

println("Pre-training accuracy: $(acc₀)")



BoundsError: BoundsError: attempt to access 0×51 BitMatrix at index [0]

## Training

┌ Info: Initialising model...
└ @ Main c:\Users\Gonzalo\Documents\projects\diphone-classification\docs\index.ipynb:11


UndefVarError: UndefVarError: `model` not defined

In [None]:
# plot training history
plot(acc_hist)
plot!(loss_hist)

### Post training parameters and predictions


In [None]:
p₁ = Flux.params(model)
probs₁, preds₁, acc₁ = make_predictions(model, X, y)
probs_test, preds_test, acc_test = make_predictions(model, X_test, y_test)