# Final Project - LSTM

**Course**: ANLY590

**Group Members**: Yuan Liu, Guiming Xu, Kuiyu Zhu, Yuxuan Yao

**NetID**: yl1130,gx26, kz175, yy560

**Dataset**: https://www.kaggle.com/cse0031/speech-representation-and-data-exploration

The dataset contains information files and a folder of audio files. The labels that we need to predict in tests are Yes, No, Up, Down, Left, Right, On, Off, Stop, Go, and everything else should be considered either unknown or silent.

This notebook is for the LSTM training and simple prediction and recognition of audio files.

### Import Packages

In [2]:
import os
import librosa   #for audio processing
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile #for audio processing
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from keras.layers import Dense, Dropout, Flatten, Conv1D, Conv2D, MaxPooling2D, Input, MaxPooling1D, Activation,BatchNormalization, Reshape, TimeDistributed
from keras.models import Model, Sequential
from keras.layers.recurrent import LSTM
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K
from matplotlib import pyplot 
import random
import soundfile as sf

In [3]:
train_audio_path = 'D:/590/final_project/train/audio/'
samples, sample_rate = librosa.load(train_audio_path+'yes/0a7c2a8d_nohash_0.wav', sr = 16000)

In [4]:
samples

array([ 0.00042725, -0.00021362, -0.00042725, ...,  0.00057983,
        0.00061035,  0.00082397], dtype=float32)

In [5]:
sample_rate

16000

In [6]:
ipd.Audio(samples, rate=sample_rate)
print(sample_rate)

16000


As we can see, the audio file is transforming to a 1D array. And the rate of sample is 16000.

In that case, we think about downsampling, which can reduce computing time.

In [7]:
samples = librosa.resample(samples, sample_rate, 8000)
ipd.Audio(samples, rate=8000)

Listening to the audio file, we can hear "yes" clearly.

**Due to github's restrictions on dataset size, it is not possible to upload our dataset. So this audio cannot be displayed here.**

In [8]:
labels=os.listdir(train_audio_path)

temp=[]
for label in labels:
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    temp.append(len(waves))
    
labels=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "wow", "house", "happy", "tree", "dog", "cat", "bird", 
        "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]

In [9]:
temp=[]
for label in labels:
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    for wav in waves:
        sample_rate, samples = wavfile.read(train_audio_path + '/' + label + '/' + wav)
        temp.append(float(len(samples)/sample_rate))

Before modeling, we should downsample all audio files.

In [10]:
train_audio_path = 'D:/590/final_project/train/audio/'
audios = []
allLabel = []
for label in labels:
    print(label)
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    for wav in waves:
        samples, sample_rate = librosa.load(train_audio_path + '/' + label + '/' + wav, sr = 16000)
        samples = librosa.resample(samples, sample_rate, 8000)
        if(len(samples)== 8000) : 
            audios.append(samples)
            allLabel.append(label)

yes
no
up
down
left
right
on
off
stop
go
wow
house
happy
tree
dog
cat
bird
one
two
three
four
five
six
seven
eight
nine


Encoder labels

In [11]:
le = LabelEncoder()
y=le.fit_transform(allLabel)
classes= list(le.classes_)
from keras.utils import np_utils
y=np_utils.to_categorical(y, num_classes=len(labels))
audios = np.array(audios).reshape(-1,8000,1)

### Modeling

Split train and test sets

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(np.array(audios),np.array(y),stratify=y,test_size = 0.2,random_state=777,shuffle=True)

In [13]:
model = Sequential()
inputs = Input(shape=(8000,1))

x = Conv1D(8,13, padding='same', activation='relu', strides=1)(inputs)
x = MaxPooling1D(strides =1)(x)
x = BatchNormalization()(x)

x = Conv1D(16,11, padding='same', activation='relu', strides=1)(x)
x = BatchNormalization()(x)

x = Conv1D(32,9, padding='same', activation='relu', strides=1)(x)
x = MaxPooling1D(strides =1)(x)
x = BatchNormalization()(x)

# x = Conv1D(64,7, padding='same', activation='relu', strides=1)(x)
# x = BatchNormalization()(x)

#x = Reshape(12, 10)(x)

x = LSTM(32, return_sequences = True)(x)
x = LSTM(32, return_sequences = True)(x)

x = Dense(128)(x)
x = TimeDistributed(Dense(64))(x)
x = Activation('relu')(x)
x = Flatten()(x)
x = Dense(len(labels))(x)
outputs = Activation('softmax')(x)

adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model  = Model(inputs, outputs)
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 8000, 1)]         0         
_________________________________________________________________
conv1d (Conv1D)              (None, 8000, 8)           112       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 7999, 8)           0         
_________________________________________________________________
batch_normalization (BatchNo (None, 7999, 8)           32        
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 7999, 16)          1424      
_________________________________________________________________
batch_normalization_1 (Batch (None, 7999, 16)          64        
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 7999, 32)         

In [15]:
model.compile(loss="categorical_crossentropy", optimizer = adam, metrics = ['accuracy'])

In [16]:
#es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, min_delta=0.0001) 
#mc = ModelCheckpoint('best_model.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit(x_train, y_train ,epochs=1, batch_size=32, validation_data=(x_val,y_val))



### Results

In [17]:
# plt.plot(history.history['accuracy'])
# plt.plot(history.history['val_accuracy'])
# plt.title('LSTM model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='bottom right')
# plt.show()

# plt.plot(history.history['loss'])
# plt.plot(history.history['val_loss'])
# plt.title('LSTM model loss')
# plt.ylabel('loss')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='upper right')
# plt.show()
score = model.evaluate(x_val, y_val, verbose=1)
print("The test accuracy: {acc}".format(acc = score[1]))

The test accuracy: 0.42608442902565
