I strongly recommend to go through the article [here](https://www.analyticsvidhya.com/blog/2019/07/learn-build-first-speech-to-text-model-python/) to understand the basics of signal processing prior implementing the speech to text.

**Understanding the Problem Statement for our Speech-to-Text Project**

Let’s understand the problem statement of our project before we move into the implementation part.

We might be on the verge of having too many screens around us. It seems like every day, new versions of common objects are “re-invented” with built-in wifi and bright touchscreens. A promising antidote to our screen addiction is voice interfaces. 

TensorFlow recently released the Speech Commands Datasets. It includes 65,000 one-second long utterances of 30 short words, by thousands of different people. We’ll build a speech recognition system that understands simple spoken commands.

You can download the dataset from [here](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge).

**Implementing the Speech-to-Text Model in Python**

The wait is over! It’s time to build our own Speech-to-Text model from scratch.

**Import the libraries**

First, import all the necessary libraries into our notebook. LibROSA and SciPy are the Python libraries used for processing audio signals.

In [1]:
import os
import librosa
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile
import warnings

warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'librosa'

In [3]:
train_audio_path = '/Users/jproza/Downloads/genres'
labels = ['rock','blues','metal','pop']
all_wave = []
all_label = []
for label in labels:
    print(label)
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    print (waves)
    for wav in waves:
        samples, sample_rate = librosa.load(train_audio_path + '/' + label + '/' + wav, sr = 44100)
        samples = librosa.resample(samples, sample_rate, 44100)
        #if(len(samples)== 8000) : 
        all_wave.append(samples)
        all_label.append(label)

rock
['rock.00011.wav', 'rock.00005.wav', 'rock.00039.wav', 'rock.00038.wav', 'rock.00004.wav', 'rock.00010.wav', 'rock.00006.wav', 'rock.00012.wav', 'rock.00013.wav', 'rock.00007.wav', 'rock.00003.wav', 'rock.00017.wav', 'rock.00016.wav', 'rock.00002.wav', 'rock.00028.wav', 'rock.00014.wav', 'rock.00000.wav', 'rock.00001.wav', 'rock.00015.wav', 'rock.00029.wav', 'rock.00099.wav', 'rock.00072.wav', 'rock.00066.wav', 'rock.00067.wav', 'rock.00073.wav', 'rock.00098.wav', 'rock.00065.wav', 'rock.00071.wav', 'rock.00059.wav', 'rock.00058.wav', 'rock.00070.wav', 'rock.00064.wav', 'rock.00048.wav', 'rock.00060.wav', 'rock.00074.wav', 'rock.00075.wav', 'rock.00061.wav', 'rock.00049.wav', 'rock.00088.wav', 'rock.00077.wav', 'rock.00063.wav', 'rock.00062.wav', 'rock.00076.wav', 'rock.00089.wav', 'rock.00090.wav', 'rock.00084.wav', 'rock.00053.wav', 'rock.00047.wav', 'rock.00046.wav', 'rock.00052.wav', 'rock.00085.wav', 'rock.00091.wav', 'rock.00087.wav', 'rock.00093.wav', 'rock.00044.wav', 'roc

Convert the output labels to integer encoded:

Now, convert the integer encoded labels to a one-hot vector since it is a multi-classification problem:

In [264]:
#one hot encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(all_label)
print(y.shape)
classes= list(le.classes_)
#print(classes)
print(y)
print(all_label)
#print(np.array(all_wave))

(400,)
[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'roc

Reshape the 2D array to 3D since the input to the conv1d must be a 3D array:

In [261]:
all_wave = np.array(all_wave)
#print(y.shape)
print(all_wave.shape)

(400,)


**Split into train and validation set**

Next, we will train the model on 80% of the data and validate on the remaining 20%:


In [281]:
from sklearn.model_selection import train_test_split
#print(np.array(all_wave[:,-1]).shape)
#print(y.shape)
##all_wave = np.stack(y)
#all_wave = all_wave.reshape(1000,)
#all_wave.shape
#y = y.transpose()
#all_wave = np.array(all_wave).reshape(-1,1000,1)
#print(all_wave.shape)
#print(len(classes))
c = []
for item in y:    
 c.append(item)

print(np.array(c))

from sklearn.model_selection import train_test_split
x_tr, x_val, y_tr, y_val = train_test_split(all_wave,np.array(y),test_size = 0.5,random_state=777,shuffle=True)

print(x_tr)
#print(np.stack(x_tr,1))
print(np.array(y_tr))

[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[array([-0.0560428 , -0.04250243,  0.00845611, ..., -0.01611531,
       -0.00319551,  0.00383932], dtype=float32)
 array([-0.044938  , -0.03814394, -0.00990358, ...,  0.47097152,
        0

**Model Architecture for this problem**

We will build the speech-to-text model using conv1d. Conv1d is a convolutional neural network which performs the convolution along only one dimension. 

**Model building**

Let us implement the model using Keras functional API.

In [282]:
from keras.layers import Dense, Dropout, Flatten, Conv1D, Input, MaxPooling1D
from keras.models import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K
K.clear_session()

inputs = Input(shape=(200 ,1))

#First Conv1D layer
conv = Conv1D(8,13, padding='valid', activation='relu', strides=1)(inputs)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Second Conv1D layer
conv = Conv1D(16, 11, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Third Conv1D layer
conv = Conv1D(32, 9, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Fourth Conv1D layer
#conv = Conv1D(64, 7, padding='valid', activation='relu', strides=1)(conv)
#conv = MaxPooling1D(3)(conv)
#conv = Dropout(0.3)(conv)

#Flatten layer
conv = Flatten()(conv)

#Dense Layer 1
conv = Dense(256, activation='relu')(conv)
conv = Dropout(0.3)(conv)

#Dense Layer 2
conv = Dense(200, activation='relu')(conv)
conv = Dropout(0.3)(conv)

outputs = Dense(200, activation='softmax')(conv)

model = Model(inputs, outputs)
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 200, 1)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 188, 8)            112       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 62, 8)             0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 62, 8)             0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 52, 16)            1424      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 17, 16)            0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 17, 16)            0   

Define the loss function to be categorical cross-entropy since it is a multi-classification problem:

In [283]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

Early stopping and model checkpoints are the callbacks to stop training the neural network at the right time and to save the best model after every epoch:

In [284]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, min_delta=0.0001) 
mc = ModelCheckpoint('best_model.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='max')

    Let us train the model on a batch size of 32 and evaluate the performance on the holdout set:

**Diagnostic plot**

I’m going to lean on visualization again to understand the performance of the model over a period of time:

In [289]:
#print(np.array(y_tr))
#print(y_tr.shape)
#print(x_val.shape)
#print(y_val.shape)

#print(model.summary())
#print(x_tr)
#x_tr = np.expand_dims(x_tr, axis=1)
#y_tr = np.expand_dims(y_tr, axis=1)


#print(x_tr.shape)
#y_tr = np.expand_dims(y_tr, 0)
scaler = StandardScaler()
X = scaler.fit_transform(np.array(data.iloc[:, :-1], dtype = float))
X = np.stack(y)
history=model.fit(np.array(np.reshape(x_tr,[-1,200,1])),np.array(np.reshape(y_tr,[-1,200])) ,epochs=100,batch_size=10)

[2 1 1 0 1 2 1 3 0 3 2 0 0 3 3 1 3 1 0 1 3 1 1 2 0 2 3 3 0 0 1 2 3 0 0 2 2
 3 3 0 0 3 2 2 3 1 1 2 2 0 1 1 3 3 3 3 0 1 2 3 2 0 1 1 3 3 1 3 1 3 3 2 3 1
 0 1 0 1 2 2 3 2 2 0 1 2 2 1 0 2 0 2 2 0 1 0 2 3 2 2 2 2 3 3 2 2 0 3 1 2 3
 0 3 3 3 3 2 0 1 0 3 0 3 0 3 1 0 1 1 3 2 0 2 1 0 2 3 2 0 3 3 2 1 0 1 1 1 1
 3 1 3 3 3 3 0 1 0 1 2 3 0 0 2 2 0 1 3 1 2 2 2 2 2 2 3 2 3 2 0 3 3 1 2 2 2
 2 2 3 0 1 1 2 0 0 0 3 3 3 2 0]


ValueError: all input arrays must have the same shape

In [219]:
target_dir = '/Users/jproza/Downloads/genres'
if not os.path.exists(target_dir):
  os.mkdir(target_dir)
model.save('/Users/jproza/Downloads/genres/modelo.h5')
model.save_weights('/Users/jproza/Downloads/genres/pesos.h5')

**Loading the best model**

In [None]:
from keras.models import load_model
model=load_model('/Users/jproza/Downloads/genres/modelo.h5')
model.load_weights('/Users/jproza/Downloads/genres/pesos.h5')

In [None]:
def predict2(audio):
  audio=np.array(audio)
  audio= np.expand_dims(audio, axis=1)
  audio=np.stack(y)
  print(audio.shape)
  array = model.predict([(audio)])
  result = array[0]
  answer = np.argmax(result)
  print(classes[result(answer)])

Define the function that predicts text for the given audio:

In [None]:
def predict(audio):
    audio=np.array(audio)
    audio=np.stack(y)
    prob=model.predict(audio.reshape([1323588, 1]))
    index=np.argmax(prob[0])
    print (index);
    return classes[index]

Prediction time! Make predictions on the validation data:

In [None]:
sr = 44100
samples, sample_rate = librosa.load('/Users/jproza/Enjoy the Silence/blues.00001.wav', sr = 44100)
samples = librosa.resample(samples, sr,(sr))
#samples = librosa.resample(samples, sr, 8000)
#samples=samples.ravel()
ipd.Audio(samples,rate=(sr))
#print("Audio:",classes[np.argmax(y_val[index])])
#ipd.Audio(samples, rate=sr/2)

In [None]:
print("Text:",predict2(samples))