An introduction to Speech Signal Processing in Python:


# 1. Audio Datasets
## 1.1.[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50)
### 1.1 `Notebook`
[Simple training tutorial](https://colab.research.google.com/github/fastaudio/fastaudio/blob/master/docs/ESC50:%20Environmental%20Sound%20Classification.ipynb)

## 1.2. https://www2.cs.uic.edu/~i101/SoundFiles/

## 1.3. https://urbansounddataset.weebly.com/download-urbansound8k.html

## 1.4 https://www.epidemicsound.com/sound-effects/body/?_us=adwords&_usx=11367246386_&gclid=CjwKCAjw49qKBhAoEiwAHQVTo9Vr6EwFqyz1nFx9dpnddQTWMdNd4DVVlMndtOJun4LCowkcZnBKaRoCB5kQAvD_BwE

## 1.5 https://www2.cs.uic.edu/~i101/SoundFiles/



Dataset
http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/8kHz_16bit/

## LibROSA
LibROSA is a python library to process audio data.
LibROSA can be used to:
1. Extract acoustic features
2. Create  speech spectrogram
3. Create Fourier Transformation of speech signals

Creating spectrograms is a method to make speech signals obvious. A spectrogram is a readout that displays frequency on the vertical axis, time on the horizontal axis, and amplitude (i.e., amount of sound energy) as either darkness or coloration. A spectrogram is described as a heat map. A spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform.


## Installing Librosa

In [None]:
pip install librosa



Using librosa you can import different types of audio codecs and .wav files. To read any audio file, you would just need to pass the file_path to librosa.load() function.
This function returns:
1. An array of amplitudes.
2. Sampling rate (the sampling rate refers to ‘sampling frequency’ used

While recording the audio file, please note:


1. if the argument sr = None, the function will load your audio file in its original sampling rate.
2. You can specify the custom sampling rate as per your requirement, the function can upsample or downsample the signal for you).

In [None]:
!wget -nc https://www2.cs.uic.edu/~i101/SoundFiles/StarWars3.wav

In [None]:
import librosa
x , sr = librosa.load('StarWars3.wav', sr=None)
# x is a a numpy array.
# sr is the sampling rate of the audio file, by default it is equal to 22KHZ
print(x)
print(sr)

## Visualizing Audio:

You can plot the audio array using **librosa.display.waveplot:**

In [None]:

#%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

In [None]:
x , sr = librosa.load('StarWars3.wav', sr=44000)
#We can change this behavior by resampling at sr=44.1KHz. On the other hand, x is a digitized audio signal that has a specified frequency and sample rate.

print(x)
print(sr)
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)


## Playing an audio signal


In [None]:
import IPython.display as ipd
ipd.Audio('StarWars3.wav')


In [None]:
import numpy as np
sr = 22050
# choose the sample rate e.g., 44000
T = 5.0    # seconds
t = np.linspace(0, T, int(T*sr), endpoint=False)
x = 0.5*np.sin(2*np.pi*220*t)


#playing generated audio
ipd.Audio(x, rate=sr) # load a NumPy array

import soundfile as sf
sf.write('example.wav', x, sr)

plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

# Pre-processing methods for audio signals

## Normalization
Using normalization techniques we can adjust the volume of audio signals to a standard set level.

In [None]:
!pip install scikit-learn
!sudo apt-get install build-essential swig
!pip install auto-sklearn==0.11.1

In [None]:
import sklearn
#min = minimum value for each row of the vector signal
#max = maximum value for each row of the vector signal
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)

#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
librosa.display.waveplot(normalize(x), sr=sr, alpha=0.2,color='r')

#plt.plot(normalize(x), color='r')

## 3.2. Pre-emphasis
Pre-emphasis is a technique that should be done before exteracting features. This technique can boost the signal’s high-frequency components. Thus, it dose not change the low-frequency components.

In [None]:
!wget -nc https://www.soundsnap.com/male_auctioneer_pa_voice_introducing_and_selling_sgi_server


In [None]:
import matplotlib.pyplot as plt
y, sr = librosa.load('male_auctioneer_pa_voice_introducing_and_selling_sgi_server', offset=30, duration=1)
y_filt = librosa.effects.preemphasis(y)
# and plot the results for comparison
S_orig = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
S_preemph = librosa.amplitude_to_db(np.abs(librosa.stft(y_filt)), ref=np.max)

librosa.display.specshow(S_orig, y_axis='log', x_axis='time')
plt.title('Original signal')
librosa.display.specshow(S_preemph, y_axis='log', x_axis='time')
plt.title('Pre-emphasized signal')

Working with Speech Signals:
# Framing

A speech signal is a non-stationary signal ( a signal with variable frequency contents). So, to analyze a speech signal, we need to view it as a stationary signal. To do so, we divide speech signals into short frames. Generally speaking, we choose small intervals (e.g., about 20 to 30 milliseconds) to make the frame. Note that the shape of the human vocal tract will be constant for such short periods. Suppose we choose an interval shorter than 20 milliseconds. In that case, we won't have sufficient samples to obtain a reasonable estimate of signals frequency components, while in a break longer than 30 milliseconds, the signal's frequency components may fluctuate too much, so the frame can not be considered as a stationary signal [3-4].

**bold text**### A Script for creating frams from an audio *signals*

In [None]:
import numpy as np


 def framing(sig, fs=16000, win_len=0.025, win_hop=0.01):
     """
     transform a signal into a series of overlapping frames.

     Args:
         sig            (array) : a mono audio signal (Nx1) from which to compute features.
         fs               (int) : the sampling frequency of the signal we are working with.
                                  Default is 16000.
         win_len        (float) : window length in sec.
                                  Default is 0.025.
         win_hop        (float) : step between successive windows in sec.
                                  Default is 0.01.

     Returns:
         array of frames.
         frame length.
     """
     # compute frame length and frame step (convert from seconds to samples)
     frame_length = win_len * fs
     frame_step = win_hop * fs
     signal_length = len(sig)
     frames_overlap = frame_length - frame_step

     # Make sure that we have at least 1 frame+
     num_frames = np.abs(signal_length - frames_overlap) // np.abs(frame_length - frames_overlap)
     rest_samples = np.abs(signal_length - frames_overlap) % np.abs(frame_length - frames_overlap)

     # Pad Signal to make sure that all frames have equal number of samples
     # without truncating any samples from the original signal
     if rest_samples != 0:
         pad_signal_length = int(frame_step - rest_samples)
         z = np.zeros((pad_signal_length))
         pad_signal = np.append(sig, z)
         num_frames += 1
     else:
         pad_signal = sig

     # make sure to use integers as indices
     frame_length = int(frame_length)
     frame_step = int(frame_step)
     num_frames = int(num_frames)

     # compute indices
     idx1 = np.tile(np.arange(0, frame_length), (num_frames, 1))
     idx2 = np.tile(np.arange(0, num_frames * frame_step, frame_step),
                    (frame_length, 1)).T
     indices = idx1 + idx2
     frames = pad_signal[indices.astype(np.int32, copy=False)]
     return frames

# Windowing
Note that when we extract frames from a speech signal, we extract a set of waveforms with a non-integer number of periods. It leads to an issue known as spectral leakage in signal processing lingo and means the signal cannot correctly represent its frequency components. To address this issue, we use windowing to split frames into several waveforms that go to zero at the borders [4-5].

In [None]:
import numpy as np


 def windowing(frames, frame_len, win_type="hamming", beta=14):
     """
     generate and apply a window function to avoid spectral leakage.

     Args:
       frames  (array) : array including the overlapping frames.
       frame_len (int) : frame length.
       win_type  (str) : type of window to use.
                         Default is "hamming"

     Returns:
       windowed frames.
     """
     if   win_type == "hamming" : windows = np.hamming(frame_len)
     elif win_type == "hanning" : windows = np.hanning(frame_len)
     elif win_type == "bartlet" : windows = np.bartlett(frame_len)
     elif win_type == "kaiser"  : windows = np.kaiser(frame_len, beta)
     elif win_type == "blackman": windows = np.blackman(frame_len)
     windowed_frames = frames * windows
     return windowed_frames

Overlapping frames
Windowing leads to losing the samples towards the beginning and the end of the frame and finally to an incorrect frequency representation. To address the issue, we take overlapping frames instead of disjoint frames. The overlap between frames is generally accepted to be of 10-15 ms.
The code below is a modified version of a script at [6]

In [None]:
import numpy
import matplotlib.pyplot as plt
import scipy.io.wavfile   #This library is used for reading the .wav file
[fs,signal]=scipy.io.wavfile.read(‘w1.wav’) #input wav file ,change here
# fs=sampling frequency,signal is the numpy 2D array where the data of the wav file is written
length=len(signal) # the length of the wav file.This gives the number of samples ,not the length in time
window_hop_length=0.01 #10ms change here
overlap=int(fs*window_hop_length)
print (”overlap=” ,overlap)
window_size=0.025 #25 ms,change here
framesize=int(window_size*fs)
print “framesize=”,framesize
number_of_frames=(length/overlap);
nfft_length=framesize #length of DFT ,change here
print “number of frames are =”,number_of_frames
frames=numpy.ndarray((number_of_frames,framesize)) # This declares a 2D matrix,with rows equal to the number of frames,and columns equal to the framesize or the length of each DTF
for k in range(0,number_of_frames):
for i in range(0,framesize):
    if((k*overlap+i)<length):
      frames[k][i]=signal[k*overlap+i]
   else:
      frames[k][i]=0
fft_matrix=numpy.ndarray((number_of_frames,framesize)) #declares another 2d matrix to store  the DFT of each windowed frame
abs_fft_matrix=numpy.ndarray((number_of_frames,framesize)) #declares another 2D Matrix to store the power spectrum
for k in range(0,number_of_frames):
fft_matrix[k]=numpy.fft.fft(frames[k]) #computes the DFT
abs_fft_matrix[k]=abs(fft_matrix[k])*abs(fft_matrix[k])/(max(abs(fft_matrix[k]))) # computes the power spectrum
t=range(len(abs_fft_matrix))  #This code segment simply plots the power spectrum obtained above
plt.plot(t,abs_fft_matrix)
plt.ylabel(‘frequency’)
plt.xlabel(‘time’)
plt.show()


1. https://appliedmachinelearning.blog/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
Python For Audio Signal Processing
2. https://appliedmachinelearning.blog/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
3. Introduction to Speech Processing https://wiki.aalto.fi/display/ITSP/Windowing
4. Spectral leakage and windowing https://superkogito.github.io/blog/SpectralLeakageWindowing.html
5. https://deerishi.wordpress.com/2013/09/23/signal-processing-using-python-part-1/
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196

Audio Features and Feature extraction from audio signal:
Audio signals consist of features.
However, we need to extract the characteristics relevant to the problem we are trying to solve. Features that can be extaercted from audio signals are:

 1. Spectrogram
 2. Spectral Centroid
 3. Spectral Rolloff
 4. Spectral Bandwidth
 5. Zero-Crossing Rate
 6. Mel-Frequency Cepstral Coefficients(MFCCs)

Spectrogram

In [None]:
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

In [None]:
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

Spectral Centroid

The spectral centroid indicates where the center of mass for a sound is located.

In [None]:
import sklearn
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
(775,)
# Computing the time variable for visualization
plt.figure(figsize=(12, 4))
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
# Normalising the spectral centroid for visualisation
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='b')

Spectral Rolloff
It depicts the frequency at which high frequencies decrease to 0.

In [None]:
spectral_rolloff = librosa.feature.spectral_rolloff(x+0.01, sr=sr)[0]
plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

### Spectral Bandwidth
The spectral bandwidth is the width of the band of information at one-half the peak maximum

In [None]:
spectral_bandwidth_1 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr)[0]
spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=3)[0]
spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=4)[0]
spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(x+0.01, sr=sr, p=5)[0]

plt.figure(figsize=(15, 9))
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_bandwidth_1), color='b')
plt.plot(t, normalize(spectral_bandwidth_2), color='r')
plt.plot(t, normalize(spectral_bandwidth_3), color='g')
plt.plot(t, normalize(spectral_bandwidth_4), color='y')
plt.legend(('p = 2', 'p = 3', 'p = 4'))

Zero-Crossing Rate

A very simple way for measuring the smoothness of a signal is to calculate the number of zero-crossing within a segment of that signal. A voice signal oscillates slowly — for example, a 100 Hz signal will cross zero 100 per second — whereas an unvoiced fricative can have 3000 zero crossings per second.

In [None]:
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

In [None]:
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

In [None]:
zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(sum(zero_crossings))#16

Mel-Frequency Cepstral Coefficients(MFCCs)

The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice.

In [None]:
mfccs = librosa.feature.mfcc(x, sr)
print(mfccs.shape)

plt.figure(figsize=(15, 7))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

Chroma feature
A chroma feature or vector is typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, …, B}, is present in the signal. In short, It provides a robust way to describe a similarity measure between music pieces.

In [None]:
chromagram=librosa.feature.chroma_stft(x, sr=sr)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', cmap='coolwarm')

#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

# Data Augmentation Methods for Speech Data
Generally speaking, any data augmentation method aims to increase the size of the dataset and provide multiple variations of each data sample. We can augment data samples if we create new samples by changing small portions in the original samples. The main advantage of using data augmentation methods is to overcome the overfitting issue and to develop generalized supervised classifiers.
To creat new data sample, we can employ the following methods:
Noise Injection
Shifting Time
Changing Pitch
Changing Speed

## Articles
1.[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%253A+arxiv%252FQSXk+%2528ExcitingAds%2521+cs+updates+on+arXiv.org%2529)
## Tutorials
i. [Data Augmentation for Speech Recognition](https://towardsdatascience.com/data-augmentation-for-speech-recognition-e7c607482e78)

ii. https://dev.to/makcedward/data-augmentation-for-speech-recognition-bfc
iii. Data Augmentation in Python: Everything You Need to Know https://neptune.ai/blog/data-augmentation-in-python
iv. https://www.kaggle.com/CVxTz/audio-data-augmentation

v. https://project-awesome.org/faroit/awesome-python-scientific-audio#data-augmentation

## Github
a. https://github.com/iver56/audiomentations

b. https://github.com/SuperKogito/pydiogment

c.https://muda.readthedocs.io/en/latest/
## Libraries
1. audiomentations https://pypi.org/project/audiomentations/

#What is speech recognition?
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format. While it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

# What is speaker recognition?

 A Speaker recognition (e.g., voice recognition and speech recognition) system is a **speaker identification** or **speaker verification**.  A speaker recognition system aims to identify an individuals  from his characteristics of voices. On the other hand, a speaker recognition system answers the question "***Who is speaking***?".  A Speaker recognition system csn be prsented as *Speaker Identification* or **Speaker Verification**.

A **Speaker Identification** system aims to determine from which of the registered speakers a given utterance comes.

A **Speaker Verification** system can accept or reject the identity claimed by a speaker.

Speaker verification can be either ***text-dependent*** or ***text-independent***.

*Text-dependent *Speaker Verification means speakers need to choose the same passphrase to use during both enrollment and verification phases.

*Text-independent* verification means speakers can speak in everyday language in the enrollment and verification phrases.


Refrences:
A Tutorial on Text-Independent Speaker Verification


### More Python codes
1. [Speaker_Recogniton_Verification.ipynb](https://colab.research.google.com/github/NVIDIA/NeMo/blob/r1.0.0rc1/tutorials/speaker_recognition/Speaker_Recognition_Verification.ipynb)

2. [Speaker Recognition](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/audio/ipynb/speaker_recognition_using_cnn.ipynb)

3. [Speaker_Diarization_Inference.ipynb](https://colab.research.google.com/github/NVIDIA/NeMo/blob/r1.0.0rc1/tutorials/speaker_recognition/Speaker_Diarization_Inference.ipynb#scrollTo=oNVGEzW2f0mF)

4. [Identifying speakers with voice recognition](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787125193/9/ch09lvl1sec61/identifying-speakers-with-voice-recognition)

5. [Speech Recognition.ipynb](https://github.com/aravindpai/Speech-Recognition/blob/master/Speech%20Recognition.ipynb)

### Tutorial
[1]. [Speaker recognition](http://www.scholarpedia.org/article/Speaker_recognition)

[2]. [A Tutorial on Text-Independent Speaker Verification](https://link.springer.com/content/pdf/10.1155/S1110865704310024.pdf)

[3]. [Real-Time Speaker Identification and Verification](https://observatoriouniversidadesoaf.com/BUENAS_PRACTICAS/uoc_tesla_trust_based_authentication/bibliografia_nueva/Speaker%20Recognition.pdf)

[4]. [A tutorial on speaker verification](http://cslt.riit.tsinghua.edu.cn/mediawiki/images/c/cb/131104-ivector-microsoft-wj.pdf)

[5]. [An Overview of Automatic Speaker Verification System](https://link.springer.com/chapter/10.1007/978-981-10-7245-1_59)

[6]. [SpeechRecognition](https://pypi.org/project/SpeechRecognition/)

[7]. [Easy Speech-to-Text with Python](https://www.kdnuggets.com/2020/06/easy-speech-text-python.html)

[8]. [Speech Recognition in Python— The Complete Beginner’s Guide](https://sonsuzdesign.blog/2020/08/14/speech-recognition-in-python-the-complete-beginners-guide/)

[9]. [Recognition: a review of the different deep learning approaches](https://theaisummer.com/speech-recognition/)