# Speech Enhancement

## Introduction

Imagine you have a piece of audio, but you cannot understand or hear well what the speaker is saying because there's some noise. In this report you will find a way of cleaning your signal from the noise (as long as it's white/colored noise) with a method called Subspace-Based Speech Enhancement

## Brief Explanation of the Method


The main idea behind Subspace-based Speech Enhancement is to consider your noisy signal as a space formed by two subspaces: the noise subspace and the speech subspace. Through a series of calculations which will be ilustrated in the following section ([Implentation](#Implementation)), one can obtain an estimation of the the speech subspace which will be (almost) free of all noise.



![Noisy,Noise and Clean Signals](img/noisyformula.jpg)

_noisy space_: y,
_noise subspace_:n,
_speech subspace_:x.

_Note 1_: Sometimes the noisy signal space and its eigen vectors will be referred to with "Z", specially in the coding.  
_Note 2_: Something to add, is that the method used works with white and with colored noise. 

## Implementation

In this section, the code is presented and its main blocks will be explained, for better understanding of how the method works and its implementation.

#### Noise Covariance Matrix
Under the "imports", we have a comment section, where the main parameters used to obtain the enchanced speech are described. 
Then, the noisy signal file is read and normalized. Finally, in line 17, the implementation is started. 

The first thing that must be done when following subspace-based method is to find the covariance matrix of noise. For this, we must obtain samples of the _noisy signal_ where there is only noise (no speech). This is usually called VAD, voice active detection and it can be implemented in different ways. 

Our implementation consists firstly on computing the power of the noisy signal and dividing the signal into frames whose bounds are defined by parameters **a1** and **b1**. Afterwards, the mean energy for each frame is found and compared to a threshold. If in that frame the energy is below the threshold, then we have found our noise.

Now that we have our noise samples, we can compute the covariance matrix of noise as it is shown in the last line of code.



In [38]:
import numpy as np
from scipy.io import wavfile

#This function returns the denoised signal using subspace method as an array
#   wav_noisy is the path to noisy signal
#   K is the lenght of the windows used in the process
#   T is the number of windows we will consider for computing the covariance matrix
#   mu is the parameter that tune the enhancement. Higher values of mu gives better enhancement but also more distortion
#   threshold is the value of the threshold on the energy used for voice activity detection and computate of the noise covariance matrix
#def subspace_enhance(wav_noisy,K,T,mu,threshold):
K = 80
T = 10
mu =10
threshold = 0.1
    #Open the noisy file and normalize it
fs_noisy, noisy_signal = wavfile.read('Audio/noisy_SNR_10.wav')
noisy_signal = np.divide(noisy_signal, 32768)
    
    #Noise co-variance
Energy = noisy_signal*noisy_signal
n_max_noise = int(2*(len(noisy_signal)-K)/K)
Somme = np.zeros((K,K))
p = 0
for n in range (0,n_max_noise):
        a1 = int(n*K/2)
        b1 = int((n*K/2)+K)
        EnergyFrame = np.mean(Energy[a1:b1])
        if EnergyFrame < threshold :
            Zn = noisy_signal[a1:b1];
            Somme = np.add(Somme,np.outer(Zn,Zn))
            p += 1
Noise_covariance_Matrix = np.divide(Somme, p)
    
   

***

#### Noisy Covariance Matrix:
The next step is to compute the covariance matrix of the noisy signal, for which we use the proposed formula shown in __add article name__.

![Noisy Covariance](img/noisycovariance.jpg)


In [40]:
 #Enhancement
lenght = len(noisy_signal)
n_max = int(2*(lenght-K)/K);
Enhance_signal = np.zeros((lenght,1))
offset = int(T*K/2)
    
for n in range (T,n_max-T):
        a = int(n*K/2)
        b = int((n*K/2)+K)
        sample = noisy_signal[a:b]
        Somme = np.zeros((K,K))
        comp = 0
        for o in range (a-offset,a+offset):
            Zn = noisy_signal[o:o+K]
            Somme = np.add(Somme,np.outer(Zn,Zn))
            comp = comp+1
Noisy_Covariance = np.divide(Somme, comp)
       


#### Matrix that diagonalizes Noise and Noisy Covariance Matrix

The variable "sigma" represents a matrix with which we are able to diagonalize the noise and the noisy signal and whose eigenvectors and eigenvalues, will help us  compute the estimator neccassary to find the speech subspace. 
Inside the for loop, we check which eigenvalues are bigger than zero, as these are the eigenvalues we need in order to build the estimator. Afterwards we sort them out with quicksort algorithm from big to small as it is suggested in the article. 

In [42]:
 sigma = np.matmul(np.linalg.pinv(Noise_covariance_Matrix),Noisy_Covariance)-np.eye(K)
Eig,V1 = np.linalg.eig(sigma)
Number_positive_Eig = 0
for z in Eig :
            if z>0 :
                Number_positive_Eig+=1
order = np.argsort(Eig, axis = -1, kind='quicksort')
order = order[::-1]
V = V1[:,order]
V = -V
Val = Eig[order]
Val = Val[0:Number_positive_Eig]

       

#### Optimal Linear Estimator (Hopt)

Finally, we are able to obtain the estimator _Hopt_. As we can see, it is here when the $\mu$ parameter plays its role, which we remember is controling the amount of distortion or the residual noise, but not both at the same time. 



In [44]:
Q1 = np.zeros((K, K))
for w in range (0,Number_positive_Eig):
            Q1[w,w] = Val[w]/(Val[w]+mu)
Hopt = np.matmul(np.linalg.pinv(np.transpose(V)),Q1)
Hopt = np.matmul(Hopt,np.transpose(V))
Enhance = np.matmul(Hopt,sample.reshape(-1, 1))
Hamming = np.hamming(K)
Enhance = np.multiply(Enhance,Hamming.reshape(-1, 1))
Enhance_signal[a:b] = np.add(Enhance_signal[a:b],Enhance);
    
   # return fs_noisy, Enhance_signal





#### Write the Enchanced signal into an audible file.



In [46]:
#fs, enhanced_signal = subspace_enhance('Audio/noisy_SNR_10.wav', 80, 10, 10, 0.1)
wavfile.write('Audio/denoised_SNR_10.wav',fs_noisy,Enhance_signal)

### Results

In order to analyse, the results given by the code, we will present in four different ways the clean signal, the noisy signal and the enhanced/estimated signal.

1. [TimeDomain](#Time-Domain)

2. [FrequencyDomain](#Frequency-Domain)

3. [Spectrogram](#Spectrogram)

4. [Audio](#Audio)

#### Time-Domain
Looking at the noisy signal we can observe that its shape varies slightly with respect to the clean signal and that it adds power to where there was no power before.

If we compare the clean signal with the enchanced signal we can see that there is practically no difference, which suggests that the noise has been properly removed. There is a small difference in the amplitude of the enchanced signal, it being smaller at almost all points in time, which may be due to the fact that our estimator is not perfect, nor was the way in which we took the noise samples to calculate noise covariance. This last point would explain why the last part of the signal is completely lost. As we can see in the noisy signal the last part of the speech is almost indistinguishable from the noise. When the VAD (Voice Active Detection) was performed, the threshold was higher than the energy of the signal in that point and thus considered as noise and removed from the "noiseless subspace" and added as part of the "noise subspace".

![timeDomain](img/timeDomain.jpg)

#### Frequency-Domain

Just like in the time domain, we observe that we have lost some power, but that the over all frequencies that belong to the clean signal appear in the enhanced one. 
In addition, looking at the noisy signal and how the noise is spread homegenously through all frequencies, we can confirm that our noise is white. 

![Spectrum](img/Spectrum.jpg)
#### Spectrogram
The spectrogram gives us an idea of how strong our signal is for a certain frequency and at a certain point in time. Yellow means that the signal is strong, while but means that there is almost no power. 
With this concepts, we shall analyze the following figure:
![Spectrogram](img/SpectrogramBig.jpg)
**Observations:**
As we can see, the first figure represents the clean signal, all the power being focused at a certain time interval and being powerful at the lower frequencies. The second figure is the noisy signal, this is why there is so much yellow spread uniformly, because the noise is more or less constant at all frequencies and at all times. We can still, with some effort, appreciate the pattern our original signal, but in all, is confusing for the eyes and more importantly, for the ear, as it will be shown. 
Finally, the third subplot, shows the signal obtained through the Speech Enhancement. If compared it to the one above, we can see that even though there is still some noise, the situation is much better, as it's power has reduced from yellow (-80 dB) to light blue (-100 dB).

#### Audio

Here we have the noisy signal with an SNR of 5 dB:

In [52]:
import IPython.display as ipd
ipd.Audio('Testing_Audio/input5.wav')

Here we have speech enhancement with spectrum substraction:

In [51]:
import IPython.display as ipd
ipd.Audio('Testing_Audio/spectre5.wav')

Here we have the clean signal, obtained with the subspace method: the "Enhanced Signal"

In [50]:
import IPython.display as ipd
ipd.Audio('Testing_Audio/subspace5.wav')

### Neural network Analysis

Since checking the results visually is not enough to objectively know how good is our Speech Enhacenment, we use a neural network to tell us the degree of confidence with which the word _"No"_ said in the audio is understandable. 

The graphic shows 3 signals: the noisy one in orange, the signal obtained with spectral substraction in blue and finally in green the enhanced signal obtain with the Subspace method. The _y_ axis show the level of confidence in percentage while the _x_ axis represents the SNR. In this case, we used an SNR of 5 dB, therefore it is this point the one which we will analyze. 


As it can be seen, in the noisy signal the word _"No"_ is barely recognizable as it scarcely reaches the 20 % of confidence, which means that the neural network is very unsure of the speech contained in the noisy signal.  For the _"single noise channel removal signal"_ we have a higher confidence (around 30 %) but this still seems to be a very low value, especially if we compare it to the degree of confidence obtained with the method we used. As it is, we pass from having a 20 % of confidence to a 90 % which means that the neural network is almost sure that the word said in the audio is _"No"_.


![Scores](img/Scores.png)



_Note_:This neural network was developed by Alexis Mermet (alexis.mermet@epfl.ch).

### Playing with parameter $\mu$



As it was explained in the first comments in the source code the parameter $\mu$ is essential in our Speech Enhancement method. Playing with its values, we can obtain a signal with more distortion and less residual noise, or less distorsion and more residual noise. 
With the purpose of clarifying:
#### Confidence for other $\mu$ values 
- High values of $\mu$ reduce residual noise, but create distortion in the speech. 
- Low values of $\mu$ reduce the distortion, in exchange of having a higher residual noise. 
##### Parameter $\mu$ = 1
The following image shows the confidence when we use $\mu$ = 1 which means that we have low distortion, but the noise isnt removed as well. Despite this, we can see that the level of confidence is still the best of all three cases, being more significant for low SNRs(up to 15 dB). 

_ADD AUDIO FILE_
![mu1](img/mu1.JPG)

##### Parameter $\mu$ = 50
Like in the earlier case, here we have changed the $\mu$ only this time instead of making it smaller, we have multiplied it by 5 with respect to the one used at the beginning. Now, in favor of having less noise, we get more distortion which can be appreciated in the audio. 
From the curve we can see that we get a better confidence with higher $\mu$ values, which means that the neural network copes better with distortion than noise. However, this changes for very high SNR levels where we can see that the curve drops below the noisy signal and the single channel noise removal signal.

_ADD AUDIO FILE_ 
![mu50](img/mu50.JPG)

## References

_A Subspace Approach for Enhancing Speech Corrupted by Colored Noise_ YiHuandPhiliposC.Loizou,Member,IEEE, SIGNAL PROCESSIN GLETTERS,VOL.9,NO.7,JULY 2002 

_Subspace-Based Speech Enhancement and Implementation_ 
Kang Liu and Genke Yang School of Electronic Information and Electrical Engineering Shang Hai Jiao Tong University Shanghai, China, 200240 E

_A Signal Subspace Approach for Speech Enhancement_ Yariv Ephraim, Fellow, IEEE, and Hany L. Van Trees, Life Fellow, IEEE,  TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. 4, JULY 1995 