# Vocal Melody Extraction with XGBoost (Vocalyze)

#### Aim 
* The aim of this project is to extract the <b>vocal</b> melody of a song essentially discarding the background instrumentals
<br> <br>
* Given an audio sample, the ideal output should be a classification of vocal regions as one of 12 musical notes
<br> <br>
* I achieve this by training an initial model to effectively identify vocal regions and training a second one to deduce the melody from these vocal regions using regression
<br> <br>
* *This project has applications in melody generation,speech denoising and voice rcognition*


## Data Preparation and Preprocessing

In [192]:
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio as play
import librosa
from librosa import display
plt.rcParams['figure.figsize'] = (15,8)

In [193]:
#Function to denoise the audio
def denoise(array,sample_rate,filter_intensity):
    S_full, phase = librosa.magphase(librosa.stft(array))
    S_filter = librosa.decompose.nn_filter(S_full,
                                       aggregate=np.median,
                                       metric='cosine',
                                       width=int(librosa.time_to_frames(2, sr=sr)))
    S_filter = np.minimum(S_full, S_filter)

    margin_i, margin_v = 2, filter_intensity
    power = 2
    mask_i = librosa.util.softmask(S_filter,
                                   margin_i * (S_full - S_filter),
                                   power=power)
    mask_v = librosa.util.softmask(S_full - S_filter,
                                   margin_v * S_filter,
                                   power=power)
    S_foreground = mask_v * S_full
    S_background = mask_i * S_full
    
    y_foreground = librosa.istft(S_foreground * phase)
      
    return y_foreground

In [194]:
#Function for getting truth values of each audio sample in dataset
def get_truth(index):
    if index < 10:
        with open('datasets/mirex05TrainFiles/train0{}REF.txt'.format(str(index))) as f:
            content = f.readlines()
    if index >= 10:
        with open('datasets/mirex05TrainFiles/train{}REF.txt'.format(str(index))) as f:
            content = f.readlines()
            
    for i in range(len(content)):
        replace = content[i].replace('\n','\t')
        split = replace.split('\t')
        content[i] = float(split[1])
    
    return content


### Preprocessing and collection

* Preprocessing is done by first denoising the audio using nearest-neighbours
* MFCCs(Mel-frequency cepstral coefficients), FFT(Fast Fourier Tranform) and RMS(root-mean-square) are then extracted as features for every 0.01 seconds of the audio
* 20 seconds of audio were used from each audio sample

In [212]:
data = []
targetbinary = []
target = []

audiopath = 'datasets/mirex05TrainFiles/train{}.wav'
truthpath = 'datasets/mirex05TrainFiles/trainREF{}.wav'

for i in range(1,14):
    if i >= 10:
        song, sr = librosa.load(audiopath.format(str(i)))
    else:
        song, sr = librosa.load(audiopath.format('0'+str(i)))
    
    #filtering signal
    filtered = denoise(song,sr,25)[:librosa.time_to_samples(24)]
    truth_vals = get_truth(i)[:2001]
    
    for c in range(2000):
        array = filtered[220*c:220*(c+1)]
        transform = np.abs(np.fft.fft(array)[:int(len(array)/2)])
        rms = librosa.feature.rms(array,frame_length=220,hop_length=110).flatten()
        mfcc = librosa.feature.mfcc(array,n_fft=220,sr=44000,n_mfcc=20).flatten()
        concat = np.concatenate((rms,mfcc,transform))

        data.append(concat)
        if truth_vals[c] > 0:
            targetbinary.append(1)
        else:
            targetbinary.append(0)
        target.append(truth_vals[c])
data = np.array(data)
target = np.array(target)
targetbinary = np.array(targetbin)

#NOTE: target = frequencies
#      targetbinary = classes(0 and 1)



###  Normalisation

In [213]:
def normalise(x,y):
    global data
    Max = np.max(data)
    Min = np.min(data)
    yMax = np.max(y)
    yMin = np.min(y)
  
  #normalisation
    X = x - Min
    X = x / (Max-Min)
    
    Y = y - yMin
    Y = y / (yMax-yMin)
  # End of normalisation
    return X,Y

In [214]:
cx,cy = normalise(data,targetbin)

In [215]:
x,y = normalise(data,target)

## Model Training

In [216]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,confusion_matrix
from xgboost import XGBClassifier,XGBRegressor

#### Vocal region classification

In [217]:
cxtr,cxte,cytr,cyte = train_test_split(cx,cy,test_size= 0.2)
xgb = XGBClassifier(n_estimators=300,max_leaves=0)
initial = xgb.fit(cxtr,cytr)
initial.score(cxtr,cytr)

0.999951923076923

In [218]:
initial.score(cxte,cyte)

0.8317307692307693

In [219]:
pred = initial.predict(cxte)

In [220]:
confusion_matrix(cyte,pred)

array([[1373,  580],
       [ 295, 2952]], dtype=int64)

#### Frequency/Melody Prediction

In [221]:
xtr,xte,ytr,yte = train_test_split(x,y,test_size= 0.2)
mlp = XGBRegressor(n_estimators=300)
model = mlp.fit(xtr,ytr)
model.score(xtr,ytr)

0.9628042837991148

In [222]:
model.score(xte,yte)

0.41397264385202237

## Tests

In [223]:
def x_normalise(x):
    global data
    Max = np.max(data)
    Min = np.min(data) 

  #normalisation
    X = x - Min
    X = x / (Max-Min)
    
  # End of normalisation
    return X

In [224]:
def preprocess_and_predict(signal,sr):
    global initial
    preprocessed = []
    filtered = denoise(signal,sr,100)
    milliseconds = int((len(signal)/sr) * 100)
    for c in range(milliseconds):
        array = filtered[220*c:220*(c+1)]
        transform = np.abs(np.fft.fft(array)[:int(len(array)/2)])
        rms = librosa.feature.rms(array,frame_length=220,hop_length=110).flatten()
        mfcc = librosa.feature.mfcc(array,n_fft=220,sr=44000,n_mfcc=20).flatten()
        concat = np.concatenate((rms,mfcc,transform))
        preprocessed.append(concat)
    preprocessed = np.array(preprocessed)
    normalised = x_normalise(preprocessed)
    prediction = initial.predict(normalised)
    return np.int32(prediction)

In [225]:
def combine_with_signal(song,prediction):
    predsong = []
    for x in range(len(prediction)):
        predsong.append(song[220*x:220*(x+1)]*(prediction[x]))
    predsong=np.array(predsong).flatten()
    return predsong

#### Identifying Vocal regions

In [226]:
audiopath = 'Bill Withers - Aint No Sunshine.mp3'
song,sr = librosa.load(audiopath,duration=30)
pred1 = preprocess_and_predict(song,sr)
combined = combine_with_signal(song,pred1)



In [227]:
play(song,rate=sr)

In [228]:
play(combined,rate=sr)

#### Predicting Melody/Frequencies

<ul><b>No Tests Yet</b></ul>

## Issues

All issues would be available in the issues section of this repository

<ol>1.The melody/frequency prediction model performs poorly with the provided fft,mfcc and rms features resulting in a testing accuracy of 43%. An accuracy of 80% is the aim</ol>
<ol>2.The vocal region prediction model performs fairly well with the provided features with a testing accuracy of 84%. An accuracy of 92% is the aim</ol>
<ol>3.Preprocessing could be made faster</ol>