# 1) Problem formulation

* Under Supervised Learning, we have so far only looked at Multi-class classification, i.e out of `n` labels each instance is only assigned to one of the labels at a time.

* Now, in advanced part we look at Multitarget classification i.e each instance can be assigned to multiples targets at a time.

* Using the MLEnd Hums and Whistles dataset,we will build a machine learning pipeline that takes as an input a `Potter`, `Frozen` or `Showman` audio segment and predicts its song label (either Potter, Frozen or Showman) which is one of the target attribute and its genre(Sountrack,Show Tune or Pop) which is another target attribute.

# 2) Machine Learning pipeline

* The ML pipeline consists of following stages:

1) The input are the 5 features that are extrated by the function `getXy` from each of the audio files and output is a model trained on the training set which is then used to evaluate the performance of the validation set.

2) The features that are inserted in the pipeline are scaled by the class `StandardScalar`. The class `StandardScalar` is fitted to the features in the training set and then it normalises those features. When the pipeline is then used again to predict the labels for validation or test dataset the fitted transformer is used to transform the features in the validation or test dataset.

3) Finally, the model specified in the pipeline is then fitted to the training dataset and the same model is then used to predict the lables for validation or test dataset and to evaluate the performance of the model.

# 3) Transformation stage

* The input were the audio samples and from each audio sample,using the function "getXy", we extracted the following features(output):

1) Power <br>
2) Pitch-mean <br>
3) Pitch-std <br>
4) Voiced_fr <br>
5) Tempo

# 4) Modelling

* Following models will be implemented:<br>
1) Logistic Regression Classifier<br>
2) K-Nearest Neighbour(KNN)<br>
3) Support Vector Machine(SVM) <br>
4) Random Forest Classifier

# 5) Methodology

* For this problem, we decided to only use files whose file name is in the correct format, so by using regex expressions we obtained  a dataset of audio segments whose file names are in the correct format and then we shuffle it to avoid the situation where after the split any of the set(train or test) contains samples of one class more. We then used the function "getXy" to obtain a NumPy array containing the 5 audio features used as predictors (X) and values of our target attributes i.e `y1`(song label) and `y2`(genre label). Under the target attribute `y1`(song),`y1=0` if the song is Potter, `y1=1` if the song is Frozen and `y1=2` if the song is Showman. Similarly, under the target attribute `y2`(genre),`y2=0` if the genre is `Soundtrack`, `y2=1` if the genre is `Show Tune` and `y2=2` if the genre is `Pop`. We then created datafram `df` consisting of 5 columns of features and 2 columns of the target attributes,`y1` and `y2`.


* After that, we will split the dataframe `df` into two sets called `train` and `test` using the module `train_test_split` from sklearn library, so `X_test` set now contain 20% of the instances in the original dataframe `df`. We will then further split the `train` into `X_train` and `y_train`. Similarly, we will split `test` into `X_test` and `y_test`.


* We will then perform K-Fold validation on the training set `train` by defining the function `cross_validation_accuracy` which will give us the mean training accuracy and mean validation accuracy. By assesing the mean validation accuracy of each different models we will choose the model that gives us the highest mean validation accuracy.


*  After selecting the best classifier, we will train our chosen model on `train` set and test it on `test`set.



# 6) Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
#import urllib.request
#import zipfile

#import IPython.display as ipd
from tqdm import tqdm
import librosa


In [2]:
# directory_to_extract_to = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Potter\sample'
# zip_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Potter\Potter_1.zip'
# with zipfile.ZipFile(zip_path, 'r') as zip_ref:
#     zip_ref.extractall(directory_to_extract_to)

sample_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Potter\sample\*.wav'
Potter_files = glob.glob(sample_path)

# Here we are using regular expressions to find those files which are in correct format
# After finding the file which is in correct format 
# we append it to the empty list Potter_correct_format_files and Potter_correct_format
import re
Potter_correct_format_files =[]
Potter_correct_format=[]
for file in Potter_files:
    a = file.split('\\')[-1]
    x = re.search(r"^[S](.|..|...)_[a-zA-Z](..|......)_\d_[P]......(wav|Wav|WAV)",a)
    if x is not None:
        Potter_correct_format.append(x.string)
        Potter_correct_format_files.append(file)
        
len(Potter_correct_format_files)


165

In [3]:
# directory_to_extract_to = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Frozen\sample'
# zip_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Frozen\Frozen_1.zip'
# with zipfile.ZipFile(zip_path, 'r') as zip_ref:
#     zip_ref.extractall(directory_to_extract_to)

sample_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Frozen\sample\*.wav'
Frozen_files = glob.glob(sample_path)

import re
Frozen_correct_format_files=[]
Frozen_correct_format=[]
for file in Frozen_files:
    a = file.split('\\')[-1]
    x = re.search(r"^[S](.|..|...)_[a-zA-Z](..|......)_\d_[F]......(wav|Wav|WAV)",a)
    if x is not None:
        Frozen_correct_format.append(x.string)
        Frozen_correct_format_files.append(file)
        
len(Frozen_correct_format_files)

165

In [4]:
# directory_to_extract_to = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Showman\sample'
# zip_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Showman\Showman_1.zip'
# with zipfile.ZipFile(zip_path, 'r') as zip_ref:
#     zip_ref.extractall(directory_to_extract_to)

sample_path = r'C:\Users\HEET\Desktop\Principles of Machine Learning\Project\Showman\sample\*.wav'
Showman_files = glob.glob(sample_path)

import re
Showman_correct_format_files=[]
Showman_correct_format=[]
for file in Showman_files:
    a = file.split('\\')[-1]
    x = re.search(r"^[S](.|..|...)_[a-zA-Z](..|......)_\d_[S].......(wav|Wav|WAV)",a)
    if x is not None:
        Showman_correct_format.append(x.string)
        Showman_correct_format_files.append(file)        
        
len(Showman_correct_format_files)

167

In [5]:
MLENDHW_table = [] 

for file in Potter_correct_format:
    file_name = file.split('\\')[-1]
    participant_ID = file.split('\\')[-1].split('_')[0]
    interpretation_type = file.split('\\')[-1].split('_')[1]
    interpretation_number = file.split('\\')[-1].split('_')[2]
    song = file.split('\\')[-1].split('_')[3].split('.')[0]
    genre = "Soundtrack"
    MLENDHW_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song, genre])

for file in Frozen_correct_format:
    file_name = file.split('\\')[-1]
    participant_ID = file.split('\\')[-1].split('_')[0]
    interpretation_type = file.split('\\')[-1].split('_')[1]
    interpretation_number = file.split('\\')[-1].split('_')[2]
    song = file.split('\\')[-1].split('_')[3].split('.')[0]
    genre = "Show Tune"
    MLENDHW_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song, genre])
    
for file in Showman_correct_format:
    file_name = file.split('\\')[-1]
    participant_ID = file.split('\\')[-1].split('_')[0]
    interpretation_type = file.split('\\')[-1].split('_')[1]
    interpretation_number = file.split('\\')[-1].split('_')[2]
    song = file.split('\\')[-1].split('_')[3].split('.')[0]
    genre = "Pop"
    MLENDHW_table.append([file_name,participant_ID,interpretation_type,interpretation_number, song, genre])



MLENDHW_df = pd.DataFrame(MLENDHW_table,columns=['file_id','participant','interpretation','number','song','genre']).set_index('file_id') 

# a datafram where the file_id is in correct format
MLENDHW_df

Unnamed: 0_level_0,participant,interpretation,number,song,genre
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S100_hum_1_Potter.wav,S100,hum,1,Potter,Soundtrack
S100_hum_2_Potter.wav,S100,hum,2,Potter,Soundtrack
S101_hum_1_Potter.wav,S101,hum,1,Potter,Soundtrack
S101_hum_2_Potter.wav,S101,hum,2,Potter,Soundtrack
S102_hum_2_Potter.wav,S102,hum,2,Potter,Soundtrack
...,...,...,...,...,...
S94_whistle_2_Showman.wav,S94,whistle,2,Showman,Pop
S97_hum_3_Showman.wav,S97,hum,3,Showman,Pop
S97_hum_4_Showman.wav,S97,hum,4,Showman,Pop
S9_hum_3_Showman.wav,S9,hum,3,Showman,Pop


In [6]:
from sklearn.utils import shuffle
files = Potter_correct_format_files + Frozen_correct_format_files + Showman_correct_format_files

# We shuffled files to avoid the situation where after the split any of the set contains samples of either labels more
files = shuffle(files,random_state=0)


In [7]:
def getPitch(x,fs,winLen=0.02):
  #winLen = 0.02 
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

In [8]:
def getXy(files,labels_file, scale_audio=False, onlySingleDigit=False):
  X,y1,y2 =[],[],[]
  for file in tqdm(files):
    fileID = file.split('\\')[-1]
    file_name = file.split('\\')[-1]
    
    if labels_file.loc[fileID]['song']=='Potter':
        y_1 = 0
    if labels_file.loc[fileID]['genre']=='Soundtrack':
        y_2 = "0"
    if labels_file.loc[fileID]['song']=='Frozen':
        y_1 = "1"
    if labels_file.loc[fileID]['genre']=='Show Tune':
        y_2 = "1"
    if labels_file.loc[fileID]['song']=='Showman':
        y_1 = "2"
    if labels_file.loc[fileID]['genre']=='Pop':
        y_2 = "2"    

    fs = None # if None, fs would be 22050
    x, fs = librosa.load(file,sr=fs)
    if scale_audio: x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
      
    power = np.sum(x**2)/len(x)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
    voiced_fr = np.mean(voiced_flag)
    tempo = (librosa.beat.tempo(x, sr=fs))[0]

    xi = [power,pitch_mean,pitch_std,voiced_fr,tempo]
    X.append(xi)
    y1.append(y_1)
    y2.append(y_2)

  return np.array(X),np.array(y1),np.array(y2)

* After running the above code, we obtained 3 numpy arrays `X`,`y1` and `y2` which I stored as `X_advanced` and `y1` and `y2`.
* I have loaded them below

In [9]:
X = np.load("X_advanced.npy")
y1 = np.load("y1.npy")
y2 = np.load("y2.npy")

In [10]:
df = pd.DataFrame(X,columns = ["Power","Pitch_mean","Pitch_std","Voiced_fr","Tempo"])
df["y1"] = y1
df["y2"] = y2
df["y1"] = df["y1"].astype("category")
df["y2"] = df["y2"].astype("category")
df

Unnamed: 0,Power,Pitch_mean,Pitch_std,Voiced_fr,Tempo,y1,y2
0,0.034483,187.109414,48.174835,0.801676,123.046875,0,0
1,0.037909,282.691676,45.326666,0.791667,132.512019,1,1
2,0.069781,208.004928,39.000433,0.695428,132.512019,1,1
3,0.044952,188.459884,27.973096,0.798455,166.708669,2,2
4,0.018885,165.580954,23.057535,0.809588,112.347147,2,2
...,...,...,...,...,...,...,...
492,0.065192,379.606907,45.292738,0.621200,126.048018,1,1
493,0.053954,409.766027,26.274146,0.817330,139.674831,1,1
494,0.048838,138.307890,36.794285,0.821128,132.512019,0,0
495,0.006714,312.389485,50.907812,0.613213,112.500000,0,0


* In the above dataframe, under `y1`, the label `0`  corresponds to the song `Potter`, label `1` corresponds to the song `Frozen` and label `2` corresponnds to song `Showman`. Similarly under `y2` the label `0`  corresponds to the genre `Soundtrack`, label `1` corresponds to the genre `Show Tune` and label `2` corresponnds to genre `Pop`.

In [18]:
df.describe()

Unnamed: 0,Power,Pitch_mean,Pitch_std,Voiced_fr,Tempo
count,497.0,497.0,497.0,497.0,497.0
mean,0.038745,268.063734,40.371996,0.751324,127.682924
std,0.030188,99.324571,16.764239,0.117189,18.121737
min,0.001254,98.48153,9.823539,0.185645,90.666118
25%,0.020171,175.896162,28.833734,0.690784,117.453835
50%,0.030166,244.99804,35.243253,0.767737,126.048018
75%,0.047424,370.729465,50.17796,0.833448,135.999178
max,0.25734,431.863787,124.472669,0.967474,198.768029


* Looking at the above statistics of each feature we can see that max value of `Power` and `Voiced_fr` doesnot even exceeds 1 whereas the max value of `Pitch_std` and `Tempo` exceeds 100 and so we need to normalise the features.

* We will use the `df` to obtain training(`train`) and test(`test`) set.

* We will then further split the `train` into `X_train` and `y_train`. Similarly, we will split `test` into `X_test` and `y_test`.



In [11]:
# Here we will obtain training and testing set
from sklearn.model_selection import train_test_split

train,test = train_test_split(df, test_size=0.3, random_state=0)# test_size is 20% of data size
train.shape,test.shape

((347, 7), (150, 7))

In [13]:
# We further split the training set into training set of predictors(X) and labels(y1 and y2)
X_train = train.iloc[: , :5]
y_train = train.iloc[:  , 5:]

# We further split the test set into testing set of predictors(X) and labels(y1 and y2)
X_test = test.iloc[: , :5]
y_test = test.iloc[: , 5:]

* We will now perform K-fold cross validation on the training set by using the function the `cross_validation_accuracy` which will give us mean training and validation accuracy for each different model.

In [15]:
def cross_validation_accuracy(classifier):
    training_accuracy = []
    validation_accuracy = []
    
    # Here we have the Pipeline
    pipe = Pipeline([("scaler",StandardScaler()),
                     ("classifier",classifier)
                    ])
    print("The steps of the Pipe are:\n" ,pipe.steps,"\n")
    
    kf = KFold(n_splits = 10)
    
    # kf.split(X_train) will give us training indices and validation indices
    # we will use this indices to obtain the training set and validation from "train" set
    for train_index, val_index in kf.split(train):
        X_t, y_t = X_train.iloc[train_index], y_train.iloc[train_index]
        X_v, y_v = X_train.iloc[val_index], y_train.iloc[val_index]
        
        pipe.fit(X_t,y_t)
        yt_predict = pipe.predict(X_t)
        yv_predict = pipe.predict(X_v)
        
        # we are adding the training accuracy of each fold to the empty list training_accuracy
        training_accuracy.append(pipe.score(X_t , y_t))
        
        # we are adding the validation accuracy of each fold to the empty list validation_accuracy
        validation_accuracy.append(pipe.score(X_v, y_v))
        
        
    print("Mean Training Accuracy: ", np.mean(training_accuracy))
    print("Mean Validation Accuracy: ", np.mean(validation_accuracy))

# 7) Results

* Using MultiOutputClassifier we fit one classifier to each of the target attributes. Through this we can fit classifiers that do not originally support multi-target classification.

In [16]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

# i) Logistic Regression Classifier

In [17]:
cross_validation_accuracy(MultiOutputClassifier(LogisticRegression()))

The steps of the Pipe are:
 [('scaler', StandardScaler()), ('classifier', MultiOutputClassifier(estimator=LogisticRegression()))] 

Mean Training Accuracy:  0.47325305152781194
Mean Validation Accuracy:  0.46042016806722685


* We performed cross-validation using the `train` set and then fittied Logistic Regression Classifier to the training set obtained from `train` set and, then obtained training and validation accuracy for each fold.

* We obtained mean training accuracy of 47.33% and mean validation accuracy of 46.04% which is very poor.

# ii) K-Nearest Neighbor(KNN) 

In [18]:
cross_validation_accuracy(MultiOutputClassifier(KNeighborsClassifier(7)))

The steps of the Pipe are:
 [('scaler', StandardScaler()), ('classifier', MultiOutputClassifier(estimator=KNeighborsClassifier(n_neighbors=7)))] 

Mean Training Accuracy:  0.6778600393217008
Mean Validation Accuracy:  0.5818487394957982


* We performed cross-validation using the `train` set and then fittied KNN classifier to the training set obtained from `train` set and, then obtained training and validation accuracy for each fold.

* We obtained mean training accuracy of 67.79% and mean validation accuracy of 58.18% which is poor but in terms of mean validation accuracy it is better than Logistic Regression Classifier.

# iii) Support Vector Machine(SVM)

In [19]:
model  = svm.SVC(C=1)
cross_validation_accuracy(MultiOutputClassifier(model))

The steps of the Pipe are:
 [('scaler', StandardScaler()), ('classifier', MultiOutputClassifier(estimator=SVC(C=1)))] 

Mean Training Accuracy:  0.6887544032112722
Mean Validation Accuracy:  0.5931932773109243


* We performed cross-validation using the `train` set and then fittied Support Vector Machine model to the training set obtained from `train` set and, then obtained training and validation accuracy for each fold.

* We obtained mean training accuracy of 68.88% and mean validation accuracy of 59.32% which is poor but in terms of mean validation accuracy it is better than Logistic Regression Classifier and KNN.

# iv) Random Forest Classifier

In [20]:
RF = RandomForestClassifier(n_estimators=150,max_features="auto",random_state=0)
cross_validation_accuracy(MultiOutputClassifier(RF))

The steps of the Pipe are:
 [('scaler', StandardScaler()), ('classifier', MultiOutputClassifier(estimator=RandomForestClassifier(n_estimators=150,
                                                       random_state=0)))] 

Mean Training Accuracy:  1.0
Mean Validation Accuracy:  0.6076470588235294


* We performed cross-validation using the `train` set and then fittied Random Forest Classifier to the training set obtained from `train` set and, then obtained training and validation accuracy for each fold.

* We obtained mean training accuracy of 100% and mean validation accuracy of 60.76% which is still quite poor but in terms of mean validation accuracy it is better than all of the models that we have considered.

* Now we will train the RandomForestClassifier on training set `train` and test it on the testing set `test`.

In [21]:
RFC = RandomForestClassifier(n_estimators=150,max_features="auto",random_state=0)
pipe = Pipeline([("scaler",StandardScaler()),
                     ("classifier",MultiOutputClassifier(RFC))
                    ])

pipe.fit(X_train,y_train)

print("Test Accuracy:",pipe.score(X_test,y_test))

Test Accuracy: 0.5333333333333333


# 8) Conclusions

Your conclusions, improvements, etc should go here

* We evaluated 4 classifiers which are `Logistic Regression classifier`, `K-Nearest Neighbour(KNN) classifier`,`Suport Vector Machine` and `Random Forest classifier`. We performed 10-fold cross validation on training set `train` and then fitted all these model on training set under each fold and obtained validation accuracy for each fold. Using `cross_validation_accuracy` we obtained mean training and validation accuracy for each of these models. Random Forest Classifier gave the highest mean validation accuracy.

* After that, we trained the Random Forest classifier on the 5 features that were extracted from the 80% of the original sample i.e 347 audio samples and finally tested the model on the test dataset which consisted of 5 features extracted from 150 audio samples, giving the `test accuracy` of `53.33% ` which is poor.

* One of the reason for such low accuracies can be that the 5 features are not enough in predicting the song and its genre. So we can improve the accuracies by adding more features such as `Zero Crossing Rate` and `Mel-Frequency Cepstral Coefficients(MFCC)`. We should also inspect whether there is correlation between the features and if there is we should perform Principal Component Analysis(PCA).

