# MLE - Exercise 2 - Comparative Experimentation
## Andreas Kocman (se19m024)

## Assignment
This exercise follows very much the style of the previous exercise - you shall do experiments with different data sets and classifiers. Again, you can do the exercise alone, or in a group of two.

The datasets to use are
* The datasets from the exercise 3, i.e. Iris, Optical Digits (and if you are in a group, then also Breast Cancer)
* Either the music or the image data set - decided by your matriculation number modulo 2, 0 means music, 1 means image (If you are doing this exercise in a group, then you shall take both data sets)

The classifiers & parameters to use are
* All the classifiers & parameters from exercise 3
* Decision trees, you shall have two setups: one fully grown tree, and one setting for a pruned or pre-pruned tree.
   * (If you are a group, you shall try a total of four settings: two unpruned trees using two different split criteria, and two setups for different amounts of (pre)-pruning the tree.)
* Random Forests, using two different settings for the number of trees
   * (If you are in a group, also vary the number of attributes that are used in each split; use three different values resp. computation methods (sqrt, log, fraction, ...); this should give you a total of 6 runs: (2 number of trees) x (3 number of attributes))
* SVMs: just use the default settings, but use both SVC and LinearSVC classifiers (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html, http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

### Image Dataset
We will use the "Fruit Image Dataset", originally provided at http://data.vicos.si/datasets/FIDS30/, but with an edited version linked from Moodle (some images had an encoding not compatible with e.g. python libraries). Your task is to classify images into the category of fruit (a total of 30 defined categories) they belong to.

As this is image data, feature extraction is a requirement before we can actually learn anything. As you shouldn't spend too much time on that, there is demo code on how to work with this data available, linked from the course main page.

This code generates a set of 4 different features, all rather simple and based on histograms of colours (i.e. counts on how often a certain colour appears). You shall work with all four of them, and likely will see very different results.

### Music Dataset
We will use the dataset provided by George Tzanetakis, called "gtzan". This dataset contains 1.000 songs, 100 songs for 10 genres, and the task is therefore to predict the genres of a song; to limit file size, the songs are only 30 second snippets, and sampled with 22 khz only. You can download the dataset from the Moodle main page, or also at at http://kronos.ifs.tuwien.ac.at/GTZANmp3_22khz.zip. As this is copyrighted materials, please do not redistribute it...!

As this is audio data, feature extraction is a requirement before we can actually learn anything. Therefore, there is demo code on how to work with this data available, linked from the course main page.

This code generates different features, very simple ones containing just BeatsPerMinute, to more advanced ones based on advanced signal processing.  You shall work with all of them, and likely will see very different results.

### Working in a group
If you work in a group, as partially written above, your scope will be extended

* More datasets: both music & image datasets
* More parameter variations
* More evaluation: for the Music&Image datasets, you shall also add an analysis of the confusion matrix for these datasets. It is sufficient, to provide one confusion matrix per feature set, you can select either the best classifier that you had on that feature set, or also other interesting ones.

### Links for python
* http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

### Non-python feature extractors
#### Java
* For images, you can use a port of openCV, the OpenCV bindings (http://docs.opencv.org/2.4/doc/tutorials/introduction/desktop_java/java_dev_intro.html) or a different implementation in  Java: https://github.com/bytedeco/javacv. The sample code should then be quite similar to the one in python.
* For music: http://jmir.sourceforge.net/index_jAudio.html (the jAudio component) offers a GUI for extracting features, best is to use BPM (strongest beat), MFCCs and Chroma, and their derivatives, i.e. the statistics that are also used in the sample code. jAudio should be able to generate ARFF files for WEKA.

#### C#
* For image, you should find OpenCV bindings as well for C#

## Sources used
* Scikit documentation

## Solution

### Datasets to Use
Matriculation number: SE19M024%2 = 0 - using music dataset

### Helper Functions for Solution and Data Analysis

In [4]:
# global Imports
import pandas as pd
import numpy as np

#sk learn imports
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

#Data reporting
from IPython.display import display

# Global definitions:
averaging_approach = 'macro'
zero_division_approach = 0
number_of_folds = 5
scoring = {'Accuracy': make_scorer(accuracy_score),
            'Precision': make_scorer(precision_score, average=averaging_approach, zero_division=zero_division_approach),
            'Recall': make_scorer(recall_score, average=averaging_approach, zero_division=zero_division_approach)}

# Helper functions
def parse_k_fold_results(results):
    return "m: " + str(np.average(results)) + " std: " + str(np.std(results))

def parse_argument_tuple_as_string(argumentsTuple):
    return "max Depth: " + str(argumentsTuple[0])  + \
           ", min Samples: " + str(argumentsTuple[1])

def calculate_results_cross_validate(dataset_name, classifier_used, classifier_name, data, target):
   scores = cross_validate(classifier_used, data, target,
                                scoring = scoring,
                                cv = number_of_folds,
                                error_score = 0)

   return pd.Series({
            'dataset': dataset_name,
            'classifier': classifier_name,
            'arguments': str(classifier_used),
            'mean_accuracy': np.average(scores.get('test_Accuracy')),
            'mean_precision': np.average(scores.get('test_Precision')),
            'mean_recall': np.average(scores.get('test_Recall')),
            'accuracy': parse_k_fold_results(scores.get('test_Accuracy')),
            'precision': parse_k_fold_results(scores.get('test_Precision')),
            'recall':parse_k_fold_results(scores.get('test_Recall'))
        })

def print_results(array, column_for_max, ascending=False):
    df = pd.DataFrame(array)
    df = df.sort_values(by=[column_for_max], ascending=False)
    display('Results', df)

    best = df.iloc[df[column_for_max].argmax()]
    display(best)


### Dataset Extraction Music

In [5]:
# We need to construct our data set; unfortunately, we don't simply have a "loadGTZanDataSet()" function in SK-learn...
# So we need to
## Download our data set & extract it (one-time effort)
## Run an audio feature extraction
## Create the create the ground truth (label assignment, target, ...)


# path to our audio folder
# For the first run, download the images from http://kronos.ifs.tuwien.ac.at/GTZANmp3_22khz.zip, and unzip them to your folder
imagePath="F:\\Informatik\\tw_mle_exercise4\\mp3\\"


# Find all songs in that folder; there are like 1.000 different ways to do this in Python, we chose this one :-)
import glob, os
print(os.getcwd())
os.chdir(imagePath)
fileNames = glob.glob("*/*.mp3")
numberOfFiles=len(fileNames)
targetLabels=[]

print ("Found " + str(numberOfFiles) + " files\n")

# The first step - create the ground truth (label assignment, target, ...)
# For that, iterate over the files, and obtain the class label for each file
# Basically, the class name is in the full path name, so we simply use that
for fileName in fileNames:
    pathSepIndex = fileName.index("\\")
    targetLabels.append(fileName[:pathSepIndex])

# sk-learn can only handle labels in numeric format - we have them as strings though...
# Thus we use the LabelEncoder, which does a mapping to Integer numbers
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(targetLabels) # this basically finds all unique class names, and assigns them to the numbers
print ("Found the following classes: " + str(list(le.classes_)))

# now we transform our labels to integers
target = le.transform(targetLabels);
print ("Transformed labels (first elements: " + str(target[0:150]))

# If we want to find again the label for an integer value, we can do something like this:
# print list(le.inverse_transform([0, 18, 1]))

print ("... done label encoding")

F:\Informatik\tw_mle_exercise4\mp3
Found 1000 files

Found the following classes: ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']
Transformed labels (first elements: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]
... done label encoding


In [6]:
# Before we extract the features, let's plot some information on a demo song, to illustrate what we are doing

import matplotlib.pyplot as plt
from librosa import display
import librosa
import numpy as np


demoSongName = fileNames[1]
demoSongPath = imagePath + demoSongName
demoSongPath = demoSongPath.replace("\\", "/")
print(demoSongPath)
print ("Showing demo feature extraction on song " + demoSongPath)

y, sr = librosa.load(demoSongPath)

# compute the tempo
onset_env = librosa.onset.onset_strength(y, sr=sr)
tempo = librosa.beat.tempo(onset_envelope=onset_env, sr=sr)
print ("The song has " + str(tempo) + " beats per minute")

# plot the wave form
plt.figure()
plt.subplot(3, 1, 1)
librosa.display.waveplot(y, sr=sr)
plt.title(demoSongPath)

y_harm, y_perc = librosa.effects.hpss(y)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y_harm, sr=sr, alpha=0.25)
librosa.display.waveplot(y_perc, sr=sr, color='r', alpha=0.5)
plt.title(demoSongPath + ': Harmonic + Percussive')
plt.tight_layout()


# Plot the power spectrum
plt.figure(figsize=(12, 8))
D = librosa.amplitude_to_db(librosa.stft(y), ref=np.max)
plt.subplot(4, 2, 1)
librosa.display.specshow(D, y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('Linear-frequency power spectrogram')

# Plot Chroma
plt.figure()
C = librosa.feature.chroma_cqt(y=y, sr=sr)
plt.subplot(4, 2, 5)
librosa.display.specshow(C, y_axis='chroma')
plt.colorbar()
plt.title('Chromagram')

# Plot tempogram
plt.figure()
plt.subplot(4, 2, 8)
Tgram = librosa.feature.tempogram(y=y, sr=sr)
librosa.display.specshow(Tgram, x_axis='time', y_axis='tempo')
plt.colorbar()
plt.title('Tempogram')
plt.tight_layout()

plt.show()

F:/Informatik/tw_mle_exercise4/mp3/blues/blues.00001.mp3
Showing demo feature extraction on song F:/Informatik/tw_mle_exercise4/mp3/blues/blues.00001.mp3




NoBackendError: 

In [None]:
# Now we do the actual feature extraction
import datetime
from collections import deque
import progressbar

import numpy as np
import scipy.stats.stats as st


# This is a helper function that computes the differences between adjacent array values
def differences(seq):
    iterable = iter(seq)
    prev = next(iterable)
    for element in iterable:
        yield element - prev
        prev = element

# This is a helper function that computes various statistical moments over a series of values, including mean, median, var, min, max, skewness and kurtosis (a total of 7 values)
def statistics(numericList):
    return [np.mean(numericList), np.median(numericList), np.var(numericList), np.float64(st.skew(numericList)), np.float64(st.kurtosis(numericList)), np.min(numericList), np.max(numericList)]



print ("Extracting features using librosa" + " (" + str(datetime.datetime.now()) + ")")

# compute some features based on BPMs, MFCCs, Chroma
data_bpm=[]
data_bpm_statistics=[]
data_mfcc=[]
data_chroma=[]

# This takes a bit, so let's show it with a progress bar
with progressbar.ProgressBar(max_value=len(fileNames)) as bar:
    for indexSample, fileName in enumerate(fileNames):
        # Load the audio as a waveform `y`, store the sampling rate as `sr`
        y, sr = librosa.load(fileName)

        # run the default beat tracker
        tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
        # from this, we simply use the tempo as BPM feature
        data_bpm.append([tempo])

        # Then we compute a few statistics on the beat timings
        beat_times = librosa.frames_to_time(beat_frames, sr=sr)
        # from the timings, compute the time differences between the beats
        beat_intervals = np.array(deque(differences(beat_times)))

        # And from this, take some statistics
        # There might be a few files where the beat timings are not determined properly; we ignore them, resp. give them 0 values
        if len(beat_intervals) < 1:
            print ("Errors with beat interval in file " + fileName + ", index " + str(indexSample) + ", using 0 values instead")
            data_bpm_statistics.append([tempo, 0, 0, 0, 0, 0, 0, 0])
        else:
            bpm_statisticsVector=[]
            bpm_statisticsVector.append(tempo) # we also include the raw value of tempo
            for stat in statistics(beat_intervals):  # in case the timings are ok, we actually compute the statistics
                bpm_statisticsVector.append(stat) # and append it to the vector, which finally has 1 + 7 features
            data_bpm_statistics.append(bpm_statisticsVector)

        # Next feature are MFCCs; we take 12 coefficients; for each coefficient, we have around 40 values per second
        mfccs=librosa.feature.mfcc(y=y, sr=sr, n_mfcc=12)
        mfccVector=[]
        for mfccCoefficient in mfccs: # we transform this time series by taking again statistics over the values
            mfccVector.append(statistics(mfccCoefficient))

        # Finally, this vector should have 12 * 7 features
        data_mfcc.append(np.array(mfccVector).flatten())


        # Last feature set - chroma (which is roughly similar to actual notes)
        chroma=librosa.feature.chroma_stft(y=y, sr=sr)
        chromaVector=[]
        for chr in chroma: # similar to before, we get a number of time-series
            chromaVector.append(statistics(chr)) # and we resolve that by taking statistics over the time series
        # Finally, this vector should be be 12 * 7 features
        data_chroma.append(np.array(chromaVector).flatten())

        bar.update(indexSample)

print (".... done" + " (" + str(datetime.datetime.now()) + ")")

# Finally, we do classification
# These are our feature sets; we will use each of them individually to train classifiers
trainingSets = [data_bpm, data_bpm_statistics, data_chroma, data_mfcc ]

from sklearn.model_selection import cross_val_score
from sklearn import neighbors
from sklearn import naive_bayes
from sklearn import tree
from sklearn import ensemble
from sklearn import svm

# set up a number of classifiers
classifiers = [neighbors.KNeighborsClassifier(),
               naive_bayes.GaussianNB(),
               tree.DecisionTreeClassifier(),
               ensemble.RandomForestClassifier(),
               svm.SVC(),
               svm.LinearSVC(),
              ]

for indexDataset, train in enumerate(trainingSets):
    for indexClassifier, classifier in enumerate(classifiers):
        # do the actual classification
        print ("Classifying ...")

### Calculation Functions


#### k-NN Calculation

In [None]:
from sklearn import neighbors

def calculate_knn(dataset_name, data, target):
    knn_results = []

    n_neighbors = range(1,10,1)

    for n in n_neighbors:
        knn_classifier = neighbors.KNeighborsClassifier(n)
        description = "N = " + str(n)
        result = calculate_results_cross_validate(dataset_name,
                                                  knn_classifier,
                                                  "knn",
                                                  data,
                                                  target)
        knn_results.append(result)
    return knn_results


#### Bayes Calculation

In [None]:
from sklearn import naive_bayes

def calculate_bayes(dataset_name, data, target):
    bayes_results = []

    classifier = naive_bayes.CategoricalNB()
    result = calculate_results_cross_validate(dataset_name,
                                              classifier,
                                              "bayes",
                                              data,
                                              target)
    bayes_results.append(result)

    return bayes_results

#### Perceptron Calculation

In [None]:
from sklearn import linear_model

def calculate_perceptron(dataset_name, data, target):
    perceptron_results=[]
    classifier = linear_model.Perceptron()
    result = calculate_results_cross_validate(dataset_name,
                                              classifier,
                                              "perceptron",
                                              data,
                                              target)
    perceptron_results.append(result)
    return perceptron_results

#### Decision Tree Calculation

In [None]:
from sklearn.tree import DecisionTreeClassifier

def calculate_decision_tree(dataset_name, data, target):
    # Parameters for the decision tree
    classifiers = [
        DecisionTreeClassifier(),
        DecisionTreeClassifier(max_depth = 5),
        DecisionTreeClassifier(min_samples_leaf = 50)
        ]
    decision_tree_results = []

    for classifier in classifiers:
        result = calculate_results_cross_validate(dataset_name,
                                                  classifier,
                                                  "decision tree",
                                                  data,
                                                  target)
        decision_tree_results.append(result)
    return decision_tree_results

#### Random Forest Calculation

In [None]:
from sklearn.ensemble import RandomForestClassifier

def calculate_random_forest(dataset_name, data, target):

    # Parameters for the random forest
    arguments = range(10,200,50)
    random_forest_results = []

    for argument in arguments:
        classifier = RandomForestClassifier(),
        result = calculate_results_cross_validate(dataset_name,
                                                  classifier,
                                                  "random forest",
                                                  data,
                                                  target)
        random_forest_results.append(result)
    return random_forest_results

#### SVM Calculation

In [None]:
from sklearn import svm
import itertools

def calculate_svm(dataset_name, data, target):
    svm_results = []

    classifiers = [
        svm.SVC(),
        svm.LinearSVC()
    ]

    for classifier in classifiers:
        result = calculate_results_cross_validate(dataset_name,
                                                  classifier,
                                                  "svm",
                                                  data,
                                                  target)
        svm_results.append(result)
    return svm_results

In [None]:
### Load Datasets
from sklearn import datasets as sk_datasets

# Iris and Digits
datasets = [{'name': 'iris', 'data': sk_datasets.load_iris()},
            {'name': 'digits', 'data': sk_datasets.load_digits()}]

for dataset in datasets:
    overall_results_dataset = []
    name = dataset['name']
    data = dataset['data']
    overall_results_dataset.extend(calculate_knn(name, data.data, data.target))
    #overall_results_dataset.extend(calculate_bayes(name, data.data, data.target))
    overall_results_dataset.extend(calculate_perceptron(name, data.data, data.target))
    overall_results_dataset.extend(calculate_decision_tree(name, data.data, data.target))
    #overall_results_dataset.extend(calculate_random_forest(name, data.data, data.target))
    overall_results_dataset.extend(calculate_svm(name, data.data, data.target))
    overall_results_dataset = pd.DataFrame(overall_results_dataset)
    dataset['result'] = overall_results_dataset

## Iris Dataset

In [None]:
display(datasets[0]['name'], datasets[0]['result'])

## Digits Dataset

In [None]:
display(datasets[1]['name'], datasets[1]['result'])