# MLE - Exercise 2 - Comparative Experimentation
## Andreas Kocman (se19m024)

## Assignment
This exercise follows very much the style of the previous exercise - you shall do experiments with different data sets and classifiers. Again, you can do the exercise alone, or in a group of two.

The datasets to use are
* The datasets from the exercise 3, i.e. Iris, Optical Digits (and if you are in a group, then also Breast Cancer)
* Either the music or the image data set - decided by your matriculation number modulo 2, 0 means music, 1 means image (If you are doing this exercise in a group, then you shall take both data sets)

The classifiers & parameters to use are
* All the classifiers & parameters from exercise 3
* Decision trees, you shall have two setups: one fully grown tree, and one setting for a pruned or pre-pruned tree.
   * (If you are a group, you shall try a total of four settings: two unpruned trees using two different split criteria, and two setups for different amounts of (pre)-pruning the tree.)
* Random Forests, using two different settings for the number of trees
   * (If you are in a group, also vary the number of attributes that are used in each split; use three different values resp. computation methods (sqrt, log, fraction, ...); this should give you a total of 6 runs: (2 number of trees) x (3 number of attributes))
* SVMs: just use the default settings, but use both SVC and LinearSVC classifiers (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html, http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

### Image Dataset
We will use the "Fruit Image Dataset", originally provided at http://data.vicos.si/datasets/FIDS30/, but with an edited version linked from Moodle (some images had an encoding not compatible with e.g. python libraries). Your task is to classify images into the category of fruit (a total of 30 defined categories) they belong to.

As this is image data, feature extraction is a requirement before we can actually learn anything. As you shouldn't spend too much time on that, there is demo code on how to work with this data available, linked from the course main page.

This code generates a set of 4 different features, all rather simple and based on histograms of colours (i.e. counts on how often a certain colour appears). You shall work with all four of them, and likely will see very different results.

### Music Dataset
We will use the dataset provided by George Tzanetakis, called "gtzan". This dataset contains 1.000 songs, 100 songs for 10 genres, and the task is therefore to predict the genres of a song; to limit file size, the songs are only 30 second snippets, and sampled with 22 khz only. You can download the dataset from the Moodle main page, or also at at http://kronos.ifs.tuwien.ac.at/GTZANmp3_22khz.zip. As this is copyrighted materials, please do not redistribute it...!

As this is audio data, feature extraction is a requirement before we can actually learn anything. Therefore, there is demo code on how to work with this data available, linked from the course main page.

This code generates different features, very simple ones containing just BeatsPerMinute, to more advanced ones based on advanced signal processing.  You shall work with all of them, and likely will see very different results.

### Working in a group
If you work in a group, as partially written above, your scope will be extended

* More datasets: both music & image datasets
* More parameter variations
* More evaluation: for the Music&Image datasets, you shall also add an analysis of the confusion matrix for these datasets. It is sufficient, to provide one confusion matrix per feature set, you can select either the best classifier that you had on that feature set, or also other interesting ones.

### Links for python
* http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

### Non-python feature extractors
#### Java
* For images, you can use a port of openCV, the OpenCV bindings (http://docs.opencv.org/2.4/doc/tutorials/introduction/desktop_java/java_dev_intro.html) or a different implementation in  Java: https://github.com/bytedeco/javacv. The sample code should then be quite similar to the one in python.
* For music: http://jmir.sourceforge.net/index_jAudio.html (the jAudio component) offers a GUI for extracting features, best is to use BPM (strongest beat), MFCCs and Chroma, and their derivatives, i.e. the statistics that are also used in the sample code. jAudio should be able to generate ARFF files for WEKA.

#### C#
* For image, you should find OpenCV bindings as well for C#

## Sources used
* Scikit documentation

## Solution

### Helper Functions for Solution and Data Analysis

In [None]:
# global Imports
import pandas as pd
import numpy as np

#sk learn imports
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

#Data reporting
from IPython.display import display

# Global definitions:
overall_results_vote = []
overall_results_amazon = []
averaging_approach = 'macro'
zero_division_approach = 0
number_of_folds = 5
scoring = {'Accuracy': make_scorer(accuracy_score),
            'Precision': make_scorer(precision_score, average=averaging_approach, zero_division=zero_division_approach),
            'Recall': make_scorer(recall_score, average=averaging_approach, zero_division=zero_division_approach)}

# Helper functions
def parse_k_fold_results(results):
    return "m: " + str(np.average(results)) + " std: " + str(np.std(results))

def parse_argument_tuple_as_string(argumentsTuple):
    return "max Depth: " + str(argumentsTuple[0])  + \
           ", min Samples: " + str(argumentsTuple[1])

def calculate_results_holdout(classifier_used, X_train, X_test, y_train, y_test):
    classifier_used.fit(X_train, y_train)

    # predict the test set on our trained classifier
    y_test_predicted = classifier_used.predict(X_test)

    acc = metrics.accuracy_score(y_test, y_test_predicted)
    recall=metrics.recall_score(y_test, y_test_predicted)
    precision = metrics.precision_score(y_test, y_test_predicted)

    return pd.Series({
            'classifier': str(classifier_used),
            'arguments': "",
            'accuracy':acc,
            'precision':precision,
            'recall':recall
        })

def calculate_results_cross_validate(classifier_used, description_used, data, target):
   scores = cross_validate(classifier_used, data, target,
                                scoring = scoring,
                                cv = number_of_folds,
                                error_score = 0)

   return pd.Series({
            'classifier': str(classifier_used),
            'arguments': description_used,
            'mean_accuracy': np.average(scores.get('test_Accuracy')),
            'mean_precision': np.average(scores.get('test_Precision')),
            'mean_recall': np.average(scores.get('test_Recall')),
            'accuracy': parse_k_fold_results(scores.get('test_Accuracy')),
            'precision': parse_k_fold_results(scores.get('test_Precision')),
            'recall':parse_k_fold_results(scores.get('test_Recall'))
        })

def print_results(array, column_for_max, ascending=False):
    df = pd.DataFrame(array)
    df = df.sort_values(by=[column_for_max], ascending=False)
    display('Results', df)

    best = df.iloc[df[column_for_max].argmax()]
    display(best)

### Calculation Functions


#### k-NN Calculation

In [None]:
from sklearn import neighbors

def calculate_knn(data, target):
    knn_results = []

    n_neighbors = range(1,10,1)

    for n in n_neighbors:
        knn_classifier = neighbors.KNeighborsClassifier(n)
        description = "N = " + str(n)
        result = calculate_results_cross_validate(knn_classifier,
                                                  description,
                                                  data,
                                                  target)
        knn_results.append(result)
    return knn_results


#### Bayes Calculation

In [None]:
from sklearn import naive_bayes

def calculate_bayes(data, target):
    bayes_results = []

    alphas = np.arange(0.1,5,1)

    for alpha in alphas:
        classifier = naive_bayes.CategoricalNB(alpha = alpha)
        description = "Alpha = " + str(alpha)
        result = calculate_results_cross_validate(classifier,
                                                  description,
                                                  data,
                                                  target)
        bayes_results.append(result)

    return bayes_results

#### Perceptron Calculation

In [None]:
from sklearn import linear_model

def calculate_perceptron(data, target):
    perceptron_results=[]
    classifier = linear_model.Perceptron()
    description = "No additional args."
    result = calculate_results_cross_validate(classifier,
                                              description,
                                              data,
                                              target)
    perceptron_results.append(result)
    return perceptron_results

#### Decision Tree Calculation

In [None]:
from sklearn import tree
import itertools

def calculate_decision_tree(data, target):
    # Parameters for the decision tree
    max_depth_arguments = range(1, 10, 2)
    min_samples_leaf_arguments = [2,20,50,100]
    argumentTuples = list(itertools.product(max_depth_arguments,
                                            min_samples_leaf_arguments))
    decision_tree_results = []

    for argumentTuple in argumentTuples:
        max_depth = argumentTuple[0]
        min_samples_leaf = argumentTuple[1]

        classifier = tree.DecisionTreeClassifier(criterion = 'gini',
                                                 max_depth = max_depth,
                                                 min_samples_leaf = min_samples_leaf,
                                                 splitter = 'best')
        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  parse_argument_tuple_as_string(argumentTuple),
                                                  data,
                                                  target)
        decision_tree_results.append(result)
    return decision_tree_results

#### SVM Calculation

In [None]:
from sklearn import svm
import itertools

def calculate_svm(data, target):
    kernels = ["poly", "rbf"]#{"linear", "poly", "sigmoid", "rbf"}
    gamma = [0.001, "scale", "auto"]
    c = [100]
    degree = range(1, 10, 1)

    argumentTuples = list(itertools.product(kernels,
                                            gamma,
                                            c,
                                            degree))
    svm_results = []

    for argumentTuple in argumentTuples:
        kernel = argumentTuple[0]
        gamma = argumentTuple[1]
        c = argumentTuple[2]
        degree = argumentTuple[3]

        classifier = svm.SVC(kernel = kernel, gamma = gamma, C = c, degree = degree)

        #result = calculate_results_holdout(classifier, X_train, X_test, y_train, y_test)
        result = calculate_results_cross_validate(classifier,
                                                  "Kernel: " + kernel + ", Degree: " + str(degree),
                                                  data,
                                                  target)
        svm_results.append(result)
    return svm_results