Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says "YOUR ANSWER HERE" or `YOUR CODE HERE` and remove the `raise NotImplementedError()` lines. 

Code blocks starting with a `# tests` comment provide unit tests which have to run without errors in order to get full points. Be aware that there might be further 'secret' tests to check correct implementation! I.e. the provided unit tests are necessary but not sufficient for full points!

You are always welcome to add **additional plots, tests, or debug outputs**.
However, make sure to: **1) not break the automated tests**, and **2) switch off any excessive debug output** when you submit your notebook!

Please add your name and student ID below:

In [1]:
NAME = "Marko Kadic"
STUDENT_ID = "12045128"

In [2]:
assert len(NAME) > 0, "Enter your name!"
assert len(STUDENT_ID) > 0, "Enter your student ID!"

# Intelligent Audio and Music Analysis Assignment 7

This assignment accounts for the last 50 points of the 3rd and last assignment block (100 points total)

Assignment is mainly **free form**, the goal is to apply what has been practiced so far. For implementing assignment 7, best practice is to follow the code structures from previous assignments and reuse as much code as possible (this makes it easier for us to review it). You can use any libraries, however, we recommend you use: madmom, librosa, pyTorch, etc. (the libraries we have used so far).


### GPU Support
Our JupyterHub, unfortunately, does not yet provide GPU support. Nevertheless, this assignemnt can be run as-is on JupyterHub, however training of the neural network will take a long time.

In order to speed up training if you are in a hurry, you can run this notebook on any local machine with GPU and cuda support, or alternatively use infrastructure like [Google colab](https://colab.research.google.com/) and drive, if you have a google account.

Simply upload your solved notebook and necessary other files, like output model file, back to JupyterHub for your submission.

In [3]:
import os
# This code block enables this notebook to run on google colab.
try:
    from google.colab import drive
    print('Running in colab...\n===================')
    COLAB = True
    !pip install madmom torch==1.4.0 torchvision==0.5.0 librosa --upgrade
    print('Installed dependencies!\n=======================')

    if not os.path.exists('data'):
        print('Downloading data...\n===================')
        !mkdir data
        !cd data
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.1.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.2.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.3.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.4.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.5.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.6.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.7.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.audio.8.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.doc.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.error.zip?download=1
        !wget https://zenodo.org/record/45739/files/TUT-acoustic-scenes-2016-development.meta.zip?download=1
            
        !wget https://zenodo.org/record/165995/files/TUT-acoustic-scenes-2016-evaluation.audio.1.zip?download=1
        !wget https://zenodo.org/record/165995/files/TUT-acoustic-scenes-2016-evaluation.audio.2.zip?download=1
        !wget https://zenodo.org/record/165995/files/TUT-acoustic-scenes-2016-evaluation.audio.3.zip?download=1
        !wget https://zenodo.org/record/165995/files/TUT-acoustic-scenes-2016-evaluation.doc.zip?download=1
        !wget https://zenodo.org/record/165995/files/TUT-acoustic-scenes-2016-evaluation.meta.zip?download=1
            
        !for file in *.*; do mv $file ${file%?download=1}; done
        
        !unzip "*.zip"
        !rm *.zip
        !cd ..

    print('===================\nMake sure you activated GPU support: Edit->Notebook settings->Hardware acceleration->GPU\n==================')
except:
    print('=======================\nNOT running in colab...\n=======================')
    COLAB = False

Running in colab...
Collecting madmom
  Downloading madmom-0.16.1.tar.gz (20.0 MB)
[K     |████████████████████████████████| 20.0 MB 1.2 MB/s 
[?25hCollecting torch==1.4.0
  Downloading torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl (753.4 MB)
[K     |████████████████████████████████| 753.4 MB 7.3 kB/s 
[?25hCollecting torchvision==0.5.0
  Downloading torchvision-0.5.0-cp37-cp37m-manylinux1_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 34.8 MB/s 
Collecting librosa
  Downloading librosa-0.9.1-py3-none-any.whl (213 kB)
[K     |████████████████████████████████| 213 kB 48.0 MB/s 
Collecting mido>=1.2.8
  Downloading mido-1.2.10-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 5.8 MB/s 
Building wheels for collected packages: madmom
  Building wheel for madmom (setup.py) ... [?25l[?25hdone
  Created wheel for madmom: filename=madmom-0.16.1-cp37-cp37m-linux_x86_64.whl size=20935901 sha256=0e774c4b0958f0f48602b1fc31a0f00bcb461c4f77164

## Audio Scene Classification

Your task is to implement a solution to an auditory scene detection challenge, precisely the DCASE 2016 Acoustic Scene Classification task. Details about the challenge are provided on the [task website](http://dcase.community/challenge2016/task-acoustic-scene-classification).
1. You are **free in choosing the strategy that you apply** and can also reuse and modify your implementation of previous assignments, e.g., by modifying the architecture to handle clips of 30 seconds length.
2. **Follow the given evaluation strategies of the task**, in particular wrt. development and evaluation datasets and cross validation settings.
3. Consider **reducing the amount of data** in a reasonable way, if necessary.
4. **Compare your results** to the numbers reported on the task website and comment on you main findings.

Remark: The goal is not to outperform the state of the art, but to experiment with a classification task in the general audio domain. Therefore, you can apply your existing solutions from the music domain and reflect upon the capabilities and limitations of your approach.

The overall goal of this assignment is to implement the method in an elegant way and present your implementation in this notebook:
1. **Illustrate your chosen architecture** e.g. by printing the individual layers and the shapes of the forward function if you choose a neural network approach (as we have done in previous assignments).
2. **Use plots** to showcase features and evaluation results.
3. Output your **final performance** and set it into context.

The rough distribution of points is as follows:
* 10 Points data preprocessing and data handling
* 10 Points machine learning architecture (e.g. neural network and data loader)
* 10 Points training method and evaluation
* 10 Points results and conclusion
* 10 Points overall presentation throughout the notebook


# Task 1: Data Processing (10 Points)

If you work on JupyterHub, find the audiofiles in the shared folder as indicated in the cell below.
Think about **reasonable features** to use and extract them for the audio files.
The DCASE dataset is already split into **a development and an evaluation** set. The idea is to only use the evaluation set **once** at the very end when you are confident about your trained system.
Only use the development set to draw your train/valid/test splits from.
The dataset comes with **predefined splits** for four-fold cross-validation. Feel free to use your own training setup, but read and **follow the guidelines** that come in the documentation of the dataset!

**Note**: Check the readme files in the dataset folder for more details!!

In [7]:
import os
import numpy as np

# get dataset path
dataset_path = os.path.join(os.environ['HOME'], 'shared', 'data', 'assignment_7')
if os.path.exists('data'):
    dataset_path = '.'

development_path = os.path.join(dataset_path, 'TUT-acoustic-scenes-2016-development')
evaluation_path = os.path.join(dataset_path, 'TUT-acoustic-scenes-2016-evaluation')

development_audio_path = os.path.join(development_path, 'audio')
development_annotation_file = os.path.join(development_path, 'meta.txt')
development_error_file = os.path.join(development_path, 'error.txt')
split_definition_path = os.path.join(development_path, 'evaluation_setup')

evaluation_annotation_file = os.path.join(evaluation_path, 'meta.txt')
evaluation_audio_path = os.path.join(evaluation_path, 'audio')

data_file_clip_info = os.path.join(dataset_path, 'clip_info_final.csv')
data_file_annotations = os.path.join(dataset_path, 'annotations_final.csv')

# collect list of audio files:
development_audio_files = [af for af in os.listdir(development_audio_path) if af.endswith('.wav')]
evaluation_audio_files = [af for af in os.listdir(evaluation_audio_path) if af.endswith('.wav')]

dev_audio_total_count = len(development_audio_files)
eval_audio_total_count = len(evaluation_audio_files)

print(f'Total number of development audio files: {dev_audio_total_count}')
print(f'Total number of evaluation audio files: {eval_audio_total_count}')

Total number of development audio files: 1170
Total number of evaluation audio files: 390


In [21]:
print(len(development_audio_files))

print(development_audio_files[0])


1170
a059_90_120.wav


### 1.1 Implementation

In [29]:
from librosa.filters import mel
# Put your data handling code here. 
# You can add additional cells below this one for structuring the notebook.
# Feel free to add markdown cells / plots / tests / etc. if it helps your presentation.
import librosa
from pathlib import Path

# YOUR CODE HERE
audio_len = 30
sample_rate = 44100
fft_frame_size = 512
num_mel_bands = 96
hop_size = 256

exp_feat_shape = (96, 1366)

# HERE I HAVE EXTRACTED THE MFCC FEATURES OF EACH AUDIO FILE

def extract_features(file_list, file_path):
    features = []  # list containing features for each audio file
    file_names = []  # audio file names for features (indices should correspond!), cut path and file extension, only leave names
    # iterate over files and extract features...
    # YOUR CODE HERE
    
    for file in file_list:
        try:
          signal_audio, sr  = librosa.load(file_path + file, sr = sample_rate)
          mel_spectrogram = librosa.feature.melspectrogram(y=signal_audio, sr=sample_rate,n_fft = fft_frame_size,
                                                          hop_length = hop_size, n_mels=num_mel_bands)
          print(len(mel_spectrogram))
          mfcc = librosa.feature.mfcc(S=librosa.power_to_db(mel_spectrogram))
          print(len(mfcc))
          features.append(mfcc)
          file_names.append(Path(file).stem)
        except Exception as e: 
          print(e)
          continue
    
    features = np.asarray(features)
    print(features.shape)
    
    return np.asarray(features), np.asarray(file_names)

In [None]:
dev_features, dev_feat_files = extract_features(development_audio_files, development_audio_path + "/")

eval_features, eval_feat_files = extract_features(evaluation_audio_files, evaluation_audio_path + "/")

In [31]:
print(len(dev_feat_files))

1170


In [43]:
import pandas as pd

# HERE I AM CREATING ANNOTATION LISTS FOR THE AUDIO SAMPLE FILES
# EACH MFCC HAS AN ANNOTATION AT THE SAME INDEX AFTER THIS

annotations_development = pd.read_csv(development_annotation_file, sep="\t", header=None)
annotations_evaluation = pd.read_csv(evaluation_annotation_file, sep="\t", header=None)

dev_annot = {}
eval_annot = {}

for i, annot in enumerate(annotations_development[0]):
  dev_annot[annot] = annotations_development[1][i]

for i, annot in enumerate(annotations_evaluation[0]):
  eval_annot[annot] = annotations_evaluation[1][i]  


targets_dev = []
targets_eval = []

for audio_file in development_audio_files:
  targets_dev.append(dev_annot["audio/" + audio_file])

for audio_file in evaluation_audio_files:
  targets_eval.append(eval_annot["audio/" + audio_file])


print(len(targets_dev))
print(len(targets_eval))

1170
390
1170
city_center
1170
390


### 1.2 Discussion

Write down what choices you made regarding data structuring and feature extraction, feel free to refer to code/plots/etc. in cells above.



For the data extraction i have calculated the Mel-Frequency-Capstrol-Coefficients (librosa.feature.mfcc) of the 30 second clips provided.

After that I have created a list of targets that provides the correct classification of each clip.

I will further use the MFCC features to try to classify the clips with a Support Vector Machine and a KNN classifier, to create a classical
Machine Learning (feature based) approach.

## Task 2: Machine Learning Approach (10 Points)

Implement your audio scene classification method here. You are free to use any approach you find appropriate. As a hint: the easiest way to succeed is to adapt the neural network approach from assignment 6 (or 5), since convolutional neural networks have been shown to work very well for this task, and you can start with a running code base.

### 2.1 Implementation

In [59]:
# Implement your machine learning architecture in the cells below. 
# You can add additional cells below this one for structuring the notebook.
# Feel free to add markdown cells / plots / tests / etc. if it helps your presentation.

# YOUR CODE HERE

# SVM CLASSIFIER APPROACH
# SUPPORT VECTOR MACHINE (SEPARATOR) - CHAINED FOR THE MULTI CLASS PROBLEM

from sklearn import svm
import pickle

###############################################################
# TRAINING THE SVM
feature_vector_train = dev_features


nsamples, nx, ny = feature_vector_train.shape
d2_train_dataset = feature_vector_train.reshape((nsamples,nx*ny))

# TARGETS
y_train = targets_dev


# CREATE THE  MULTI-CLASS SVM
# ONE-VS-ONE implementation
clf = svm.SVC(decision_function_shape='ovo')


cnn_tagger_0.model  finalized_model_knn.sav  name_cache.npy
feat_cache.npy	    finalized_model.sav
(1170, 20, 5168)


(1170, 20, 5168)
0.3717948717948718


In [78]:
####################################

#  KNN CLASSICIER APPROACH
# k-Nearest-Neighbours (Hedging classifier)

from sklearn.neighbors import KNeighborsClassifier


feature_vector_train = dev_features


nsamples, nx, ny = feature_vector_train.shape
d2_train_dataset = feature_vector_train.reshape((nsamples,nx*ny))

# TARGETS
y_train = targets_dev

neigh = KNeighborsClassifier(n_neighbors = 15)

185
205
0.47435897435897434


### 2.2 Discussion
Write down your choices and findings. Feel free to refer to code/plots/etc. in cells above.

Here I have created 2 classifiers. A simple KNN classifier, and a SVM Classifier. The Support Vector machine is a separator and is commonly a binary classifier. This is why we use the scikit "ovo" implementation of the classifier to link multiple SVM-s to classify for the 15 classes we have in our problem.
I am expecting these classifiers to perform worse than the Neural Network approach, but I am hoping they won't be much worse.

## Task 3: Training, Inference, and Evaluation (10 Points)

Depending on your choices for the machine learning model, implement the appropriate code to train and test it.
For developing and training the model only use the development set. 

### 3.1 Implementation

In [None]:
# Put your trainin and evaluation code in the cells below.
# You can add additional cells below this one for structuring the notebook.
# Feel free to add markdown cells / plots / tests / etc. if it helps your presentation.

# YOUR CODE HERE


In [None]:
# TRAINING THE SVM CLASSIFIER
clf.fit(d2_train_dataset, y_train)

# SAVE THE MODEL
!ls "/content/drive/MyDrive/audio/"

mod_path = "/content/drive/MyDrive/audio/"

filename = 'finalized_model.sav'

pickle.dump(clf, open(mod_path + filename, 'wb'))

In [65]:
# EVALUATING THE SUPPORT VECTOR MACHINE
from sklearn.metrics import precision_score

# Calculate Precision, Recall and F-measure
# SVM EVALUATION

filename = "/content/drive/MyDrive/audio/finalized_model.sav"

clf_loaded_model = pickle.load(open(filename, 'rb'))

###############################################################
# TESTING THE SVM
feature_vector_test = eval_features
nsamples, nx, ny = feature_vector_test.shape
d2_test_dataset = feature_vector_test.reshape((nsamples,nx*ny))

# TARGETS
y_test = targets_eval

print(feature_vector_train.shape)


#TEST
outputs_svm = clf_loaded_model.predict(d2_test_dataset)

prec = precision_score(outputs_svm, y_test, average='micro')

print(prec)

(1170, 20, 5168)
0.3717948717948718


In [79]:
# TRAINING THE KNN CLASSIFIER
neigh.fit(d2_train_dataset, y_train)

# SAVE THE MODEL
!ls "/content/drive/MyDrive/audio/"

mod_path = "/content/drive/MyDrive/audio/"

filename = 'finalized_model_knn.sav'

pickle.dump(neigh, open(mod_path + filename, 'wb'))

cnn_tagger_0.model  finalized_model_knn.sav  name_cache.npy
feat_cache.npy	    finalized_model.sav


In [81]:
# EVALUATING THE KNN CLASSIFIER
# Calculate Precision, Recall and F-measure
# KNN Evaluation

filename = "/content/drive/MyDrive/audio/finalized_model_knn.sav"
clf_loaded_model = pickle.load(open(filename, 'rb'))


loaded_model_neigh = pickle.load(open(filename, 'rb'))

###############################################################
# TESTING THE KNN
feature_vector_test = eval_features
nsamples, nx, ny = feature_vector_test.shape
d2_test_dataset = feature_vector_test.reshape((nsamples,nx*ny))

# TARGETS
y_test = targets_eval

#print(feature_vector_train.shape)


#TEST
outputs_knn = loaded_model_neigh.predict(d2_test_dataset)

prec = precision_score(outputs_knn, y_test, average='micro')

print(prec)

0.4717948717948718


### 3.2 Discussion
Write down your choices and findings. Feel free to refer to code/plots/etc. in cells above.

Here I have created and Traines the clssifiers. The SVM training took some time, about 10-15 mintues. After that I have calculated the precision of eeach classifier. 
The results are defeating, to say the least.
Classical classifiers obviosly perform severly worse than a Neural Network Approach for the problem of Audio scene Classification.

SVM precision: 37%

KNN precision: 47%

## Task 4: Results and Conclusion (10 Points)

Use the code cells below to calculate the final performance of the developed approach on the evaluation part of the dataset. 

In [74]:
# Put the evaluation code on the evaulation dataset in these code cells.
# You can add additional cells below this one for structuring the notebook.
# Feel free to add markdown cells / plots / tests / etc. if it helps your presentation.

# YOUR CODE HERE
from sklearn.metrics import f1_score

# SVM F1 SCORE

values_numbers = {}

cnt = 0
for val in y_test:
  if val not in values_numbers:
    values_numbers[val] = cnt
    cnt += 1


y_true = []
y_pred = []

for out in outputs_svm:
  y_pred.append(values_numbers[out])

for an in y_test:
  y_true.append(values_numbers[an])


# F1 SCORE
f1_macro = f1_score(y_true, y_pred, average='macro')

f1_micro = f1_score(y_true, y_pred, average='micro')

f1_weigh = f1_score(y_true, y_pred, average='weighted')

print("SVM F1 MACRO: " + str(f1_macro) + " SVM F1 MACRO: " + str(f1_macro) + " SVM F1 WEIGH: " + str(f1_weigh))

SVM F1 MACRO: 0.3241436442734849 SVM F1 MACRO: 0.3241436442734849 SVM F1 WEIGH: 0.3241436442734849


In [82]:
# KNN F1 SCORE

values_numbers = {}

cnt = 0
for val in y_test:
  if val not in values_numbers:
    values_numbers[val] = cnt
    cnt += 1


y_true = []
y_pred = []

for out in outputs_knn:
  y_pred.append(values_numbers[out])

for an in y_test:
  y_true.append(values_numbers[an])




# F1 SCORE
f1_macro = f1_score(y_true, y_pred, average='macro')

f1_micro = f1_score(y_true, y_pred, average='micro')

f1_weigh = f1_score(y_true, y_pred, average='weighted')

print("SVM F1 MACRO: " + str(f1_macro) + " SVM F1 MACRO: " + str(f1_macro) + " SVM F1 WEIGH: " + str(f1_weigh))

SVM F1 MACRO: 0.4203283396604311 SVM F1 MACRO: 0.4203283396604311 SVM F1 WEIGH: 0.4203283396604312


### Task 4.2 Discussion

Compare your performance to the ones shown on the DCASE website, and discuss possible reasons for performance differences.
Discuss your approach in the context of the other methods presented on the DCASE website.

We can see by the precision and F1-scores we are getting (0.32 for the SVM, and 0.42 for the KNN) classiciers that my method with the MFCC features and the classical ML approach severly underperforms compared even to the lowes displayed classifier on the Challenge Website (62%).

Conclusion: I have confirmed my suspicion that the Neural Network approaches work better for audio classification, because they integrate the most important part of audio processing - the switch to the frequency domain by utilizing spectrograms, and then work out better features than the MFCC.

## Task 5: Overall Presentation (10 Points)

Make sure your notebook **clearly presents your chosen approach** to the problem solution. If necessary, revisit the individual tasks and **add plots, outputs, code comments**, etc. to clearly explain what is going on.

You do not need to overdo it (no endless prints or plots that bloat the notebook) - less is sometimes more - as a goal think about your peers in the lecture and make it so that they could easily understand what is going on in the notebook. Exemplary plots with overall metrics are usually a nice compromise.

## Congratulations, you are done!

Reminder:
Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).