# Applied Machine Learning (2020), exercises


## General instructions for all exercises

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manualle graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks may be text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, download the whole notebook, using menu `File -> Download as -> Notebook (.ipynb)`. Save the file in your hard disk, and submit it in [Moodle](https://moodle.uwasa.fi) under the corresponding excercise.

Your solution should be an executable Python code. Use the code already existing as an example of Python programing and read more from the numerous Python programming material from the Internet if necessary. 


In [None]:
NAME = ""
Student_number = ""

---

# ICAT3190, Module 5, Excercise 6


## Sound sample classification

The following sound sample contains wind turbine noise, tractor noise and bird noise. The samples are separated so that only one noise source is dominating at the time. Your task is to make a classifier algorith which can separate the noise sources from each other.

These recording are made in Honkajoki, near Kirkkokallio wind park during [WindSoMe](https://osuva.uwasa.fi/handle/10024/11290) project.

You can listen the sound sample using the following link.

In [None]:
from IPython.display import Audio
audio = Audio(filename='data/sample.wav')
display(audio)


### Feature extraction

The first step in classification is to generate features. In this case, the whole sound signal of 180 seconds is divided in 360 pieces, and many audio features are generated for each sample. Since we did not know which audio features might be most important, we created (almost) all that we knew, and let a machine learning algorithm to choose the best features. The calculate the following features:

| Feature # | Explanation 
| ----- | :---- |
| 0-128 | 128 Mel Spectral coefficients (these describe how the sound energy is distributed in psychoachoustically relevant spectrum) |
| 128-168 | 40 Mel Cepstral coefficiens |
| 168-175 | 7 Spectral contrast coefficients |
| 175-178 | Three polynomial coefficients (when fitting a third order polynomial to the data) |
| 178:181 |  Three LPC filter model coefficients |
| 181 |  RMS value of the signal  |
| 182 | Zero crossing rate |
| 183 | Spectral centroid |
| 184 | Spectral bandwidth |
| 185 | Spectral flatness |
| 186 | True sound class |

The sound classes are
 1. Wind turbine noise
 2. Bird calls
 3. Tractor noise
 4. Rain
 5. Wind

 The most of the features are calculated with [LibRosa](https://librosa.org/).
 
 The pre-calculated features are available in the data file, and you can read it with the following code:

In [None]:
import pandas as pd
Data=pd.read_hdf('data/Features.hdf')
print(Data.shape)
Data.head()

## Task 1: Visualize the data using PCA

It is often a good idea to first visualize the data using PCA coordinates, to see if the classes are nicely separated or not. It helps deciding how advanced classifier is needed. 

Scale the data, calculate PCA and plot in two first principal components, coloring the samples using class labels.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Store your principal component as pc
#pc= 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert('pc' in locals()), "No pc variable found! Did you forgot to store your principal components as pc?"

if abs(pc.max())>50:
    print("You apparently forgot to standardize the variables before PCA?")
    assert(False)

## Task 2


### Classification using extrarees classifier


Train an extratrees classifier to classify the samples directly in the feature space. Do not use PCA as a preprocessor. Use 25% of the samples in the training set. Report the accuracy in training set, with cross validation (5 folds) and in the test set.

This problem is not really difficult, so try to use shallow decision trees, no deeper than 3. 

Print the accuracy score in training set, crossvalidation and test set. Calculate also the confusion matrix as variable `M` and print it.

Print also the `.feature_importances_` variable of the predictor, for example using the `plt.stem()` function. Read from the documentation what does it mean and how could this information could be used.

You can try to adjust at least the following parameters of the classifier `n_estimators`, `max_depth`, `min_samples_leaf`. Try to let your classifier only as much degrees of freedom as necessary for gaining good classification results, but not any more to avoid too complex decision boundaries and possible problems in generalization.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier


train_score=
test_score=
cv_score=
M=

# YOUR CODE HERE
raise NotImplementedError()
plt.figure(figsize=(10,3))
plt.stem(predictor.feature_importances_)


In [None]:
import sklearn.ensemble
print(predictor)
assert(type(predictor)==sklearn.ensemble.ExtraTreesClassifier), "The classifier is of wrong type"
assert(test_score>0.9), "The predictor cannot classify good enough"
assert(np.diag(M).sum()/M.sum()>0.9), "Confusion matrix does not look good"

## Task 3

### Classification using gradient boosted tree classifier

Train an GradientBoostingClassifier to classify the samples directly in the feature space. Do not use PCA as a preprocessor. Use 25% of the samples in the training set. Report the accuracy in training set, with cross validation (5 folds) and in the test set. Calculate also the confusion matrix as variable M and print it.

Print also the .feature_importances_ variable of the predictor, for example using the plt.stem() function. Read from the documentation what does it mean and how could this information could be used.

You can try to adjust at least the following parameters of the classifier `n_estimators`, `max_depth`, `min_samples_leaf` and `leraning_rate`. Try to let your calssifier only as much degrees of freedom as necessary for gaining good classification results, but not any more to avoid too complex decision boundaries and possible problems in generalization.


In [None]:
from sklearn.ensemble import GradientBoostingClassifier


#train_score=
#test_score=
#cv_score=
#M=

# YOUR CODE HERE
raise NotImplementedError()
plt.figure(figsize=(10,3))
plt.stem(predictor.feature_importances_)


In [None]:
import sklearn.ensemble
print(predictor)
assert(type(predictor)==sklearn.ensemble.GradientBoostingClassifier), "The classifier is of wrong type"
assert(test_score>0.9), "The predictor cannot classify good enough"
assert(np.diag(M).sum()/M.sum()>0.9), "Confusion matrix does not look good"