# Applied Machine Learning (2022), exercises


## General instructions for all exercises

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Remove also line 

> raise NotImplementedError()

**Do not change other areas of the document**, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manually graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks are text in markdown format. It may contain text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, download the whole notebook, using menu `File -> Download as -> Notebook (.ipynb)`. Save the file in your hard disk, and submit it in [Moodle](https://moodle.uwasa.fi) or EUNICE Moodle under the corresponding excercise.

Your solution should be an executable Python code. Use the code already existing as an example of Python programing and read more from the numerous Python programming material from the Internet if necessary. 


# Classification exercises


## Satellite image classification

Following satellite image is obtained from Vaasa in 2.6.2021. The image is acquired from European Sentinell 2 satellite by means of multispectral imaging device (MSI). The multispectral camera has acquired the image using 13 different wavelength bands instead of three (RGB) in the normal camera. These images can searched and dowloaded using [Copernicus Open Access Hub](https://scihub.copernicus.eu/dhus/), and preprosessed by using ESA's [SNAP](http://step.esa.int/main/download/) tool. This data is However downloaded by Cem, using his extraordinary [satellite data tool](https://cemmozzy.users.earthengine.app/view/test).


The bands used are 

| Band number | Band name | Wavelength | Region | Remarks |
| ----------- | --------- | -----------| ------ | ------- |
|  1 | B1  |   443 nm | Violet     | Chlorophyll-A |
|  2 | B2  |   490 nm | Cyan       | |
|  3 | B3  |   560 nm | Green      | |
|  4 | B4  |   665 nm | Red        | Chlorophyll_A |
|  5 | B5  |   705 nm | Red        | |
|  6 | B6  |   740 nm | Red        | |
|  7 | B7  |   783 nm | Deep red   | |
|  8 | B8  |   842 nm | NIR        | |
|  9 | B8A |   865 nm | NIR        | |
| 10 | B9  |   945 nm | NIR        | |
| 11 | B10 |  1375 nm | NIR        | |
| 12 | B11 |  1610 nm | NIR        | |
| 13 | B12 |  2190 nm | NIR        | |
| 14 | -   |  -       | –          |Empty      | 
| 15 | -   |  -       | -          |Empty      |
| 16 | -   |  -       | -          | Empty     |

The channels listed above can be used for creating a natural looking RGB-image, as shown below.

![Palosaari](Palosaari.png)

Even though, only three channels are used for RGB image, all 13 can be usefull features for land type and crops classification. 

## Training data 

The training data is obtained from [Dynamic word land usage dataset](https://www.dynamicworld.app). 

The labelled areas are:

| Segment no. | Segment name | Segment color | 
| ----------- | --------- | -----------| 
|  0 | Water   | Blue |
|  1 | Trees   | Green |
|  2 | Grass   | Light green |
|  3 | Crops   | Brownish Yellow |
|  4 | Shrub | Yello |
|  5 | Flooded vegetation | Lila |
|  6 | Built up area | Red |
|  7 | Bare ground | Gray |
|  8 | Snow & Ice | Lila |


## Task 1


### Read the data
Open the data which is in two 32-bit TIFF images. These images need to be opened using imageIo library with LZW compression support, neither installed by default and needs to be installed first.

The data is stored in the following images:
 - `20210602_s2.tif`: The Sentinel two spectral image, 800x4817 pixels, each pixel containin 16 32-bit channels.
 - `20210602_dw.tif`: The Dynamic World land use classification data, 800x4817 pixels, each pixel containin one 8-bit integer. Values from 0:15


Install first the needed libraries `imageio` and `imagecodecs` with pip.

The opening of the images can be achieved like this:


`import imageio as iio
Is2 = iio.imread('20210602_s2.tif')
Idw = iio.imread('20210602_dw.tif')
`

Select the Palosaari area by taking a subset from image using only pixels from $y\in[200:800]$ and $x\in[500:1000]$ and store it as `I`.

Select the labels from the Dynamic world image Idw, covering the same area, and store it as a row-vector `l`. 

Plot the image using only channels 1, 2 and 3 to see which area it covers, and plot a histogram of area labels from Dynamic world to see how much there are samples from different areas in the image. Scale the image so that its values are floating point values between 0..1 to display it properly. 

In [None]:
#!pip install imageio
#!pip install imagecodecs

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Import the imageio-library which is also capable of reading 32 bit scientific TIFF images
import imageio as iio

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#Tests

points=0
if ('Is2' in globals()) and ('l' in globals()):
    points+=1
else:
    print("Please define Is2 and l")
points

## Task 2

Construct the design matrix `X`, label vector `y` and split the data to training and testing sets.

Reshape the image data `I` so that is has only one spatial dimension, and first 13 features. Use the `reshape` function of numpy arrays for this purpose. Store your results to design matrix `X`.

Check that your label vector `l` is already a row-vector. If it is not you can use `reshape` or `ravel` -functions of numpy array to convert it to row vector.

Use sklearn function `train_test_split` to randomly split the data into testing set and training set. Normally it is good to use quite large training set, but since we are now using nearest neighbours method, with is slow with large training sets, we will exceptionally split the data so that 1% will be used as a training set and 99% for testing. The function splits both the design matrix `X` and the label vector `y` at the same time, to make it's use very convenient. Store your training sets to variables `X_train`, `y_train` and your testing set to variables `X_test` and `y_test`. 

In [None]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#Tests

points=0
if ('X_train' in globals()) and ('X_test' in globals()) and ('y_train' in globals()) and ('y_test' in globals()):
    points+=0.5
if (X_train.shape == (3000,13)) and (len(y_train)==3000):
    points+=0.5

assert(y.dtype==np.uint8)
points

## Task 2

Analyze the complexity of the data  by plotting the training set of data with two first principal components

 - Transform the the trainig set data in PCA domain as `pc`
 - Plot a scatter plot of two first principal components, where the color is the label number, the integer representation of `y` (=classes) of the training set
 - Use the `s` (size) parameter to set the size of the dots in scatter plot smaller, to see the data better.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#Tests

assert(pc.shape==(3000, 2)), "Something is wrong with PCA"
assert(max(y_train)<9), "Something is wrong with integer labels"


## Task 3

Define a KNN classifier which assigns each pixel from the image to the correct land use similar way than Dynamic World land use map.

- Create a processing pipeline using standard scaler and KNN classifier, name the pipeline as `predictor`
- Train the pipeline using the training data
- Check if it passes the tests for precision in training set


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics as metrics


#from sklearn ...

#predictor= ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#Tests

# Testing the precision in the training set
yh=predictor.predict(X_train)
train_score=metrics.accuracy_score(y_true=y_train, y_pred=yh)
if len(predictor.steps)<2:
    print("The predictor is not a pipeline. Did you forget scaling?")
assert(len(predictor.steps)>=2)
print(train_score)
assert(train_score > 0.8)


## Task 4: Evaluation of the predictor

Having trained the predictor, evaluate now it's performance using cross validation and test set. You may use `cross_val_score` function from the `sklearn.model_selection` library and `accuracy_score` from the `sklearn.metrics` library.

Print also the confusion matrix to see which areas are miss-classified. Use `confusion_matrix` function from the `sklearn.metrics` library


In [None]:
#from sklearn ...
#
#cv_score= ...
#train_score= ...
#test_score= ...


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#Tests
assert(cv_score > 0.85)
assert(train_score > 0.85)
assert(test_score > 0.85)

## Task 5: Confusion matrix

Make a confusion matrix of the classifier for training set, store it as `CM` and print it in screen for further analysis.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
if 'CM' not in globals():
    print("Please store your confusion matrix as variable CM")
assert(CM.shape==(8,8))

## Task 5: Interpretation of the results

1. What is your opinion of the precision you achieved?
1. Can you see signs of overfitting? Why/Why not?
1. Which samples were misclassified? To which class they were assigned?


YOUR ANSWER HERE