# Classification exercises


## Satellite image classification

Following satellite image is obtained from Vaasa in 2.6.2021. The image is acquired from European Sentinell 2 satellite by means of multispectral imaging device (MSI). The multispectral camera has acquired the image using 13 different wavelength bands instead of three (RGB) in the normal camera. These images can searched and dowloaded using [Copernicus Open Access Hub](https://scihub.copernicus.eu/dhus/), and preprosessed by using ESA's [SNAP](http://step.esa.int/main/download/) tool. This data is However downloaded by Cem, using his extraordinary [satellite data tool](https://cemmozzy.users.earthengine.app/view/test).


The bands used are 

| Band number | Band name | Wavelength | Region | Remarks |
| ----------- | --------- | -----------| ------ | ------- |
|  1 | B1  |   443 nm | Violet     | Chlorophyll-A |
|  2 | B2  |   490 nm | Cyan       | |
|  3 | B3  |   560 nm | Green      | |
|  4 | B4  |   665 nm | Red        | Chlorophyll_A |
|  5 | B5  |   705 nm | Red        | |
|  6 | B6  |   740 nm | Red        | |
|  7 | B7  |   783 nm | Deep red   | |
|  8 | B8  |   842 nm | NIR        | |
|  9 | B8A |   865 nm | NIR        | |
| 10 | B9  |   945 nm | NIR        | |
| 11 | B10 |  1375 nm | NIR        | |
| 12 | B11 |  1610 nm | NIR        | |
| 13 | B12 |  2190 nm | NIR        | |
| 14 | -   |  -       | –          |Empty      | 
| 15 | -   |  -       | -          |Empty      |
| 16 | -   |  -       | -          | Empty     |

The channels listed above can be used for creating a natural looking RGB-image, as shown below.

![Palosaari](Palosaari.png)

Even though, only three channels are used for RGB image, all 13 can be usefull features for land type and crops classification. 

## Training data 

The training data is obtained from [Dynamic word land usage dataset](https://www.dynamicworld.app). 

The labelled areas are:

| Segment no. | Segment name | Suomeksi | Segment color | 
| ----------- | --------- | -----------| ------ |
|  0 | Water  | Vettä   | Blue |
|  1 | Trees  | Puita tai metsää  | Green |
|  2 | Grass  | Ruohikkoa  | Light green |
|  3 | Crops  | Viljelysmaata  | Brownish Yellow |
|  4 | Shrub  | Pensaikkoa | Yellow |
|  5 | Flooded vegetation | Tulva-alue | Lila |
|  6 | Built up area | Rakennettu alue| Red |
|  7 | Bare ground | Paljas maa | Gray |
|  8 | Snow & Ice  | Lunta ja jäätä | Lila |


## Task 1


### Read the data
Open the data which is in two 32-bit TIFF images. These images need to be opened using imageIo library with LZW compression support. (If you are using your own environment, install imageio and imagecodecs first)

The data is stored in the following images:
 - `20210602_s2.tif`: The Sentinel two spectral image, 800x4817 pixels, each pixel containin 16 32-bit channels.
 - `20210602_dw.tif`: The Dynamic World land use classification data, 800x4817 pixels, each pixel containin one 8-bit integer. Values from 0:15

The opening of the images can be achieved like this:

```python
import imageio as iio
Is2 = iio.v2.imread('20210602_s2.tif')
Idw = iio.v2.imread('20210602_dw.tif')
```

The image is a three dimensional matrix, where the x- and y-dimensions are the pixel locations and the z-dimension is the color channel number.

![satelliteimage](satelliteimage.svg)

Select the Palosaari area by taking a subset from image using only pixels from $y\in[200:800]$ and $x\in[500:1000]$ and store it as `I`.

Select the labels from the Dynamic world image Idw, covering the same area, and store it as a row-vector `l`. 

Plot the image using only channels 1, 2 and 3 to see which area it covers, and plot a histogram of area labels from Dynamic world to see how much there are samples from different areas in the image. Scale the image so that its values are floating point values between 0..1 to display it properly. 

In [3]:
#!pip install imageio
#!pip install imagecodecs

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Import the imageio-library which is also capable of reading 32 bit scientific TIFF images
import imageio as iio

In [1]:
# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [7]:
#Tests

points=0
if ('Is2' in globals()) and ('l' in globals()):
    points+=1
else:
    print("Please define Is2 and l")
### BEGIN HIDDEN TESTS
if Is2.shape==(800, 4817, 16):
    points +=0.5
else:
    print("Wrong shape for Is2", Is2.shape)
if len(l)==300000:
    points +=0.5
else:
    print("Wrong length for l", len(l))
### END HIDDEN TESTS
points

2.0

## Task 2

Construct the design matrix `X`, label vector `y` and split the data to training and testing sets.

Reshape the image data `I` so that is has only one spatial dimension, and first 13 features. Use the `reshape` function of numpy arrays for this purpose. Store your results to design matrix `X`.

Check that your label vector `l` is already a row-vector. If it is not you can use `reshape` or `ravel` -functions of numpy array to convert it to row vector.

Use sklearn function `train_test_split` to randomly split the data into testing set and training set. Normally it is good to use quite large training set, but since we are now using nearest neighbours method, with is slow with large training sets, we will exceptionally split the data so that 1% will be used as a training set and 99% for testing. The function splits both the design matrix `X` and the label vector `y` at the same time, to make it's use very convenient. Store your training sets to variables `X_train`, `y_train` and your testing set to variables `X_test` and `y_test`. 

In [2]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [3]:
#Tests

points=0
if ('X_train' in globals()) and ('X_test' in globals()) and ('y_train' in globals()) and ('y_test' in globals()):
    points+=0.5
if (X_train.shape == (3000,13)) and (len(y_train)==3000):
    points+=0.5

assert(y.dtype==np.uint8)
points

NameError: name 'X_train' is not defined

## Task 2

Analyze the complexity of the data  by plotting the training set of data with two first principal components

 - Transform the the trainig set data in PCA domain as `pc`
 - Plot a scatter plot of two first principal components, where the color is the label number, the integer representation of `y` (=classes) of the training set
 - Use the `s` (size) parameter to set the size of the dots in scatter plot smaller, to see the data better.

In [4]:
# YOUR CODE HERE
raise NotImplementedError()


NotImplementedError: 

In [11]:
#Tests

assert(pc.shape==(3000, 2)), "Something is wrong with PCA"
assert(max(y_train)<9), "Something is wrong with integer labels"

### BEGIN HIDDEN TESTS
print(y_train[0])
# It should be 6 = built
assert(y_train[0]==6)
### END HIDDEN TESTS

0


AssertionError: 

## Task 3

Define a KNN classifier which assigns each pixel from the image to the correct land use similar way than Dynamic World land use map.

- Create a processing pipeline using standard scaler and KNN classifier, name the pipeline as `predictor`
- Train the pipeline using the training data
- Check if it passes the tests for precision in training set


In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics as metrics


#from sklearn ...

#predictor= ...

# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [13]:
#Tests

# Testing the precision in the training set
yh=predictor.predict(X_train)
train_score=metrics.accuracy_score(y_true=y_train, y_pred=yh)
if len(predictor.steps)<2:
    print("The predictor is not a pipeline. Did you forget scaling?")
assert(len(predictor.steps)>=2)
print(train_score)
assert(train_score > 0.8)


0.9393333333333334


## Task 4: Evaluation of the predictor

Having trained the predictor, evaluate now it's performance using cross validation and test set. You may use `cross_val_score` function from the `sklearn.model_selection` library and `accuracy_score` from the `sklearn.metrics` library.

Print also the confusion matrix to see which areas are miss-classified. Use `confusion_matrix` function from the `sklearn.metrics` library


In [6]:
#from sklearn ...
#
#cv_score= ...
#train_score= ...
#test_score= ...


### BEGIN SOLUTION

# YOUR CODE HERE
raise NotImplementedError()


NotImplementedError: 

In [15]:
#Tests
assert(cv_score > 0.85)
assert(train_score > 0.85)
assert(test_score > 0.85)

## Task 5: Confusion matrix

Make a confusion matrix of the classifier for training set, store it as `CM` and print it in screen for further analysis.

In [7]:
# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [17]:
if 'CM' not in globals():
    print("Please store your confusion matrix as variable CM")
assert(CM.shape==(8,8))

AssertionError: 

## Task 5: Interpretation of the results

1. Describe the precision of the classifier.
2. Is it probable that the classifier suffers from overfitting? Why/Why not?
1. What land cover type had most misclassified samples?


### YOUR ANSWER HERE