<a href="https://colab.research.google.com/github/peterliu502/GEO1001_hw02/blob/master/5386586_5360684.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis report for GEO1001 HW01 
---

## Authors  
1. First author: [<img src="https://avatars3.githubusercontent.com/u/59593272?s=400&u=ba1618be6d5e354f0bd7685ff405bdec6d18c101&v=4" align = "left" width = "25" height = "25" />](https://github.com/peterliu502)  
   * Name: Zhenyu Liu  
   * Student Number: 5386586  
  
2. Second author: [<img src="https://avatars2.githubusercontent.com/u/47234206?s=400&u=3f54e18f68e48f985db9f0ef1a8eb3a3ca189b1e&v=4" align = "left" style = "float:left" width = "25" height = "25" />](https://github.com/Ziyan-Wu)
  * Name: Ziyan Wu
  * Student Number: 5360684

## Source data 
The source data contains `Sentinel-2` images for the area around Delft on May 30th 2020, with 10m, 20m and 60m resolutions.

## Python packages  
This assignment uses following Python packages:  
1. `numpy`
2. `scikit-learn ` 
3. `matplotlib`  
4. `rasterio` 

In [1]:
import numpy as np
import rasterio as rio
from rasterio.windows import Window
import rasterio.plot as rp
from matplotlib import pyplot as plt
from matplotlib import colors as mc
from sklearn import cluster
from sklearn import mixture

## Open files and preprocess  

In [2]:
def open_raster(resolution, band):
    with rio.open(
            './GRANULE/L2A_T31UET_A025788_20200530T105134/IMG_DATA/R' + resolution
            + 'm/T31UET_20200530T105031_B' + band + '_' + resolution + 'm.jp2',
            driver="JP2OpenJPEG") as ds:
        if resolution == '10':
            # get the row and column number for the top-left pixel
            row, col = ds.index(601200, 5773695)
            # cut the raster and standardize the raster values
            return rp.adjust_band(ds.read(1, window=Window(col, row, 700, 500)))
        else:
            # standardize the raster values
            return rp.adjust_band(ds.read(1))

### Open data  

In order to make full use of the multi-band advantage of the data, this assignment firstly plans to use all bands (4 bands for 10m data and 11 bands for 60m data) of the raster data.

However, the classification result of the 60m data (especially using Kmeans) shows the high dimensional data makes some features of ground objects inconspicuous, resulting in poor classification effect. To highlight the problem, two commonly used band combination, band 4, 3, 2 and band 8, 4, 3, are imported as reference.

In [3]:
ds_10_all_subset = tuple([open_raster('10', elm10m) for elm10m in ["02", "03", "04", "08"]])
ds_10 = np.dstack(ds_10_all_subset)
ds_60_all_subset = tuple([open_raster('60', elm10m) for elm10m in ["01", "02", "03", "04", "05", "06", "07", "8A", "09", "11", "12"]])
ds_60 = np.dstack(ds_60_all_subset)

ds_60_432 = tuple([open_raster('60', elm10m) for elm10m in ["04", "02", "03"]])
ds_60_432 = np.dstack(ds_60_432)
ds_60_843 = tuple([open_raster('60', elm10m) for elm10m in ["8A", "04", "03"]])
ds_60_843 = np.dstack(ds_60_843)

## KMeans
KMeans is a commonly used classification method. Its basic idea is assigning points to K clusters so that each point is nearest to its `cluster center` (`cluster mean`) than other cluster centers.  

One of the key points of KMeans clustering is to find the optimal `K-value` (`n_clusters`) in advance. In this assignment, `SSE` (`Error Sum of Squares`) is a good solution used to find the `optimal K-value`. The main idea of this solution is the value of SSE will decrease with the increase of K-value in general, but if the K-value is less than the optimal K-value, the SSE will decrease sharply. After K-value is greater than the optimal K-value, SSE decreases very slowly. Therefore, the goal is to find the turning point on the SSE curve, which is the optimal K-value position.

### KMeans Preprocess

#### Creat KMeans classifier

In [4]:
# create a KMeans classifier
def kmeans_classifier(ds):
    # store the SSE of each result
    sse = []
    ds_1d = ds[:, :, :ds.shape[2]].reshape((ds.shape[0] * ds.shape[1], ds.shape[2]))
    ds_img_cl_list = []
    for elm in range(11)[3:]:
        # create a KMeans classifier object
        ds_cl = cluster.KMeans(n_clusters=elm)
        # train the data
        ds_cl.fit(ds_1d)
        # get the labels of the classes
        ds_img_cl = ds_cl.labels_
        # reshape labels to a 3d array (one band only)
        ds_img_cl = ds_img_cl.reshape(ds[:, :, 0].shape)
        sse.append(ds_cl.inertia_)
        ds_img_cl_list.append(ds_img_cl)
    return ds_img_cl_list, sse

#### Create classification image and SSE curve generation functions

In [5]:
# SSE curve generation function
def sse_value(arr):
    plt.figure(figsize=(5, 5))
    X = range(3, 11)
    plt.xlabel('k')
    plt.ylabel('SSE')
    plt.plot(X, arr, 'o-')
    plt.show()


# classification image generation function
def plot_classification(ds_list, resolution):
    if resolution == 10:
        ds_fig = plt.figure(figsize=(30, 10))
    else:
        ds_fig = plt.figure(figsize=(30, 15))
    for elm in range(8):
        # plot the classification image
        ds_fig.add_subplot(2, 4, elm + 1)
        # set the custom color map to represent the different classes in image
        cmap = mc.LinearSegmentedColormap.from_list(
            "", ["white", "purple", "black", "green", "darkgreen", "yellow",
                 "magenta", "red", "blue"])
        plt.imshow(ds_list[elm], cmap=cmap)
        plt.title("n_clusters="+str(elm + 3))
    plt.show()

### Plot 60m data


#### Plot classification images of 60m data using all bands

In [6]:
[ds_img_60, sse_60] = kmeans_classifier(ds_60)

sse_value(sse_60)

KeyboardInterrupt: 

In [None]:
plot_classification(ds_img_60, 60)

For classification images merginging all bands, its optimal number of clusters is 6.  
* For `water bodies`:  
  1. The most serious problem is the classification result doesn't distinguish the `estuary` area and `deep sea` area. In terms of `water bodies`, performance of all bands data is far worse than the 3 bands data.
  So band 4, 3, 2 and band 8, 4, 3, are imported as reference.

* For `cities and towns`:  
  1. KMeans can classify the `cities and town`s objects very well in all bands data. For example, `urban` area and the `greenhouse` area can be distinguished obviously.  

* For `vegetative cover`:  
  1. The clustering is relatively clear and not broken.

#### Plot classification images of 60m data using band 4, 3, 2

In [None]:
[ds_img_60_432, sse_60_432] = kmeans_classifier(ds_60_432)

sse_value(sse_60_432)

In [None]:
plot_classification(ds_img_60_432, 60)

For classification using band 4, 3, 2, its optimal number of clusters is also 6.  
* For `water bodies`:
  1. It does well in classification, distinguishing the `estuary` area and `deep sea` area.  
* For `cities and towns`:
  1. The clusters of `cities and towns` don't have clear geometric outline.  
  2. The differences between the clusters of different `cities and towns` objects (such as `urban` area and `greenhouse` area) are not well shown.  
* For `vegetative cover``:
    1. The classification images of band 4, 3, 2 data can show a variety of `vegetation`.  
    2. `Vegetation` clusters are difficult to distinguish from `cities and towns` clusters.  

#### Plot classification images of 60m data using band 8, 4, 3

In [None]:
[ds_img_60_843, sse_60_843] = kmeans_classifier(ds_60_843)
sse_value(sse_60_843)

In [None]:
plot_classification(ds_img_60_843, 60)

For classification using band 8, 4, 3, its optimal number of clusters is 6. 
* For `water bodies``:  
  1. It also can not classify `water bodies` into `estuary` area and `deep sea` area very well.  
* For `cities and towns`:
  1. The clusters of `cities and towns` don't have clear geometric outline.
  2. It can show the differences between the clusters of different `cities and towns` objects.
* For `vegetative cover`:
  1. The classification images of band 4, 3, 2 data can show a variety of `vegetation`.  
  2. The clustering is relatively clear and not broken.

### Plot 10m data

#### Plot classification images of 10m data using all bands


In [None]:
[ds_img_10, sse_10] = kmeans_classifier(ds_10)
sse_value(sse_10)

In [None]:
plot_classification(ds_img_10, 10)

According to the SSE curve, the optimal number of clusters is 6.
* For `cities and town area`:
  1. The geometric outline of the `cities and town area` is clear;
  2. The `road network` is relatively obvious;
  3. The `building` clustering is not particularly broken;

* For `vegetative cover`:
  1. The classification result shows 4 categories of `vegetation` clustering.
  2. There are `misclassifications` in `farmland` clustering, but not many.  

* For `water bodies`:
  1. The classification is very accurate;

## GMM (Gaussian Mixture Model) 

`GMM` (`Gaussian Mixture Model`) is a probabilistic clustering method, which assumes that all data samples are generated by a mixture of verious mixed multivariate Gaussian distributions. The clustering process of GMM is the `inverse process` of generating data samples by using GMM. Given the number of clustering clusters (`K-value` or `n_component`), the parameters of each mixed component are derived by a certain parameter estimation method through a given data set.  

As with KMeans, determining the optimal K-Value is also critical to GMM. According to the [official document](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py) of `scikit learn`, `BIC` (`information-theoretic criteria`) can be seen as an effective tool for finding the optimal K-Value.

### Creat GMM classifie

In [None]:
def gmm_classifier(ds):
    # store the SSE of each result
    bic = []
    ds_1d = ds[:, :, :ds.shape[2]].reshape((ds.shape[0] * ds.shape[1], ds.shape[2]))
    gmm_img_list = []
    for elm in range(11)[3:]:
        # start = time()
        # create a Gaussian Mixture Mode classifier object
        gmm = mixture.GaussianMixture(n_components=elm, max_iter=1000)
        # train the data
        gmm.fit(ds_1d)
        # get the labels of the classes
        gmm_lable = gmm.predict(ds_1d)
        # reshape labels to a 3d array (one band only)
        gmm_img = gmm_lable.reshape(ds[:, :, 0].shape)
        bic.append(gmm.bic(ds_1d))
        gmm_img_list.append(gmm_img)
        # print("time gmm" + str(elm - 2) + ":" + str(time() - start))
    return gmm_img_list, bic

def bic_value(arr):
    plt.figure(figsize=(5, 5))
    X = range(3, 11)
    plt.xlabel('n')
    plt.ylabel('BIC')
    plt.plot(X, arr, 'o-')
    plt.show()

#### Plot 60m data

#### Plot classification images of 60m data using all bands

In [None]:
ds_img_60_gmm, bic_60 = gmm_classifier(ds_60)
bic_value(bic_60)

In [None]:
plot_classification(ds_img_60_gmm, 60)

According to the plot, the optimal cluster number of all bands data is 6. 

* For `water bodies`:
  1. The clusterings of `water bodies` are successfully classified into `estuary` area and `deep sea` area.  
* For `cities and towns`:
  1. `Cities and towns` objects are classified very well in all bands data.
* For `vegetative cover`:  
  1. The classification images of all bands data can show a variety of `vegetation`.
  2. The clustering is relatively clear and not broken.

#### Plot classification images of 60m data using band 4, 3, 2

In [None]:
ds_img_60_432_gmm, bic_60_432 = gmm_classifier(ds_60_432)
bic_value(bic_60_432)

In [None]:
plot_classification(ds_img_60_432_gmm, 60)

According to the plot, the optimal cluster number of band 4, 3, 2 is 5.

* For `water bodies`:
  1. The clusterings of `water bodie`s are successfully classified into `estuary` area and `deep sea` area.  
* For `cities and towns`:
  1. `Cities and towns` objects are classified very well in all bands data.
* For `vegetative cover`:  
  1. The clustering is relatively clear and not broken.
  2. The categories number of `vegetative cover` clusterings is small.

#### Plot classification images of 60m data using band 8, 4, 3

In [None]:
ds_img_60_843_gmm, bic_60_843 = gmm_classifier(ds_60_843)
bic_value(bic_60_843)

In [None]:
plot_classification(ds_img_60_843_gmm, 60)

According to the plot, the optimal cluster number of band 8, 4, 3 is 5.

* For `water bodies`:
  1. The clusterings of `water bodies` are successfully classified into `estuary` area and `deep sea` area.  
* For `cities and towns`:
  1. The `cities and towns` area is a little fragmented.  
* For `vegetative cover`:  
  1. The clustering is relatively clear and not broken.
  2. The categories number of `vegetative cover` clusterings is small.

### Plot 10m data using all bands

In [None]:
[ds_img_10_gmm, bic_10] = gmm_classifier(ds_10)
bic_value(bic_10)

In [None]:
plot_classification(ds_img_10_gmm, 10)

According to the plot, the optimal cluster number of 10m data is 6.

* For `cities and town`:
  1. The geometric outline of the `cities and town` is clear.
  2. The `road network` is obvious.
  3. The `building` clustering is a little bit broken;
  4. There are a lot of `farmland` clusterings that are not distinguished from `cities and towns` clusterings.

* For `vegetative cover`:
  1. The classification result shows 4 categories of `vegetation` clusterings.
  2. The `misclassification` phenomenon is relatively serious.
* For `water bodies`:
  1. The classification is very accurate;

## Summary

### Evaluation of the results of each classification
This assignment uses KMeans and GMM to classify the all banda, band 4, 3, 2 and band 8, 4, 3 of raster data in 60m resolution and all bands of raster data in 10m resolution. The evaluation is shown in the following table:

| classification method | band combination  | resolution | water body | cities and towns | vegetative cover | road network |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| KMeans | all bands (11) | 60m | bad | normal | normal | \ |
| KMeans | band 4, 3, 2 | 60m | good | bad | normal | \ |
| KMeans | band 8, 4, 3 | 60m | bad | normal | good | \ |
| GMM | all bands (11) | 60m | good | good | good | \ |
| GMM | band 4, 3, 2 | 60m | good | good | normal | \ |
| GMM | band 8, 4, 3 | 60m | good | normal | normal | \ |
| KMeans | all bands (4) | 10m | good | good | normal | normal |
| GMM | all bands (4) | 10m | good | normal | normal | good |

### Compare KMeans and GMM

#### 60m data

* For all bands: 

  1. Compared with KMeans, classification images using GMM have the similar result for `vegetative cover` and `water bodies`, until increasing the cluster number to 8.  
  2. For objects on `cities and towns`, the clusterings generated by KMeans are more fragmented than those generated by GMM.

* For band 432:

  1. Compared with KMeans , GMM classification images using band 4,3,2 equally perform good in `water bodies` classification.  
  2. For objects on `vegetative cover` and `cities and towns`, the GMM classification is a litte better.

* For band 843:

  1. Compared with KMeans classification with band 8,4,3, the classification result using GMM is much better.

* Conclusion:  
  1. When processing the 60m data (especially using all bands), GMM's processing speed is much slower than KMeans's.   
  2. Both KMeans and GMM have similar performance in terms of `water` classification. The two methods show little difference under the same band combination.    
  3. GMM is better at `vegetative cover` and `cities and towns` classification generally than KMeans.  
  4. **The best classification image of 60m data is the one using GMM with all bands and setting 6 clusters.**

#### 10m data

* For `cities and towns`:  
  1. KMeans can generate more complete and geometrically clear `cities and town` clusters than GMM.
  2. The road network in GMM classification images is clearer.

* For `vegetative cover`:  
  1. Both GMM and KMeans have the problem of misclassifying part of `farmland` clusterings into `cities and towns` clusterings. However, GMM's problem is more serious.

* For `water bodies`:
  1. The classification results produced by the two methods are similar in  water area. They are both fairly accurate.  

* Conclusion:  
  1. **The best classification image of 10m data is the one using Kmeans with all bands and setting 5 clusters.**