<a href="https://colab.research.google.com/github/junyi2022/musa-650-remote-sensing/blob/main/assignments/HW2/HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MUSA 650 Homework 2: Supervised Land Use Classification with Google Earth Engine

In this assignment, you will use Google Earth Engine via Python to implement multi-class land cover classification. You will hand-label Landsat 8 satellite images which you will then use to train a random forest model. Along the way, you will consider practical remote sensing issues like cloud cover, class imbalances, and feature selection.

Submit a single Jupyter Notebook containing code, narrative text, visualizations, and answers to each question. Please also upload your classification results as a GeoTIFF and your accuracy assessment as a CSV file.

**Disclaimer:** I consulted the following AI tool to revise codes and answer questions for this project.

- DeepSeek. (n.d.). DeepSeek artificial intelligence system. Retrieved from https://www.deepseek.com

**Note:** the output interactive geemap cannot be visualized on GitHub because the 'state' key is missing from 'metadata.widgets'. Although the notebook is 'invalid' on github, we can use it in Colab.

## 1. Setup

`geemap` has many [tutorials](https://geemap.org/tutorials/#geemap-tutorials) available. This notebook specificlly referenced the [#32 Machine Learning with Earth Engine - Supervised Classification](https://geemap.org/notebooks/32_supervised_classification/) and the video is available [here](https://www.youtube.com/watch?v=qWaEfgWi21o)

In [79]:
# Import required libraries
import ee
import geemap
import ipywidgets as widgets
from IPython.display import display
import leafmap

import rasterio
import matplotlib.pyplot as plt
import numpy as np

import geopandas as gpd

Google Earth Engine requires authentication before usage. Instruction can be found [here](https://developers.google.com/earth-engine/guides/auth). The project is a google cloud project set up in the google cloud account. There is also a notebook autheticator [here](https://code.earthengine.google.com/client-auth?scopes=https%3A//www.googleapis.com/auth/earthengine%20https%3A//www.googleapis.com/auth/cloud-platform%20https%3A//www.googleapis.com/auth/devstorage.full_control&request_id=jHMQOVzUM-B-pUwoeKSCPjAqmuPK90lbh-Z2xFjR55o&tc=n8BD6km8I2vhYIau8ww5Hrztwrd5Wulp0qdijy5YqII&cc=Yusop5Cp9Vxq3z_wUl9rzbY_q2YP5o1JUMM4lyLIvJs).

In [5]:
ee.Authenticate()
ee.Initialize(project='ee-musa-remote-sensing')

Create an interactive map. There are multriple base map available.  

In [None]:
import os

os.environ["ROADMAP"] = 'https://mt1.google.com/vt/lyrs=m&x={x}&y={y}&z={z}'
os.environ["SATELLITE"] = 'https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}'
os.environ["TERRAIN"] = 'https://mt1.google.com/vt/lyrs=p&x={x}&y={y}&z={z}'
os.environ["HYBRID"] = 'https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}'

Map = geemap.Map()
Map.add_basemap("ROADMAP")
Map

## 2. Data Collection and Feature Engineering

### 2.1 Collecting and Labeling Training Data

Using the [interactive `geemap` intereface](https://www.youtube.com/watch?v=VWh5PxXPZw0) or another approach (e.g., QGIS, ArcGIS, a GeoJSON file, etc.), create at least 100 samples (points or polygons) for each of the following four classes: urban, bare, water, and vegetation. (Again, we encourage you to work in pairs or groups of three to generate these hand labels.) Use visual cues and manual inspection to ensure that the samples are accurate. Assign each class a unique label (e.g., 0 for urban, 1 for bare, 2 for water, and 3 for vegetation) and merge the labeled samples into a single dataset. You are free to propose any labels you like, as long as 1) you include at least 4 classes, and 2) you justify why they are appropriate for a remote sensing task (for example, including a label for ice cream shops wouldn't make sense, because those can't be detected from aerial imagery).

#### 2.1.1 Collecting Data

The region of interest (ROI) of this notebook is Chicago. We have defined a rectangle area of Chicago that will be the area of focus. We first start from adding data to the map. The data used is the Collection 2 for Landsat data in 2023. We filter for images with minimal cloud cover.  

The band info is showed in the form below:

**Landsat 8 (OLI & TIRS) Band Designations**

| Band    | Name                          | Wavelength (µm) | Spatial Resolution (m) | Common Applications |
|---------|-------------------------------|-----------------|------------------------|---------------------|
| **SR_B1** | Coastal/Aerosol               | 0.433–0.453     | 30                     | Coastal water mapping, aerosol studies |
| **SR_B2** | Blue                          | 0.450–0.515     | 30                     | Water body penetration, soil/vegetation discrimination |
| **SR_B3** | Green                         | 0.525–0.600     | 30                     | Healthy vegetation detection, urban areas |
| **SR_B4** | Red                           | 0.630–0.680     | 30                     | Chlorophyll absorption (vegetation health) |
| **SR_B5** | Near-Infrared (NIR)           | 0.845–0.885     | 30                     | Biomass content, water body delineation |
| **SR_B6** | Shortwave Infrared 1 (SWIR 1) | 1.560–1.660     | 30                     | Moisture content, snow/cloud discrimination |
| **SR_B7** | Shortwave Infrared 2 (SWIR 2) | 2.100–2.300     | 30                     | Soil/rock differentiation, vegetation stress |

### Thermal Bands (TIRS)

| Band     | Name                          | Wavelength (µm) | Spatial Resolution (m) | Common Applications |
|----------|-------------------------------|-----------------|------------------------|---------------------|
| **ST_B10** | Thermal Infrared 1 (TIRS 1)   | 10.60–11.19    | 100 (resampled to 30)  | Surface temperature, urban heat islands |
| **ST_B11** | Thermal Infrared 2 (TIRS 2)   | 11.50–12.51    | 100 (resampled to 30)  | Surface temperature, volcanic activity |

In [34]:
chicago_region = ee.Geometry.Rectangle([-89.0914, 41.1428, -87.4011, 42.4773])
Map.addLayer(chicago_region, {}, "Chicago Region")

In [39]:
# Chicago point
point = ee.Geometry.Point([-87.7719, 41.8799])

image = (
    ee.ImageCollection("LANDSAT/LC08/C02/T1_L2")
    .filterBounds(point)
    .filterDate("2023-01-01", "2023-12-31")
    .sort("CLOUD_COVER")
    .first()
    .select("SR_B[1-7]")
    .clip(chicago_region)
)

print(image.getInfo())

{'type': 'Image', 'bands': [{'id': 'SR_B1', 'data_type': {'type': 'PixelType', 'precision': 'int', 'min': 0, 'max': 65535}, 'dimensions': [4753, 5008], 'origin': [1716, 1260], 'crs': 'EPSG:32616', 'crs_transform': [30, 0, 272985, 0, -30, 4742715]}, {'id': 'SR_B2', 'data_type': {'type': 'PixelType', 'precision': 'int', 'min': 0, 'max': 65535}, 'dimensions': [4753, 5008], 'origin': [1716, 1260], 'crs': 'EPSG:32616', 'crs_transform': [30, 0, 272985, 0, -30, 4742715]}, {'id': 'SR_B3', 'data_type': {'type': 'PixelType', 'precision': 'int', 'min': 0, 'max': 65535}, 'dimensions': [4753, 5008], 'origin': [1716, 1260], 'crs': 'EPSG:32616', 'crs_transform': [30, 0, 272985, 0, -30, 4742715]}, {'id': 'SR_B4', 'data_type': {'type': 'PixelType', 'precision': 'int', 'min': 0, 'max': 65535}, 'dimensions': [4753, 5008], 'origin': [1716, 1260], 'crs': 'EPSG:32616', 'crs_transform': [30, 0, 272985, 0, -30, 4742715]}, {'id': 'SR_B5', 'data_type': {'type': 'PixelType', 'precision': 'int', 'min': 0, 'max': 

In [40]:
vis_params = {"min": 0, "max": 65535, "bands": ["SR_B4", "SR_B3", "SR_B2"]} # the max value is from the image.getInfo

Map.centerObject(point, 8)
Map.addLayer(image, vis_params, "Landsat-8")

Check image properties.

In [21]:
ee.Date(image.get("system:time_start")).format("YYYY-MM-dd").getInfo()

'2023-08-31'

In [22]:
image.get("CLOUD_COVER").getInfo()

0.16

#### 2.1.2 Labeling Data

There are multiple ways to label the data. This notebook use national landcover database (nlcd) layer as a base to generate training points. The values of the points will be reclassified into **0 for urban, 1 for bare, 2 for water, and 3 for vegetation**. These 4 categories will be used for the remote sensing model.

In [91]:
nlcd = ee.Image("USGS/NLCD/NLCD2016").select("landcover").clip(chicago_region)
Map.addLayer(nlcd, {}, "NLCD")

Generate 5000 training points with coresponding nlcd values.

In [72]:
# Make the training dataset.
points = nlcd.sample(
    **{
        "region": chicago_region,
        "scale": 30,
        "numPixels": 5000,
        "seed": 0,
        "geometries": True,  # Set this to False to ignore geometries
    }
)

Reclassify the nlcd values into 4 categories.

In [74]:
# reclassification rules
reclass_rules = {
    # Original NLCD values : New class
    21: 0, 22: 0, 23: 0, 24: 0,  # Urban (Developed)
    31: 1, 52: 1,                 # Bare (Barren/Shrub)
    11: 2, 12: 2, 90: 2, 95: 2,   # Water (Water/Wetlands)
    41: 3, 42: 3, 43: 3,          # Vegetation (Forests)
    71: 3, 81: 3, 82: 3           # Vegetation (Grasslands/Crops)
}

# Convert the rules to Earth Engine Dictionary
reclass_dict = ee.Dictionary(reclass_rules)

# Function to reclassify each feature
def reclassify_feature(feature):
    original_value = ee.Number(feature.get('landcover'))
    new_value = reclass_dict.get(original_value, -1)  # -1 for unmapped values
    return feature.set('class', new_value)

# Apply the reclassification
reclassified_points = points.map(reclassify_feature)

# Get class distribution
if reclassified_count > 0:
    class_dist = reclassified_points.aggregate_histogram('class').getInfo()
    print("\nClass distribution:")
    for cls, count in sorted(class_dist.items()):
        print(f"Class {cls}: {count} samples")


Class distribution:
Class 0: 1549 samples
Class 1: 21 samples
Class 2: 736 samples
Class 3: 2694 samples


In [75]:
print(reclassified_points.size().getInfo())

5000


In [70]:
print(reclassified_points.first().getInfo())

{'type': 'Feature', 'geometry': {'type': 'Point', 'coordinates': [-88.69960787030843, 41.66436276289306]}, 'id': '0', 'properties': {'class': 3, 'landcover': 82}}


Visualize the points on the interactive geemap.

In [96]:
# Define a function to set visual properties for each feature
def set_style(feature):
    class_value = ee.Number(feature.get('class'))
    color = ee.String(ee.Dictionary({
        '0': '#DF6149',
        '1': '#FEDC7B',
        '2': '#33576E',
        '3': '#498B6D'
    }).get(class_value.format()))

    return feature.set('style', {
        'color': color,
        'pointSize': ee.Number(2),
        'opacity': ee.Number(0.8)
    })

# Apply styling and add to map
styled_points = reclassified_points.map(set_style)
Map.addLayer(styled_points.style(**{'styleProperty': 'style'}), {}, 'Reclassed Points')

Map

Map(bottom=49063.0, center=[41.781552998900345, -87.42370605468751], controls=(WidgetControl(options=['positio…

### 2.2 Feature Engineering.

For possible use in the model, calculate and add the following spectral indices:

- **NDVI** (Normalized Difference Vegetation Index)
- **NDBI** (Normalized Difference Built-up Index)
- **MNDWI** (Modified Normalized Difference Water Index)

Additionally, add elevation and slope data from a DEM. Normalize all image bands to a 0 to 1 scale for consistent model input.

For bonus points, consider adding [kernel filters](https://google-earth-engine.com/Advanced-Image-Processing/Neighborhood-based-Image-Transformation/) (e.g., edge detection, smoothing) to see if they improve model performance.


## 3. Model Training and Evaluation

### 3.1 Model Training

Split your data into a training dataset (70%) and a validation dataset (30%). Train and evaluate a random forest model using the training set with all engineered features.

After training, analyze [variable importance scores](https://stackoverflow.com/questions/74519767/interpreting-variable-importance-from-random-forest-in-gee) to justify each feature's inclusion. Identify which features are most influential in the classification. Report the final features that you keep in your model.

In [45]:
# Use these bands for prediction.
bands = ["SR_B1", "SR_B2", "SR_B3", "SR_B4", "SR_B5", "SR_B6", "SR_B7"]


# This property of the table stores the land cover labels.
label = "landcover"

# Overlay the points on the imagery to get training.
training = image.select(bands).sampleRegions(
    **{"collection": points, "properties": [label], "scale": 30}
)

# Train a CART classifier with default parameters.
trained = ee.Classifier.libsvm().train(training, label, bands)

In [46]:
print(training.first().getInfo())

{'type': 'Feature', 'geometry': None, 'id': '0_0', 'properties': {'SR_B1': 8008, 'SR_B2': 8255, 'SR_B3': 9278, 'SR_B4': 8662, 'SR_B5': 21390, 'SR_B6': 13542, 'SR_B7': 10215, 'landcover': 82}}


In [None]:
# Classify the image with the same bands used for training.
result = image.select(bands).classify(trained)

# # Display the clusters with random colors.
Map.addLayer(result, {}, "classified")
Map

In [49]:
class_values = nlcd.get("landcover_class_values").getInfo()
class_values

[11,
 12,
 21,
 22,
 23,
 24,
 31,
 41,
 42,
 43,
 51,
 52,
 71,
 72,
 73,
 74,
 81,
 82,
 90,
 95]

In [50]:
class_palette = nlcd.get("landcover_class_palette").getInfo()
class_palette

['476ba1',
 'd1defa',
 'decaca',
 'd99482',
 'ee0000',
 'ab0000',
 'b3aea3',
 '68ab63',
 '1c6330',
 'b5ca8f',
 'a68c30',
 'ccba7d',
 'e3e3c2',
 'caca78',
 '99c247',
 '78ae94',
 'dcd93d',
 'ab7028',
 'bad9eb',
 '70a3ba']

In [51]:
landcover = result.set("classification_class_values", class_values)
landcover = landcover.set("classification_class_palette", class_palette)

In [55]:
Map.addLayer(landcover, {}, "Land cover")

In [53]:
print("Change layer opacity:")
cluster_layer = Map.layers[-1]
cluster_layer.interact(opacity=(0, 1, 0.1))

Change layer opacity:


Box(children=(FloatSlider(value=1.0, description='opacity', max=1.0),))

In [None]:
Map.add_legend(builtin_legend="NLCD")
Map

### 3.2 Accuracy Assessment

Use the trained model to classify the Landsat 8 image, creating a land cover classification map with classes for urban, bare, water, and vegetation (or whatever classes you have chosen).

Using the validation data, generate a confusion matrix and calculate the overall accuracy, precision, and recall. Which classes were confused most often with each other? Why do you think this was?

Visually compare your landcover data for your ROI with the corresponding [landcover data from the European Space Agency](https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v200). Do your classifications agree? If not, do you notice any patterns in the types of landcover where they differ, or any particular features in the imagery that are hard for your model to recognize (e.g., sand, water, or asphalt)?

Export the classified image as a GeoTIFF and the confusion matrix and accuracy metrics to a CSV file for documentation.

## 4. Reflection Questions

What limitations did you run into when completing this assignment? What might you do differently if you repeated it, or what might you change if you had more time and/or resources?

What was the impact of feature engineering? Which layers most contributed to the model? Did you expect this? Why or why not?

Did you find it difficult to create the training data by hand? Did you notice any issues with class imbalance? If so, how might you resolve this in the future (hint: consider a different sampling technique).

Did your model perform better on one class than another? Why? Can you think of a reason that this might be good or bad depending on the context?
