# Expanding the Use of Satellite Imagery with Machine Learning  
##### Nikolay Zhechev, SoftUni

## Abstract  

Satellite imagery provides a unique view of planet Earth and a possibility to see features that were never seeen before. In recent years satellite imagery has become even more essential. Now we are able to see astonishing viwes in detail, narowing to just a few meters.  
Collected images have multiple purposes from commmercial and educational to the ability to provide insights on how our Earth is evolving and changing. One of the most fascinating aspects of Earth imagery is that we are able to destinguish meanighful patterns over time helping us with predictions and future outcome. Today humanity has millions of images increasing with around 80TB/day from the past few decades which can help shape our future. Two considerations arise: vast amounts of data and a reproducible method to perfom predictions. Machine Learning can solve both, providing a reproducible and structured way of detecting patterns and important features of sattelite imagery, enabling indepth model selection and tuning for all specific needs. Exporting of sattelite image data and what are good, approachable tecniques and how this data can be fed to a machine learning model to evaluate results, outcomes and determine improvements, reproducibility and further steps.
The text aims to further expand the knowledge of the reader and solidify a good understanding of how machine learning concepts can be applyed to sattelite imagery. Both machine learning and specific satellite imagery concepts will be looked at in more detail.

## Indroduction  
Manual investigation of sattelite imagery can be very time consuming as well as very difficult. Specific software, land and map understanding is required. How can this process be made more efficient? What further value can be generated from a more streamline and effective processing method? The use of machine learning algorithms can solve most of our questions. It should provide a good and stable process of image predictions, pattern recognition and much more.

> from: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-023-00772-x  
In recent times, Deep Learning, a sophisticated tool in the field of machine learning, has demonstrated its effectiveness in the realm of computer vision and subsequently, in remote sensing as well. The conventional machine learning tools such as Support Vector Machine (SVM) and Random Forest (RF) which are shallow-structured, have major limitations that are addressed by these advanced machine learning algorithms. Prominent deep learning models such as Deep Belief Net (DBN), Stacked Auto-Encoder (SAE), and deep Convolutional Neural Network (CNN) have shown promising results in several remote sensing applications, including segmentation, object detection, and classification. These models are characterized by deep architecture, multi-layered interconnected channels, and a high capacity to learn features.  
Nonetheless, the application of these transformers is computationally expensive, and their efficiency decreases exponentially with the size of the image, thereby requiring significant computational resources.

Presented in this text are approaches with CNN and SVM. Both methods are used with diiferent datasets to attempt and understand how results differ with data and how these models work and interpret data. 

Some use cases include:
- agriculture, forestry, and sustainability sectors;
- understand local waterway pollution, illegal land uses, or mass migrations;
- predict wilwdefire and flood;
- predict glacier and water differences;
- crops and food predictions;
- poverty prediction;
- infrastructure and urban planning;

Spatial Resolution: how many meters are covered per pixel, higher resolution will unlock some use cases and give better performance for most algorithms.

Instruments available: the instruments will capture different spectral bands, both from the visible and invisible spectrum. Bands usefulness will vary depending on the use case.

Temporal resolution : time between two visits on a given spot on earth. Note that many satellite systems actually include several satellites and in that case the global temporal resolution is usually equal to that of one of its satellites divided by the number of satellites.

Radiometric resolution : the number of possible values the instrument captures. The higher, the more precise the measurements are.

![title](Images/classified_plots.png)

Spectral, temporal, and spatial resolution are major features of remote sensing images and are important parameters to be considered during remote sensing image classification process;

1.
Spectral resolution is composed of different wavelengths of electromagnetic radiation.

2.
Temporal resolution is the time interval between image acquisitions.

3.
Spatial resolution is the size of a pixel on the ground. These parameters play a critical role in identifying different land cover types and monitoring changes in land cover over time.

They are complex features on remote sensing images and efficient system must be able to effectively process them to achieve accurate classification of remote sensing images by focusing on spectral, temporal, and spatial resolution of the images.

There are also other types of remote sensing images based on the nature of the capturing devices. These are categorised into optical, thermal, hyper-spectral, and SAR images:

1.
Optical images capture visible and near-infrared regions of the electromagnetic spectrum and are the most commonly used remote sensing data for land cover classification.

2.
Thermal images capture the thermal radiation emitted by the Earth’s surface and is used to detect temperature variations.

3.
Hyper-spectral images capture a wide range of spectral bands with narrow bandwidths, allowing for the identification of more subtle spectral signatures.

4.
SAR images use microwave radiation and can penetrate through clouds and vegetation, making them useful in detecting changes in surface features.

## Methods

Let’s assume you want to train a machine learning model to identify objects in an image it’s never encountered. The first step in training this supervised machine learning model is to annotate and label a collection of images, called a training dataset. Annotation involves manually identifying and marking the regions of interest in an image.

For example, we would annotate each image by outlining what is present in the image and assigning them the corresponding class label. This annotated dataset becomes the foundation for training the model. Once the training dataset is labeled, relevant features need to be extracted from the images. Feature extraction involves identifying and capturing important characteristics or patterns that distinguish one class from another.



There are various techniques available for feature extraction in image processing, ranging from simple methods like color histograms and texture descriptors to more advanced approaches like convolutional neural networks (CNNs).

The training process of a supervised ML model, like a CNN, involves several steps. First, the data needs to be preprocessed to ensure consistency and quality. This may involve resizing the images, normalizing pixel values, and augmenting the dataset by applying transformations like rotations or flips to increase its diversity.



CNNs’ architecture tries to mimic the structure of neurons in the human visual system composed of multiple layers, where each one is responsible for detecting a specific feature in the data.  As illustrated in the image below, the typical CNN is made of a combination of four main layers: 

- Convolutional layers  
- Rectified Linear Unit (ReLU for short)  
- Pooling layers  
- Fully connected layers�𝑒𝑑 . 

Spectral indices, which are features computed from two or
more spectral bands, are commonly used in place of or to
supplement the bands. The most commonly used index is the
normalized difference vegetation index (NDVI), which
compares the values of the red and near-infrared (NIR) band
using this formula:  

$ NDVI = \frac{(NIR−RED)}{(NIR+RED)} $

NDVI quantifies vegetation by measuring the difference between near-infrared (which vegetation strongly reflects) and red light (which vegetation absorbs).

Unsupervised learning does not rely on labeled data but instead aims to discover hidden patterns, structures, or relationships within the data itself.

The purpose of unsupervised learning in image analysis is to uncover meaningful structures and insights from unlabeled image data. By utilizing unsupervised learning techniques, valuable information can be extracted and a deeper understanding is gained of the underlying characteristics of images.

Unsupervised learning can help identify clusters of similar images, discover patterns or textures that are characteristic of certain image classes, and detect anomalies or outliers within the data.

### Supervised Classification
#### Land Cover Mapping 
For implemention see section 1.1, sub section 1.1.1  

Categorize areas into land cover types like urban, forest, or water. Enables monitoring of land cover changes over time with land cover classification using CART (Classification and Regression Tree) in Earth Engine.  
Data:  
- Collect images from Landsat 8 (LANDSAT/LC08/C02/T1_L2).
- Scale and mask each image.
- Creates a cloud-free composite by taking the median value for each pixel across all images


Preprocess:
- Remove unwanted pixels.
- Select visible, near-infrared (NIR), shortwave-infrared (SWIR), and thermal bands as input features for the classification model.
- Scale and offset.
- Apply mask.


Training:  
- Uses a predefined FeatureCollection (demo_landcover_labels) containing points with known land cover classes.  
- Each point has a landcover property that stores numeric labels (e.g., 0 = Urban, 1 = Forest, 2 = Water).


Classifier:  
- Uses a CART (Classification and Regression Tree) model. CART splits data based on decision rules to classify pixels into categories.  


Applies the trained classifier to the entire image, assigning each pixel a land cover class (e.g., 0 = Urban, 1 = Forest, 2 = Water).

Visualization:  
- Setp map coordinates to predefined ones.
- Add input image visualizing the cloud-free compoiste with specific bands (RGB bands).
- Displays the classified image with land cover classes:
    - Orange: Urban (class 0)
    - Green: Forest (class 1)
    - Blue: Water (class 2)
 
<img src="Images/ee_pred_1.png" width="600" />
fig. 01



#### Identify areas of forest loss  
For implemention see section 1.1, sub section 1.1.2  


Processes Landsat 8 surface reflectance data to create a cloud-free composite, prepare training data, train a classifier, and then classify the image to identify areas of deforestation. Represent forested and non-forested areas.  

FeatureCollection:
- Combines polygons into a dataset with a class property (1 for forest, 0 for non-forest).  
Classifier:  
- SVM classifier with a radial basis function (RBF) kernel.

  
Apply the trained classifier to the entire composite image, labeling each pixel as either forest (1) or non-forest (0).  
Displays the composite with bands SR_B4 (red), SR_B3 (green), and SR_B2 (blue), which are in the visible spectrum.

Classification Result:  
- Displays the classified map with colors:  
    - Orange: Non-forest (class 0)  
    - Green: Forest (class 1)
 

<img src="Images/ee_pred_2.png" width="600" />
fig. 02


#### Land Cover Analysis  
For implemention see section 1.1, sub section 1.1.3  


Identify areas of forest, water, urban, etc. Detects land cover changes over time for environmental monitoring. a Using Random Forest (RF) classifier to train and validate a land cover classification model. 

- Define latitude and longitude.
- Retreive Landsat 8 Tier 1 Level 2 data for the defined ROI (region of interest).
- Load MODIS International Geosphere-Biosphere Programme (IGBP) land cover classification as labels for training (contains 17 land cover classes (e.g., Forest, Urban, Water).
- Select speicif bands SR_B2 (Blue) to SR_B7 (SWIR2) for anlysis.
- Take the median value for each pixel (median composite) to create cloud-free images.

Combine the input image bands and MODIS labels into a single dataset. Randomly samples 5,000 points across the ROI to create a FeatureCollection of training data. Where each point has:  
- Landsat band values (SR_B2 to SR_B7) as features.
- MODIS land cover label (LC_Type1) as the target.
- Split the data into training (70%) and validation (30%) sets.

Random Forest Classifier:  
- Algorithm: Constructs an ensemble of decision trees.
- Features: Reflectance values from bands SR_B2 to SR_B7.
- Target (classProperty): MODIS land cover label (LC_Type1).
- We use 10 tress in the forest for higher accuracy.
- Outputs a classified map where each pixel is assigned a land cover class (0–16).

Model evaluation:
- Confusion Matrix for training accuracy: Compares predicted vs. true labels.
- Validation Accuracy for validation accuracy: Summarizes true positives, false positives, etc.

Outputs a classified land cover map.

Training size: 3515
Training overall accuracy: 0.9513888888888888
Validation overall accuracy: 0.6299259851970395

<img src="Images/ee_pred_3.png" width="600" />

fig. 03


<img src="Images/matrixs_pred.png" width="600" />

fig. 04



#### SVM Image Classifier with TensorFlow  
For implemention see section 1.1, sub section 1.1.4  

|               | precision      | recall     | f1-score      | support      |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| cloudy | 1.00 | 1.00 | 1.00 | 268 |
| desert | 1.00 | 1.00 | 1.00 | 256 |
| green_area |  0.88 | 0.93 | 0.91 |305 |
| water | 0.92 | 0.88 | 0.90 | 297 |
| accuracy |   |   | 0.95 | 1126 |
| macro avg | 0.95  | 0.95 | 0.95 | 1126 |
| weighted avg | 0.95  | 0.95 | 0.95 | 1126 |

<img src="Images/classification_images.png" width="600" />

fig. 05

#### Pattern Identification
For implemention see section 1.2, sub section 1.2.1  

Clustering does not rely on labeled training data. Instead, it groups data points based on their inherent characteristics (spectral similarity). using K-means Clustering where the algorithm divides the data into k clusters by minimizing the variance within each cluster and maximizing the variance between clusters. Useful for identifying patterns or features such as vegetation zones, water bodies, or urban areas in remote sensing images without prior labels.  

Each pixel contains spectral information from the bands in the image.
K-means Clustering:
- Here we specify k=15 clusters.
- The algorithm minimizes intra-cluster variance, ensuring pixels in the same cluster are spectrally similar.
- The trained clusterer is applied to the entire image to assign each pixel to one of the 15 clusters.
- The output is a single-band image where each pixel’s value corresponds to its cluster ID (an integer between 0 and 14).
  
Training:
- The clusterer is trained using the sampled pixel data, allowing it to learn the spectral characteristics of the region.

The map shows 15 distinct clusters representing different spectral characteristics of the landscape. Result is not tied to specific land cover types but provides insight into natural groupings in the data.

<img src="Images/ee_pred_4.png" width="600" />

fig. 06

### Supervised Classification
For implemention see section 2

#### Data preprocess and masking
section 2.1

Building: #3C1098  
Land (unpaved area): #8429F6  
Road: #6EC1E4  
Vegetation: #FEDD3A  
Water: #E2A929  
Unlabeled: #9B9B9B  

| Satellite Image             |  Mask |
|-----------------------------|-------------------------|
| <img src="Datasets/Segmentation_Tasks/Semantic%20segmentation%20dataset/Tile%201/images/image_part_009.jpg" width="300" /> | <img src="Datasets/Segmentation_Tasks/Semantic%20segmentation%20dataset/Tile%201/masks/image_part_009.png" width="300" /> |
| <img src="Datasets/Segmentation_Tasks/Semantic%20segmentation%20dataset/Tile%205/images/image_part_007.jpg" width="300" /> | <img src="Datasets/Segmentation_Tasks/Semantic%20segmentation%20dataset/Tile%205/masks/image_part_007.png" width="300" /> |



Break large images into smaller patches and normalize them, to ensure the dataset is ready for patch-based processing and semantic segmentation.

For image patches:

Min-max normalization is applied using minmaxscaler.
Each normalized patch is stored in image_dataset.
For mask patches:

Patches are directly stored in mask_dataset.

#### One-hot encoding
section 2.2

- using classes.json  

Each class e.g., building, land, road is represented by a specific color. During one-hot encoding of mask images, these RGB values can be used to identify pixels corresponding to each class and map them to class indices or binary masks.  

Extract the red ('3C'), green ('10'), and blue ('98') values and converts them to integers using int(hex, 16).  
The output for each class is an array of three integers, each representing the RGB values.

[ 60  16 152]  
[132  41 246]  
[110 193 228]  
[254 221  58]  
[226 169  41]  
[155 155 155]  

Models expect masks in the form of class indices (e.g., 0 for water, 1 for land).

For example:  
[226, 169, 41] corresponds to class 0 (water),  
[155, 155, 155] corresponds to class 5 (unlabeled),  
[254, 221, 58] corresponds to class 4 (vegetation),  
[110, 193, 228] corresponds to class 2 (road)  


#### Architecture
section 2.3


The Jaccard coefficient is defined as:


$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$


Here:

- $|A \cap B| $: The intersection, already computed.
- $ |A \cup B| $: The union, calculated as:

  $$|A \cup B| = |A| + |B| - |A \cap B|$$


Adding \( 1.0 \) to the numerator and denominator is a smoothing term to avoid division by zero when both `y_true` and `y_pred` are empty (i.e., when the intersection and union are zero).


Follow U-Net architecture > https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/  

![u-net-arch-img-link](Images/u-net-architecture.png)


-  convulution parameter (first number) is changed from original architecture due to our use case (for every next layer filter needs to be doubled)
-  dropout parameter can be changed and experimented for use case

#### Define Loss Function 
section 2.4

dice loss > Focal Loss > Total Loss  
Total Loss = (Dice loss + (1*Focal Loss))  


Class weights are used to balance the loss function in cases where some classes are more prevalent than others in the dataset (i.e., imbalanced datasets).  
Dice loss measures the overlap between predicted segmentation and ground truth. `class_weights` will adjust the contribution of each class.     
Dice loss: Focuses on overlap quality, improving segmentation performance.  
Focal loss: Emphasizes learning on hard-to-classify examples, addressing class imbalance.


#### Model compliation
section 2.5

Model: "functional"


#### Evaluation
section 2.6


|   |   |
|-----------------------------|-------------------------|
<img src="Images/loss_1_segm.png" width="350" /> | <img src="Images/loss_intersect_segm.png" width="350" />



#### Generate predictions
section 2.7

Compare test mask and predicted mask images  

<img src="Images/segm_pred_1.png" width="650" />
<img src="Images/segm_pred_2.png" width="650" />
<img src="Images/segm_pred_3.png" width="650" />
<img src="Images/segm_pred_4.png" width="650" />

## *Conclusion - outcome>

## *Cite and resources >