# Train a Semantic Segmentation Model using Segmentation-Models-PyTorch

[![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/opengeos/geoai/blob/main/docs/examples/train_segmentation_model.ipynb)

This notebook demonstrates how to train semantic segmentation models for object detection (e.g., building detection) using the [segmentation-models-pytorch](https://smp.readthedocs.io) library. Unlike instance segmentation with Mask R-CNN, this approach treats the task as pixel-level binary classification.

## Install packages
To use the new functionality, ensure the required packages are installed.


In [1]:
%pip install geoai-py

Collecting geoai-py
  Downloading geoai_py-0.9.0-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting buildingregulariser (from geoai-py)
  Downloading buildingregulariser-0.2.2-py3-none-any.whl.metadata (6.9 kB)
Collecting contextily (from geoai-py)
  Downloading contextily-1.6.2-py3-none-any.whl.metadata (2.9 kB)
Collecting ever-beta (from geoai-py)
  Downloading ever_beta-0.5.1-py3-none-any.whl.metadata (18 kB)
Collecting jupyter-server-proxy (from geoai-py)
  Downloading jupyter_server_proxy-4.4.0-py3-none-any.whl.metadata (8.7 kB)
Collecting leafmap (from geoai-py)
  Downloading leafmap-0.48.6-py2.py3-none-any.whl.metadata (16 kB)
Collecting localtileserver (from geoai-py)
  Downloading localtileserver-0.10.6-py3-none-any.whl.metadata (5.2 kB)
Collecting mapclassify (from geoai-py)
  Downloading mapclassify-2.10.0-py3-none-any.whl.metadata (3.1 kB)
Collecting maplibre (from geoai-py)
  Downloading maplibre-0.3.4-py3-none-any.whl.metadata (3.9 kB)
Collecting overturemaps (from geoai-p

## Import libraries

In [2]:
import geoai

## Download sample data

We'll use the same dataset as the Mask R-CNN example for consistency.

In [3]:
train_raster_url = (
    "https://huggingface.co/datasets/mikadishen/vexcel_google_images/resolve/main/vexcel%20teste_full_ref.tif"
)
train_vector_url = "https://huggingface.co/datasets/mikadishen/vexcel_google_images/resolve/main/buildings_tree_vicente_piresfinal.geojson"
test_raster_url = (
    "https://huggingface.co/datasets/mikadishen/vexcel_google_images/resolve/main/vexcel_treino_ref.tif"
)

In [4]:
train_raster_path = geoai.download_file(train_raster_url)
train_vector_path = geoai.download_file(train_vector_url)
test_raster_path = geoai.download_file(test_raster_url)

Downloading vexcel%20teste_full_ref.tif: 100%|██████████| 119M/119M [00:05<00:00, 24.0MB/s]
Downloading buildings_tree_vicente_piresfinal.geojson: 2.24MB [00:00, 3.89MB/s]
Downloading vexcel_treino_ref.tif: 100%|██████████| 121M/121M [00:05<00:00, 24.4MB/s]


## Visualize sample data

In [5]:
geoai.get_raster_info(train_raster_path)

{'driver': 'GTiff',
 'width': 8370,
 'height': 4970,
 'count': 3,
 'dtype': 'uint8',
 'crs': 'EPSG:4326',
 'transform': Affine(4.759976051036239e-07, 0.0, -48.0366707557608,
        0.0, -4.759976051036239e-07, -15.812011875489855),
 'bounds': BoundingBox(left=-48.0366707557608, bottom=-15.81437758358722, right=-48.03268665580609, top=-15.812011875489855),
 'resolution': (4.759976051036239e-07, 4.759976051036239e-07),
 'nodata': None,
 'band_stats': [{'band': 1,
   'min': 0.0,
   'max': 255.0,
   'mean': 122.79148804415502,
   'std': 70.02226270049813},
  {'band': 2,
   'min': 0.0,
   'max': 255.0,
   'mean': 124.2646285598898,
   'std': 57.07915377851708},
  {'band': 3,
   'min': 0.0,
   'max': 255.0,
   'mean': 116.69543382156739,
   'std': 57.831991876684576}]}

In [6]:
geoai.view_vector_interactive(train_vector_path, tiles=train_raster_url)

In [7]:
geoai.view_raster(test_raster_url)

## Create training data

We'll create the same training tiles as before.

In [8]:
out_folder = "buildings"
tiles = geoai.export_geotiff_tiles(
    in_raster=train_raster_path,
    out_folder=out_folder,
    in_class_data=train_vector_path,
    tile_size=512,
    stride=256,
    buffer_radius=0,
)


Raster info for vexcel%20teste_full_ref.tif:
  CRS: EPSG:4326
  Dimensions: 8370 x 4970
  Resolution: (4.759976051036239e-07, 4.759976051036239e-07)
  Bands: 3
  Bounds: BoundingBox(left=-48.0366707557608, bottom=-15.81437758358722, right=-48.03268665580609, top=-15.812011875489855)
Loaded 810 features from buildings_tree_vicente_piresfinal.geojson
Vector CRS: EPSG:4326
Found 6 unique classes: ['dark_roof' 'mixte_roof' 'light_roof' 'clay_roof' 'forest' 'crown']


Generated: 608, With features: 607: 100%|██████████| 608/608 [00:15<00:00, 38.68it/s]



------- Export Summary -------
Total tiles exported: 608
Tiles with features: 607 (99.8%)
Average feature pixels per tile: 129559.2
Output saved to: buildings

------- Georeference Verification -------


## Train semantic segmentation model

Now we'll train a semantic segmentation model using the new `train_segmentation_model` function. This function supports various architectures from `segmentation-models-pytorch`:

- **Architectures**: `unet`, `unetplusplus` `deeplabv3`, `deeplabv3plus`, `fpn`, `pspnet`, `linknet`, `manet`
- **Encoders**: `resnet34`, `resnet50`, `efficientnet-b0`, `mobilenet_v2`, etc.

For more details, please refer to the [segmentation-models-pytorch documentation](https://smp.readthedocs.io/en/latest/models.html).

### Example 1: U-Net with ResNet34 encoder


In [9]:
# Train U-Net model
geoai.train_segmentation_model(
    images_dir=f"{out_folder}/images",
    labels_dir=f"{out_folder}/labels",
    output_dir=f"{out_folder}/unet_models",
    architecture="unet",
    encoder_name="resnet34",
    encoder_weights="imagenet",
    num_channels=3,
    num_classes=2,  # background and building
    batch_size=8,
    num_epochs=100,
    learning_rate=0.001,
    val_split=0.2,
    verbose=True,
)

Using device: cuda
Found 608 image files and 608 label files
Training on 486 images, validating on 122 images
Checking image sizes for compatibility...
All sampled images have the same size: (512, 512)
No resizing needed.
Testing data loader...
Data loader test passed.


config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

KeyboardInterrupt: 

### Example 2: SegFormer with resnet152 encoder

In [10]:
geoai.train_segmentation_model(
    images_dir=f"{out_folder}/images",
    labels_dir=f"{out_folder}/labels",
    output_dir=f"{out_folder}/segformer_models",
    architecture="segformer",
    encoder_name="resnet152",
    encoder_weights="imagenet",
    num_channels=3,
    num_classes=6,
    batch_size=6,  # Smaller batch size for more complex model
    num_epochs=100,
    learning_rate=0.0005,
    val_split=0.2,
)

Using device: cuda
Found 608 image files and 608 label files
Training on 486 images, validating on 122 images
Checking image sizes for compatibility...
All sampled images have the same size: (512, 512)
No resizing needed.
Testing data loader...
Data loader test passed.


config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Starting training with segformer + resnet152
Model parameters: 59,474,246
Epoch: 0, Batch: 0/81, Loss: 1.9077, Time: 2.29s
Epoch: 0, Batch: 10/81, Loss: 1.0248, Time: 8.04s
Epoch: 0, Batch: 20/81, Loss: 0.5073, Time: 8.09s
Epoch: 0, Batch: 30/81, Loss: 0.5913, Time: 8.12s
Epoch: 0, Batch: 40/81, Loss: 0.8282, Time: 8.19s
Epoch: 0, Batch: 50/81, Loss: 0.6960, Time: 8.26s
Epoch: 0, Batch: 60/81, Loss: 0.5608, Time: 8.32s
Epoch: 0, Batch: 70/81, Loss: 0.6314, Time: 8.40s
Epoch: 0, Batch: 80/81, Loss: 0.5669, Time: 8.46s
Epoch 1/50: Train Loss: 0.7647, Val Loss: 0.8508, Val IoU: 0.3721, Val Dice: 0.4625
Saving best model with IoU: 0.3721
Epoch: 1, Batch: 0/81, Loss: 0.5043, Time: 1.35s
Epoch: 1, Batch: 10/81, Loss: 0.4340, Time: 8.54s
Epoch: 1, Batch: 20/81, Loss: 0.6724, Time: 8.59s
Epoch: 1, Batch: 30/81, Loss: 0.4574, Time: 8.65s
Epoch: 1, Batch: 40/81, Loss: 0.5520, Time: 8.69s
Epoch: 1, Batch: 50/81, Loss: 0.5463, Time: 8.73s
Epoch: 1, Batch: 60/81, Loss: 0.4564, Time: 8.83s
Epoch: 1,

KeyboardInterrupt: 

## Run inference

Now we'll use the trained model to make predictions on the test image.

In [None]:
# Define paths
masks_path = "vexcel_treino_semantic_prediction3.tif"
model_path = f"{out_folder}/segformer_models/best_model.pth"

In [None]:
# Run semantic segmentation inference
geoai.semantic_segmentation(
    input_path=test_raster_path,
    output_path=masks_path,
    model_path=model_path,
    architecture="segformer",
    encoder_name="resnet152",
    num_channels=3,
    num_classes=6,
    window_size=512,
    overlap=256,
    batch_size=4,
)

## Vectorize masks

Convert the predicted mask to vector format for better visualization and analysis.

In [None]:
output_vector_path = "naip_test_semantic_prediction.geojson"
gdf = geoai.orthogonalize(masks_path, output_vector_path, epsilon=2)

## Add geometric properties

In [None]:
gdf_props = geoai.add_geometric_properties(gdf, area_unit="m2", length_unit="m")

## Visualize results

In [None]:
geoai.view_raster(masks_path, nodata=0, basemap=test_raster_url, backend="ipyleaflet")

In [None]:
geoai.view_vector_interactive(gdf_props, column="area_m2", tiles=test_raster_url)

In [None]:
gdf_filtered = gdf_props[(gdf_props["area_m2"] > 50)]

In [None]:
geoai.view_vector_interactive(gdf_filtered, column="area_m2", tiles=test_raster_url)

In [None]:
geoai.create_split_map(
    left_layer=gdf_filtered,
    right_layer=test_raster_url,
    left_args={"style": {"color": "red", "fillOpacity": 0.2}},
    basemap=test_raster_url,
)

## Model Performance Analysis

Let's examine the training curves and model performance:

In [None]:
geoai.plot_performance_metrics(
    history_path=f"{out_folder}/unet_models/training_history.pth",
    figsize=(15, 5),
    verbose=True,
)

![image](https://github.com/user-attachments/assets/9355446f-f9ba-4818-aedb-4bb5dee56813)

## Performance Metrics

**IoU (Intersection over Union)** and **Dice score** are both popular metrics used to evaluate the similarity between two binary masks—often in image segmentation tasks. While they are related, they are not the same.

---

### 🔸 **Definitions**

#### **IoU (Jaccard Index)**

$$
\text{IoU} = \frac{|A \cap B|}{|A \cup B|}
$$

* Measures the overlap between predicted region $A$ and ground truth region $B$ relative to their union.
* Ranges from 0 (no overlap) to 1 (perfect overlap).

#### **Dice Score (F1 Score for Sets)**

$$
\text{Dice} = \frac{2|A \cap B|}{|A| + |B|}
$$

* Measures the overlap between $A$ and $B$, but gives more weight to the intersection.
* Also ranges from 0 to 1.

---

### 🔸 **Key Differences**

| Metric   | Formula                     | Penalizes                      | Sensitivity                      |
| -------- | --------------------------- | ------------------------------ | -------------------------------- |
| **IoU**  | $\frac{TP}{TP + FP + FN}$   | FP and FN equally              | Less sensitive to small objects  |
| **Dice** | $\frac{2TP}{2TP + FP + FN}$ | Less harsh on small mismatches | More sensitive to small overlaps |

> TP: True Positive, FP: False Positive, FN: False Negative

---

### 🔸 **Relationship**

Dice and IoU are mathematically related:

$$
\text{Dice} = \frac{2 \cdot \text{IoU}}{1 + \text{IoU}} \quad \text{or} \quad \text{IoU} = \frac{\text{Dice}}{2 - \text{Dice}}
$$

---

### 🔸 **When to Use What**

* **IoU**: Common in object detection and semantic segmentation benchmarks (e.g., COCO, Pascal VOC).
* **Dice**: Preferred in medical imaging and when class imbalance is an issue, due to its sensitivity to small regions.