# 01 - Data Exploration
## Khám phá dữ liệu Sentinel-1, Sentinel-2 và Ground Truth

Notebook này sẽ:
- Load và phân tích dữ liệu ground truth (1285 điểm)
- Đọc và hiển thị thông tin ảnh Sentinel-1 và Sentinel-2
- Visualize RGB composite và các spectral indices
- Kiểm tra phân bố labels và vị trí các điểm
- Trích xuất và visualize một số patches mẫu

In [1]:
import sys
sys.path.append('..')  # Add parent directory to path

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend for faster processing
import matplotlib.pyplot as plt
import seaborn as sns
import rasterio
from rasterio.plot import show
from pathlib import Path
import os

# Import project modules
from src.config import *
from src.utils import read_geotiff, normalize_image, validate_sentinel2_ranges, coords_to_pixel, mask_raster_with_boundary
from src.preprocessing import load_ground_truth, extract_patch_at_point

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Disable interactive plotting
plt.ioff()

# Create figures directory if it doesn't exist
FIGURES_DIR = Path(PROJECT_ROOT) / 'figures'
FIGURES_DIR.mkdir(exist_ok=True)

print("="*60)
print("FAST MODE: Figures will be saved without display")
print(f"Output directory: {FIGURES_DIR}")
print("="*60)
print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")

FAST MODE: Figures will be saved without display
Output directory: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures
Project root: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..
Data directory: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\data


## 1. Load và phân tích Ground Truth

In [2]:
# Load ground truth CSV
df = load_ground_truth()

# Display first few rows
print("\nFirst 10 rows:")
df.head(10)

Loaded 1285 ground truth points
  - No deforestation (0): 650
  - Deforestation (1): 635

First 10 rows:


Unnamed: 0,id,label,x,y
0,1,1,495551.110218,1054432.0
1,2,1,495451.786737,1054588.0
2,3,1,495391.097161,1054524.0
3,4,1,495664.635289,1054353.0
4,5,1,495610.798117,1054299.0
5,6,1,496706.522336,1054520.0
6,7,1,496726.584917,1054502.0
7,8,1,496685.963731,1054462.0
8,9,1,496636.045419,1054433.0
9,10,1,496666.873913,1054412.0


In [3]:
# Basic statistics
print("Dataset info:")
print(df.info())
print("\nBasic statistics:")
print(df.describe())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1285 entries, 0 to 1284
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      1285 non-null   int64  
 1   label   1285 non-null   int64  
 2   x       1285 non-null   float64
 3   y       1285 non-null   float64
dtypes: float64(2), int64(2)
memory usage: 40.3 KB
None

Basic statistics:
                id        label              x             y
count  1285.000000  1285.000000    1285.000000  1.285000e+03
mean    643.000000     0.494163  504952.989734  1.001523e+06
std     371.091857     0.500161   19869.198606  3.423688e+04
min       1.000000     0.000000  476428.116678  9.483253e+05
25%     322.000000     0.000000  489448.961958  9.720617e+05
50%     643.000000     0.000000  496832.139123  9.915942e+05
75%     964.000000     1.000000  521610.656136  1.033932e+06
max    1285.000000     1.000000  545098.055002  1.054588e+06


## 2. Load và phân tích Sentinel-2 Imagery

In [4]:
# Load Sentinel-2 images
print("Loading Sentinel-2 2024...")
s2_2024, s2_2024_profile, s2_2024_transform = read_geotiff(SENTINEL2_2024)

print("\nLoading Sentinel-2 2025...")
s2_2025, s2_2025_profile, s2_2025_transform = read_geotiff(SENTINEL2_2025)

# Print information
print(f"\nS2 2024 shape: {s2_2024.shape}")
print(f"S2 2024 dtype: {s2_2024.dtype}")
print(f"S2 2024 CRS: {s2_2024_profile['crs']}")
print(f"S2 2024 transform: {s2_2024_transform}")

print(f"\nS2 2025 shape: {s2_2025.shape}")
print(f"S2 2025 dtype: {s2_2025.dtype}")

# Check for NaN values
print(f"\nS2 2024 has NaN: {np.isnan(s2_2024).any()}")
print(f"S2 2025 has NaN: {np.isnan(s2_2025).any()}")

# Value ranges
print(f"\nS2 2024 value range: [{np.nanmin(s2_2024):.2f}, {np.nanmax(s2_2024):.2f}]")
print(f"S2 2025 value range: [{np.nanmin(s2_2025):.2f}, {np.nanmax(s2_2025):.2f}]")

Loading Sentinel-2 2024...

Loading Sentinel-2 2025...

S2 2024 shape: (7, 10917, 12547)
S2 2024 dtype: float32
S2 2024 CRS: EPSG:32648
S2 2024 transform: | 10.00, 0.00, 465450.00|
| 0.00,-10.00, 1055820.00|
| 0.00, 0.00, 1.00|

S2 2025 shape: (7, 10917, 12547)
S2 2025 dtype: float32

S2 2024 has NaN: True
S2 2025 has NaN: True

S2 2024 value range: [-1.00, 1.00]
S2 2025 value range: [-1.00, 1.00]


In [5]:
# Check value range for each band individually
print("\n" + "="*60)
print("DETAILED VALUE RANGE FOR EACH BAND (RAW FILE)")
print("="*60)

for i, band_name in enumerate(S2_BANDS):
    min_2024 = np.nanmin(s2_2024[i])
    max_2024 = np.nanmax(s2_2024[i])
    mean_2024 = np.nanmean(s2_2024[i])
    
    min_2025 = np.nanmin(s2_2025[i])
    max_2025 = np.nanmax(s2_2025[i])
    mean_2025 = np.nanmean(s2_2025[i])
    
    print(f"\n{band_name}:")
    print(f"  2024: [{min_2024:7.4f}, {max_2024:7.4f}]  mean={mean_2024:7.4f}")
    print(f"  2025: [{min_2025:7.4f}, {max_2025:7.4f}]  mean={mean_2025:7.4f}")

# Check for outliers in B12 2025
print("\n" + "="*60)
print("CHECKING B12 2025 FOR OUTLIERS")
print("="*60)
b12_2025 = s2_2025[3]  # B12 is band index 3
outliers = b12_2025[b12_2025 > 1.0]
outliers_valid = outliers[~np.isnan(outliers)]
print(f"Number of pixels with B12 > 1.0: {len(outliers_valid):,}")
if len(outliers_valid) > 0:
    print(f"Max value: {np.max(outliers_valid):.6f}")
    print(f"Values > 1.0: {np.unique(outliers_valid)[:10]}")  # Show first 10 unique values

print("\n" + "="*60)


DETAILED VALUE RANGE FOR EACH BAND (RAW FILE)

B4:
  2024: [ 0.0001,  0.8136]  mean= 0.0475
  2025: [ 0.0001,  0.6636]  mean= 0.0560

B8:
  2024: [ 0.0001,  0.7592]  mean= 0.2179
  2025: [ 0.0001,  0.6240]  mean= 0.2102

B11:
  2024: [ 0.0013,  0.5830]  mean= 0.0993
  2025: [ 0.0004,  0.6433]  mean= 0.1117

B12:
  2024: [ 0.0002,  0.9826]  mean= 0.0529
  2025: [ 0.0005,  1.0000]  mean= 0.0607

NDVI:
  2024: [-1.0000,  1.0000]  mean= 0.5631
  2025: [-1.0000,  1.0000]  mean= 0.5105

NBR:
  2024: [-1.0000,  1.0000]  mean= 0.5690
  2025: [-1.0000,  1.0000]  mean= 0.5113

NDMI:
  2024: [-1.0000,  1.0000]  mean= 0.3401
  2025: [-1.0000,  1.0000]  mean= 0.2717

CHECKING B12 2025 FOR OUTLIERS
Number of pixels with B12 > 1.0: 0



### Apply Forest Boundary Mask

Áp dụng mask từ shapefile boundary để chỉ giữ lại pixels trong khu vực rừng nghiên cứu. Các pixels ngoài boundary sẽ được set thành NaN.

In [6]:
# Apply forest boundary mask to all imagery
print("\nApplying forest boundary mask...")
print(f"Boundary shapefile: {FOREST_BOUNDARY}")

# Mask Sentinel-2 data
s2_2024_masked, s2_mask = mask_raster_with_boundary(s2_2024, s2_2024_transform, FOREST_BOUNDARY)
s2_2025_masked, _ = mask_raster_with_boundary(s2_2025, s2_2025_transform, FOREST_BOUNDARY)

# Replace original with masked versions
s2_2024 = s2_2024_masked
s2_2025 = s2_2025_masked

print(f"Masked S2 2024 - Valid pixels: {np.sum(s2_mask):,} / {s2_mask.size:,} ({np.sum(s2_mask)/s2_mask.size*100:.2f}%)")
print(f"S2 2024 after masking - value range: [{np.nanmin(s2_2024):.2f}, {np.nanmax(s2_2024):.2f}]")
print("Forest boundary mask applied to Sentinel-2 imagery ✓")


Applying forest boundary mask...
Boundary shapefile: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\data\raw\boundary\forest_boundary.shp




Masked S2 2024 - Valid pixels: 17,016,424 / 136,975,599 (12.42%)
S2 2024 after masking - value range: [-1.00, 1.00]
Forest boundary mask applied to Sentinel-2 imagery ✓


In [7]:
# Check value ranges AFTER applying boundary mask (compare with QGIS)
print("\n" + "="*60)
print("VALUE RANGES AFTER BOUNDARY MASK (Compare with QGIS)")
print("="*60)

for i, band_name in enumerate(S2_BANDS):
    min_2024 = np.nanmin(s2_2024[i])
    max_2024 = np.nanmax(s2_2024[i])
    
    min_2025 = np.nanmin(s2_2025[i])
    max_2025 = np.nanmax(s2_2025[i])
    
    print(f"\n{band_name}:")
    print(f"  2024: [{min_2024:7.4f}, {max_2024:7.4f}]")
    print(f"  2025: [{min_2025:7.4f}, {max_2025:7.4f}]")

print("\n" + "="*60)
print("Expected from QGIS (Forest area only):")
print("="*60)
print("S2 2024:")
print("  B4:   [0.0003, 0.3164]")
print("  B8:   [0.0003, 0.5732]")
print("  B11:  [0.0051, 0.4538]")
print("  B12:  [0.0046, 0.3662]")
print("\nS2 2025:")
print("  B4:   [0.0001, 0.4436]")
print("  B8:   [0.0001, 0.5464]")
print("  B11:  [0.0068, 0.5130]")
print("  B12:  [0.0052, 0.9560]")
print("="*60)


VALUE RANGES AFTER BOUNDARY MASK (Compare with QGIS)

B4:
  2024: [ 0.0001,  0.8136]
  2025: [ 0.0001,  0.6636]

B8:
  2024: [ 0.0001,  0.7592]
  2025: [ 0.0001,  0.6240]

B11:
  2024: [ 0.0013,  0.5830]
  2025: [ 0.0004,  0.6433]

B12:
  2024: [ 0.0002,  0.9826]
  2025: [ 0.0005,  1.0000]

NDVI:
  2024: [-1.0000,  1.0000]
  2025: [-1.0000,  1.0000]

NBR:
  2024: [-1.0000,  1.0000]
  2025: [-1.0000,  1.0000]

NDMI:
  2024: [-1.0000,  1.0000]
  2025: [-1.0000,  1.0000]

Expected from QGIS (Forest area only):
S2 2024:
  B4:   [0.0003, 0.3164]
  B8:   [0.0003, 0.5732]
  B11:  [0.0051, 0.4538]
  B12:  [0.0046, 0.3662]

S2 2025:
  B4:   [0.0001, 0.4436]
  B8:   [0.0001, 0.5464]
  B11:  [0.0068, 0.5130]
  B12:  [0.0052, 0.9560]


In [8]:
# Visualize individual bands for both 2024 and 2025
# Use appropriate colormaps and value ranges for each band type
band_names = S2_BANDS  # Get band names from config

colormaps = {
    'B4': 'Reds',      # Red band
    'B8': 'YlOrRd',    # NIR band
    'B11': 'YlOrBr',   # SWIR1 band
    'B12': 'viridis',  # SWIR2 band - perceptually uniform, colorblind-friendly
    'NDVI': 'RdYlGn',  # Vegetation index
    'NBR': 'RdYlGn',   # Burn ratio
    'NDMI': 'RdYlBu'   # Moisture index
}

# Value ranges for each band type
# B4, B8, B11, B12: [0, 1]
# NDVI, NBR, NDMI: [-1, 1]
vmin_values = [0, 0, 0, 0, -1, -1, -1]
vmax_values = [1, 1, 1, 1,  1,  1,  1]

fig, axes = plt.subplots(2, 7, figsize=(20, 6))

# 2024 bands
for i, band_name in enumerate(band_names):
    cmap = colormaps.get(band_name, 'viridis')
    im = axes[0, i].imshow(s2_2024[i], cmap=cmap, 
                           vmin=vmin_values[i],
                           vmax=vmax_values[i])
    axes[0, i].set_title(f'{band_name} 2024', fontsize=10)
    axes[0, i].axis('off')
    plt.colorbar(im, ax=axes[0, i], fraction=0.046, pad=0.04)

# 2025 bands
for i, band_name in enumerate(band_names):
    cmap = colormaps.get(band_name, 'viridis')
    im = axes[1, i].imshow(s2_2025[i], cmap=cmap,
                           vmin=vmin_values[i],
                           vmax=vmax_values[i])
    axes[1, i].set_title(f'{band_name} 2025', fontsize=10)
    axes[1, i].axis('off')
    plt.colorbar(im, ax=axes[1, i], fraction=0.046, pad=0.04)

fig.suptitle('Sentinel-2 All Bands - 2024 vs 2025\n(Spectral bands: [0,1], Indices: [-1,1])', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
# Save figure
plt.savefig(FIGURES_DIR / '01_s2_all_bands_comparison.png', dpi=300, bbox_inches='tight')
print(f"✓ Saved: {FIGURES_DIR / '01_s2_all_bands_comparison.png'}")
plt.close()

✓ Saved: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures\01_s2_all_bands_comparison.png


## 3. Load và phân tích Sentinel-1 Imagery

In [9]:
# Load Sentinel-1 images
print("Loading Sentinel-1 2024...")
s1_2024, s1_2024_profile, s1_2024_transform = read_geotiff(SENTINEL1_2024)

print("\nLoading Sentinel-1 2025...")
s1_2025, s1_2025_profile, s1_2025_transform = read_geotiff(SENTINEL1_2025)

# Use both VV and VH bands
s1_2024_vv = s1_2024[0:1, :, :]  # VV band
s1_2024_vh = s1_2024[1:2, :, :]  # VH band
s1_2025_vv = s1_2025[0:1, :, :]  # VV band
s1_2025_vh = s1_2025[1:2, :, :]  # VH band

# Print information
print(f"\nS1 2024 full shape: {s1_2024.shape}")
print(f"S1 2024 VV shape: {s1_2024_vv.shape}")
print(f"S1 2024 VH shape: {s1_2024_vh.shape}")
print(f"S1 2024 dtype: {s1_2024.dtype}")

print(f"\nS1 2025 full shape: {s1_2025.shape}")
print(f"S1 2025 VV shape: {s1_2025_vv.shape}")
print(f"S1 2025 VH shape: {s1_2025_vh.shape}")

# Check for NaN values
print(f"\nS1 2024 VV has NaN: {np.isnan(s1_2024_vv).any()}")
print(f"\nS1 2024 VH has NaN: {np.isnan(s1_2024_vh).any()}")
print(f"S1 2025 VV has NaN: {np.isnan(s1_2025_vv).any()}")
print(f"S1 2025 VH has NaN: {np.isnan(s1_2025_vh).any()}")

# Value ranges
print(f"\nS1 2024 VV value range: [{np.nanmin(s1_2024_vv):.2f}, {np.nanmax(s1_2024_vv):.2f}]")
print(f"S1 2024 VH value range: [{np.nanmin(s1_2024_vh):.2f}, {np.nanmax(s1_2024_vh):.2f}]")
print(f"S1 2025 VV value range: [{np.nanmin(s1_2025_vv):.2f}, {np.nanmax(s1_2025_vv):.2f}]")
print(f"S1 2025 VH value range: [{np.nanmin(s1_2025_vh):.2f}, {np.nanmax(s1_2025_vh):.2f}]")

Loading Sentinel-1 2024...

Loading Sentinel-1 2025...

S1 2024 full shape: (2, 10917, 12547)
S1 2024 VV shape: (1, 10917, 12547)
S1 2024 VH shape: (1, 10917, 12547)
S1 2024 dtype: float32

S1 2025 full shape: (2, 10917, 12547)
S1 2025 VV shape: (1, 10917, 12547)
S1 2025 VH shape: (1, 10917, 12547)

S1 2024 VV has NaN: True

S1 2024 VH has NaN: True
S1 2025 VV has NaN: True
S1 2025 VH has NaN: True

S1 2024 VV value range: [-54.78, 29.71]
S1 2024 VH value range: [-57.92, 14.67]
S1 2025 VV value range: [-48.09, 28.61]
S1 2025 VH value range: [-58.19, 13.40]


In [10]:
# Apply forest boundary mask to Sentinel-1
print("\nApplying forest boundary mask to Sentinel-1...")

# Mask Sentinel-1 data (both VV and VH bands)
s1_2024_masked, _ = mask_raster_with_boundary(s1_2024, s1_2024_transform, FOREST_BOUNDARY)
s1_2025_masked, _ = mask_raster_with_boundary(s1_2025, s1_2025_transform, FOREST_BOUNDARY)

# Replace original with masked versions
s1_2024 = s1_2024_masked
s1_2025 = s1_2025_masked

# Update VV and VH bands with masked versions
s1_2024_vv = s1_2024[0:1, :, :]
s1_2024_vh = s1_2024[1:2, :, :]
s1_2025_vv = s1_2025[0:1, :, :]
s1_2025_vh = s1_2025[1:2, :, :]

print(f"S1 2024 VV after masking - value range: [{np.nanmin(s1_2024_vv):.2f}, {np.nanmax(s1_2024_vv):.2f}]")
print(f"S1 2024 VH after masking - value range: [{np.nanmin(s1_2024_vh):.2f}, {np.nanmax(s1_2024_vh):.2f}]")
print("Forest boundary mask applied to Sentinel-1 imagery ✓")


Applying forest boundary mask to Sentinel-1...




S1 2024 VV after masking - value range: [-54.78, 29.71]
S1 2024 VH after masking - value range: [-57.92, 14.67]
Forest boundary mask applied to Sentinel-1 imagery ✓


In [11]:
# Visualize Sentinel-1 VV and VH polarizations
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# VV 2024
im1 = axes[0, 0].imshow(s1_2024_vv[0], cmap='gray', 
                        vmin=np.nanpercentile(s1_2024_vv, 2),
                        vmax=np.nanpercentile(s1_2024_vv, 98))
axes[0, 0].set_title('Sentinel-1 VV - 2024-02-04', fontsize=12, fontweight='bold')
axes[0, 0].axis('off')
plt.colorbar(im1, ax=axes[0, 0], fraction=0.046)

# VH 2024
im2 = axes[0, 1].imshow(s1_2024_vh[0], cmap='gray',
                        vmin=np.nanpercentile(s1_2024_vh, 2),
                        vmax=np.nanpercentile(s1_2024_vh, 98))
axes[0, 1].set_title('Sentinel-1 VH - 2024-02-04', fontsize=12, fontweight='bold')
axes[0, 1].axis('off')
plt.colorbar(im2, ax=axes[0, 1], fraction=0.046)

# VV 2025
im3 = axes[1, 0].imshow(s1_2025_vv[0], cmap='gray',
                        vmin=np.nanpercentile(s1_2025_vv, 2),
                        vmax=np.nanpercentile(s1_2025_vv, 98))
axes[1, 0].set_title('Sentinel-1 VV - 2025-02-22', fontsize=12, fontweight='bold')
axes[1, 0].axis('off')
plt.colorbar(im3, ax=axes[1, 0], fraction=0.046)

# VH 2025
im4 = axes[1, 1].imshow(s1_2025_vh[0], cmap='gray',
                        vmin=np.nanpercentile(s1_2025_vh, 2),
                        vmax=np.nanpercentile(s1_2025_vh, 98))
axes[1, 1].set_title('Sentinel-1 VH - 2025-02-22', fontsize=12, fontweight='bold')
axes[1, 1].axis('off')
plt.colorbar(im4, ax=axes[1, 1], fraction=0.046)

fig.suptitle('Sentinel-1 SAR - VV and VH Polarizations', fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
# Save figure
plt.savefig(FIGURES_DIR / '02_s1_vv_vh_comparison.png', dpi=300, bbox_inches='tight')
print(f"✓ Saved: {FIGURES_DIR / '02_s1_vv_vh_comparison.png'}")
plt.close()

✓ Saved: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures\02_s1_vv_vh_comparison.png


## 4. Visualize Ground Truth Points trên ảnh

In [12]:
# Convert coordinates to pixel positions
df['col'] = 0
df['row'] = 0

for idx, row in df.iterrows():
    col, row_val = coords_to_pixel(row['x'], row['y'], s2_2024_transform)
    df.at[idx, 'col'] = col
    df.at[idx, 'row'] = row_val

print("Converted coordinates to pixel positions")
print(df[['x', 'y', 'col', 'row', 'label']].head())

Converted coordinates to pixel positions
               x             y   col  row  label
0  495551.110218  1.054432e+06  3010  138      1
1  495451.786737  1.054588e+06  3000  123      1
2  495391.097161  1.054524e+06  2994  129      1
3  495664.635289  1.054353e+06  3021  146      1
4  495610.798117  1.054299e+06  3016  152      1


In [13]:
# Extract NDVI and other spectral indices from masked S2 data
ndvi_2024 = s2_2024[4]  # NDVI band
ndvi_2025 = s2_2025[4]

print(f"NDVI 2024 extracted: shape={ndvi_2024.shape}, range=[{np.nanmin(ndvi_2024):.3f}, {np.nanmax(ndvi_2024):.3f}]")
print(f"NDVI 2025 extracted: shape={ndvi_2025.shape}, range=[{np.nanmin(ndvi_2025):.3f}, {np.nanmax(ndvi_2025):.3f}]")

NDVI 2024 extracted: shape=(10917, 12547), range=[-1.000, 1.000]
NDVI 2025 extracted: shape=(10917, 12547), range=[-1.000, 1.000]


In [14]:
# Plot ground truth points on NDVI images for both years
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

# Get point subsets
deforestation_points = df[df['label'] == 1]
no_deforestation_points = df[df['label'] == 0]

# NDVI 2024
im1 = axes[0].imshow(ndvi_2024, cmap='RdYlGn', vmin=-1, vmax=1)
axes[0].scatter(no_deforestation_points['col'], no_deforestation_points['row'],
                c='blue', s=3, alpha=0.7, label='No Deforestation', edgecolors='white', linewidths=0.3)
axes[0].scatter(deforestation_points['col'], deforestation_points['row'],
                c='red', s=3, alpha=0.7, label='Deforestation', edgecolors='white', linewidths=0.3)
axes[0].set_title('Ground Truth Points on NDVI 2024', fontsize=12, fontweight='bold')
axes[0].legend(loc='upper right', fontsize=10, markerscale=3)
axes[0].axis('off')
plt.colorbar(im1, ax=axes[0], fraction=0.046, label='NDVI')

# NDVI 2025
im2 = axes[1].imshow(ndvi_2025, cmap='RdYlGn', vmin=-1, vmax=1)
axes[1].scatter(no_deforestation_points['col'], no_deforestation_points['row'],
                c='blue', s=3, alpha=0.7, label='No Deforestation', edgecolors='white', linewidths=0.3)
axes[1].scatter(deforestation_points['col'], deforestation_points['row'],
                c='red', s=3, alpha=0.7, label='Deforestation', edgecolors='white', linewidths=0.3)
axes[1].set_title('Ground Truth Points on NDVI 2025', fontsize=12, fontweight='bold')
axes[1].legend(loc='upper right', fontsize=10, markerscale=3)
axes[1].axis('off')
plt.colorbar(im2, ax=axes[1], fraction=0.046, label='NDVI')

plt.tight_layout()
# Save figure
plt.savefig(FIGURES_DIR / '03_ground_truth_points.png', dpi=300, bbox_inches='tight')
print(f"✓ Saved: {FIGURES_DIR / '03_ground_truth_points.png'}")
plt.close()

✓ Saved: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures\03_ground_truth_points.png


In [15]:
# Extract spectral indices for later use
# (Not creating visualization to save time)

# 2024 indices (order: B4, B8, B11, B12, NDVI, NBR, NDMI)
ndvi_2024 = s2_2024[4]  # NDVI is band 5
nbr_2024 = s2_2024[5]   # NBR is band 6
ndmi_2024 = s2_2024[6]  # NDMI is band 7

# 2025 indices
ndvi_2025 = s2_2025[4]
nbr_2025 = s2_2025[5]
ndmi_2025 = s2_2025[6]

print("Spectral indices extracted successfully")
print(f"  NDVI 2024 range: [{np.nanmin(ndvi_2024):.3f}, {np.nanmax(ndvi_2024):.3f}]")
print(f"  NDVI 2025 range: [{np.nanmin(ndvi_2025):.3f}, {np.nanmax(ndvi_2025):.3f}]")

Spectral indices extracted successfully
  NDVI 2024 range: [-1.000, 1.000]
  NDVI 2025 range: [-1.000, 1.000]


In [16]:
# Normalize images for patch extraction
print("Validating and normalizing images...")
# Sentinel-1: Use minmax normalization for both VV and VH (dB values -> [0, 1])
s1_2024_norm = normalize_image(s1_2024, method='minmax')  # Both VV and VH
s1_2025_norm = normalize_image(s1_2025, method='minmax')  # Both VV and VH

# Sentinel-2: Validate ranges (no scaling, just clip outliers)
# B4, B8, B11, B12: keep [0, 1], NDVI, NBR, NDMI: keep [-1, 1]
s2_2024_norm = validate_sentinel2_ranges(s2_2024)
s2_2025_norm = validate_sentinel2_ranges(s2_2025)

print("Done!")
print(f"S1 2024 range (VV, VH): [{np.nanmin(s1_2024_norm):.3f}, {np.nanmax(s1_2024_norm):.3f}]")
print(f"S2 2024 bands 0-3 range: [{np.nanmin(s2_2024_norm[0:4]):.3f}, {np.nanmax(s2_2024_norm[0:4]):.3f}]")
print(f"S2 2024 bands 4-6 range: [{np.nanmin(s2_2024_norm[4:7]):.3f}, {np.nanmax(s2_2024_norm[4:7]):.3f}]")

Validating and normalizing images...
Done!
S1 2024 range (VV, VH): [0.000, 1.000]
S2 2024 bands 0-3 range: [0.000, 0.983]
S2 2024 bands 4-6 range: [-1.000, 1.000]


In [17]:
# Extract and visualize patches for both classes and both years
num_samples = 3
patch_size = PATCH_SIZE

# Get sample points
deforestation_samples = df[df['label'] == 1].sample(n=num_samples, random_state=42)
no_deforestation_samples = df[df['label'] == 0].sample(n=num_samples, random_state=42)

fig, axes = plt.subplots(4, num_samples, figsize=(15, 18))

# Extract and plot deforestation patches - 2024
for i, (idx, row) in enumerate(deforestation_samples.iterrows()):
    x, y = row['x'], row['y']
    
    patch = extract_patch_at_point(
        s1_2024_norm, s1_2025_norm, s2_2024_norm, s2_2025_norm,
        x, y, s2_2024_transform, patch_size
    )
    
    if patch is not None:
        # Show NDVI 2024 (band 4) - range [-1, 1]
        axes[0, i].imshow(patch[4], cmap='RdYlGn', vmin=-1, vmax=1)
        axes[0, i].set_title(f'Deforestation\nID: {row["id"]}\n2024', fontsize=9)
        axes[0, i].axis('off')
        
        # Show NDVI 2025 (band 13) - range [-1, 1]
        axes[1, i].imshow(patch[13], cmap='RdYlGn', vmin=-1, vmax=1)
        axes[1, i].set_title(f'Deforestation\nID: {row["id"]}\n2025', fontsize=9)
        axes[1, i].axis('off')

# Extract and plot no deforestation patches - 2024 and 2025
for i, (idx, row) in enumerate(no_deforestation_samples.iterrows()):
    x, y = row['x'], row['y']
    
    patch = extract_patch_at_point(
        s1_2024_norm, s1_2025_norm, s2_2024_norm, s2_2025_norm,
        x, y, s2_2024_transform, patch_size
    )
    
    if patch is not None:
        # Show NDVI 2024 (band 4) - range [-1, 1]
        axes[2, i].imshow(patch[4], cmap='RdYlGn', vmin=-1, vmax=1)
        axes[2, i].set_title(f'No Deforestation\nID: {row["id"]}\n2024', fontsize=9)
        axes[2, i].axis('off')
        
        # Show NDVI 2025 (band 13) - range [-1, 1]
        axes[3, i].imshow(patch[13], cmap='RdYlGn', vmin=-1, vmax=1)
        axes[3, i].set_title(f'No Deforestation\nID: {row["id"]}\n2025', fontsize=9)
        axes[3, i].axis('off')

# Add row labels
fig.text(0.02, 0.87, 'Deforestation\n2024', fontsize=11, fontweight='bold', 
         va='center', ha='center', bbox=dict(boxstyle='round', facecolor='red', alpha=0.3))
fig.text(0.02, 0.63, 'Deforestation\n2025', fontsize=11, fontweight='bold',
         va='center', ha='center', bbox=dict(boxstyle='round', facecolor='red', alpha=0.3))
fig.text(0.02, 0.37, 'No Deforestation\n2024', fontsize=11, fontweight='bold',
         va='center', ha='center', bbox=dict(boxstyle='round', facecolor='green', alpha=0.3))
fig.text(0.02, 0.13, 'No Deforestation\n2025', fontsize=11, fontweight='bold',
         va='center', ha='center', bbox=dict(boxstyle='round', facecolor='green', alpha=0.3))

fig.suptitle(f'Sample {patch_size}x{patch_size} Patches (NDVI) - 2024 vs 2025 Comparison', 
             fontsize=14, fontweight='bold')
plt.tight_layout(rect=[0.05, 0, 1, 0.98])
# Save figure
plt.savefig(FIGURES_DIR / '04_sample_patches.png', dpi=300, bbox_inches='tight')
print(f"✓ Saved: {FIGURES_DIR / '04_sample_patches.png'}")
plt.close()

✓ Saved: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures\04_sample_patches.png


In [18]:
# Visualize multi-temporal comparison for a single patch
sample_point = df[df['label'] == 1].sample(n=1, random_state=42).iloc[0]
x, y = sample_point['x'], sample_point['y']

patch = extract_patch_at_point(
    s1_2024_norm, s1_2025_norm, s2_2024_norm, s2_2025_norm,
    x, y, s2_2024_transform, patch_size
)

if patch is not None:
    fig, axes = plt.subplots(2, 5, figsize=(20, 8))
    
    # 2024 bands (0-8: S2 bands [0-6] + S1 VV [7] + S1 VH [8])
    # B4 (Red) - spectral band [0, 1]
    axes[0, 0].imshow(patch[0], cmap='Reds', vmin=0, vmax=1)
    axes[0, 0].set_title('B4 (Red) 2024')
    axes[0, 0].axis('off')
    
    # B8 (NIR) - spectral band [0, 1]
    axes[0, 1].imshow(patch[1], cmap='YlOrRd', vmin=0, vmax=1)
    axes[0, 1].set_title('B8 (NIR) 2024')
    axes[0, 1].axis('off')
    
    # NDVI - index [-1, 1]
    axes[0, 2].imshow(patch[4], cmap='RdYlGn', vmin=-1, vmax=1)
    axes[0, 2].set_title('NDVI 2024')
    axes[0, 2].axis('off')
    
    # S1 VV - normalized [0, 1]
    axes[0, 3].imshow(patch[7], cmap='gray', vmin=0, vmax=1)
    axes[0, 3].set_title('S1 VV 2024')
    axes[0, 3].axis('off')
    
    # S1 VH - normalized [0, 1]
    axes[0, 4].imshow(patch[8], cmap='gray', vmin=0, vmax=1)
    axes[0, 4].set_title('S1 VH 2024')
    axes[0, 4].axis('off')
    
    # 2025 bands (9-17: S2 bands [9-15] + S1 VV [16] + S1 VH [17])
    # B4 (Red) - spectral band [0, 1]
    axes[1, 0].imshow(patch[9], cmap='Reds', vmin=0, vmax=1)
    axes[1, 0].set_title('B4 (Red) 2025')
    axes[1, 0].axis('off')
    
    # B8 (NIR) - spectral band [0, 1]
    axes[1, 1].imshow(patch[10], cmap='YlOrRd', vmin=0, vmax=1)
    axes[1, 1].set_title('B8 (NIR) 2025')
    axes[1, 1].axis('off')
    
    # NDVI - index [-1, 1]
    axes[1, 2].imshow(patch[13], cmap='RdYlGn', vmin=-1, vmax=1)
    axes[1, 2].set_title('NDVI 2025')
    axes[1, 2].axis('off')
    
    # S1 VV - normalized [0, 1]
    axes[1, 3].imshow(patch[16], cmap='gray', vmin=0, vmax=1)
    axes[1, 3].set_title('S1 VV 2025')
    axes[1, 3].axis('off')
    
    # S1 VH - normalized [0, 1]
    axes[1, 4].imshow(patch[17], cmap='gray', vmin=0, vmax=1)
    axes[1, 4].set_title('S1 VH 2025')
    axes[1, 4].axis('off')
    
    fig.suptitle(f'Multi-temporal Patch (18 channels) - ID: {sample_point["id"]} (Label: {sample_point["label"]})',
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    # Save figure
    plt.savefig(FIGURES_DIR / '05_multitemporal_patch.png', dpi=300, bbox_inches='tight')
    print(f"✓ Saved: {FIGURES_DIR / '05_multitemporal_patch.png'}")
    plt.close()

✓ Saved: d:\ninhhaidang\25-26_HKI_DATN_21021411_DangNH\notebooks\..\figures\05_multitemporal_patch.png


## 6. Summary

Từ việc khám phá dữ liệu, chúng ta có thể rút ra:

1. **Ground Truth**: 1285 điểm với phân bố tương đối cân bằng giữa 2 lớp
2. **Sentinel-2**: 7 bands bao gồm spectral bands và indices (NDVI, NBR, NDMI)
3. **Sentinel-1**: 2 bands - VV and VH polarization từ SAR imagery
4. **Patch size**: 64x64 hoặc 128x128 pixels
5. **Total input channels**: 18 channels (2 time periods × (7 S2 + 2 S1) = 18)

**Channel structure per patch**:
- S2 2024: 7 bands (B4, B8, B11, B12, NDVI, NBR, NDMI)
- S1 2024: 2 bands (VV, VH)
- S2 2025: 7 bands (B4, B8, B11, B12, NDVI, NBR, NDMI)
- S1 2025: 2 bands (VV, VH)

**Next steps**:
- Extract patches cho tất cả ground truth points
- Split data thành train/val/test
- Train deep learning models
- Evaluate và so sánh kết quả