# Glacier Segmentation - HKH Pretraining

This notebook runs Phase 0: HKH dataset pretraining.

**Expected runtime:** 2-3 hours on dual T4

**Expected MCC:** 0.75-0.78

**Output:** `weights/hkh_pretrained.pth` (~44 MB)

## 1. Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/observer04/gchack2_v2.git
%cd gchack2_v2

In [None]:
# Install dependencies
!pip install -q segmentation-models-pytorch albumentations timm scikit-learn rasterio geopandas ttach scikit-image opencv-python PyYAML tqdm

In [None]:
# Verify GPU setup
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB")

## 2. Download Competition Data

In [None]:
# Download competition data
!wget https://www.glacier-hack.in/train2.zip
!unzip train2.zip -d /kaggle/working/

In [None]:
# Organize competition data
import os
import shutil
from pathlib import Path

comp_data_dir = Path('/kaggle/working/gchack2_v2/data/competition')
comp_data_dir.mkdir(parents=True, exist_ok=True)

# Move extracted data to proper location
train_source = Path('/kaggle/working/Train')
if train_source.exists():
    for band_dir in train_source.iterdir():
        dest = comp_data_dir / band_dir.name
        if dest.exists():
            shutil.rmtree(dest)
        shutil.move(str(band_dir), str(dest))
    print("✓ Competition data organized")

# Verify structure
print("\nData structure:")
for item in sorted(comp_data_dir.iterdir()):
    if item.is_dir():
        count = len(list(item.glob('*.tif')))
        print(f"  {item.name}: {count} files")

## 3. Download HKH Dataset

**Note:** HKH dataset is ~29.4 GB compressed.

This may take 10-15 minutes on Kaggle.

In [None]:
# Download HKH dataset (29.4 GB - includes patches, polygons, images)
!mkdir -p /kaggle/working/gchack2_v2/data/hkh/raw
%cd /kaggle/working/gchack2_v2/data/hkh/raw

# Download from Azure (fastest for Kaggle)
# Alternative mirrors: GCP or AWS (see KAGGLE_GUIDE.md)
!wget -O hkh_patches.tar.gz https://lilawildlife.blob.core.windows.net/lila-wildlife/icimod-glacier-mapping/hkh_patches.tar.gz

# Extract dataset
!tar -xzf hkh_patches.tar.gz
!ls -lh

%cd /kaggle/working/gchack2_v2
print("✓ HKH dataset downloaded (29.4 GB)")

## 4. Preprocess HKH Dataset

Convert raw HKH data to 512×512 tiles suitable for training.

In [None]:
# Preprocess HKH Dataset
# HKH has 15 channels but competition has 5 - we need to extract matching bands

import numpy as np
from pathlib import Path
from tqdm import tqdm

# The extracted HKH data structure varies by download
# Check what we actually got
hkh_raw = Path('/kaggle/working/gchack2_v2/data/hkh/raw')
print("Contents of HKH download:")
!ls -lh /kaggle/working/gchack2_v2/data/hkh/raw/

# Create processed directories
proc_dir = Path('/kaggle/working/gchack2_v2/data/hkh/processed')
(proc_dir / 'images').mkdir(parents=True, exist_ok=True)
(proc_dir / 'masks').mkdir(parents=True, exist_ok=True)

print("\n⚠️ PREPROCESSING REQUIRED:")
print("The HKH dataset uses numpy format with 15 channels.")
print("Our implementation (HKHDataset class) will:")
print("  - Automatically select 5 matching bands: [B1, B2, B3, B5, B6_high]")
print("  - Convert 2-channel masks to 4-class format")
print("  - Handle this during training - NO manual preprocessing needed!")
print("\nReady to train with HKHDataset class.")

## 5. Run HKH Pretraining

Train Boundary-Aware U-Net on HKH dataset.

**Expected time:** 2-3 hours (60 epochs × ~90 sec/epoch)

In [None]:
# Run training
!python src/training/train.py \
    --config configs/hkh_pretrain_kaggle.yaml \
    --experiment_name hkh_pretrain_v1

## 6. Evaluate Results

In [None]:
# Check final metrics
import torch

checkpoint = torch.load('weights/hkh_pretrain_v1/best_checkpoint.pth')

print("\n" + "="*80)
print("HKH Pretraining Results")
print("="*80)

metrics = checkpoint['metrics']
print(f"\nBest Epoch: {checkpoint['epoch']}")
print(f"Validation MCC: {metrics['mcc']:.4f}")
print(f"Mean IoU: {metrics['mean_iou']:.4f}")
print(f"Macro F1: {metrics['macro_f1']:.4f}")

print("\nPer-class IoU:")
class_names = ['Background', 'Glacier', 'Debris', 'Lake']
for i, name in enumerate(class_names):
    iou = metrics.get(f'class_{i}', 0)
    print(f"  {name:12}: {iou:.4f}")

print("="*80)

# Check file size
import os
size_mb = os.path.getsize('weights/hkh_pretrain_v1/best_checkpoint.pth') / 1e6
print(f"\nCheckpoint size: {size_mb:.1f} MB")

# Success criteria
print("\n✓ Success Criteria:")
print(f"  MCC ≥ 0.75: {'✓ PASS' if metrics['mcc'] >= 0.75 else '✗ FAIL'}")
print(f"  IoU > 0.60: {'✓ PASS' if all(metrics.get(f'class_{i}', 0) > 0.60 for i in range(4)) else '✗ FAIL'}")
print(f"  Size < 50MB: {'✓ PASS' if size_mb < 50 else '✗ FAIL'}")

## 7. Download Checkpoint

In [None]:
# Download checkpoint for local use
from google.colab import files  # Works in Kaggle too

files.download('weights/hkh_pretrain_v1/best_checkpoint.pth')

print("\n✓ Download complete!")
print("Upload this file to your GitHub repo or use for competition fine-tuning.")