# GeoLifeCLEF 2025 Model Notebook

This notebook provides a comprehensive example of how to approach the GeoLifeCLEF 2025 competition. It covers data acquisition, preprocessing, model training, and submission generation. Each section is thoroughly commented and includes descriptive headers to facilitate understanding.

## 1. Setup and Data Acquisition

The necessary libraries (e.g., PyTorch, pandas, rasterio) are installed, and the GeoLifeCLEF 2025 dataset is downloaded from Kaggle using the Kaggle API. Authentication with Kaggle is required; ensure your API token is configured correctly. The dataset includes training and testing metadata (GLC25_PA_metadata_train.csv, GLC25_PA_metadata_test.csv) and environmental rasters (Landsat, Bioclim, Sentinel).

In [1]:
!pip install rasterio tqdm numpy pandas albumentations kaggle kagglehub scikit-learn scikit-image matplotlib seaborn

Collecting rasterio
  Downloading rasterio-1.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)
Collecting pandas
  Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting albumentations
  Downloading albumentations-2.0.8-py3-none-any.whl.metadata (43 kB)
Collecting kaggle
  Downloading kaggle-1.7.4.5-py3-none-any.whl.metadata (16 kB)
Collecting kagglehub
  Downloading kagglehub-0.3.12-py3-none-any.whl.metadata (38 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (17 kB)
Collecting scikit-image
  Downloading scikit_image-0.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting 

### 1.1 Kaggle Authentication

Authenticates with Kaggle to allow programmatic access to datasets and competitions. Ensure your Kaggle API token is configured correctly.

In [2]:
import kagglehub
kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

### 1.2 Download Competition Data

Downloads the GeoLifeCLEF 2025 competition dataset. This may take some time due to the size of the dataset, as it includes various environmental rasters and observation data.

In [3]:
geolifeclef_2025_path = kagglehub.competition_download('geolifeclef-2025')

print('Data source import complete.')

Downloading from https://www.kaggle.com/api/v1/competitions/data/download-all/geolifeclef-2025...


100%|██████████| 3.73G/3.73G [01:47<00:00, 37.4MB/s]

Extracting files...





Data source import complete.


## 2. Data Loading and Initial Exploration

This section focuses on loading the downloaded data, including observations and environmental rasters, and performing initial data exploration to understand its structure and content. We will load `observations.csv` for both training and testing, and `train_labels.csv` for the training target.

In [None]:
import os
import timm
import torch
import rasterio
import numpy as np
import pandas as pd
import seaborn as sns
import torch.nn as nn
import albumentations as A
import torch.nn.functional as F
import imageio.v3 as imageio
import matplotlib.pyplot as plt
import torchvision.models as models
import matplotlib.image as mpimg
import torchvision.transforms as transforms

from PIL import Image
from tqdm.notebook import tqdm
from torchmetrics import F1Score
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import CosineAnnealingLR
from albumentations.pytorch import ToTensorV2

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import precision_recall_fscore_support

In [None]:
train_metadata = pd.read_csv(f"{geolifeclef_2025_path}/GLC25_PA_metadata_train.csv")
test_metadata = pd.read_csv(f"{geolifeclef_2025_path}/GLC25_PA_metadata_test.csv")

In [11]:
unique, counts = np.unique(train_metadata.speciesId.values, return_counts=True)
print(len(unique))

new_unique, new_counts = [], []
for u, c in zip(unique, counts):
    if c > 5:
        new_unique.append(u)
        new_counts.append(c)
unique = np.array(new_unique)
counts = np.array(new_counts)
print(len(unique))


5016
3425


In [12]:
num_classes = len(unique)
num_surveys = len(np.unique(train_metadata.surveyId.values))

species_dict = {}
for i in range(num_classes):
    species_dict[unique[i]] = i

# Dataset, Data preprocessing, and data loading

The training and testing metadata are loaded from CSV files. The training metadata contains species IDs and survey IDs. Initial exploration reveals 5016 unique species, but only 3425 have more than 5 observations. The model focuses on these 3425 species, and a dictionary maps their IDs to indices for use in training.

In [13]:
def construct_patch_path(data_path, survey_id):
    """Construct the patch file path based on plot_id as './CD/AB/XXXXABCD.jpeg'"""
    path = data_path
    for d in (str(survey_id)[-2:], str(survey_id)[-4:-2]):
        path = os.path.join(path, d)

    path = os.path.join(path, f"{survey_id}.tiff")

    return path

def quantile_normalize(band, low=2, high=98):
    sorted_band = np.sort(band.flatten())
    quantiles = np.percentile(sorted_band, np.linspace(low, high, len(sorted_band)))
    normalized_band = np.interp(band.flatten(), sorted_band, quantiles).reshape(band.shape)
    
    min_val, max_val = np.min(normalized_band), np.max(normalized_band)
    
    # Prevent division by zero if min_val == max_val
    if max_val == min_val:
        return np.zeros_like(normalized_band, dtype=np.float32)  # Return an array of zeros

    # Perform normalization (min-max scaling)
    return ((normalized_band - min_val) / (max_val - min_val)).astype(np.float32)

class TrainDataset(Dataset):
    def __init__(self, bioclim_data_dir, landsat_data_dir, sentinel_data_dir, metadata, transform=None):
        self.transform = transform
        self.sentinel_transform = A.Compose([
            A.Rotate(limit=(-10, 10)),
            A.RandomBrightnessContrast(brightness_limit=(-0.05, 0.05), contrast_limit=(-0.05, 0.05), p=0.3),
            A.Normalize(mean=(0.5, 0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5, 0.5), max_pixel_value=1),
            ToTensorV2(),
        ])
      
        self.bioclim_data_dir = bioclim_data_dir
        self.landsat_data_dir = landsat_data_dir
        self.sentinel_data_dir = sentinel_data_dir
        self.metadata = metadata
        self.metadata = self.metadata.dropna(subset="speciesId").reset_index(drop=True)
        self.metadata['speciesId'] = self.metadata['speciesId'].astype(int)
        self.label_dict = self.metadata.groupby('surveyId')['speciesId'].apply(list).to_dict()
        
        self.metadata = self.metadata.drop_duplicates(subset="surveyId").reset_index(drop=True)

    def __len__(self):
        return len(self.metadata)

    def __getitem__(self, idx):
        
        
        survey_id = self.metadata.surveyId[idx]
        
        landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-train-landsat-time-series_{survey_id}_cube.pt")))
        bioclim_sample = torch.nan_to_num(torch.load(os.path.join(self.bioclim_data_dir, f"GLC25-PA-train-bioclimatic_monthly_{survey_id}_cube.pt")))
        
        
        tiff_path = construct_patch_path(self.sentinel_data_dir, survey_id)
        with rasterio.open(tiff_path) as dataset:
            sentinel_sample = dataset.read(out_dtype=np.float32)  # Read all bands
            sentinel_sample = np.array([quantile_normalize(band) for band in sentinel_sample])  # Apply quantile normalization
        sentinel_sample = np.transpose(sentinel_sample, (1, 2, 0)) 

        species_ids = self.label_dict.get(survey_id, [])  # Get list of species IDs for the survey ID
        label = torch.zeros(num_classes)  # Initialize label tensor
        for species_id in species_ids:
            label_id = species_id
            if label_id in species_dict.keys():
                label[species_dict[label_id]] = 1 # Set the corresponding class index to 1 for each species
        
        if isinstance(landsat_sample, torch.Tensor):
            landsat_sample = landsat_sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            landsat_sample = landsat_sample.numpy()  # Convert tensor to numpy array
            
        if isinstance(bioclim_sample, torch.Tensor):
            bioclim_sample = bioclim_sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            bioclim_sample = bioclim_sample.numpy()  # Convert tensor to numpy array   
        
        if self.transform:
            landsat_sample = self.transform(landsat_sample)
            bioclim_sample = self.transform(bioclim_sample)
            sentinel_sample = self.sentinel_transform(image=sentinel_sample)['image']
        
        
        return landsat_sample, bioclim_sample, sentinel_sample, label, survey_id
    
class TestDataset(TrainDataset):
    def __init__(self, bioclim_data_dir, landsat_data_dir, sentinel_data_dir, metadata, transform=None):
        self.transform = transform
        self.sentinel_transform = A.Compose([
            A.Normalize(mean=(0.5, 0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5, 0.5), max_pixel_value=1),
            ToTensorV2(),
        ])
      
        self.bioclim_data_dir = bioclim_data_dir
        self.landsat_data_dir = landsat_data_dir
        self.sentinel_data_dir = sentinel_data_dir
        self.metadata = metadata
        
    def __getitem__(self, idx):
        
        survey_id = self.metadata.surveyId[idx]
        landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-test-landsat_time_series_{survey_id}_cube.pt")))
        bioclim_sample = torch.nan_to_num(torch.load(os.path.join(self.bioclim_data_dir, f"GLC25-PA-test-bioclimatic_monthly_{survey_id}_cube.pt")))
        
        
        tiff_path = construct_patch_path(self.sentinel_data_dir, survey_id)
        with rasterio.open(tiff_path) as dataset:
            sentinel_sample = dataset.read(out_dtype=np.float32)  # Read all bands
            sentinel_sample = np.array([quantile_normalize(band) for band in sentinel_sample])  # Apply quantile normalization
        sentinel_sample = np.transpose(sentinel_sample, (1, 2, 0)) 
            
        if isinstance(landsat_sample, torch.Tensor):
            landsat_sample = landsat_sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            landsat_sample = landsat_sample.numpy()  # Convert tensor to numpy array
        if isinstance(bioclim_sample, torch.Tensor):
            bioclim_sample = bioclim_sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            bioclim_sample = bioclim_sample.numpy()  # Convert tensor to numpy array   
        
        if self.transform:
            landsat_sample = self.transform(landsat_sample)
            bioclim_sample = self.transform(bioclim_sample)
            sentinel_sample = self.sentinel_transform(image=sentinel_sample)['image']
        
        return landsat_sample, bioclim_sample, sentinel_sample, survey_id

In [None]:
# Dataset and DataLoader
batch_size = 256

transform = transforms.Compose([
    transforms.ToTensor(),
])

# Load Training metadata
train_landsat_data_path = f"{geolifeclef_2025_path}/SateliteTimeSeries-Landsat/cubes/PA-train/"
train_bioclim_data_path = f"{geolifeclef_2025_path}/BioclimTimeSeries/cubes/PA-train/"
train_sentinel_data_path=f"{geolifeclef_2025_path}/SatelitePatches/PA-train/"
train_metadata_path = f"{geolifeclef_2025_path}/GLC25_PA_metadata_train.csv"
train_metadata = pd.read_csv(train_metadata_path)
dataset_alpine = TrainDataset(train_bioclim_data_path, train_landsat_data_path, train_sentinel_data_path, train_metadata, transform=transform)
train_loader = DataLoader(dataset_alpine, batch_size=batch_size, shuffle=True, num_workers=4)

# Load Test metadata
test_landsat_data_path = f"{geolifeclef_2025_path}/SateliteTimeSeries-Landsat/cubes/PA-test/"
test_bioclim_data_path = f"{geolifeclef_2025_path}/BioclimTimeSeries/cubes/PA-test/"
test_sentinel_data_path = f"{geolifeclef_2025_path}/SatelitePatches/PA-test/"
test_metadata_path = f"{geolifeclef_2025_path}/GLC25_PA_metadata_test.csv"
test_metadata = pd.read_csv(test_metadata_path)
test_dataset = TestDataset(test_bioclim_data_path, test_landsat_data_path, test_sentinel_data_path, test_metadata, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

# Model Architeture Setup

The model is a multimodal ensemble combining three Swin Transformers:

Landsat time series: Processed with a Swin Transformer (6 input channels).
Bioclimatic monthly data: Processed with a Swin Transformer (4 input channels).
Sentinel patches: Processed with a pre-trained Swin Transformer (4 input channels).
Features from each transformer are projected to 1000 dimensions, concatenated, and passed through a classification head to predict the 3425 species.

In [None]:
import torch.nn.functional as F

class MultimodalEnsemble(nn.Module):
    def __init__(self, num_classes):
        super(MultimodalEnsemble, self).__init__()
        
        self.landsat_norm = nn.LayerNorm([6,4,21]) # Normalize Landsat data
        self.landsat_model = models.swin_t(weights=None) # Initialize Swin Transformer model for Landsat data
        self.landsat_model.features[0][0] = nn.Conv2d(6, 96, kernel_size=(4, 4), stride=(4, 4)) # Change input channels to 6
        self.landsat_model.head = nn.Identity() # Remove the classification head
        
        self.bioclim_norm = nn.LayerNorm([4,19,12])# Normalize Bioclim data
        self.bioclim_model = models.swin_t(weights=None)# Initialize Swin Transformer model for Bioclim data
        self.bioclim_model.features[0][0] = nn.Conv2d(4, 96, kernel_size=(4, 4), stride=(4, 4))# Change input channels to 4
        self.bioclim_model.head = nn.Identity()# Remove the classification head
        
        self.sentinel_model = models.swin_t(weights="IMAGENET1K_V1")# Initialize Swin Transformer model for Sentinel data
        self.sentinel_model.features[0][0] = nn.Conv2d(4, 96, kernel_size=(4, 4), stride=(4, 4))# Change input channels to 4
        self.sentinel_model.head = nn.Identity() # Remove the classification head
        
        self.proj1 = nn.Sequential(# Project Landsat features
            nn.Linear(768, 1000),
            nn.BatchNorm1d(1000),
            nn.GELU(),
            nn.Dropout(0.2)
        )
        self.proj2 = nn.Sequential(# Project Bioclim features
            nn.Linear(768, 1000),
            nn.BatchNorm1d(1000),
            nn.GELU(),
            nn.Dropout(0.2)
        )
        self.proj3 = nn.Sequential(# Project Sentinel features
            nn.Linear(768, 1000),
            nn.BatchNorm1d(1000),
            nn.GELU(),
            nn.Dropout(0.2)
        )
        
        self.label = nn.Sequential( # Final classification head
            nn.Linear(3000, 4096),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(4096, num_classes),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(num_classes, num_classes),
        )
        
    def forward(self, x, y, z): # x: Landsat data, y: Bioclim data, z: Sentinel data
        
        x = self.landsat_norm(x)
        x = self.landsat_model(x)
        x = self.proj1(x)
        
        y = self.bioclim_norm(y)
        y = self.bioclim_model(y)
        y = self.proj2(y)
        
        z = self.proj3(self.sentinel_model(z))
        
        
        xyz = torch.cat((x, y, z), dim=1)
        out = self.label(xyz)
        return out

In [16]:
def set_seed(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(69)

In [17]:
# Check if cuda is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("DEVICE = CUDA")

model = MultimodalEnsemble(num_classes).to(device)

DEVICE = CUDA


Downloading: "https://download.pytorch.org/models/swin_t-704ceda3.pth" to /root/.cache/torch/hub/checkpoints/swin_t-704ceda3.pth
100%|██████████| 108M/108M [00:00<00:00, 133MB/s] 


# Hyperparameters

In [None]:

learning_rate = 8e-5
num_epochs = 3
positive_weigh_factor = 1.0

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = CosineAnnealingLR(optimizer, T_max=25, verbose=True)



# Model Training

The model is trained for 3 epochs using the AdamW optimizer and a cosine annealing learning rate scheduler. The loss function is BCEWithLogitsLoss with a positive weight factor of 1.0 to address class imbalance. The model is saved after each epoch.

In [None]:
print(f"Training for {num_epochs} epochs started.")

for epoch in range(num_epochs):
    model.train()
    
    for batch_idx, (data1, data2, data3, targets, _) in enumerate(train_loader):

        data1 = data1.to(device)
        data2 = data2.to(device)
        data3 = data3.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()
        outputs = model(data1, data2, data3)

        pos_weight = targets*positive_weigh_factor  
        criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        loss = criterion(outputs, targets)

        loss.backward()
        optimizer.step()

        if batch_idx % 128 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}/{len(train_loader)}, Loss: {loss.item()}")

    model.eval()
    torch.save(model.state_dict(), f"{epoch}-multimodal-model.pth")
    
    scheduler.step()
    print("Scheduler:",scheduler.state_dict())

# Save the trained model
model.eval()
torch.save(model.state_dict(), "multimodal-model.pth")

Training for 3 epochs started.


  landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-train-landsat-time-series_{survey_id}_cube.pt")))
  landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-train-landsat-time-series_{survey_id}_cube.pt")))
  landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-train-landsat-time-series_{survey_id}_cube.pt")))
  landsat_sample = torch.nan_to_num(torch.load(os.path.join(self.landsat_data_dir, f"GLC25-PA-train-landsat-time-series_{survey_id}_cube.pt")))
  bioclim_sample = torch.nan_to_num(torch.load(os.path.join(self.bioclim_data_dir, f"GLC25-PA-train-bioclimatic_monthly_{survey_id}_cube.pt")))
  bioclim_sample = torch.nan_to_num(torch.load(os.path.join(self.bioclim_data_dir, f"GLC25-PA-train-bioclimatic_monthly_{survey_id}_cube.pt")))
  bioclim_sample = torch.nan_to_num(torch.load(os.path.join(self.bioclim_data_dir, f"GLC25-PA-train-bioclimatic_monthly_{survey_id}_cube

Epoch 1/3, Batch 0/348, Loss: 0.6932763457298279


# Generate predictions on test set

In [None]:
model.eval()
print("Done")

with torch.no_grad():
    surveys = []
    predictions_list = []
    top_k_indices = None
    for batch_idx, (data1, data2, data3, surveyID) in enumerate(test_loader):

        data1 = data1.to(device)
        data2 = data2.to(device)
        data3 = data3.to(device)

        outputs = model(data1, data2, data3)
        predictions = torch.sigmoid(outputs).cpu().numpy()
        
        batch_top_predictions = []
        for el in predictions:
            answ = np.array(list(np.nonzero(el > 0.18)[0]))
            
            if len(answ) < 14:
                answ = np.array(np.argsort(-el)[:14])
            batch_top_predictions.append(answ)
        
        batch_top_unique = [unique[el] for el in batch_top_predictions]
        
        predictions_list.extend(batch_top_unique)
        
        surveys.extend(surveyID.cpu().numpy())

    top_k_indices = predictions_list

# Produce CSV file for Submission

In [None]:
data_concatenated = [' '.join(map(lambda x: str(int(x)), row)) for row in top_k_indices]


pd.DataFrame(
    {'surveyId': surveys,
     'predictions': data_concatenated,
    }).to_csv("submission.csv", index = False)