## Simple baseline with Landsat Cubes — ResNet18 + Binary Cross Entropy [0.26424]

To demonstrate the potential of other data such as Landsat cubes, we provide a straightforward baseline that is baseline on a modified ResNet18 and Binary Cross Entropy but still ranks highly on the leaderboard. The model itself should learn the relationship between the location [R, G, B, NIR, SWIR1, and SWIR2] value at a given location and its species composition.

Considering the significant extent for enhancing performance of this baseline, we encourage you to experiment with various techniques, architectures, losses, etc.

#### **Have Fun!**

In [1]:
import os
import torch
import tqdm
import numpy as np
import pandas as pd
import torchvision.models as models
import torchvision.transforms as transforms
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import CosineAnnealingLR
from sklearn.metrics import precision_recall_fscore_support

## Data description

Satellite time series data includes over 20 years of Landsat satellite imagery extracted from [Ecodatacube](https://stac.ecodatacube.eu/).
The data was acquired through the Landsat satellite program and pre-processed by Ecodatacube to produce raster files scaled to the entire European continent and projected into a unique CRS.

Since the original rasters require a high amount of disk space, we extracted the data points from each spectral band corresponding to all PA and PO locations (i.e., GPS coordinates) and aggregated them in (i) CSV files and (ii) data cubes as tensor objects. Each data point corresponds to the mean value of Landsat's observations at the given location for three months before the given time; e.g., the value of a time series element under column 2012_4 will represent the mean value for that element from October 2012 to December 2012.

In this notebook, we will work with just the cubes. The cubes are structured as follows.
**Shape**: `(n_bands, n_quarters, n_years)` where:
- `n_bands` = 6 comprising [`red`, `green`, `blue`, `nir`, `swir1`, `swir2`]
- `n_quarters` = 4 
    - *Quarter 1*: December 2 of previous year until March 20 of current year (winter season proxy),
    - *Quarter 2*: March 21 until June 24 of current year (spring season proxy),
    - *Quarter 3*: June 25 until September 12 of current year (summer season proxy),
    - *Quarter 4*: September 13 until December 1 of current year (fall season proxy).
- `n_years` = 21 (ranging from 2000 to 2020)

The datacubes can simply be loaded as tensors using PyTorch with the following command :

```python
import torch
torch.load('path_to_file.pt')
```

**References:**
- *Traceability (lineage): This dataset is a seasonally aggregated and gapfilled version of the Landsat GLAD analysis-ready data product presented by Potapov et al., 2020 ( https://doi.org/10.3390/rs12030426 ).*
- *Scientific methodology: The Landsat GLAD ARD dataset was aggregated and harmonized using the eumap python package (available at https://eumap.readthedocs.io/en/latest/ ). The full process of gapfilling and harmonization is described in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ).*
- *Ecodatacube.eu: Analysis-ready open environmental data cube for Europe (https://doi.org/10.21203/rs.3.rs-2277090/v3).*

## Prepare custom dataset loader

We have to sloightly update the Dataset to provide the relevant data in the appropriate format.

In [2]:
class TrainDataset(Dataset):
    def __init__(self, data_dir, metadata, subset, transform=None):
        self.subset = subset
        self.transform = transform
        self.data_dir = data_dir
        self.metadata = metadata
        self.metadata = self.metadata.dropna(subset="speciesId").reset_index(drop=True)
        self.metadata['speciesId'] = self.metadata['speciesId'].astype(int)
        self.label_dict = self.metadata.groupby('surveyId')['speciesId'].apply(list).to_dict()
        self.metadata = self.metadata.drop_duplicates(subset="surveyId").reset_index(drop=True)

    def __len__(self):
        return len(self.metadata)

    def __getitem__(self, idx):
        survey_id = self.metadata.surveyId[idx]
        lonlat = torch.tensor(self.metadata.loc[idx, ["lon","lat"]].values.astype(np.float32))
        meta = self.metadata.loc[idx, ["lon","lat"]].values.astype(np.float32)
        sample = torch.nan_to_num(torch.load(os.path.join(self.data_dir, f"GLC25-PA-{self.subset}-landsat-time-series_{survey_id}_cube.pt"), weights_only=True))

        species_ids = self.label_dict.get(survey_id, [])  # Get list of species IDs for the survey ID
        label = torch.zeros(num_classes).scatter(0, torch.tensor(species_ids), torch.ones(len(species_ids)))

        # Ensure the sample is in the correct format for the transform
        if isinstance(sample, torch.Tensor):
            sample = sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            sample = sample.numpy()  # Convert tensor to numpy array
            #print(sample.shape)

        if self.transform:
            sample = self.transform(sample)

        return sample, lonlat, label, survey_id
    
class TestDataset(TrainDataset):
    def __init__(self, data_dir, metadata, subset, transform=None):
        self.subset = subset
        self.transform = transform
        self.data_dir = data_dir
        self.metadata = metadata
        
    def __getitem__(self, idx):
        survey_id = self.metadata.surveyId[idx]
        lonlat = torch.tensor(self.metadata.loc[idx, ["lon","lat"]].values.astype(np.float32))
        sample = torch.nan_to_num(torch.load(os.path.join(self.data_dir, f"GLC25-PA-{self.subset}-landsat_time_series_{survey_id}_cube.pt"), weights_only=True))
        if isinstance(sample, torch.Tensor):
            sample = sample.permute(1, 2, 0)  # Change tensor shape from (C, H, W) to (H, W, C)
            sample = sample.numpy()
        if self.transform:
            sample = self.transform(sample)
        return sample, lonlat, survey_id

### Load metadata and prepare data loaders

In [3]:
# Dataset and DataLoader
batch_size = 256
transform = transforms.Compose([
    transforms.ToTensor()
])

# Load Training metada
path_data = "/home/gt/DATA/geolifeclef-2025"
train_data_path = os.path.join(path_data, "SateliteTimeSeries-Landsat/cubes/PA-train/")
train_metadata_path = os.path.join(path_data, "GLC25_PA_metadata_train.csv")
train_metadata = pd.read_csv(train_metadata_path)
train_dataset = TrainDataset(train_data_path, train_metadata, subset="train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)

# Load Test metadata
test_data_path = os.path.join(path_data, "SateliteTimeSeries-Landsat/cubes/PA-test/")
test_metadata_path = os.path.join(path_data, "GLC25_PA_metadata_test.csv")
test_metadata = pd.read_csv(test_metadata_path)
test_dataset = TestDataset(test_data_path, test_metadata, subset="test", transform=transform)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

In [4]:
# num_classes = 11255 # Number of all unique classes within the PO and PA data.
# print(next(iter(train_loader)))

## Define and initialize the ModifiedResNet18 model

To utilize the landsat cubes, which have a shape of [6,4,21] (BANDs, QUARTERs, and YEARs), some minor adjustments must be made to the vanilla ResNet-18. It's important to note that this is just one method for ensuring compatibility with the unusual tensor shape, and experimentation is encouraged.

In [5]:
class ModifiedResNet18(nn.Module):
    def __init__(self, num_classes, lonlat):
        super(ModifiedResNet18, self).__init__()

        self.norm_input = nn.LayerNorm([6,4,21])
        self.resnet18 = models.resnet18(weights=None)
        # We have to modify the first convolutional layer to accept 4 channels instead of 3
        self.resnet18.conv1 = nn.Conv2d(6, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.resnet18.maxpool = nn.Identity()
        self.ln = nn.LayerNorm(1000)
        self.lonlat = lonlat
        if self.lonlat==True:
            self.fc = nn.Linear(1002, num_classes)
        else:
            self.fc = nn.Linear(1000, num_classes)

    def forward(self, x, lonlat=None):
        x = self.norm_input(x)
        x = self.resnet18(x)
        x = self.ln(x)
        if self.lonlat != False:
            x = torch.concat([x, lonlat], -1)
        x = self.fc(x)
        return x

In [6]:
def set_seed(seed):
    # Set seed for Python's built-in random number generator
    torch.manual_seed(seed)
    # Set seed for numpy
    np.random.seed(seed)
    # Set seed for CUDA if available
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        # Set cuDNN's random number generator seed for deterministic behavior
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(69)

In [7]:
# Check if cuda is available
device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("DEVICE = CUDA")
num_classes = 11255 # Number of all unique classes within the PO and PA data.
model = ModifiedResNet18(num_classes, False).to(device)

DEVICE = CUDA


In [13]:
from torchsummary import summary
summary(model, (6, 4, 21))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
         LayerNorm-1             [-1, 6, 4, 21]           1,008
            Conv2d-2            [-1, 64, 4, 21]           3,456
       BatchNorm2d-3            [-1, 64, 4, 21]             128
              ReLU-4            [-1, 64, 4, 21]               0
          Identity-5            [-1, 64, 4, 21]               0
            Conv2d-6            [-1, 64, 4, 21]          36,864
       BatchNorm2d-7            [-1, 64, 4, 21]             128
              ReLU-8            [-1, 64, 4, 21]               0
            Conv2d-9            [-1, 64, 4, 21]          36,864
      BatchNorm2d-10            [-1, 64, 4, 21]             128
             ReLU-11            [-1, 64, 4, 21]               0
       BasicBlock-12            [-1, 64, 4, 21]               0
           Conv2d-13            [-1, 64, 4, 21]          36,864
      BatchNorm2d-14            [-1, 64

## Training Loop

Nothing special, just a standard Pytorch training loop.

In [9]:
# Hyperparameters
learning_rate = 0.0002
num_epochs = 20
positive_weigh_factor = 1.0

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = CosineAnnealingLR(optimizer, T_max=25, verbose=True)



In [10]:
print(f"Training for {num_epochs} epochs started.")

for epoch in range(num_epochs):
    model.train()
    for batch_idx, (data, lonlat, targets, _) in tqdm.tqdm(enumerate(train_loader), total=len(train_loader), leave=False):
        data = data.to(device)
        lonlat = lonlat.to(device)
        # print(lonlat.dtype)
        targets = targets.to(device)
        optimizer.zero_grad()
        outputs = model(data)
        pos_weight = targets*positive_weigh_factor  # All positive weights are equal to 10
        criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        # if batch_idx % 175 == 0:
        #     print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}/{len(train_loader)}, Loss: {loss.item()}")
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")
    scheduler.step()
    #print("Scheduler:",scheduler.state_dict())

# Save the trained model
model.eval()
torch.save(model.state_dict(), "resnet18-with-landsat-cubes.pth")

Training for 20 epochs started.


                                                                                                                                                                         

Epoch 1/20, Loss: 0.006783004850149155


                                                                                                                                                                         

Epoch 2/20, Loss: 0.005892917048186064


                                                                                                                                                                         

Epoch 3/20, Loss: 0.005638463422656059


                                                                                                                                                                         

Epoch 4/20, Loss: 0.004626089241355658


                                                                                                                                                                         

Epoch 5/20, Loss: 0.004750856198370457


                                                                                                                                                                         

Epoch 6/20, Loss: 0.004376672208309174


                                                                                                                                                                         

Epoch 7/20, Loss: 0.0044020479544997215


                                                                                                                                                                         

Epoch 8/20, Loss: 0.0039253863506019115


                                                                                                                                                                         

Epoch 9/20, Loss: 0.0038529299199581146


                                                                                                                                                                         

Epoch 10/20, Loss: 0.003782301675528288


                                                                                                                                                                         

Epoch 11/20, Loss: 0.0038399151526391506


                                                                                                                                                                         

Epoch 12/20, Loss: 0.003735095029696822


                                                                                                                                                                         

Epoch 13/20, Loss: 0.003572053276002407


                                                                                                                                                                         

Epoch 14/20, Loss: 0.0036347825080156326


                                                                                                                                                                         

Epoch 15/20, Loss: 0.003613216569647193


                                                                                                                                                                         

Epoch 16/20, Loss: 0.0031437657307833433


                                                                                                                                                                         

Epoch 17/20, Loss: 0.003282710677012801


                                                                                                                                                                         

Epoch 18/20, Loss: 0.003230009227991104


                                                                                                                                                                         

Epoch 19/20, Loss: 0.0028856401331722736


                                                                                                                                                                         

Epoch 20/20, Loss: 0.002801822265610099


## Test Loop

Again, nothing special, just a standard inference.

In [11]:
with torch.no_grad():
    all_predictions = []
    surveys = []
    top_k_indices = None
    for data, lonlat, surveyID in tqdm.tqdm(test_loader, total=len(test_loader)):
        data = data.to(device)
        lonlat = lonlat.to(device)
        outputs = model(data, lonlat)
        predictions = torch.sigmoid(outputs).cpu().numpy()

        # Sellect top-25 values as predictions
        top_25 = np.sort(np.argsort(-predictions, axis=1)[:, :25])
        if top_k_indices is None:
            top_k_indices = top_25
        else:
            top_k_indices = np.concatenate((top_k_indices, top_25), axis=0)

        surveys.extend(surveyID.cpu().numpy())

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:09<00:00,  5.91it/s]


## Save prediction file! 🎉🥳🙌🤗

In [12]:
data_concatenated = [' '.join(map(str, row)) for row in top_k_indices]

pd.DataFrame(
    {'surveyId': surveys,
     'predictions': data_concatenated,
    }).to_csv("submission.csv", index = False)