# ClimateHack.AI 2023: Training a Basic Model

Thank you for participating in ClimateHack.AI 2023! 

Your contributions could help cut carbon emissions by up to 100 kilotonnes per year in Great Britain alone. We look forward to seeing what you build over the course of the competition!

In this Jupyter notebook, you will hopefully train your first model for the challenge using historical solar PV data and HRV satellite imagery.

## Installing packages

Before you can get started, you will need to install a number of packages to allow you to work with the data and submit to the platform. If you do not already have these packages installed, you can uncomment the lines below to do so! You will also need to [install PyTorch](https://pytorch.org/get-started/locally/).

In [None]:
# %pip install zarr xarray gcsfs fsspec dask cartopy ocf-blosc2 torchinfo tqdm
# %pip install lightning
# %pip install -U doxa-cli
# !git clone https://github.com/nhat-vo/getting-started-2023.git && mv getting-started-2023/* .

## Importing packages

Here, we import a number of packages we will need to train our first model.

In [None]:
import os
from datetime import datetime, time, timedelta
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import xarray as xr
from torch.utils.data import DataLoader, IterableDataset
from torchinfo import summary
import json
import math
from tqdm import tqdm
import lightning as L
from lightning.pytorch.callbacks import DeviceStatsMonitor
from ocf_blosc2 import Blosc2


plt.rcParams["figure.figsize"] = (20, 12)

## Downloading a month of data

While streaming the Zarr-format datasets directly from Hugging Face was adequate for some initial data exploration in `1_data.ipynb`, it most likely will not be fast enough in training. Since there is so much data available, we can get started just by downloading a single month of PV and HRV satellite imagery data.

In [None]:
if not os.path.exists("data"):
    os.makedirs("data/pv/2020", exist_ok=True)
    os.makedirs("data/satellite-hrv/2020", exist_ok=True)

    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/pv/metadata.csv --output data/pv/metadata.csv
    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/pv/2020/7.parquet --output data/pv/2020/7.parquet
    !curl -L https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/satellite-hrv/2020/7.zarr.zip --output data/satellite-hrv/2020/7.zarr.zip

## Loading the data

In [None]:
pv = pd.read_parquet("data/pv/2020/7.parquet").drop("generation_wh", axis=1)
pv

In [None]:
hrv = xr.open_dataset( "data/satellite-hrv/2020/7.zarr.zip", engine="zarr", chunks="auto")
hrv

As part of the challenge, you can make use of satellite imagery, numerical weather prediction and air quality forecast data in a `[128, 128]` region centred on each solar PV site. In order to help you out, we have pre-computed the indices corresponding to each solar PV site and included them in `indices.json`, which we can load directly. For more information, take a look at the [challenge page](https://doxaai.com/competition/climatehackai-2023).


In [None]:
with open("indices.json") as f:
    blobs = json.load(f)
    site_locations = {
        v: {int(r): (int(blob[r][0]), int(blob[r][1])) for r in blob}
        for v, blob in blobs.items()
    }

### Defining a PyTorch Dataset

To get started, we will define a simple `IterableDataset` that shows how to slice into the PV and HRV data using `pandas` and `xarray`, respectively. You will have to modify this if you wish to incorporate non-HRV data, weather forecasts and air quality forecasts into your training regimen. If you have any questions, feel free to ask on the [ClimateHack.AI Community Discord server](https://discord.gg/HTTQ8AFjJp)!

**Note**: `site_locations` contains indices for the non-HRV, weather forecast and air quality forecast data as well as for the HRV data!

There are many more advanced strategies you could implement to load data in training, particularly if you want to pre-prepare training batches in advance or use multiple workers to improve data loading times.

In [None]:
class ChallengeDataset(IterableDataset):
    def __init__(self, pv, hrv, site_locations, start_date, end_date):
        super().__init__()
        self.pv = pv
        self.hrv = hrv
        self._site_locations = site_locations

        self.start_date = start_date
        self.end_date = end_date
        self.start_time = time(8)
        self.end_time = time(17)

    def _get_image_times(self):
        date = self.start_date
        while date < self.end_date:
            current_time = datetime.combine(date, self.start_time)
            while current_time.time() < self.end_time:
                if current_time:
                    yield current_time

                current_time += timedelta(minutes=60)

            date += timedelta(days=1)
            
    
    # def __len__(self):
    #     start = datetime.combine(self.start_date, self.start_time) 
    #     end = datetime.combine(self.end_date, self.end_time) 
    #     time_count = int((end - start).total_seconds()) // 3600
    #     return time_count * len(self._site_locations["hrv"])

    def __iter__(self):
        for time in self._get_image_times():
            first_hour = slice(str(time), str(time + timedelta(minutes=55)))

            pv_features = pv.xs(first_hour, drop_level=False)  # type: ignore
            pv_targets = pv.xs(
                slice(  # type: ignore
                    str(time + timedelta(hours=1)),
                    str(time + timedelta(hours=4, minutes=55)),
                ),
                drop_level=False,
            )

            hrv_data = self.hrv["data"].sel(time=first_hour).to_numpy()
            np.pad(hrv_data, ((0, 0), (64, 64), (64, 64), (0, 0)))

            for site in self._site_locations["hrv"]:
                try:
                    # Get solar PV features and targets
                    site_features = pv_features.xs(site, level=1).to_numpy().squeeze(-1)
                    site_targets = pv_targets.xs(site, level=1).to_numpy().squeeze(-1)
                    assert site_features.shape == (12,) and site_targets.shape == (48,)

                    # Get a 128x128 HRV crop centred on the site over the previous hour
                    x, y = self._site_locations["hrv"][site]
                    hrv_features = hrv_data[:, y - 64 : y + 64, x - 64 : x + 64, 0]
                    assert hrv_features.shape == (12, 128, 128)

                    # How might you adapt this for the non-HRV, weather and aerosol data?
                except:
                    print('ignoring', x, y)
                    continue

                yield site_features, hrv_features, site_targets

In [None]:
def worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    dataset = worker_info.dataset  # the dataset copy in this worker process
    overall_start = dataset.start_date
    overall_end = dataset.end_date
    days_count = (overall_end - overall_start).days

    # configure the dataset to only process the split workload
    per_worker = int(math.ceil(days_count / float(worker_info.num_workers)))
    date_delta = timedelta(days=per_worker)
    dataset.start_date = overall_start + worker_id * date_delta
    dataset.end_date = min(dataset.start_date + date_delta, overall_end)
    print(f"Worker {worker_id} processing {dataset.start_date} to {dataset.end_date}")

## Defining a model

In order to make a PyTorch-based submission to the DOXA AI platform, you need to upload both the code defining your model in addition to your trained model weights (and some code to run your model). As a result, if you want to experiment with different model architectures using this notebook, you will need to edit the model in `submission/model.py` and re-import it here.

Here is the small convolutional neural network you are initially given in `submission/model.py`. You will absolutely be able to improve upon this!


In [None]:
# Import the model defined in `submission/model.py`

from submission.model import Model

In [None]:
class ModelModule(L.LightningModule):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.criterion = nn.L1Loss()
        # self.example_input_array = [torch.Tensor(1, 12), torch.Tensor(1, 12, 128, 128)]

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        pv_features, hrv_features, pv_targets = batch
        predictions = self.model(
            pv_features,
            hrv_features,
        )

        loss = self.criterion(predictions, pv_targets)

        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        # this is the validation loop
        pv_features, hrv_features, pv_targets = batch
        predictions = self.model(
            pv_features,
            hrv_features,
        )

        loss = self.criterion(predictions, pv_targets.to)
        self.log("val_loss", loss)

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

## Train a model

In [None]:
BATCH_SIZE = 32

train_data = ChallengeDataset(pv, hrv, site_locations=site_locations, start_date=datetime(2020, 7, 1), end_date=datetime(2020, 7, 25))
val_data = ChallengeDataset(pv, hrv, site_locations=site_locations, start_date=datetime(2020, 7, 25), end_date=datetime(2020, 8, 1))
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, pin_memory=True)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE, pin_memory=True)

In [None]:
model = ModelModule(Model())
print(summary(model.model, input_size=[(1, 12), (1, 12, 128, 128)]))

In [None]:
EPOCHS = 2

trainer = L.Trainer(accelerator="gpu", max_epochs=EPOCHS, profiler="advanced", callbacks=[DeviceStatsMonitor()])
trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)

In [None]:
# Save your model
torch.save(model.model.state_dict(), "submission/model.pt")

# Submitting to the DOXA AI platform

Congratulations &ndash; **you have trained your first model for ClimateHack.AI 2023**! 🥳

Why not try making a submission to the platform?

First, make sure you have enrolled for the competition on the [ClimateHack.AI 2023 competition page](https://doxaai.com/competition/climatehackai-2023). You will need to be signed in with a DOXA AI account registered with your university email address so that we can verify your eligibility.

You can then sign in with the CLI using the following command:

In [None]:
!doxa login

Finally, you can upload your submission to the platform by running the following cell:

In [None]:
!doxa upload submission

If everything went well, you will soon appear on the [competition scoreboard](https://doxaai.com/competition/climatehackai-2023/scoreboard) once your model has been evaluated! 😎

## Next steps

Well done for reaching the end of this Jupyter notebook! By now, you will have loaded and explored the data, trained a basic model, and joined other competition participants on the [competition scoreboard](https://doxaai.com/competition/climatehackai-2023/scoreboard)!

To get started, we used a very simple model architecture, but this model most likely does not have a sufficiently rich representation to properly solve our problem. How might you be able to improve on this? Which model architectures would be best suited to this problem? Would you want to train a model from scratch, as we have done here, or possibly fine-tune a pre-trained computer vision model? Check out the resources on the [competition page](https://doxaai.com/competition/climatehackai-2023) for ideas on where to go from here.

Additionally, we only used historical PV and HRV data, but perhaps you might be able to get more mileage out of the other data sources available to you, such as non-HRV satellite imagery, the DWD weather forecast data or even the aerosol data. If you do decide to incorporate more data, what **data engineering** work would you have to perform so that you can train effectively on a large quantity of data?

**We want to hear about your approaches**! If you develop anything interesting, let us know on the [ClimateHack.AI Community Discord server](https://discord.gg/HTTQ8AFjJp) and start a conversation!