# Workshop 2: Introduction to Hydrogen imports, Offshore Hubs & `Snakemake` workflows

:::{note}
If you have not yet set up Python on your computer, you can execute this tutorial in your browser via [Google Colab](https://colab.research.google.com/). Click on the rocket in the top right corner and launch "Colab". If that doesn't work download the `.ipynb` file and import it in [Google Colab](https://colab.research.google.com/).

Then install the following packages by executing the following command in a Jupyter cell at the top of the notebook.

```sh
!pip install pypsa atlite pandas geopandas xarray matplotlib hvplot geoviews plotly highspy holoviews folium mapclassify
```
:::

## H2 imports

## Offshore Hubs

## The `Snakemake` tool

<img src="snakemake_logo.png" width="300px" />

The `Snakemake` workflow management system is a tool to create reproducible and scalable data analyses.
Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition.

Snakemake follows the [GNU Make](https://www.gnu.org/software/make) paradigm: workflows are defined in terms of so-called `rules` that define how to create a set of output files from a set of input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.

:::{note}
Documentation for this package is available at https://snakemake.readthedocs.io/. You can also check out a [slide deck Snakemake Tutorial](https://slides.com/johanneskoester/snakemake-tutorial) by Johannes Köster (2024).

Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.
:::


### A minimal Snakemake example

To check out how this looks in practice, we've prepared a minimal Snakemake example workflow that processes some data. The workflow consists of two simple rules
- `filter_data`
- `sort_data`

<img src="minimal_workflow.png" width="100px" />

We will first need to load the raw data file used in this minimal example into our google drive:

In [23]:
from urllib.request import urlretrieve

urls = {
    "data_raw.csv": "https://storage.googleapis.com/open-tyndp-data-store/workshop-02/data_raw.csv",
}
for name, url in urls.items():
    print(f"Retrieving {name} from GCP storage.")
    urlretrieve(url, name)
print("Done")

Retrieving data_raw.csv from GCP storage.
Done


#### The `Snakefile` and `rules`

The rules need to be defined in a so-called `Snakefile` that sits in the same directory as your current working directory. For our minimal example the `Snakefile` looks like this:

> ```Snakemake
> # SPDX-FileCopyrightText: Open Energy Transition gGmbH
> #
> # SPDX-License-Identifier: MIT
> 
> rule all:
>     input:
>         "data_sorted.csv"
> 
> rule filter_data:
>     input:
>         "data_raw.csv"
>     output:
>         "data_filtered.csv"
>     script:
>         "filter_data.py"
> 
> rule sort_data:
>     input:
>         "data_filtered.csv"
>     output:
>         "data_sorted.csv"
>     script:
>         "sort_data.py"
> 
> ```

#### Calling a workflow

You can then execute the workflow by asking for the target file `data_sorted.csv` or any intermediate file:
```
snakemake data_sorted.csv
```

Alternatively you can also execute the workflow by calling the rule that produces those files:
```
snakemake sort_data
```

Or you can call the common rule `all` which can be used to execute the whole workflow. It takes the final workflow output as its input and thus requires all previous dependent rules to be run as well:
```
snakemake all
```

The `-n` flag executes a `dry-run`. It is recommended to always first execute a `dry-run` before the actual execution of the workflow. This simply prints out the DAG of the workflow to investigate.

Let's try this out and investigate the output:

In [3]:
! snakemake all -n

Set parameter Username
Academic license - for non-commercial use only - expires 2026-04-14
[32mhost: MacBook-Pro-181.home[0m
[33mBuilding DAG of jobs...[0m
[33mJob stats:
job            count
-----------  -------
all                1
filter_data        1
sort_data          1
total              3
[0m
[32m[0m
[32m[Wed Jun 11 14:55:10 2025][0m
[32mrule filter_data:
    input: data_raw.csv
    output: data_filtered.csv
    jobid: 2
    reason: Missing output files: data_filtered.csv
    resources: tmpdir=<TBD>[0m
[32m[0m
[32m[Wed Jun 11 14:55:10 2025][0m
[32mrule sort_data:
    input: data_filtered.csv
    output: data_sorted.csv
    jobid: 1
    reason: Missing output files: data_sorted.csv; Input files updated by another job: data_filtered.csv
    resources: tmpdir=<TBD>[0m
[32m[0m
[32m[Wed Jun 11 14:55:10 2025][0m
[32mrule all:
    input: data_sorted.csv
    jobid: 0
    reason: Input files updated by another job: data_sorted.csv
    resources: tmpdir=<TBD>[0m
[

#### Visualizing the `DAG` of a worflow

You can also visualize the `DAG` of jobs using the `--dag` flag and the Graphviz `dot` command. This will not run the workflow but only create the visualization:
```
snakemake all --dag | dot -Tsvg > dag.svg
```

In [None]:
! snakemake all --dag | dot -Tpng > dag.png

[33mBuilding DAG of jobs...[0m


Alternatively, you can also visualize a filegraph like the figure above which includes also some information about the inputs and outputs to each of the rules.

You can reproduce the figure from above with the following command:
```
snakemake all --filegraph | dot -Tsvg > filegraph.svg
```

In [None]:
! snakemake all --filegraph | dot -Tsvg > filegraph.svg

[33mBuilding DAG of jobs...[0m


### Task 1: Executing a workflow with Snakemake

a) For our minimal example, execute a `dry-run` to produce the intermediate file `data_filtered.csv`.

b) Execute the workflow and investigate what happens if you try to execute the workflow again.

c) Delete the final output file `data_sorted.csv` and investigate what happens if you try to execute the workflow again.

d) Import the raw input data file `data_raw.csv` using pandas and save it again overwriting the original file. Investigate what happens if you try to execute the workflow again.

e ) FInally, open the `Snakefile` and add a second rule that filters the file `data_raw_2.csv` using the same script as the `filter_data` rule. Add the output of this new rule as a second input to the `sort_data` rule.

### Using Snakemake to launch the open-TYNDP workflow

Let's start by cloning the `open-tyndp` GitHub repository into our working directory...

In [1]:
! git clone https://github.com/open-energy-transition/open-tyndp.git

Cloning into 'open-tyndp'...
remote: Enumerating objects: 31153, done.[K
remote: Counting objects: 100% (580/580), done.[K
remote: Compressing objects: 100% (292/292), done.[K
remote: Total 31153 (delta 470), reused 309 (delta 288), pack-reused 30573 (from 3)[K
Receiving objects: 100% (31153/31153), 107.08 MiB | 32.96 MiB/s, done.
Resolving deltas: 100% (23997/23997), done.


We now need to change our working directory to this new directory:

In [3]:
import os

In [4]:
os.chdir('open-tyndp')

Let's check that we are indeed in the new directory now:

In [5]:
os.getcwd()

'/Users/daniel/Desktop/Work/OET/Projects/open-tyndp/code/open-tyndp-workshops/open-tyndp-workshops/open-tyndp'

We can now use Snakemake to call some of the rules to produce outputs with the `open-tyndp` PyPSA model. 

We will use the test configuration file and schedule a dry-run with `-n` as we only want to investigate the DAG of the workflow:

In [None]:
! snakemake -call all --configfile config/test/config.tyndp.yaml -n

Set parameter Username
Academic license - for non-commercial use only - expires 2026-04-14
[33mConfig file config/config.default.yaml is extended by additional config specified via the command line.[0m
[33mConfig file config/plotting.default.yaml is extended by additional config specified via the command line.[0m
[33mConfig file config/config.private.yaml is extended by additional config specified via the command line.[0m
[35mThe flag 'directory' used in rule clean_tyndp_demand is only valid for outputs, not inputs.[0m
[35mThe flag 'directory' used in rule all is only valid for outputs, not inputs.[0m
[35mThe flag 'directory' used in rule all is only valid for outputs, not inputs.[0m
[35mThe flag 'directory' used in rule all is only valid for outputs, not inputs.[0m
[35mThe flag 'directory' used in rule all is only valid for outputs, not inputs.[0m
[32mhost: MacBook-Pro-181.home[0m
[33mBuilding DAG of jobs...[0m
[33mJob stats:
job                                   

:::{note}
If you are executing this notebook on your local machine, you can also use the `conda` package manager to install the `open-tyndp` environment and run the workflow instead of dry-runs:
```
conda env create --file envs/<YourSystemOS>-pinned.yaml
```
:::

### Task 2: Adjusting the Open-TYNDP workflow with the configuration file

a) Make some changes in the configuration file and call another **dry-run** of the `open-tyndp` model again to see the changes to the workflow.