# Architecture of PyPSA-Earth

This notebook describes the architecture of the PyPSA-Earth software.
In particular, the document will go through
1. **Introduction to the workflow**: the software and its special features
2. **Organization of the package**: the structure of the folder and relevant files
3. **Snakemake**: the tool to automatize workflow
4. **Configuration file**: flexibility and customizability on a single file

## 1. Introduction to the workflow

### 1.1. Overview and scope

The PyPSA-Earth model aims to be the first open-source global energy system model with data in high spatial and temporal resolution.
The main objective of the package is to answer socio-economic and technical challenges to support policy decision making and technical experts in developing electrification masterplans, planning renewable sources projects and developing policies.


The above is remarkably important in developing countries and PyPSA-Earth aims at providing tools to support a fast transition of any economy to a modern green strong industrial and commercial powerhouse, based on the will of their policy makers and population.

![PyPSA africa introduction](./images/pypsa_africa_intro.png "PyPSA africa introduction")

### 1.2. Special features

In this subsection, we recall the special features that PyPSA-Earth provides to better explain why specific architecture approaches have been selected

#### 1.2.1. Application-oriented features

First of all, PyPSA-Earth has been and is being developed to address the following technical-oriented requirements

- **Investment and dispatch optimization**: investment planning and dispatch analyses are a must-have for energy models to support policy makers and utilities to properly address the technical, social and economic implications of the future needs of the system, while coping with security and adequacy requirements.
  PyPSA-Earth shall provide a unique co-planning and dispatch tool for supporting investment decisions in a robust and verifiable manner to facilitate the energy analysis also for entities with limited resources.
  In particular, the tool will enable a co-planning and co-dispatch of multi-regional areas, thus supporting cooperation among different countries, which is a remarkable novelty.
  By using accurate dispatch analyses, utilities can also understand the bottlenecks of the current system and plan ahead.

  *To cope with this requirement, specific scripts and 'rules' shall be developed to perform such analyses*
  <br><br>

- **Scenario analysis**: scenarios play a significant role in addressing the future needs and keep the system reliable, adequate and resilient.
  Demand growth scenarios as well as the production of renewable sources, accounting for climate-change effects, shall be easy to address.
  
  *To cope with this requirement, configuration parameters shall be available to users to perform the desired analyses*
  <br><br>

- **Adjustable resolution and grid modelling**: the tool shall be easy to perform detailed analysis focused on specific small regions (e.g. space resolution of provinces for country-specific analyses), as well as being able to analyze entire continents using a larger space resolution (e.g. space resolution of regions for continent-wise analyses).
  Such adjustable parameters shall focus on both spacial resolution, such as what countries the analysis shall focus on, as well as time resolution, which means the specific time horizon for the current optimization.
  
  *To cope with this requirement, configuration parameters shall be available to calibrate the spatial and temporal resolution, as well the region under interest*
  <br><br>

- **Multi-horizon optimization** (see [PyPSA-EUR-SEC](https://pypsa-eur-sec.readthedocs.io/en/latest/)): building on previous experience, the model shall be able to perform multi-year and multi-horizon analyses to successfully support policy-oriented measures and cope with the socio-economic dynamics of the system.
  The above is remarkably important in developing countries where the population growth is significant as well the socio-economic transitions can have fast dynamics.
  
  *To cope with this requirement, configuration parameters and scripts shall be available to users to perform multi-horizon analyses*
  <br><br>

- **Sector-coupled optimization** (see [PyPSA-EUR-SEC](https://pypsa-eur-sec.readthedocs.io/en/latest/)): energy is the main backbone of any modern economy, as such PyPSA-Earth shall provide energy-wise analysis and not focusing on electricity only
  However, electricity is and will be a major energy vector and as such the first development of PyPSA-Earth will focus on electricity, to be then extended to multiple energy vectors.
  
  *To cope with this requirement, scripts shall be available to users to perform sector-coupled analyses*.
  <br><br>

#### 1.2.2. User-oriented features

  PyPSA-Earth, however, is not only a technical-oriented tool but it's development has been driven to support good user experience

- **Easy to learn and use**: barriers in understanding the code and use shall be minimized
  
  *To cope with this requirement, PyPSA-Earth shall be developed in python*. <img src="./images/Python_logo.png" alt="Python" height="40"/>
  <br><br>

- **Accessible, transparent, flexible for incorporating new features and data**
  
  *To cope with this requirement, PyPSA-Earth shall be open-source with a modular structure based on several files each one with a clear scope*.
  <br><br>
- **Customizability**: the code shall be easy to customize: add new features, new methodologies, new data, etc.

  *To cope with this requirement, each file of PyPSA-Earth shall have an internal modular structure composed by simple functions per each feature*.
  <br><br>

- **Support parallel co-editing**: since the PyPSA-Earth community aims to grow large, parallel co-editing shall be supported since the beginning.

  *To cope with this requirement, PyPSA-Earth is available on Github and has a file-wise and architecture-wise modular structure to avoid conflicts*.  <img src="./images/Github_logo.png" alt="Github" height="80"/>
  <br><br>

- **Automated reproducible workflow**: developing and solving the entire software involves many different steps and methodologies.
  To simplify the user experience and enable replicability, the entire workflow shall be automated.

  *To cope with this requirement, PyPSA-Earth exploits snakemake that is a workflow tool that simplify the development and use of workflow procedures composed by multiple steps*.  <img src="./images/snakemake_logo.png" alt="Github" height="25"/>
  <br><br>

### 1.3. The Workflow

The workflow is structured in 6 main parts as summarized below.
In particular, the entire workflow is automated and executed using `snakemake`.

![Workflow overview](./images/workflow_overview.png "Workflow overview")

1. **Download data**: the first phase aims at preparing the main raw data needed to execute the code.
   In this preliminary phase, PyPSA-Earth automatically downloads open source datasets and pre-compiled data bundle needed to perform the execution.
   In particular, some input data apply for multiple countries, whereas some are country specific.
   Depending on the source, some data can be automatically downloaded from the original source, when that is not possible, a processed data bundle is downloaded from google drive to fill the gap.
   <br><br>

2. **Filter data**: most raw data contain NaN values and the structure of the dataset does not comply with the PyPSA requirements.
   Data are properly cleaned so that the rest of the workflow can successfully process them in a clean and repeatable way.
   <br><br>

3. **Populate data**: the cleaned data are then processed to generate time series and inputs for building the specific PyPSA model that is used for the simulations.
   In this phase, multiple methodologies will be executed depending on the configuration file
   <br><br>

4. **Create network model**: once all inputs are generated, then the network model is created successfully.
   <br><br>

4. **Solve model**: the model is solved and results are retrieved
   <br><br>

4. **Summary and plots**: finally results summary and plots are generated [not yet ready]
   <br><br>

### 1.4. The Work Packages

In PyPSA-Earth, five main work packages are in place to tackle the continent-wide challenges:

- **WP1. Demand modelling**: to characterize the demand profiles for each location both in space and time resolution
- **WP2. Conventional generator modelling**: define the type of generators that are installed in each substation
- **WP3. RES modelling**: to characterize the available renewable production in space and time resolution, given technical, social, and land constraints
- **WP4. Land coverage constraint modelling**: to identify limits and constraints where assets can or cannot be installed
- **WP5. Network and substation modelling**: drawing the model of the network to feed the PyPSA modeling suite

Such workpackages are complemented and further enhanced by the last WP of the PyPSA meets Earth initiative that focuses on filling the gaps in the available data. 

- **WP6. Data creation and validation**: when data are scarce,
  by using high resolution maps we aim at filling the gaps and update the datasets. The AI detection is tackled in the package
  [detect_energy](https://github.com/pypsa-meets-earth/detect-energy)

## 2. Organization of the package

The PyPSA-Earth package is organized in different folders having different scopes.
The following image describes the main folder and its elements and, in the following description, we will go through the description of the folder organization in detail.



### 2.1. Folders

- ``data``: Includes input data that are used ``snakemake`` rules.

  - ``data``: includes mainly raw data as downloaded from online databases.
      

    - ``data\copernicus``: it contains the raw data on the land covering as available from the [Copernicus database](https://land.copernicus.eu/). It is used in the `build_renewable_profiles` rule to quantify what are the land regions available for the installation of renewable resources, e.g. renewable assets may not be installed on arable land.
    - ``data\eez``: it contains the dataset of the Exclusive Economic Zones (EEZ) available from [Marine Regions](https://www.marineregions.org/downloads.php).
      This file is used in the rule `build_shapes` to identify the marine region by country and provide shapes of the maritime regions to be possibly used to estimate off-shore renewable potential, for example.
    - ``data\gadm``: it contains data of the shapes of administrative zones by country (e.g. regions, districts, provinces, ...), depending on the level of resolution desired by the configuration file.
      The data in this folder are automatically populated by the `build_shapes` rule that download such data from the [gadm website](https://gadm.org/data.html)
    - ``data\GDP``: raster dataset of the Gross Domestic Product (GDP) by arcs of the world, as available from [DRYAD](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dk1j0)
    - ``data\gebco``: elevation dataset as available from [GEBCO](https://www.gebco.net/data_and_products/gridded_bathymetry_data/); similarly to copernicus data, it is used in the `build_renewable_profiles` rule
    - ``data\hydrobasins``: datasets of polygons describing the shapes of watershed boundaries and basins, as available from [HydroBASINS](https://www.hydrosheds.org/page/hydrobasins).
      These data are used to estimate the renewable production by hydro source in the `build_renewable_profiles` rule when used with the specific hydro resource
    - ``data\landcover``: describes the shapes of world protected areas that are needed to identify in what areas no (renewable) assets can be installed.
      These data are used in the `build_renewable_profiles` rule
    - ``data\WorldPop``: raster dataset of the population by arc as automatically by `build_shapes` rule from [WorldPop](https://www.worldpop.org/)

  - ``resources\osm\raw``: includes a series of folders that contain the raw OpenStreetMap (OSM) data as downloaded from OpenFabrik and processed by the the download_osm_data rule.
      

    The raw OSM data in *.pbf* format are stored in the folder ``scripts\osm\{region}\pbf``, where region denotes a continent, e.g. Africa, or a macro region, e.g. "central_america", where the countries of interest are stored. The pbf files contain the entire OSM data for the country; the specific network information related to generators, substations, lines and cables are extracted and stored as a geojsons in the folder ``scripts\osm\{region}\Elements``. All network data (generators, substations, lines and cables) for each country are stored in geojson files in the folder ``scripts\osm``.

  - ``resources\osm\clean``: includes the cleaned OSM network data that are the output of the `osm_data_cleaning` rule, which process the raw OSM data to obtain cleaned datasets of all the network assets, namely generators, substations, lines and cables.
      

  - ``resources\base_network``: this folder contains the output of the ``osm_build_network``, which are the processed clean OSM network data with additional refining so to create a solid and robust network model.
      
    
  - `data\costs.csv`: csv file containing the costs of the technologies

  - ``data\ssp2-2.6``: it contains the demand data scenarios obtained by using the [GEGIS model](https://github.com/niclasmattsson/GlobalEnergyGIS).
    The folder contains several subfolders for each yearly scenario (at 2030, 2040, 2050 and 2100 years), using different data sources (era5_2011, era5_2013 and era5_2018) and for selected regions (Africa, Asia and SouthAmerica), for the purpose of this package.
    
- ``resources``: Stores intermediate results of the workflow which can be picked up again by subsequent rules.
      
- ``cutouts``: it contains files containing time series of renewable energy production for each area of the considered region.
  The files are generated using [Atlite](https://github.com/PyPSA/atlite).
  <br><br>

- ``scripts``: Includes all the Python scripts executed by the ``snakemake`` rules; each script typically correspond to its namesake rule and viceversa.
      
  ![Folder scripts](./images/scripts_folder.png "Folder scripts")
  <br><br>

- ``notebooks``: Stores useful notebooks to investigate the raw data, the intermediate data and the results, as well as to generate plots.
      
  ![Folder notebooks](./images/notebooks_folder.png "Folder notebooks")

  <br><br>
- ``networks``: Stores intermediate, unsolved stages of the PyPSA network that describes the energy system model.
      
  ![Folder networks](./images/networks_folder.png "Folder networks")

- ``results``: Stores the solved PyPSA network data, summary files and plots.
      
  ![Folder results](./images/results_folder.png "Folder results")

- ``test``: folder that contains the configuration files used in the [Continuous Integration (CI) on Github](https://docs.github.com/en/actions/automating-builds-and-tests/about-continuous-integration). Whenever a commit or a pull request occurs in Github, the CI process starts to verify that there are no breaks in the code. In our case, the CI executes a `snakemake` workflow using `test/config.test1.yaml` as configuration file and verify that the entire workflow works without issues.
      
  ![Folder test](./images/tests_folder.png "Folder test")

- ``envs``: Stores the conda environment files to successfully run the workflow.
      
  ![Folder envs](./images/envs_folder.png "Folder envs")
  
- ``benchmarks``: Stores ``snakemake`` benchmarks.
- ``logs``: Stores log files about solving, including the solver output, console output and the output of a memory logger.

### 2.2. Package files

![Organization files](./images/organization_files.png "Organization files")

- `Snakefile`: this files contains the list of rules for the `snakemake` procedure.
- `config{any}.yaml`
  - `config.yaml`: this is the main configuration file used to guide the entire workflow procedure, from what are the countries under interest, to what methodologies for the demand and resource assessment to use.
  - `config.default.yaml`: when no `config.yaml` file is found, then `snakemake` automatically reads this file and creates a new copy with name `config.yaml` to run the workflow.
  - `config.tutorial.yaml`: this is a tutorial configuration file used for teaching purposes and simple debugging.
- `.gitignore`: this file indicates what are the folders that should and should not be tracked by `git`.
  The `git` tool, in fact, enables an easy tracking what are the differences between the locally store github folder and the version of the package stored in github. Since we work on developing an automated workflow that can reproduce the results by using the same inputs, git shall track only the workflow procedures and ignore the data-related inputs
- `.prettierignore`: similarly to `.gitignore`, this files signal `git` to ignore specific files: in this package, all `yaml` files are ignored
- `readthedocs.yaml`: configuration file to guide the requirements for building the documentation

## 3. Snakemake

[Snakemake](https://snakemake.readthedocs.io/en/stable/) is a "workflow management system to create reproducible and scalable data analyses".

In PyPSA-Earth it is a major dependency that successfully enables to automatically create a ordered set of actions to generate an output by using `rules`.
In PyPSA-Earth, the file named `Snakemake` contains the definition of all the rules used in the package; in the following specific, we will use examples to explain how snakemake works using parts of such file.

### 3.1. What is a rule?

A snakemake `rule` declares a specific action.
Each rule has a `name` and a series of `directives` that characterize its needs.
In this subsection, we give a general introduction of the rule and for more information, please check out the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)

To describe a sample rule, we focus on the rule `clean_osm_data` that performs the data cleaning of the raw input data downloaded from OpenStreetMap.

``` python
rule clean_osm_data:
    input:
        cables="resources/osm/raw/africa_all_raw_cables.geojson",
        generators="resources/osm/raw/africa_all_raw_generators.geojson",
        lines="resources/osm/raw/africa_all_raw_lines.geojson",
        substations="resources/osm/raw/africa_all_raw_substations.geojson",
        country_shapes='resources/shapescountry_shapes.geojson',
        offshore_shapes='resources/shapesoffshore_shapes.geojson',
    output:
        generators="resources/osm/clean/africa_all_generators.geojson",
        generators_csv="resources/osm/clean/africa_all_generators.csv",
        lines="resources/osm/clean/africa_all_lines.geojson",
        substations="resources/osm/clean/africa_all_substations.geojson",
    log: "logs/clean_osm_data.log"
    script: "scripts/osm_data_cleaning.py"
```

In particular, it is worth noticing that:

- Each rule is specified with the keyword `rule`, followed by its name. The depicted example focuses on the rule named `clean_osm_data`

- After the rule definition, tabs are used to declare the so-called `directives`. In the example, the directives are `input`, `output`, `log` and `script`
  - the directive `input` is used to specify input data for the execution of the code
  - the directive `output` produced by the rule
  - the directive `script` specifies the (python) script that is executed for the rule, which uses the inputs and produces the outputs as specified by the corresponding directives. In this example, the rule named `clean_osm_data` executes the script `osm_data_cleaning.py` located in the folder `scripts`. **Note: this implies that the current folder when using snakemake shall be the main parent directory of the package**
  - the directive `log` indicates the file where the logging output is stored

- Rules may contain a list of inputs; such as `input` and `output`. Commas are used to separate different input data

### 3.2. How are rules chained?

As stated above, it is clear to notice that a rule defines the inputs and outputs of a given action, and specify how to perform that action.
If we scout the Snakemake file of PyPSA-Earth, however, a large number of rules are defined, but how are they executed?

**Snakemake chains the execution of the rules depending on the input-output relationships**.

To clarify the above, let's consider now the rule `download_osm_data` that automatically downloads raw data from Open Street Map (OSM).

``` python
rule download_osm_data:
    output:
        cables="resources/osm/raw/africa_all_raw_cables.geojson",
        generators="resources/osm/raw/africa_all_raw_generators.geojson",
        generators_csv="resources/osm/raw/africa_all_raw_generators.csv",
        lines="resources/osm/raw/africa_all_raw_lines.geojson",
        substations="resources/osm/raw/africa_all_raw_substations.geojson",
    log: "logs/download_osm_data.log"
    script: "scripts/osm_pbf_power_data_extractor.py"
```

According to what stated in the previous section:
- the name of the rule is `download_osm_data`
- the python script related to the rule is `osm_pbf_power_data_extractor.py` in the folder `scripts`
- there is the output directive `output` stating a list of outputs
- **there is no input directive `input`**: this means that no inputs are required for executing this rule!

Now, let's compare the outputs of the rule `download_osm_data` with the inputs of `clean_osm_data`.
It turns out that the outputs of the rule `download_osm_data` are needed in `clean_osm_data`, thus the latter rule shall be executed after the former.

![Comparing input-output](./images/compare_rules_chaining.png "Comparing input-output")

**Snakemake automatically detects these chaining elements and execute rules accordingly.**

In particular, when multiple threads are available, rules may be executed in parallel if the input-output chaining allows that.

### 3.3. How to run snakemake

The practical experience with using snakemake will be described in detail in the notebook `how_to_execute_the_workflow.ipynb`
Anyway, the typical commands do have the following expression to be run in conda environment:

`snakemake <options>`

### 3.4. Verify the workflow chaining (dry-run and dag)

#### 3.4.1. Dry-run

To verify what is the list of rules that are executed by snakemake, you can use the option `--dry-run` or `-n`.
That code does not execute the workflow, but it performs a *dry-run*: it identifies the chaining and the rules to run, but it does not execute the rules

For example, by using the following code, to run the workflow to produce the output of the script `clean_osm_data`
``` batch
    snakemake --dry-run clean_osm_data
```
the following output is produced:

![Dry-run snakemake](./images/dry_run_snakemake.png "Dry-run snakemake")

It is worth noticing that to obtain the results of `clean_osm_data`, the `download_osm_data` is also executed, as discussed previously.

#### 3.4.2. Directed Acyclic Graph (DAG)

In particular, to clarify and verify the specific ordered chain that is generated by snakemake, it is possible to run the command `--dag` that generates the description of the [Directed Acyclic Graph (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph) that describes the ordered list of rules executed by the `snakemake` tool.

For example, the DAG of the execution shown in 3.3.1. can be obtained by using the following code
``` batch
    snakemake --dry-run clean_osm_data
```

and the result is shown in the following picture

![DAG snakemake](./images/dag_snakemake.png "DAG snakemake")

To visualize the dag as an image, it is possible to combine snakemake with [Graphiz tool](https://graphviz.org/) to visualize the DAG, using the following command:

``` batch
snakemake --dag clean_osm_data | dot -Tpng -o dag_image_snakemake.png
```

The output is shown below and highlights the dependencies of the different rules

![DAG plot snakemake](./images/dag_image_snakemake.png "DAG plot snakemake")

### 3.5. Wildcards

It is worth noticing that in some cases a rule shall apply on a variety of different files, having specific characteristics.
To generalize the application of a rule, the so-called `wildcards` can be used.
When wildcards are used, snakemake automatically resolve multiple names according to where the wildcard is placed.

To clarify the above, let's consider the rule `simplify_network` used in the PyPSA-Earth workflow and let's notice the wildcard `simpl`.
When enclosed with brackets `{simpl}` resolves multiple values.

``` python
rule simplify_network:
    input:
        network='networks/elec.nc',
        tech_costs=COSTS,
        regions_onshore="resources/regions_onshore.geojson",
        regions_offshore="resources/regions_offshore.geojson"
    output:
        network='networks/elec_s{simpl}.nc',
        regions_onshore="resources/regions_onshore_elec_s{simpl}.geojson",
        regions_offshore="resources/regions_offshore_elec_s{simpl}.geojson",
        busmap='resources/busmap_elec_s{simpl}.csv'
    log: "logs/simplify_network/elec_s{simpl}.log"
    benchmark: "benchmarks/simplify_network/elec_s{simpl}"
    threads: 1
    resources: mem=4000
    script: "scripts/simplify_network.py"
```

For example, the string `networks/elec_s{simpl}.nc` can match both `networks/elec_sA.nc` as well as `networks/elec_s6.nc`.
Therefore, in the case in the workflow multiple files with the structure `networks/elec_s{simpl}.nc` will be needed, then snakemake automatically executes the `simplify_network` rule multiple times to generate all the needed outputs, according to the wildcard.


**Note:** the value of the wildcard can be constrained using python regular expression, as specified in the keyword wildcard_constraints available at the beginning of the Snakemake file:

``` python
wildcard_constraints:
    simpl="[a-zA-Z0-9]*|all",
    clusters="[0-9]+m?|all",
    ll="(v|c)([0-9\.]+|opt|all)|all",
    opts="[-+a-zA-Z0-9\.]*",
```

## 4. Configuration file

The configuration file `config.yaml` contains all the options that describe the specific workflow that snakemake shall run.
When that file is not available, it is created by copying the default file `config.default.yaml`.

In the following we discuss specific keywords that are useful for the discussion and get familiar with the code.

### 4.1. Keyword `countries`

The keyword `countries` in the configuration file specifies what are the countries of interest for the optimization.
Its value shall be a list of two-digit (alpha-2) strings each one representing a country with the [ISO nomenclature](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes). Region-specific values are also accepted as values.

For example, the default configuration file is executed on Nigeria ("NG") and Benin ("BJ"), that have been selected as good representatives for testing all the workflow features.
``` python
countries: ["NG", "BJ"]
```

Region-specific codes can also be used to execute the workflow on groups of countries. Currently, "Africa" has been tested most extensively and many more are ongoing.
``` python
countries: ["africa"]
```

### 4.2. Keyword `enable`

The keyword `enable` enables to clarify whether to activate specific rules or not in the workflow.
The keyword contains a list of sub-keywords each one having values true or false values.
Each sub-keyword name corresponds to the name of a rule, and when its value is true, the corresponding rule is enabled and included in the workflow if needed.

``` python
enable:
  # prepare_links_p_nom: false
  retrieve_databundle: true
  download_osm_data: true
  # retrieve_cutout: true
  # retrieve_natura_raster: true
  # If "build_cutout" : true # requires cds API key https://cds.climate.copernicus.eu/api-how-to
  # More information https://atlite.readthedocs.io/en/latest/introduction.html#datasets
  build_cutout: true
  build_natura_raster: true  # If True, then build_natura_raster can be run
  # custom_busmap: false
```

For example, by default, the `retrieve_databundle` subkeyword is set to `true` in the example above.
That means that if snakemake does not found all the outputs of the `retrieve_databundle*` scripts, then, the scripts are automatically run.
Viceversa, if the same keyword is set to false, the rule is not included in the workflow.
Be aware, however, that the outputs of the script are needed for the workflow, thus if you set the value to false, be sure to have downloaded the inputs previously.


## 4. Environment file

In order to properly use PyPSA Earth and install all the major package dependencies, environment files are provided in the folder `envs` to properly install the package.

![Folder envs](./images/envs_folder.png "Folder envs")

In particular, two main environment files are provided:

- `environment.yaml`: it contains the conda environment `pypsa-earth` that is the major environment needed to run the PyPSA-Earth model. This is the main reference
- `environment.docs.yaml`: it contains the conda environment used to generate locally the documentation. Users willing to contribute to the documentation (stored in the `doc` folder) shall install and use this package when building the documentation.