# Running PyAeroval

PyAeroval is the most used feature of AeroTools. The point of PyAeroval is that the user, through a "simple" configuration file, can evaluate any combination of model and observation data, and display the result on the [Aeroval web page](https://aeroval.met.no/).

The configuration (config) file can either be a json file that is run with the command `pya aeroval <file path>`, or a python file that calls on PyAeroval with a configuration dictionary. In this tutorial we will use the latter method, as I think it is easier to work with a python files.

We will start by looking at the structure of the configuration dictionary, and then we will look at how to run PyAeroval with said dictionary.

## Configuration Dictionary

The config dictionary is, as you can guess, a Python dictionary. It consists of three "parts"

1. Global options: A set of options for the evaluation, e.g. name, PI, output directory, etc
2. Model config: A dictionary containing information about the models to be used
3. Observation config: A dictionary containing information about the observations to be used

All of this is contained in one dictionary on the form
```
CONFIG = {
    <global option key 1> = <global option value 1>,
    <global option key 2> = <global option value 2>,
    .
    .
    .
    <global option key 3> = <global option value 3>,
    model_cfg = <dict with model configs>,
    obs_cfg = <dict with observation config>,
}
```

We will see how this is done below

### Global Options

Below you can see a typical set of options used in most of the runs I do with PyAeroval. We will start by first defining some paths and names that should be unique for your configuration

In [1]:
output_dir = <output_dir>"/data"
coldata_dir = <output_dir>"/coldata"
exp_pi = <our name>
proj_id = "workshop"
exp_id = "emep"

In [2]:
CFG = dict(
    # Output directories
    json_basedir=output_dir,
    coldata_basedir=coldata_dir,

    # Run options
    reanalyse_existing=True,        # if True, existing colocated data files will be deleted
    raise_exceptions=True,          # if True, the analysis will stop whenever an error occurs 
    clear_existing_json=False,      # if True, deletes previous output before running

    # Map Options
    add_model_maps=False,           # Adds a plot of the whole map. Very slow!!!
    only_model_maps=False,          # Adds only plot above, without any other evaluation
    filter_name="ALL-wMOUNTAINS",   # Regional filter for analysis
    map_zoom="Europe",              # Zoom level. For EMEP, Europe is typically used
    regions_how="country",          # Calculates statistics for different regions. Typically "country" is used, but that does not work for satellite data

    # Time and Frequency Options
    ts_type="monthly",              # Colocation frequency (no statistics in higher resolution can be computed)
    freqs=["monthly", "yearly"],    # Frequencies that are evaluated
    main_freq="monthly",            # Frequency that is displayed when opening webpage
    periods=["2010"],               # List of years or periods of years that are evaluated. E.g. "2005" or "2001-2020"
    

    # Statistical Options
    obs_remove_outliers=False,
    model_remove_outliers=False,
    colocate_time=True,
    zeros_to_nan=False,
    weighted_stats=True,
    annual_stats_constrained=True,
    harmonise_units=True,
    resample_how={"vmro3max": {"daily": {"hourly": "max"}}}, # How to handle Ozone. Used all the time in EMEP

    # Experiment Metadata
    exp_pi=exp_pi,
    proj_id=proj_id,
    exp_id=exp_id,
    exp_name="Evaluation of EMEP data",
    exp_descr=("Evaluation of EMEP data"),
    public=True,
)

We will add one more options before looking at the models and observations: `min_num_obs`. This tells us how many data points are necessary when averaging to a coarser frequency. If that constraint is not met, then the coarser data point will be set to `NaN`.

In [4]:
DEFAULT_RESAMPLE_CONSTRAINTS = dict(
    yearly=dict(monthly=9),
    monthly=dict(
        daily=1,
        weekly=1,
    ),
    daily=dict(hourly=1),
)

CFG["min_num_obs"] = DEFAULT_RESAMPLE_CONSTRAINTS

### Setting up Filters

In many, or maybe even most cases we want to filter our observations. So before we start with model and observations, we are going to define our filters. 

The most general, and most used filters are longitude, latitude and altitude. Below we can see that we have a range of accepted values for these. This is our base filter.

EBAS (see below) needs some extra filters, like the data level, and whether or not to set flagged data to nan. The EBAS filer is these option, as well as the base filters.

EEA (see below) expects other filters. These filters are where the stations are found, e.g. whether they are rural or near cities. These filters are then added to the base filters to make the EEA filter dictionary.

In [1]:
EEA_RURAL_FILTER = {
    "station_classification": ["background"],
    "area_classification": [
        "rural",
        "rural-nearcity",
        "rural-regional",
        "rural-remote",
    ],
}

BASE_FILTER = {
    "latitude": [30, 82],
    "longitude": [-30, 90],
    "altitude": [-200, 5000],
}

EBAS_FILTER = {
    **BASE_FILTER,
    "data_level": [None, 2],
    "set_flags_nan": True,
}

EEA_FILTER = {
    **BASE_FILTER,
    **EEA_RURAL_FILTER,
}

### Model Configuration

While the main data format of AeroTools is Aerocom3 model data, since we are the EMEP group, I will focus on reading EMEP data. We start by defining a dictionary with all the information needed for defining the model. This data is

- A model id. This should be names EMEP
- The folder where the data is found. This folder should follow the scheme we looked at in the last tutorial
- Which reader should be used for this data.

In [8]:
folder_EMEP = "/lustre/storeB/project/fou/kl/emep/People/danielh/projects/pyaerocom/workshop/emep/mod/2020"

EMEP = dict(
        model_id="EMEP",
        model_data_dir=folder_EMEP,
        gridded_reader_id={"model": "ReadMscwCtm"}, # Tells pyaerocom to use the EMEP reader instead of the default aerocom reader
    )

We can now add these two dictionaries to a common *model dictionary*, and then add this to out configuration dictionary

In [10]:
MODELS = {
    "EMEP": EMEP,
}

CFG["model_cfg"] = MODELS

### Observations Configuration

Observations can either be observation networks already defined by PyAerocom, or custom network. The problem with observations is that we don't have any conventions like Aerocom, so providing custom observations is not as easy as with models. 

To solve this problem, the AeroTools team has created [Pyaro](https://github.com/metno/pyaro); an interface for creating observation readers that works with PyAerocom. Who to create such readers are far outside of the scope of this workshop. We will instead look at how to use a Pyaro reader made for reading PMF EBAS data. While we only show how to read PMF EBAS data, the method is general, and with small changes applicable to many other types of data

But before looking at that we will look at two networks found in PyAerocom, EBAS and EEA

EBAS and EEA are the two most used observation networks used in AeroTool. This means that they are "known" by Pyaerocom. By referring to the IDs of the networks, Pyaerocom know exactly how to read the data. 

In [None]:
EEA = dict(
    obs_id="EEAAQeRep.v2",          # ID of EEA in AeroTools
    obs_vars=[
        "concpm10",                 # Variables to be used
    ],
    web_interface_name="EEA-rural", # The name which is shown on the web interface
    obs_vert_type="Surface",
    ts_type="monthly",              # Frequency of read observations. Evaluation can not be finer than this, for this network
    obs_filters=EEA_FILTER,         # Filters we made before
)

EBAS = dict(
    obs_id="EBASMC",                # ID of EBAS in AeroTools
    web_interface_name="EBAS-m",    # The name which is shown on the web interface
    obs_vars=[
        "concpm10",                 # Variables to be used
    ],
    obs_vert_type="Surface",        # Observation level
    ts_type="monthly",              # Frequency of read observations. Evaluation can not be finer than this, for this network
    obs_filters=EBAS_FILTER,        # Filters we made before
)

As we can see, this is a bit more complicated than the model configs. With my comments they should be more or less understandable. Note that again `obs_id` has to be a known network, and as with the models, a catalog of networks is in the works.

In the global options we defined a `min_num_obs`. We can do that here as well. If PyAeroval finds this option in the observation config, as well as in the global options, the one found in the observation config will be prioritized. There are a handful of such options, e.g. outlier limits.

 We are now left with out third way of defining an observation: Pyaro. For this we need an extra step. Pyaro need yet another configuration... We start by making that

In [19]:
from pyaerocom.io.pyaro.pyaro_config import PyaroConfig

data_id = "nilupmfebas"
url = "/lustre/storeB/project/fou/kl/emep/People/danielh/projects/pyaerocom/workshop/emep/obs/EIMPs_winter2017-2018_data/EIMPs_winter_2017_2018_ECOC_Levo/"

config = PyaroConfig(
    name="pmf",
    data_id=data_id,
    filename_or_obj_or_url=url,
    filters={
        "variables": {
            "include": [
                "pm10#elemental_carbon#ug C m-3",
                "pm10#organic_carbon#ug C m-3",
            ]
        }
    },
    name_map={
        "pm10#elemental_carbon#ug C m-3": "elementalcarbon",
        "pm10#organic_carbon#ug C m-3": "organiccarbon",
    },
)

This may seem like we are adding another level of complexity on top of the already complex PyAeroval configuration, but for the sake of developing new observational readers this makes it much easier. The options in this configurations are

- data_id: name of the reader need to read the data
- name: unique name chosen by the user. Readers with the same *data_id* might have to read from different sources, and therefore a unique name is needed
- filename_or_obj_or_url: is the aptly named path to where the reader can find the data
- filters: where are multiple filters in Pyaro. The most important are *variable/include* and *variable/exclude*. See the [docs](https://pyaro.readthedocs.io/en/latest/) for more on filters
- name_map: each reader might have different names for variables. Here you can make a map between those names and the Aerocom names

We can now add this to a observation config

In [20]:
PYARO = dict(
        obs_id=config.name,                                     # Must be set to the name found in the config
        obs_config=config,                                      # The pyaro config
        web_interface_name="Pyaro-m",                           # Name that is displayed on the webpage
        obs_vars=["elementalcarbon", "organiccarbon"],        # List of variables that is to be evaluated
        obs_vert_type="Surface",                                # Observation level
        ts_type="monthly",                                      # Frequency of read observations. Evaluation can not be finer than this, for this network
    )


We see that this is quite similar to out other observation configs, the only difference being `obs_id` and `obs_config`. Instead of defining the network with the `obs_id`, as we need for the other configs, we instead use `obs_config` to define it. Note that to make PyAeroval keep track of this observation, we need to make `obs_id` the same as the **name** defined in the Pyaro config. The way I've done this above is the easiest and safest.

We can now add all of this to out main configuration

In [22]:
OBS_CFG = {
    "Pyaro-m": PYARO,
    "EBAS-m": EBAS,
    "EEA-m": EEA
}

CFG["obs_cfg"] = OBS_CFG

## Running PyAeroval

We have now finally come to the point were we can run this config. While this can be done in this notebook, I instead recommend using the accompanying Python file. But we will here look at the last part of code needed to run the configuration dict

In [None]:
from pyaerocom.aeroval import EvalSetup, ExperimentProcessor
from pyaerocom import const

print(const.CACHEDIR)               # Prints where to find the caching folder. Not needed but this folder should be emptied now and then, so I like to see where it is


stp = EvalSetup(**CFG)              # Makes a setup object from the dict, that PyAeroval can use
ana = ExperimentProcessor(stp)      # Makes an experiment object
res = ana.run()                     # Runs the experiment


# Example on how to run an experiment with only certain models, obs, and variables. Often used to run only parts of the experiment that has changed
# res = ana.run(model_name=["EMEP"], obs_name=["EEA-m"], var_list=["concCec"])



### Running On PPI

Since PyAeroval evaluation can be quite resource heavy, we recommend either submitting it as a job, or running the python script on a compute node. This can be done by

In [None]:
%%bash
# on PPI
qlogin -l h_rss=8G,mem_free=8G,h_data=8G # Logs into a compute node with 8GB memory

module load /modules/MET/rhel8/user-modules/fou-kl/aerotools/pya-v2024.05.1.EMEP.conda 

cd <path to the python script>
python pyaeroval_config.py

Or, when you are using a module that is not Anaconda based (see [tutorial 1](../1/what_is_aerotools.ipynb))

In [None]:
%%bash
# on PPI
qlogin -l h_rss=8G,mem_free=8G,h_data=8G # Logs into a compute node with 8GB memory

module load /modules/MET/rhel8/user-modules/fou-kl/aerotools/pya-v2024.05.1.EMEP

cd <path to the python script>
pya_python pyaeroval_config.py

## Afterword

There are other options, both for the model- and observation setup, as well as the global setup. I've tried to make this config file as slim as possible, while still trying to make it useful. It might seem complex and daunting, but once you have a working config file, you will for the must part just copy  that file, only changing the model and observation setups.