# Prototype data generation

In this notebook we will layout an example workflow that can be used with the prototype cubing engine which is the endgoal of MS19 in the BmD project. 

## Setup

In order to utilize the modules that were written for this project we will add the path temporarily to the environment so that we will not need to do a setup

In [1]:
import sys
from pathlib import Path

# Add the src directory to sys.path
sys.path.append(str(Path().resolve().parents[1] / "src"))

from datasource.gbif import sql
from cube import bmd

INFO:Note: NumExpr detected 22 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO:NumExpr defaulting to 16 threads.


## Prototype Area of interest

For the prototype we will look at one of the Natura2000 sites in Belgium, the Sonian Forest and the surrounding sites. In the W3 T1 documentation for the BmD project a complete description of this area is given but we will provide a short overview in this notebook as well.

<div style="float: right; width: 50%; margin-left: 15px; text-align: center;">
  <img src="img/AOI.png" alt="Sonian Forest" style="width:100%;">
  <div style="font-size: 90%; color: gray; margin-top: 4px;">
    <em>Figure 1:</em> Sonian Forest and its surrounding Natura2000 areas
  </div>
</div>

<p>
  The Sonian Forest and its surrounding areas are characterized by a set of habitats, which can be found on the 
  <a href="https://natura2000.eea.europa.eu/" target="_blank">Natura2000</a> website. The areas of interest are characterized by:
</p>

<ol>
  <li>
    <strong>La Forêt de Soignes avec lisières et domaines boisés avoisinants et la Vallée de la Woluwe</strong> – 
    <em>complexe Forêt de Soignes - Vallée de la Woluwe</em>
    <ul>
      <li><strong>Area code:</strong> BE1000001</li>
      <li><strong>Protected under:</strong> the Habitats Directive</li>
      <li><strong>Area:</strong> 2066 ha</li>
      <li><strong>Protected:</strong> 5 species & 8 habitats</li>
    </ul>
  </li>
  <li>
    <strong>Sonian Forest</strong>
    <ul>
      <li><strong>Area code:</strong> BE2400008</li>
      <li><strong>Protected under:</strong> the Habitats Directive</li>
      <li><strong>Area:</strong> 2066 ha</li>
      <li><strong>Protected:</strong> 3 species & 9 habitats</li>
    </ul>
  </li>
  <li>
    <strong>Vallées de l'Argentine et de la Lasne</strong>
    <ul>
      <li><strong>Area code:</strong> BE31002C0</li>
      <li><strong>Protected under:</strong> both Birds and Habitats Directives</li>
      <li><strong>Area:</strong> 821.45 ha</li>
      <li><strong>Protected:</strong> 16 species & 14 habitats</li>
    </ul>
  </li>
</ol>
<p>
  Each habitat is characterized by a set of species that are indicative of its health. In total there are 
  <strong>211 species of interest</strong> for this area, which are described within the file 
  <code>prototypeNames.csv</code> located in the prototype script directory under the 
  <code>inp</code> folder.
</p>

<p>
  In addition to this, we also provide a file containing the invasive species that are known within the country. 
  This list, the <em>Global Register of Introduced and Invasive Species - Belgium</em>, can be found on 
  <a href="https://www.gbif.org/dataset/6d9e952f-948c-4483-9807-575348147c7e" target="_blank">GBIF</a> and is 
  accessible as a dataset containing a Darwin Core (DwC) archive.
</p>

## GBIF 

### GBIF data gathering

#### Prototype

In this section we will generate the data 

In [14]:
#Bbox formatted with long_min, lat_min, long_max, lat_max for the areas of interest
aoi_bbox = (4.171371,50.684060,4.743004,50.877911)
#Path and filename to the species of interest
species_oi_path = "inp"
species_oi_file = "prototypeNames.csv"
#Path and filename to the invasive species
species_inv_path = "inp/dwca-unified-checklist-v1.14"
species_inv_file = "taxon.txt"

In [3]:
species_oi_df, mismatch_oi_df = sql.fetch_taxon_info(species_oi_file,   
                                                     inp_path=species_oi_path,
                                                     out_file="species_oi.csv",
                                                     out_path="out",
                                                     mismatch_file="species_mm.csv",
                                                     keep_higherrank=False)
species_oi_keys = species_oi_df["acceptedUsageKey"].values

'NONE' and 'HIGHERRANK' matches encountered while searching through the GBIF taxonomic backbone:
The following lookup names (Lotus uliginosus, Ranunculus nemorosus, Chara sp, Lathyrus montanus, Salix alba, Picea abie, Carex sp.) resulted in 'NONE' or 'HIGHERRANK' type match. Potential reasons can be found in the mismatch_df under the key 'note'


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  taxonomic_df["acceptedUsageKey"].fillna(taxonomic_df["usageKey"], inplace=True)


In [7]:
species_oi_keys

array([ 5301213,  3189103,  2926901,  9118014,  7693914,  4567182,
        5333408,  6027328,  2432452,  5376075,  3040249,  3033464,
        2727294,  8214667,  2890668,  2432427,  2704505,  5410564,
        2706163,  8229935,  2475532,  5333411,  5386897,  7270598,
        3152379,  2876213,  3128547,  3191374,  2926557,  5347824,
        7883344,  3033377,  5374901,  2891147,  2701418,  5304283,
        5334277,  3033289,  2878688,  3039454,  5341478,  4408732,
        1704195,  5290081,  3923253,  5414992,  7702019,  5347644,
        5329192,  6027388,  2914642,  5301200,  2888960,  5371781,
        3133702,  3087634,  5403296,  8165236,  2704951,  3113650,
        2687943,  3033675,  8397184,  2882833,  3188736,  2685484,
        3112651,  3172427,  2888808,  2913288,  5290194,  3172049,
        8365713,  2673131,  4567215,  3029627,  2700934,  2701939,
        2676091,  2722051,  3040227,  3033665,  3034620,  2914547,
        5284517,  7589456,  3152047,  3190715,  5334208,  5347

In [8]:
len(species_oi_keys)

204

In [15]:
sql.generate_json_query(species_oi_keys, aoi_bbox, 1980, 2020,
                        out_file="gbif_prototype_query.json", out_path="out", notificationAddress=["niels.billiet@plantentuinmeise.be"])

#### Belgium angiosperm

For belgium we will generate GBIF data that concerns all the invasive species recorded in belgium. This list can be found on found on [Global Register of Introduced and Invasive Species - Belgium](https://www.gbif.org/dataset/6d9e952f-948c-4483-9807-575348147c7e). This will illustrate the use of a darwin core archive provided by GBIF

In [17]:
#Bbox formatted with long_min, lat_min, long_max, lat_max for the areas of interest
be_bbox = (2.392821,49.503853,6.457763,51.564469)
#Path and filename to the species of interest
species_oi_path = "inp/Global Register of Introduced and Invasive Species - Belgium"
species_oi_file = "taxon.txt"

In [12]:
speciesKeys = sql.extract_keys_dwc(species_oi_file, species_oi_path)
print(speciesKeys)

['1002621', '1003567', '10071055', '1007534', '1008610', '1008612', '1008955', '1010644', '10108775', '1013526', '1014565', '1016841', '1017419', '10269496', '1031394', '1031400', '1031512', '1031520', '1031524', '1031564', '1031677', '1031680', '1031684', '1031685', '1031737', '1031742', '1031743', '1032262', '1032377', '10329298', '10411852', '1043717', '1043978', '1045323', '1047536', '10545407', '10578411', '10629881', '10646747', '10676000', '10701161', '10730110', '10755213', '10773250', '10786206', '10797854', '10800064', '10801424', '10852031', '10857535', '10902460', '10920460', '10933479', '10937982', '10944522', '10948891', '10953644', '1095946', '10966302', '10972541', '10986555', '11007246', '11055820', '11064584', '11104870', '11107889', '1111797', '11136683', '11141765', '11162940', '1119292', '11205618', '11335341', '1133603', '1152186', '11528335', '11794300', '11844792', '11921804', '1194885', '1194989', '1195013', '12132283', '12187916', '12205713', '12218002', '1224

In [18]:
sql.generate_json_query(speciesKeys, be_bbox, 1980, 2020,
                        out_file="gbif_beAngiosperm_query.json", out_path="out", notificationAddress=["niels.billiet@plantentuinmeise.be"])

#### Submitting the Query

Executing this code will generate .json file that will be stored within the output directory. This json file can subsequently be used to call the GBIF SQL API through the gbifCube.sh bash script. 
1) Assure that the bash script has the correct permission to be execute from the shell. To check whether the file has the correct permission to execute use the `ls -l gbifCube.sh` command in the shell. If the file has execution permission it should have the `x` character. If the script does not have the right permission on your system use the `chmod +x gbifCube.sh` in your shell
2) In order to run this bash script, GBIF credentials should be added to the `~/.bashrc`
```
EXPORT GBIF_USERNAME = "yourUserName"
EXPORT GBIF_EMAIL = "yourAccountEmail"
EXPORT GBIF_PASSWORD = "yourPassword"
```
3) Execution of the `gbifCube.sh` script will return a message that ends in multidigit string. This multidigit string should subsequently be used to download the files

For the scripts above the datasets can be found on GBIF
1) prototype species of interest - https://www.gbif.org/occurrence/download/0006714-250827131500795 (DOI I10.15468/dl.ez24w2)
2) Invasive plant species in Belgium - https://www.gbif.org/occurrence/download/0006715-250827131500795 (DOI 10.15468/dl.g3vgpc)

## BmD Cube generation

### Formatting the yaml file

The yaml field is formatted to include all possible parameters which can be fed into the cubing engine. A template for the param.yaml can be found within the `config/` folder in the root of the directory. We will go through each component of the param file to provide an in depth discussion of how to use it

<div style="border:2px solid #4CAF50; border-radius:8px; padding:10px;">

```yaml
    cube_dir: "cube_path/"
    cube_name: "cube_name"
```
</div>

The first segment of the yaml parameter file describes the
* `cube_dir`, which will serve as the output directory of the cubing engine. Files that are generated will be written to this directory
* `cube_name`, the name that the exported data tree file will have after exporting

<div style="border:2px solid #4CAF50; border-radius:8px; padding:10px;">
    
```yaml
    spatial:
      # select the method which should be employed
      method: "bbox"
      bbox:
        long_min: 0
        long_max: 0
        lat_min: 0
        lat_max: 0
      polygon:
        shapefile_path: "/shapefile/filename.shp"
```
</div>

The spatial segment details all the geospatial parameters that are used within the cubing engine
* `method`, method that needs to be employed during the extraction of the subset of data. Currently only the bbox method is supported but future versions of the cubing engine will support polygon subsetting
* `bbox`, parameters needed to use the bbox method
    * `long_min, lat_min`, the minimal longitude and latitude value used in the construction of the bbox. This corresponds to the lower left corner of the bbox
    * `long_max, lat_max`, the maximal longitude and latitude value used in the construction of the bbox. This corresponds to the upper right corner of the bbox
* `shapefile_path`, when support is implemented this should point towards the shapefile which the user wishes to utilize during subsetting

<div style="border:2px solid #4CAF50; border-radius:8px; padding:10px;">
    
```yaml
    layers:
      gbif:
        enabled: true
        type:
          occurences: true
          absences: false
        species_paths: "/directory/species_list.csv"
        taxonomic:
          highest_rank: "kingdom" #What is the highest taxonomic rank required
          lowest_rank: "species" #What is the lowest taxonomic rank required
        time:
          start_year: #Number
          end_year: #Number
          start_month: #Number between 1 and 12
          end_month: #Number between 1 and 12
        defaultCoordinateUncertainty: 1000
        cubing:
          enabled: true
          cubingGrid: "EEA"
        selection_issues: #Selection flags that need to be taken into account when selecting records
          spatial:
            hasCoordinate: true 
            zeroCoordinate: true
            invalidCoordinate: true 
            countryMismatch: true
        file:
          file_path: "gbif_filepath/"
          file_name: "gbif_filename.csv"
          sep: "\t"
```
</div>

The layers section of the yaml file currently only contains GBIF and CHELSA as available layers.

The GBIF layer template has already been provided has not yet been integrated within the prototype of the cubing engine. Collection of the biodiversity data hasn't yet been integrated within the cubing engine, i.e. formatting these parameters described in the yaml file to a SQL query which is subsequently submitted and than used to automatically process the GBIF data. We require the user to provide a csv file that is obtained from the GBIF SQL API through the usage of the functions discussed in the previous section of this notebook. The output of this segment is registered in this yaml parameter file, specifically in the `file`subsection of the GBIF segment.
* `file_path`, points to the directory where the csv file is stored
* `file_name`, gives the name of the csv file
* `sep`, documents the separator symbol used in this file. GBIF standardly returns csv files separated with tabs.

<div style="border:2px solid #4CAF50; border-radius:8px; padding:10px;">
    
```yaml
    chelsa:
        enabled: true
        chelsa_month:
          enabled: true
          time:
            start_year: 1979
            start_month: 1
            end_year: 2020
            end_month: 12
            year_range: [1979, 2020]  # Optional: for fallback or metadata only
          variables:
            include_all: true
            included: []       # Ignored if include_all is true
            excluded: []
          source:
            base_url: "https://os.zhdk.cloud.switch.ch/chelsav2/GLOBAL/monthly"
            version: "V.2.1"  # Optional data version tag
          metadata:
            chelsa_month:
              label: "CHELSA Monthly Climate Data"
              description: "Monthly high-resolution climate variables from CHELSA V2.1."
              year_range: "1979-2020"
              available_variables:
                clt: "Cloud cover (%)"
                cmi: "Climatic Moisture Index"
                hurs: "Relative humidity at 2m (%)"
                pet: "Potential Evapotranspiration (mm)"
                pr: "Precipitation (mm)"
                rsds: "Surface downwelling shortwave radiation (W/m²)"
                sfcWind: "Surface wind speed (m/s)"
                tas: "Mean air temperature at 2m (°C)"
                tasmax: "Daily maximum air temperature at 2m (°C)"
                tasmin: "Daily minimum air temperature at 2m (°C)"
                vpd: "Vapor Pressure Deficit (kPa)"
```
</div>

The chelsa layer describes all the different possible chelsa layers together. Each of these follow a similar structure although deviations may be present depending on the dimensions that each one has.
* `enabled`, general keyword used to signify if the layer should be included within the end product. This keyword can take either `true` or `false` as values
*  Generally some parameters follow the structure of
      * `include_all`, boolean used to signify whether all possible parameters for the given parameter need to be taken into account. 
      * `included`, if `include_all` is false than we check which parameters should be included. Depending on the parameter this should be a string type entry in the list (variable names and year ranges) or an integer (year, month)
      * `excluded`, if `include_all` is false and `included` is empty than we check which parameters should be excluded from all available options.  Depending on the parameter this should be a string type entry in the list (variable names and year ranges) or an integer (year, month)
* Some parameters depending on the layer take specific inputs or have standard values that should remain unchanged unless changes occur within CHELSA itself
     *   for reference climatological CHELSA data the `year_range` is set to a standard value of "1981-2010"
     *   for the monthly CHELSA data we require the option for `start_year`, `end_year`, `start_month`, `end_month`, these should be integer values specified in the `year_range` which has a standard value of [1979, 2020] (should remain unchanged normally)
     *   each chelsa layer has a source subsection containing information necessary to generate URLs that point to the files on the S3 bucket. The `base_url` points to the base adress which will be extended to generate the specific files in question. The version is the linked to the version of the data. Both these should not be modified by the user unless the addres to the S3 bucket changes or the version of the data gets updated.

### Using the cubing engine

In what follows we will utilize the prototype engine to generate cubes for 3 different scenarios. The first is the dev cube cube that was used during development. This cube is the smalles possible for the prototype area and runs relatively quickly. The other 2 scenarios will generate significantly larger cubes because of the amount of data that is being requested or the spatial extent is much larger. The parameter files 
* param_dev_cube, the data cube used during development of the cubing engine. This is a limited data cube where we sample each layer with a restricted set of variables. This cube can be included in the github repository to serve as an example for the end user 
* param_prototype, the full data cube for the prototype region. This will include all possible parameters for all possible layers in the given prototype.
* param_beAngiosperm, a data cube that will be produced for all the angiosperm observation within Belgium.

The different examples that are provided serve as different scales of requested data, going from small to much larger.

#### Cubing engine - Dev test parameters

In [2]:
dev_cube = bmd.bmd_cube()
dev_cube.generate_bmd_data("param_dev_cube.yaml", "../../config")
dev_cube.construct_datatree()

-----Retrieving monthly CHELSA data for variable 'tas'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.14item/s]


-----Retrieving monthly CHELSA data for variable 'tasmin'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.48item/s]


-----Retrieving monthly CHELSA data for variable 'tasmax'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.66item/s]


-----Retrieving Reference Climatology CHELSA data for variable 'bio1' in reference period 1981-2010-----
Complete
-----Retrieving Reference Climatology CHELSA data for variable 'bio2' in reference period 1981-2010-----
Complete
-----Retrieving Reference Climatology CHELSA data for variable 'bio3' in reference period 1981-2010-----
Complete
-----Retrieving monthly Reference Climatology CHELSA data for variable 'tas' in reference period 1981-2010-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.06item/s]


-----Retrieving monthly Reference Climatology CHELSA data for variable 'tasmin' in reference period 1981-2010-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.04item/s]


-----Retrieving monthly Reference Climatology CHELSA data for variable 'tasmax' in reference period 1981-2010-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.37s/item]


-----Retrieving Simulation (period) CHELSA data for variable 'bio1'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.22item/s]


-----Retrieving Simulation (period) CHELSA data for variable 'bio2'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.71item/s]


-----Retrieving Simulation (period) CHELSA data for variable 'bio3'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.50item/s]


-----Retrieving Simulation (monthly) CHELSA data for variable 'tas'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.20item/s]


-----Retrieving Simulation (monthly) CHELSA data for variable 'tasmin'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.87s/item]


-----Retrieving Simulation (monthly) CHELSA data for variable 'tasmax'-----


Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.06item/s]


#### Cubing engine - Prototype Complete

In [2]:
prototype_cube = bmd.bmd_cube()
prototype_cube.generate_bmd_data("param_prototype.yaml", "../../config")
prototype_cube.construct_datatree()

-----Retrieving monthly CHELSA data for variable 'clt'-----


Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 480/480 [04:16<00:00,  1.87item/s]


DEBUG batch_process_urls output: <class 'list'> [(array([4.18387099, 4.20887098, 4.23387097, 4.25887096, 4.28387095,
       4.30887094, 4.33387093, 4.35887092, 4.38387091, 4.4088709 ,
       4.43387089, 4.45887088, 4.48387087, 4.50887086, 4.53387085,
       4.55887084, 4.58387083, 4.60887082, 4.63387081, 4.6588708 ,
       4.68387079, 4.70887078, 4.73387077]), array([50.865411  , 50.84041102, 50.81541102, 50.79041103, 50.76541105,
       50.74041105, 50.71541106, 50.69041108]), array([[4849, 4853, 4858, 4852, 4836, 4815, 4855, 4849, 4813, 4833, 4817,
        4810, 4820, 4833, 4857, 4836, 4829, 4827, 4818, 4846, 4845, 4837,
        4831],
       [4848, 4845, 4857, 4855, 4843, 4858, 4865, 4810, 4851, 4868, 4835,
        4836, 4853, 4855, 4868, 4859, 4858, 4833, 4843, 4835, 4831, 4851,
        4856],
       [4876, 4870, 4861, 4856, 4872, 4905, 4821, 4930, 4948, 4889, 4865,
        4951, 4973, 4923, 4908, 4894, 4867, 4829, 4881, 4824, 4896, 4923,
        4945],
       [4888, 4885, 4873, 48

Processing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 480/480 [04:42<00:00,  1.70item/s]


DEBUG batch_process_urls output: <class 'list'> [0, (array([4.17553767, 4.183871  , 4.19220433, 4.20053767, 4.208871  ,
       4.21720433, 4.22553767, 4.233871  , 4.24220433, 4.25053767,
       4.258871  , 4.26720433, 4.27553767, 4.283871  , 4.29220433,
       4.30053767, 4.308871  , 4.31720433, 4.32553767, 4.333871  ,
       4.34220433, 4.35053767, 4.358871  , 4.36720433, 4.37553767,
       4.383871  , 4.39220433, 4.40053767, 4.408871  , 4.41720433,
       4.42553767, 4.433871  , 4.44220433, 4.45053767, 4.458871  ,
       4.46720433, 4.47553767, 4.483871  , 4.49220433, 4.50053767,
       4.508871  , 4.51720433, 4.52553767, 4.533871  , 4.54220433,
       4.55053767, 4.558871  , 4.56720433, 4.57553767, 4.583871  ,
       4.59220433, 4.60053766, 4.608871  , 4.61720433, 4.62553766,
       4.633871  , 4.64220433, 4.65053766, 4.658871  , 4.66720433,
       4.67553766, 4.683871  , 4.69220433, 4.70053766, 4.708871  ,
       4.71720433, 4.72553766, 4.733871  , 4.74220433]), array([50.87374433,

TypeError: 'int' object is not subscriptable

#### Cubing engine - Invasive species in Belgium

In [None]:
beInv_cube = bmd.bmd_cube()
beInv_cube.generate_bmd_data("param_beInvasive.yaml", "../../config")
beInv_cube.construct_datatree()

#### Using the datatree

Accessing the individual datasets in the data tree is done in a similar way that dicitonairies are accessed in python. An overview off the tree can be obtained by calling the print function on the tree itself

In [30]:
print(dev_cube.data_tree)

DataTree('prototype_dev_cube', parent=None)
├── DataTree('static')
│   ├── DataTree('chelsa_clim_ref_period')
│   │       Dimensions:  (lat: 23, long: 69)
│   │       Coordinates:
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 4.734 4.742
│   │       Data variables:
│   │           bio1     (lat, long) uint16 3kB 2837 2837 2837 2836 ... 2832 2832 2832 2831
│   │           bio2     (lat, long) uint16 3kB 71 71 71 71 71 71 71 ... 71 71 71 71 71 71
│   │           bio3     (lat, long) float32 6kB 3.3 3.3 3.3 3.29 ... 3.27 3.27 3.27 3.26
│   ├── DataTree('chelsa_clim_ref_month')
│   │       Dimensions:  (months: 2, lat: 23, long: 69)
│   │       Coordinates:
│   │         * months   (months) int64 16B 1 2
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 

Iterating over the tree can be done to get the different branches. Alternatively one can call the children function to get the branches originating from the current node

In [36]:
for branch in dev_cube.data_tree:
    print(f" dev_cube branch '{branch}'")
    for leaf in dev_cube.data_tree[branch]:
        print(f"Branch '{branch}' has a leaf '{leaf}'")

 dev_cube branch 'static'
Branch 'static' has a leaf 'chelsa_clim_ref_period'
Branch 'static' has a leaf 'chelsa_clim_ref_month'
Branch 'static' has a leaf 'chelsa_clim_sim_period'
Branch 'static' has a leaf 'chelsa_clim_sim_month'
 dev_cube branch 'dynamic'
Branch 'dynamic' has a leaf 'chelsa_month'
Branch 'dynamic' has a leaf 'gbif_occurences'


Accessing the data can be done by calling `.ds` on the leaf of interest 

In [40]:
dev_cube.data_tree["static"]["chelsa_clim_sim_month"].ds

From which we can subsequently call the data associated with a data variable by calling it like a dictionairy

In [41]:
dev_cube.data_tree["static"]["chelsa_clim_sim_month"].ds["tas"]

Performing operations on this occurs similary to how we select and filter data in pandas. Working with data array is explained in depth in the xarray beginner tutorials

#### Exporting and importing the datatree

Generated data trees can be exported and imported using the bmd_cube functionality

In [42]:
dev_cube.export_tree()

In [45]:
imported_tree = dev_cube.import_tree("out/prototype_cubing/", "prototype_dev_cube.nc")

In [46]:
print(dev_cube.data_tree)

DataTree('None', parent=None)
├── DataTree('static')
│   ├── DataTree('chelsa_clim_ref_period')
│   │       Dimensions:  (lat: 23, long: 69)
│   │       Coordinates:
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 4.734 4.742
│   │       Data variables:
│   │           bio1     (lat, long) uint16 3kB ...
│   │           bio2     (lat, long) uint16 3kB ...
│   │           bio3     (lat, long) float32 6kB ...
│   ├── DataTree('chelsa_clim_ref_month')
│   │       Dimensions:  (months: 2, lat: 23, long: 69)
│   │       Coordinates:
│   │         * months   (months) int64 16B 1 2
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 4.734 4.742
│   │       Data variables:
│   │           tas      (months, lat, long) float64 25kB ...
│   │           tasmin   (mo

In [47]:
print(imported_tree)

DataTree('None', parent=None)
├── DataTree('static')
│   ├── DataTree('chelsa_clim_ref_period')
│   │       Dimensions:  (lat: 23, long: 69)
│   │       Coordinates:
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 4.734 4.742
│   │       Data variables:
│   │           bio1     (lat, long) uint16 3kB ...
│   │           bio2     (lat, long) uint16 3kB ...
│   │           bio3     (lat, long) float32 6kB ...
│   ├── DataTree('chelsa_clim_ref_month')
│   │       Dimensions:  (months: 2, lat: 23, long: 69)
│   │       Coordinates:
│   │         * months   (months) int64 16B 1 2
│   │         * lat      (lat) float64 184B 50.87 50.87 50.86 50.85 ... 50.71 50.7 50.69
│   │         * long     (long) float64 552B 4.176 4.184 4.192 4.201 ... 4.726 4.734 4.742
│   │       Data variables:
│   │           tas      (months, lat, long) float64 25kB ...
│   │           tasmin   (mo