# NcML dataset creation to start working with the available data

## Tools for accessing and processing climate data: Case study with R

This worked example uses the `climate4R` framework. Go to ["climate4R: An R-based Framework for Climate Data Access, Post-processing and Bias Correction"](https://www.sciencedirect.com/science/article/pii/S1364815218303049), for more information.


![c4R](https://github.com/SantanderMetGroup/climate4R/blob/devel/man/figures/climate4R_2.png?raw=true)

------------

## Parameter setting
### Set the paths



In [1]:
work.folder.name <- "ncml"

In [2]:
rootdir <- "/home/jovyan"
data.dir <- file.path(rootdir, "IMPETUS4CHANGE", "data")
work.dir <- file.path(rootdir, "work", work.folder.name)
data.dir.obs <- file.path(data.dir, "BSC/CERRA/daily_mean")
data.dir.pred <- file.path(data.dir, "/ESGF/CMIP6/DCPP/EC-Earth-Consortium/EC-Earth3/dcppA-hindcast")

### Other settings

In [3]:
var <- "tas"

## Library loading

In [4]:
library(loadeR)
library(magrittr)

Loading required package: rJava

Loading required package: loadeR.java

Java version 22x amd64 by N/A detected

NetCDF Java Library v4.6.0-SNAPSHOT (23 Apr 2015) loaded and ready

Loading required package: climate4R.UDG

climate4R.UDG version 0.2.6 (2023-06-26) is loaded

Please use 'citation("climate4R.UDG")' to cite this package.

loadeR version 1.8.1 (2023-06-22) is loaded


Get the latest stable version (1.8.2) using <devtools::install_github(c('SantanderMetGroup/climate4R.UDG','SantanderMetGroup/loadeR'))>

Please use 'citation("loadeR")' to cite this package.



## NcML creation

### Observational reference

In this section a single entry point (NcML file) is created to all the NetCDF files correspondint to the selected variable (object `var`) in the observational dataset (CERRA). To do so, first the files in `data.dir.obs` are listed using `pattern = sprintf("%s_", var)`:

In [5]:
lf.obs <- list.files(data.dir.obs, pattern = sprintf("%s_", var), recursive = T)
head(lf.obs)

Next, the common directory to the data files is extracted.

In [6]:
var.folder.obs <- unique(dirname(lf.obs))
var.dir.obs <- file.path(data.dir.obs, var.folder.obs)

Next, the specific path for the NcML is define and created.

In [7]:
ncml.dir.obs <- file.path(work.dir, "CERRA")

In [8]:
if (! dir.exists(ncml.dir.obs))
    dir.create(ncml.dir.obs, recursive = TRUE)

Define the full path (including filename) of the NcML:

In [9]:
ncml.filename.obs <- sprintf("%s/%s.ncml", ncml.dir.obs, var.folder.obs)

Create and save the NcML files using function `makeAggregatedDataset`:

In [10]:
makeAggregatedDataset(
    source.dir = var.dir.obs,
    ncml.file = ncml.filename.obs,
    aggr.dim = "time"
) %>% suppressMessages

The NcML file has been now created in the indicated path. Do an inventory using function `dataInventory` to extract relevan information:

In [11]:
di <- dataInventory(ncml.filename.obs)

[2024-07-18 11:39:54.0065] Doing inventory ...

[2024-07-18 11:39:54.562135] Retrieving info for 'tas' (0 vars remaining)

[2024-07-18 11:39:54.640501] Done.



For instance the date range:

In [12]:
di$tas$Dimensions$time$Date_range

... or other:

In [13]:
str(di)

List of 1
 $ tas:List of 7
  ..$ Description: chr "2 metre temperature"
  ..$ DataType   : chr "float"
  ..$ Shape      : int [1:3] 13452 1113 2631
  ..$ Units      : chr "K"
  ..$ DataSizeMb : num 157566
  ..$ Version    : logi NA
  ..$ Dimensions :List of 3
  .. ..$ time:List of 4
  .. .. ..$ Type      : chr "Time"
  .. .. ..$ TimeStep  : chr "24.0 hours"
  .. .. ..$ Units     : chr "hours since 1984-9-1 00:00:00"
  .. .. ..$ Date_range: chr "1984-09-01T10:30:00Z - 2021-06-30T10:30:00Z"
  .. ..$ lat :List of 5
  .. .. ..$ Type       : chr "Lat"
  .. .. ..$ Units      : chr "degrees_north"
  .. .. ..$ Values     : num [1:1113] 19.8 19.8 19.9 19.9 20 ...
  .. .. ..$ Shape      : int 1113
  .. .. ..$ Coordinates: chr "lat"
  .. ..$ lon :List of 5
  .. .. ..$ Type       : chr "Lon"
  .. .. ..$ Units      : chr "degrees_east"
  .. .. ..$ Values     : num [1:2631] -57.7 -57.6 -57.6 -57.5 -57.5 ...
  .. .. ..$ Shape      : int 2631
  .. .. ..$ Coordinates: chr "lon"


### Decadal predictions

Creating NcML for the decadal hindcast follows the same steps, except that an NcML is created for each initialization.

In [14]:
lf <- list.files(data.dir.pred, recursive = T, pattern = sprintf("%s_.*hindcast", var))
head(lf)

In [15]:
tail(lf)

Unlike the case with observational data in the previous section, here we obtain several shared directories, each corresponding to a different initialization (object `dir.inits`).

In [16]:
dir.inits <- unique(dirname(lf))
head(dir.inits)

In [17]:
tail(dir.inits)

Next, the path for the NcMLs is define and created:

In [18]:
ncml.dir.pred <- file.path(work.dir, "EC-Earth3/dcppA-hindcast")
if (! dir.exists(ncml.dir.pred))
    dir.create(ncml.dir.pred, recursive = TRUE)

`makeAggregatedDataset` is applied in a loop, pointing to a different initialization in each iteration:

In [19]:
for (d in dir.inits)
    makeAggregatedDataset(
        source.dir = sprintf("%s/%s", data.dir.pred, d), 
        ncml.file = sprintf("%s/%s.ncml", ncml.dir.pred, gsub("/", "_", d)),
        aggr.dim = "time"
    ) %>% suppressMessages

In [20]:
ncml.filenames.pred <- list.files(ncml.dir.pred, full.names = T) 
head(ncml.filenames.pred)

Perform a data inventory to extract the relevant information. For instance for the first initialization:

In [21]:
di <- dataInventory(ncml.filenames.pred[1])

[2024-07-18 11:40:02.591065] Doing inventory ...

[2024-07-18 11:40:02.631567] Retrieving info for 'tas' (0 vars remaining)

[2024-07-18 11:40:02.691593] Done.



In [22]:
str(di)

List of 1
 $ tas:List of 7
  ..$ Description: chr "Near-Surface Air Temperature"
  ..$ DataType   : chr "float"
  ..$ Shape      : int [1:3] 4017 256 512
  ..$ Units      : chr "K"
  ..$ DataSizeMb : num 2106
  ..$ Version    : logi NA
  ..$ Dimensions :List of 3
  .. ..$ time:List of 4
  .. .. ..$ Type      : chr "Time"
  .. .. ..$ TimeStep  : chr "1.0 days"
  .. .. ..$ Units     : chr "days since 1850-01-01 00:00:00"
  .. .. ..$ Date_range: chr "1960-11-01T12:00:00Z - 1971-10-31T12:00:00Z"
  .. ..$ lat :List of 5
  .. .. ..$ Type       : chr "Lat"
  .. .. ..$ Units      : chr "degrees_north"
  .. .. ..$ Values     : num [1:256] -89.5 -88.8 -88.1 -87.4 -86.7 ...
  .. .. ..$ Shape      : int 256
  .. .. ..$ Coordinates: chr "lat"
  .. ..$ lon :List of 5
  .. .. ..$ Type       : chr "Lon"
  .. .. ..$ Units      : chr "degrees_east"
  .. .. ..$ Values     : num [1:512] 0 0.703 1.406 2.109 2.812 ...
  .. .. ..$ Shape      : int 512
  .. .. ..$ Coordinates: chr "lon"



***
Note: Repeat the operations in this notebook for additional variables if needed.
***