Skip to content

Latest commit

 

History

History
211 lines (167 loc) · 9.25 KB

data.rst

File metadata and controls

211 lines (167 loc) · 9.25 KB

Specifying Input Data

When we construct a :py:class:`~filtering.filtering.LagrangeFilter` object, the filenames, variables and dimensions arguments are passed straight into OceanParcels. There are some examples of how these arguments should be constructed in the OceanParcels tutorial, but we will summarise some of the important takeaways here.

Filenames dictionary

The filenames argument is more properly called the filenames_or_dataset argument in the :py:class:`~filtering.filtering.LagrangeFilter` initialiser. We'll start by describing the more common usecase, providing filenames, rather than a dataset. In all cases where you provide filenames, the files should be in the NetCDF format. In the most simple case, all your data is in a single file:

filenames = "data_file.nc"

The filenames can contain wildcard characters, for example:

filenames = "data_directory/output*/diags.nc"

If your variables are in separate files, you can pass a dictionary:

filenames = {
  "U": "u_velocity.nc",
  "V": "v_velocity.nc",
  "rho": "diags.nc",
}

Finally, you can pass a dictionary of dictionaries, separating the files containing latitude, longitude, depth and variable data. This is particularly useful when your data is on a B- or C-grid, as :ref:`detailed below <bcgrid-data>`. The format of the dictionaries follows, noting that the depth entry is not required if you're only using two-dimensional data:

filenames = {
  "U": {"lat": "mask.nc", "lon": "mask.nc", "depth": "depth.nc", "data": "u_velocity.nc"},
  "V": {"lat": "mask.nc", "lon": "mask.nc", "depth": "depth.nc", "data": "v_velocity.nc"},
}

Dataset input

As an alternative to passing filenames, an xarray dataset can be given to the filenames_or_dataset argument. One use for this functionality is to provide synthetic data, without requiring that it first be written to a file.

Another use for dataset input is to provide more flexibility with your input data. In particular, you are able to leverage dask for on-the-fly computations, such as the :ref:`dynamic data masking example <masking example>`. Note that the default behaviour of :py:func:`xarray.open_dataset` is to use a single chunk for a file. For large datasets, this will both take an extremely long time, and use an excessive amount of memory. Ensure the dataset is opened with a sensible chunks dictionary.

A complication that comes up when using data from a dataset is that we don't handle some forms of datetime object particularly well. This is especially the case when using a standard or proleptic Gregorian calendar, which loads with a numpy-specific datetime object. In these cases, tell xarray not to decode the time data into these objects by passing use_cftime=True.

Finally, your data is likely to be spread across multiple files, with different dimension names for variables. The different mapping for grid data isn't possible with dataset input, so you will have to combine multiple datasets with :py:func:`xarray.merge`. Other useful functions for massaging your data into a conforming format are demonstrated in the :ref:`loading data through xarray example <xarray example>`.

Variables dictionary

OceanParcels uses particular names for the velocity components and dimensions of data. These names may differ from those actualy used within your files. The first bridge between these two conventions is the variables dictionary. This is a map between a variable name used within OceanParcels, and the name within the data files themselves. Note that if you have extra data beyond just the velocity components, it still requires an entry in variables.

variables = {"U": "UVEL", "V": "VVEL", "P": "PHIHYD", "RHO": "RHOAnoma"}

This mapping defines the usual U and V velocity components, and the additional P and RHO variables, named PHIHYD and RHOAnoma in the source data files, respectively.

Dimensions dictionary

The other bridge between conventions relates to the dimensions of the data. There are two considerations here: first is to simply inform OceanParcels of the latitude, longitude, depth and time dimensions within the data. However, the second consideration is to redefine the data locality of the variables, which is required when using :ref:`B- or C-grid interpolation <bcgrid-data>`.

If all data is on the same grid, i.e. Arawaka A-grid, dimensions can be a single dictionary mapping the OceanParcels dimension names lat, lon, time and depth to those found within the data files. As before, depth isn't required for two-dimensional data. However, if your data is three-dimensional and you're choosing a single depth-level with the index mechanism below, depth must still be present in the dimensions dictionary.

dimensions = {"lon": "X", "lat": "Y", "time": "T", "depth": "Zmd000200"}

It is also possible to separately specify the dimensions for each of the variables defined in the variables dictionary. This is often used when variables have different spatial staggering.

dimensions = {
  "U":   {"lon": "xu_ocean", "lat": "yu_ocean", "time": "time"},
  "V":   {"lon": "xu_ocean", "lat": "yu_ocean", "time": "time"},
  "RHO": {"lon": "xt_ocean", "lat": "yt_ocean", "time": "time"},
}

Index dictionary

In some cases, we might want to restrict the extent of the data that OceanParcels sees. This is different from using :py:func:`~filtering.filtering.LagrangeFilter.seed_subdomain` to use the full domain for advection, but restrict the domain size used for filtering. This functionality is most useful considering that we perform filtering in two-dimensional slices: if we provide a full three-dimensional data file, we may run into some problems. Instead of requiring a pre-processing step to split out separate vertical levels, we can tell OceanParcels to consider only a particular level by its index through the indices dictionary. This is an optional argument to the :py:class:`~filtering.filtering.LagrangeFilter` initialiser. For example, to use only the surface data (for a file where the indices increase downwards):

indices = {"depth": [0]}

B- and C-grid data

Compared to the Arakawa A-grid, where all variables are collocated within a grid cell, the different variables are staggered differently in the B- and C-grid conventions. In particular, on a B-grid, velocity is defined on cell edges, and tracers are taken as a cell mean. This means that velocity is interpolated bilinearly, as you may expect. The behaviour with three-dimensional data is more complicated, but we will not discuss this because the filtering library is aimed at two-dimensional slices.

OceanParcels assumes that C-grid velocity data is constant along faces. The U component is defined on the eastern face of a cell, and the V component on the northern face. To interpolate in this manner, OceanParcels needs the grid information for velocities to refer to the corner of a cell. Perhaps confusingly, this means that although U and V are staggered relative to each other, they need to have the same grid information in dimensions. OceanParcels assumes the NEMO grid convention, where U[i, j] is on the cell edge between corners [i, j-1] and [i, j]. Similarly, V[i, j] is on the edge between corners [i-1, j] and [i, j]. If your data doesn't follow this convention, new coordinate data will need to be generated in order to work correctly. More detail is available in the indexing documentation.

Output grid

The underlying :doc:`algorithm <algorithm>` involves seeding particles at all gridpoints in order to sample the fields of interest. With the potential staggering mentioned above in mind, this could mean running the filtering advection with three times the number of points. Additionally, we can specify variables on arbitrary grids to be sampled by the velocity data, which could increase the advection time and memory consumption further. Instead, we anticipate a given filtering workflow will seed particles on a single grid, leveraging interpolation for other staggering schemes.

By default, the first grid defined within the OceanParcels :py:class:`~parcels.fieldset.FieldSet` will be used for seeding the filtering particles, and therefore as the final location of the filtered data. Usually, this will be the U velocity field, but the :py:func:`~filtering.filtering.LagrangeFilter.set_particle_grid` method can be used to modify this after creation of the filtering object. This looks up a field by name from OceanParcels, and as such needs to be called with a variable in the keys of the variables dictionary, as opposed to the variable name within your data files. Using the example variable data from before, to set particle seeding and output on the rho grid:

variables = {"U": "UVEL", "V": "VVEL", "P": "PHIHYD", "RHO": "RHOAnoma"}
f = LagrangeFilter(...)
f.set_particle_grid("RHO")