# Input and Target Data

This section describes the file structure of the data files containing the input data (satellite observations and ancillary data) and the target data. While this information is relevant for users who want to develop their own data loaders or evaluate external retrievals, the ``satrain`` dataset also provides ready-to-use PyTorch datasets that avoid manual data handling.

## Input Data

The SatRain dataset distinguishes a total of seven input sources. The data are stored in separate files in the format ``<prefix>_<timestamp>.nc`` with a uniqe ``prefix`` identifying the data source and a shared timestamp corresponding to the median time of the PMW overpass. The different input sources are:

 - PMW observations: ``atms`` and ``gmi``, depending on the base sensor
 - Geostationary Vis and IR: ``geo`` for the geo observations closest to the PMW overpass and ``geo_t`` for the time-resolved observations.
 - Geostationary IR: ``geo_ir`` for the Geo-IR observations closest to the PMW overpass and ``geo_ir_t`` for the time-resolved observations.
 - Ancillary data: ``ancillary``

Below, we take look the file structure of these input data.


## GMI and ATMS Observations

The GMI and ATMS observations contain the variables ``observations`` and ``earth_incidence_angle`` containing the PMW observations and the corresponding earth-incidence angles. Since the data loaded below is in gridded format, the spatial dimensions are latitude and longitude. For the on-swath data they are ``scan`` and ``pixel``.

> **Note**: The PMW files also contain the GPM L1C file and the start and end of the extracted scan range, which can be helpful to extract retrieval results from external retrieval algorithms.

In [5]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("gmi", split="training", input_data=["gmi"], geometry="gridded", subset="xs")
# Load the first training file from the 'gmi' files in the training dataset.
xr.load_dataset(training_files["gmi"][0])

The ATMS files are only available when the ``base_sensor`` is ATMS. They contain a different number of channels.

In [6]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("atms", split="training", input_data=["atms"], geometry="gridded", subset="xs")
# Load the first training file from the 'atms' files in the training dataset.
xr.load_dataset(training_files["atms"][0])

## Geo Observations

The Geo Vis and IR observations contain all observations in a single variable ``observations``. Visible channels are stored using reflectance in percent and thermal IR channels use K. The ``geo_t`` files contain an additional dimension containing all 10-minute observations within an hour of the overpass.

> **Note**: Since the Vis/IR observations for different testing domains are derived from different instruments, the number of channels and their spectral coverage changes across the testing domains.

In [7]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("gmi", split="training", input_data=["geo"], geometry="gridded", subset="xs")
# Load the first GEO training data file.
xr.load_dataset(training_files["geo"][0])

## Geo-IR Observations

The Geo IR contain globally merged observations from the IR window channels around $11 \mu m$. The multi-timestep files containg 16 half-hourly timesteps centered on the overpass time.


In [9]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("gmi", split="training", input_data=["geo_ir"], geometry="gridded", subset="xs")
# Load the first Geo-IR training data file.
xr.load_dataset(training_files["geo_ir"][0])

## Ancillary data

The ancillary data files contain the ERA5 data, GPROF surface type, and surface elevation all in separate variables.



In [10]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("gmi", split="training", input_data=["ancillary"], geometry="gridded", subset="xs")
# Load the first ancillary training data file.
xr.load_dataset(training_files["ancillary"][0])

## Target data

Finally, the target files contain the precipitation estimates as well as the radar-quality index (RQI), the gauge-correction factor (GCF), and the precipitation type fraction.


In [11]:
import xarray as xr
from satrain.data import get_files
training_files = get_files("gmi", split="training", input_data=[], geometry="gridded", subset="xs") # Target data is always included
# Load the first ancillary training data file.
xr.load_dataset(training_files["target"][0])