# Getting started

The easiest way to get started using the SatRain dataset is through the ``satrain`` Python package. While the full dataset can be
manually downloaded from [here](https://rain.atmos.colostate.edu/ipwgml/satrain/), the ``satrain`` package provides functionality to automate the data download and the loading of the training and testing data.


## Installing the ``satrain`` package

To install the latest version of satrain, use the following command:

```
pip install satrain[complete]@git+https://github.com/satrain/satrain
```

> **Note**: The above command installs all dependencies required to run the examples included here. If
this is a concern, use ``pip install git+https://github.com/satrain/satrain`` for a minimal installation.

## Accessing the Data

Accessing the SatRain data from within Python is as easy as:

In [4]:
from satrain import get_files

# Get a dictionary containing the training files for the GMI, ancillary, and precipitation reference data.
training_files = get_files(
    base_sensor="gmi",
    split="training",
    input_data=["gmi", "ancillary"],
    geometry="gridded",
    subset="s"
)
list(training_files.keys())

['gmi', 'ancillary', 'target']

This function will automatically download GMI observations, ancillary data, and the corresponding precipitation estimates and return a dictionary containing the paths of those files. This is all that is needed to get started using the data. Feel free to skip ahead to the neural net [examples](examples) to get started building precipitation retrievals using the dataset. Read on to learn how to configure where the SatRain data is stored on your machine.

## Configuring the Data Location

The ``satrain`` package expects the data files of the SatRain dataset to be located in a special path called the ``satrain_data_path``.  ``satrain`` does its best to keep track of the data path between subsequent
invocations to avoid downloading data multiple times.

After a fresh install, the ``data_path`` will be initialized to the current working directory from which the first download is invoked.
To set an explicit data path, you can use the ``satrain config set_data_path`` command. This will create a ``satrain`` configuration file in the
current user's configuration directory, which will allow the setting to persist for subsequent use of the ``satrain`` package. A configuration file storing the
``data_path`` will also be created when the first data download is invoked.

Alternatively, the ``data_path`` can be set using the ``SATRAIN_DATA_PATH`` environment
variable. The path in ``SATRAIN_DATA_PATH`` will overwrite the setting in the user's configuration
file.

The ``satrain config show`` command can be used to find out the value of the ``satrain`` data path
and how it is derived:

```
satrain config show
```

## Optional: Manual data download using the ``satrain`` command

```{note}
Most functionality of the ``satrain`` dataset that requires access to the SatRain dataset will automatically download the required files. Therefore, the following steps are not
strictly required to get started using the package.
```

Use the satrain command-line interface to download parts or all of the SatRain dataset:

```
satrain download --data_path /path/to/store/data --sensors gmi --subset s --splits training,validation,testing --geometries gridded
```

This will download the gridded SatRain training, validation, and testing data. The ``satrain download`` command takes the following options:

 - ``--sensors`` A comma-separated lists of the sensors for which to download the data.
 - ``--subset`` The size of the subset to download. Choose ``xl`` for the full dataset.
 - ``--splits`` A comma-separated lists of the data splits to download. Available options are ``training``, ``validation``, ``testing``, and ``evaluation``.
 - ``--geometries`` A comma-separated lists of the data geometries. Available options are ``gridded`` for gridded
   observations and ``on_swath`` for the data on the PMW swath.
 - ``--inputs`` A comma-separated list of the input data to download. Available options are ``ancillary``, ``geo``, ``geo_ir`` and ``pmw``.
 - ``--formats`` A comma-separated list of the data formats to download. Available options are ``spatial`` for 2D
   training scenes and ``tabular`` for tabular data.
   
   
## Listing available files

 The ``satrain list_files`` command can be used to list the files on the local machine
 that ``satrain`` is aware of. After a successful download, it should show a
 table listing relative locations of each dataset and how many files it
 comprises.
 
```