# Recipe Tutorial

This tutorial describes how to create a recipe from scratch.


## Step 1: Get to know your source data

If you are developing a new recipe, you are probably starting from an existing
datset. The first step is to just get to know the dataset. For this tutorial,
our example will be the _NOAA Optimum Interpolation Sea Surface Temperature
(OISST) v2.1_. The authoritative website describing the data is
<https://www.ncdc.noaa.gov/oisst/optimum-interpolation-sea-surface-temperature-oisst-v21>.
This website contains links to the actual data files on the
[data access](https://www.ncdc.noaa.gov/oisst/data-access) page. We will use the
_AVHRR-Only_ version of the data and follow the corresponding link to the
[Gridded netCDF Data](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/).
Browsing through the directories, we can see that there is one file per day. The
very first day of the dataset is stored at the following URL:

```text
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
```

From this example, we can work out the pattern of the file naming conventions.
But first, let's just download one of the files and open it up.


In [None]:
! curl -O https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc 

In [None]:
import xarray as xr

ds = xr.open_dataset("oisst-avhrr-v02r01.19810901.nc")
ds

We can see there are four data variables, all with dimension
`(time, zlev, lat, lon)`. There is a _dimension coordinate_ for each dimension,
and no _non-dimension coordinates_. Each file in the sequence presumably has the
same `zlev`, `lat`, and `lon`, but we expect `time` to be different in each one.

Let's also check the total size of the dataset in the file.


In [None]:
print(f"File size is {ds.nbytes/1e6} MB")

The file size is important because it will help us define the _chunk size_
Pangeo Forge will use to build up the target dataset.


## Step 2: Pick a Recipe class

For our first recipe, we will want to use a pre-defined Recipe class from Pangeo
Forge.

By examining the {doc}`recipes` documentation page, we see that our scenario is
a good case for the {class}`pangeo_forge.recipe.NetCDFtoZarrSequentialRecipe`
class. Let's examine its documentation string in our notebook.


In [None]:
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe
NetCDFtoZarrSequentialRecipe?

## Step 3: Define Recipe parameters

Our chosen class has only two required parameters: `input_urls` and
`sequence_dim`.

`input_urls` is a list of URLs pointing to the data. To populate this, we need
to explicitly create this list based on what we know about the file naming
conventions. Let's look again at the first URL

```text
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
```

From this we deduce the following format string.


In [None]:
input_url_pattern = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation"
    "/v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)

Now we need a sequence of datetimes. Pandas is the easiest way to get this. At
the time of writing, the latest available data is from 2021-01-05.


In [None]:
import pandas as pd

dates = pd.date_range("1981-09-01", "2021-01-05", freq="D")
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime("%Y%m"), yyyymmdd=day.strftime("%Y%m%d")
    )
    for day in dates
]
print(f"Found {len(input_urls)} files!")
input_urls[-1]

That's a lot of files!

The other remaining parameter is `sequence_dim`. It's just `"time"`. We can now
instantiate our recipe.


In [None]:
recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=input_urls, sequence_dim="time"
)
recipe