# Defining a Recipe (Intro Tutorial Part 1)

Welcome to the Pangeo Forge introduction tutorial!

This tutorial is split into three parts:
1. Defining a recipe
1. Running a recipe locally
2. Setting up a recipe to run in the cloud

Throughout this tutorial we are going to convert NOAA OISST stored in netCDF to Zarr. OISST is a global, gridded ocean sea surface temperature dataset at daily 1/4 degree resolution. By the end of this tutorial sequence you will have converted some OISST data to zarr, be able to access a sample on your computer, and see how to propose the recipe for cloud deployment!

Here we tackle **Part 1 - Defining a Recipe**. We will assume that you already have `pangeo-forge-recipes` installed.

## Steps to Creating a Recipe

The two major pieces of creating a recipe are:

1. Creating a generalized URL pattern
1. Defining a `FilePattern` object
2. Defining a Recipe Class object

We will talk about each of these steps in turn.

### Where should I write this code?
Eventually, all of the code defining the recipe will need to go in a file called `recipe.py`. If you typically work in a text editor and want to start that way from the beginning that is great. It is also totally fine to work on your recipe code in a Jupyter Notebook and then copy the final code to a single `.py` file later. The choice between the two is personal preference.

## Understanding the URL Pattern for OISST


### Explore the structure

In order to create our Recipe, we have to understand how the data are organized on the server.
Like many datasets, OISST is availble over the internet via the HTTP protocol.
We can browse the the files at this URL:

<https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/>

By putting the URL into a web browser, we can explore the organization of the dataset.
We need to understand this organization in order to build our Recipe.

The link above shows folders grouped by month. Wihin each month there is data for individual days. We could represent the file structure that OISST is following like this:

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/
```
 │
 ├──198109/
 │   ├──oisst-avhrr-v02r01.19810901.nc
 │  ...
 │   └──oisst-avhrr-v02r01.19810930.nc
...
 └──202201/
      ├──oisst-avhrr-v02r01.20220101.nc
     ...
      └──oisst-avhrr-v02r01.20220131.nc
```

The important takeaways from this structure exploration are:
- 1 file = 1 day
- Folders separate months

### A single URL

By putting together the full URL for a single file we can see that the OISST dataset for December 9th, 1981 would be accessed using the URL:

[https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc)

Copying and pasting that url into a web browser will download that single file to your computer.

If we just have a few files, we can just manually type out the URLs for each of them.
But that isn't practical when we have thousands of files.
We need to understand the _pattern_.

### A generalized URL pattern
We can generalize the URL to say that OISST datasets are accessed using a URL of the format:

`https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{year-month}/oisst-avhrr-v02r01.{year-month-day}.nc`

where `{year-month}` and `{year-month-day}` change for each file. Of the three dimensions of this dataset - latitude, longitude and time - the individual files are split up by time.
Our goal is to combine, or _concatenate_, these files into a single Zarr dataset.

### Why does this matter so much?

A Pangeo Forge {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` is built on the premise that

1. We want to combine many individual small files into a larger dataset along one or more dimensions using either "concatentate" or "merge" style operations. 
1. The individual files are accessible by URL and organized in a predictable way.
2. There is a some kind of mapping between the dimensions of the combination process and the actual URLs.

Knowing the generalized structure of the OISST URL leads us to the next step of creating a recipe - defining a `FilePattern`.

## Creating a `FilePattern` object

```{note}
`FilePattern`s are probably the most abstract part of Pangeo Forge.
It may take some time and experience to become comfortable with the `FilePattern` concept.
```

In order to define a `FilePattern` we need to:

- Define the {ref}`Combine Dimensions` and associated `keys`
- Write a python function that maps these `keys` to actual URLs 

Let's start with the Combine Dimension.


### Define Combine Dimension

This File Pattern is pretty straightforward.
There is only one Combine Dimenion: time.
There is one file per day, and we want to concatenate the files in time.

We start by creating an index of every day covered by the dataset.
The easiest way to do this is with the Pandas `date_range` function.


In [1]:
import pandas as pd

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')
# print the first 4 dates
dates[:4]

DatetimeIndex(['1981-09-01', '1981-09-02', '1981-09-03', '1981-09-04'], dtype='datetime64[ns]', freq='D')

These will be the `keys` for our `ConcatDim`.
We now define a {class}`ConcatDim <pangeo_forge_recipes.patterns.ConcatDim>` as follows:

In [2]:
from pangeo_forge_recipes.patterns import ConcatDim

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
time_concat_dim

ConcatDim(name='time', nitems_per_file=1)

The `nitems_per_file=1` option is a hint we can give to Pangeo Forge. It means, "we know there is only one timestep in each file".
Providing this hint is not necessary, but it makes some things more efficient down the line.

### Define Format Function

Next we we need to write a function that takes a single key and translates it to a URL.
This is just a standard Python function.

```{caution}
If you're not comfortable with writing Python functions, this may be a good time to review
the [official Python tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
on this topic.
```

There are a couple of important things to note about this function:

- It must have the _same number of arguments as the number of Combine Dimensions_. In our case, this is just one.
- The name of the argument must match the `name` of the the Combine Dimension. In our case, this is `time`.

So we need to write a function that takes `time` as its argument and returns the correct URL for the OISST file for that date.
A very useful helper for this is the [strftime](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) function,
which is a method on each item in the `dates` array.
For example

In [3]:
dates[0].strftime('%Y')

'1981'

Armed with this, we can now write our function

In [4]:
def make_url(time):
    yyyymm = time.strftime('%Y%m')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
        f'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

Let's test it out:

In [5]:
make_url(dates[0])

'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

It looks good! 🤩 

We are now ready to make our `FilePattern`

### Define the `FilePattern`

We now have the two ingredients need for our {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>`.
At this point, it's pretty simple:

In [6]:
from pangeo_forge_recipes.patterns import FilePattern

pattern = FilePattern(make_url, time_concat_dim)
pattern

<FilePattern {'time': 14764}>

This object contains everything Pangeo Forge needs to know about where the data are coming from and how they should be combined.

### Optional: Iterating through a `FilePattern`

While optional, if you want to interact with the `FilePattern` object a bit (for example, for debugging) more you can iterate through it using `.items()`.
To keep the output concise, we use an if statement to stop the iteration after a few filepaths.

In [7]:
for index, url in pattern.items():
    print(index)
    print(url)
    # Stop after the 3rd filepath (September 3rd, 1981)
    if '19810903' in url:
        break

Index({DimIndex(name='time', index=0, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
Index({DimIndex(name='time', index=1, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc
Index({DimIndex(name='time', index=2, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc


The `index` is an object used internally by Pangeo Forge. The url corresponds to the actual file we want to download.

## Defining a Recipe Class object

Now that we have our `FilePattern` object the [Recipe Class](https://pangeo-forge.readthedocs.io/en/latest/recipe_user_guide/recipes.html) comes pretty quickly. In this tutorial we want to convert our dataset to zarr, so we will use the `XarrayZarrRecipe` class. Implementing the class looks like this:

In [8]:
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)
recipe

XarrayZarrRecipe(file_pattern=<FilePattern {'time': 14764}>, storage_config=StorageConfig(target=FSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fc218887610>, root_path='/var/folders/dt/n99tg72n61v8d22px_jm78yh0000gn/T/tmpsx8yd6wz/f28yA6FS'), cache=CacheFSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fc218887610>, root_path='/var/folders/dt/n99tg72n61v8d22px_jm78yh0000gn/T/tmpsx8yd6wz/WH9JoNEB'), metadata=MetadataTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fc218887610>, root_path='/var/folders/dt/n99tg72n61v8d22px_jm78yh0000gn/T/tmpsx8yd6wz/kH1FPsrZ')), inputs_per_chunk=10, target_chunks={}, cache_inputs=True, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs={}, xarray_concat_kwargs={}, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs={}, open_input_with_fsspec_reference=False)

The arguments are:
1. the `FilePattern` object
2. `inputs_per_chunk` - indicates how many files should go into a single chunk of the zarr store

In more complex recipes additional arguments that may get used, but for this tutorial these two are all we need.

## End of Part 1
And there you have it - your first recipe object! Inside that object is all the information about the dataset that is needed to run the data conversion. Pretty compact!

In part 2 of the tutorial, we will use our recipe object, `recipe` to convert some data locally.

### Code Summary
The code written in part 1 could all be written together as:

In [12]:
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

def make_url(time):
    yyyymm = time.strftime('%Y%m')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
        f'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)