# Defining a Recipe (Intro Tutorial Part 1)

Welcome to the Pangeo Forge introduction tutorial!

This tutorial is split into three parts:
1. Defining a recipe
1. Running a recipe locally
2. Setting up a recipe to run in the cloud

Throughout this tutorial we are going to convert NOAA OISST stored in netCDF to Zarr. OISST is a global, gridded ocean sea surface temperature dataset at daily 1/4 degree resolution. By the end of this tutorial sequence you will have converted some OISST data to zarr, be able to access a sample on your computer, and see how to propose the recipe for cloud deployment!

Here we tackle **Part 1 - Defining a Recipe**. We will assume that you already have `pangeo-forge-recipes` installed.

## Steps to Creating a Recipe

The two major pieces of creating a recipe are:

1. Creating a generalized URL pattern
1. Defining a `FilePattern` object
2. Defining a Recipe Class object

We will talk about each of these steps in turn.

### Where should I write this code?
Eventually, all of the code defining the recipe will need to go in a file called `recipe.py`. If you typically work in a text editor and want to start that way from the beginning that is great. It is also totally fine to work on your recipe code in a Jupyter Notebook and then copy the final code to a single `.py` file later. The choice between the two is personal preference.

## A Generalized URL Pattern for OISST

### Explore the structure

Like many datasets, OISST is availble via an HTTP URL.

[https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/)

By putting the URL into a webbrowser we can explore the organization of OISST. This is important because Pangeo Forge relies on the organization of a dataset URL to scale data access.

The link above shows data by month. Wihin each month there is data for individual days. We could represent the file structure that OISST is following like this:

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/
```
 │
 ├──198109/
 │   ├──oisst-avhrr-v02r01.19810901.nc
 │  ...
 │   └──oisst-avhrr-v02r01.19810930.nc
...
 └──202201/
      ├──oisst-avhrr-v02r01.20220101.nc
     ...
      └──oisst-avhrr-v02r01.20220131.nc
```

The important takeaways from this structure exploration are:
- 1 file = 1 day
- Folders separate months

### A single URL

By putting together the full URL for a single file we can see that the OISST dataset for December 9th, 1981 would be accessed using the URL:

[https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc)

Copying and pasting that url into a webbrowser will download that single file to your computer.

### A generalized URL pattern
We can generalize the URL to say that OISST datasets are accessed using a URL of the format:

`https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{year-month}/oisst-avhrr-v02r01.{year-month-day}.nc`

where `{year-month}` and `{year-month-day}` change for each file. Of the three dimensions of this dataset - latitude, longitude and time - the individual files are split up by time.

### Why does this matter so much?
A Pangeo Forge `FilePattern` is built on the premise that 1) datasets are accessible by URL will be organized in a predictable way 2) the URL organization tells us something about the structure of the dataset. Knowing the generalized structure of the OISST URL leads us to the next step of creating a recipe - defining a `FilePattern`.

## Defining a `FilePattern` object

There are several different ways to define a [`FilePattern`](https://pangeo-forge.readthedocs.io/en/latest/recipe_user_guide/file_patterns.html) in Pangeo Forge. In this tutorial we are going to use `pattern_from_file_sequence`.  The input to the `pattern_from_file_sequence` function is a list of urls to the files of the dataset that we want to convert.  In other words, our goal is to create a list that looks like this:

```
["https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900601.nc", "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900602.nc", "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900603.nc", ... ]
```

To create the list of URLs we will use the library `pandas` and Python format strings.

### Create a format string & function

First, let's use `pd.date_range()` to create a list of dates. We will put in the beginning and ending dates of the dataset - in this case September 1st, 1981 through February 1st, 2022. We will also use `freq='D'` because OISST is a daily dataset.

In [1]:
import pandas as pd

In [10]:
dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')
# print the first 10 dates
print(dates[:10])

DatetimeIndex(['1981-09-01', '1981-09-02', '1981-09-03', '1981-09-04',
               '1981-09-05', '1981-09-06', '1981-09-07', '1981-09-08',
               '1981-09-09', '1981-09-10'],
              dtype='datetime64[ns]', freq='D')


Great, we have our dates in a list. Now we want to create a Python format string so that we can use this list of dates to programatically create all of our URLs. 

For OISST the format string for our generalized URL looks like this:

In [3]:
input_url_pattern = (
    'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
    'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

We have subsituted the parts of the URL that change for each file with `{}`. The `{}` allow us to use Python's `.format()` function for strings and the `.strftime()` method for Python dates to create each file's url. For example, we could create the format string for the first day of the dataset like this:

In [4]:
print('formating OISST URL for the date ', dates[0])

input_url_pattern.format(
    yyyymm=dates[0].strftime('%Y%m'), yyyymmdd=dates[0].strftime('%Y%m%d')
)

formating OISST URL for the date  1981-09-01 00:00:00


'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

`%Y%m` and `%Y%m%d` are the format specifiers that describe `{year-month}` and `{year-month-day}` as they appeared in the generalized OISST URL. You can view a reference for `.strftime()` format specifiers [here](https://strftime.org/).

We programatically created a URL string for the first file! 

### Create a list of all the dataset urls

Now let's put these pieces together. We will put our 3 parts
1. the `dates` sequence
2. the `input_url_pattern` format string
3. the `.format()` function

into a list comprehension to create the urls for all the files in 1 month of the OISST dataset.

In [5]:
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime('%Y%m'), yyyymmdd=day.strftime('%Y%m%d')
    )
    for day in dates
]

In [9]:
# Print the first 3 urls
print(input_urls[:3])

['https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc']


There we have it!  A list of URLs for the OISST dataset. By changing the dates in the `pandas` `pd.date_range()` function we could generate data access URLs for the full OISST dataset.

### Create the File Pattern object

Now we return to our `pattern_from_file_sequence` function. The implementation looks like this:

In [7]:
from pangeo_forge_recipes.patterns import pattern_from_file_sequence

In [8]:
pattern = pattern_from_file_sequence(input_urls, 'time', nitems_per_file=1)

The arguments are:
* `input_urls` - the list of file urls we just created
* `'time'` - indicates which variable is changing between files. Because each file of OISST is a new day of data with the same variable (sea surface temperature) and spatial extent, `time` is the input for OISST.
* `nitems_per_file` - specifies that each OISST file contains a single timestep (1 day per file for OISST)

When we print `pattern` we see that it is a `FilePattern` object with 32 timesteps.

In [11]:
pattern

<FilePattern {'time': 14764}>

### Inspecting a `FilePattern`

In [13]:
for index, url in pattern.items():
    print(index, url)

Index({DimIndex(name='time', index=0, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)}) https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
Index({DimIndex(name='time', index=1, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)}) https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc
Index({DimIndex(name='time', index=2, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)}) https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc
Index({DimIndex(name='time', index=3, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)}) https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810904.nc
Index({DimIndex(name='time', index=4, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)}) htt

What we iterated through a `FilePattern` we saw that it has two things:
1. An index object
2. A dataset url
Together, these two things describe the files of the dataset (with #2) and how those files relate to one another (with #1).

## Defining a Recipe Class object

Now that we have our `FilePattern` object the [Recipe Class](https://pangeo-forge.readthedocs.io/en/latest/recipe_user_guide/recipes.html) comes pretty quickly. In this tutorial we want to convert our dataset to zarr, so we will use the `XarrayZarrRecipe` class. Implementing the class looks like this:

In [17]:
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

In [18]:
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)

The arguments are:
1. the `FilePattern` object
2. `inputs_per_chunk` - indicates how many files should go into a single chunk of the zarr store

In more complex recipes additional arguments that may get used, but for this tutorial these two are all we need.

## End of Part 1
And there you have it - your first recipe object! Inside that object is all the information about the dataset that is needed to run the data conversion. Pretty compact!

In part 2 of the tutorial, we will use our recipe object, `recipe` to convert some data locally.

### Code Summary
The code written in part 1 could all be written together as:

In [None]:
import pandas as pd

from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

In [None]:
input_url_pattern = (
    'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
    'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

dates = pd.date_range('1982-01-01', '1982-02-01', freq='D')
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime('%Y%m'), yyyymmdd=day.strftime('%Y%m%d')
    )
    for day in dates
]

pattern = pattern_from_file_sequence(input_urls, 'time', nitems_per_file=1)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)