### TODO
- add links to read more where applicable

# Introduction Tutorial (Part 1 - Creating a Recipe)

Welcome to the Pangeo Forge introduction tutorial!

This tutorial has two parts:
1. creating and testing a recipe locally
2. setting up a recipe to run in the cloud

We will assume that you already have `pangeo-forge-recipes` installed (write file and insert link). If you have an environment setup where you can run `import pangeo_forge_recipes` in Python you have likely done that already.

For this tutorial we are going to convert NOAA OISST to zarr. This is an ocean sea surface temperature dataset originally stored as a netCDF. By the end of this tutorial sequence you will have converted some OISST data to zarr and be able to access it on your computer!

## Steps to Creating a Recipe

The two major pieces of creating a recipe are:
1. Defining a File Pattern
2. Defining a Recipe Class

We will talk about each of these steps in turn.

We will define both of these steps in a single file, which we will call `recipe.py`. For now you can also do this in a Jupyter Notebook if you prefer. (**Question -- I imagine this will have to be in a script by the time it gets pushed to GH, correct?**)To start, create a new file with this name and open up your favorite IDE or text editor to get coding.

## The OISST Dataset

Like many datasets, OISST is availble via an http url that can be explored. By putting the url https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/ into a webbrowser we can explore the organization of the OISST url. Going to that web url displays a page that as the following info on it:

```
Parent Directory	 	-	 
198109/	2020-05-15 17:08	-	 
198110/	2020-05-15 17:08	-	 
198111/	2020-05-15 17:08	-	 
198112/	2020-05-15 17:08	-	 
198201/	2020-05-15 17:08	-	 
198202/	2020-05-15 17:08	-	 
198203/	2020-05-15 17:08	-	 
198204/	2020-05-15 17:08	-	 
198205/	2020-05-15 17:08	-	 
198206/	2020-05-15 17:08	-	
```

We see that each folder on this page is listing a 4-digit year followed by 2-digit month. Clicking on a particular month (for examle `198112/`) we see:

```
Parent Directory	 	-	 
oisst-avhrr-v02r01.19811201.nc	2020-05-15 11:14	1.6M	 
oisst-avhrr-v02r01.19811202.nc	2020-05-15 11:15	1.6M	 
oisst-avhrr-v02r01.19811203.nc	2020-05-15 11:07	1.6M	 
oisst-avhrr-v02r01.19811204.nc	2020-05-15 11:07	1.6M	 
oisst-avhrr-v02r01.19811205.nc	2020-05-15 11:08	1.6M	 
oisst-avhrr-v02r01.19811206.nc	2020-05-15 11:07	1.6M	 
oisst-avhrr-v02r01.19811207.nc	2020-05-15 11:08	1.6M	 
oisst-avhrr-v02r01.19811208.nc	2020-05-15 11:08	1.6M	 
oisst-avhrr-v02r01.19811209.nc	2020-05-15 11:10	1.6M	 
```

By putting together the url with a single file we can see that the OISST dataset for December 9th, 1981 would be accessed using the url:
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc

Copying and pasting that url into a webbrowser will download that single file to your computer.

Pangeo Forge File Patterns are built on the premise that datasets accessible by URL will be organized in a predictable way that also tells us something about the structure of the dataset. This leads us to the next step of creating a recipe - defining the file pattern.

## Defining a File Pattern

We saw by exploring the remote url that the OISST dataset has a url that is organized in the format:

`https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/<year-month>/oisst-avhrr-v02r01.<year-month-day>.nc`

There are several different ways to define File Patterns in Pangeo Forge, but in this tutorial we are going to use `pattern_from_file_sequence`. `pattern_from_file_sequence` takes as an input a list of urls to the files of the dataset that we want to convert. To create this list we will use the library `pandas` and Python format strings.

### Programatically create a dataset url

First, let's use `pd.date_range()` to create a list of dates. We will just define the first month of data for this tutorial, but in practice you could define the date range to be the entire range of the dataset. We will also use `freq='D'` because OISST is a daily dataset.

In [1]:
import pandas as pd

In [5]:
dates = pd.date_range('1982-01-01', '1982-02-01', freq='D')
print(dates[:10])

DatetimeIndex(['1982-01-01', '1982-01-02', '1982-01-03', '1982-01-04',
               '1982-01-05', '1982-01-06', '1982-01-07', '1982-01-08',
               '1982-01-09', '1982-01-10'],
              dtype='datetime64[ns]', freq='D')


Great, we have our dates. Now we want to create a Python format string that we can fill with each of the dates in our `dates` list. For OISST that url looks like this:

In [7]:
input_url_pattern = (
    'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
    'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

We have subsituted the parts of the URL that change with `{}`. The `{}` allow us to use pythons `.format()` function for strings and the `.strftime()` method for datetime objects to create each file's url. For example, we could create the format string for the first day of the dataset like this:

In [8]:
input_url_pattern.format(
    yyyymm=dates[0].strftime('%Y%m'), yyyymmdd=dates[0].strftime('%Y%m%d')
)

'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198201/oisst-avhrr-v02r01.19820101.nc'

We created a url for the first file! You can view a reference for the the `.strftime()` format specifiers [here](https://strftime.org/).

### Create a list of all the dataset urls

Now let's put these pieces together - the `dates` sequence, the `input_url_pattern` format string, and the `.format()` function - into a list comprehension to create the urls for all the files in 1 month of the OISST dataset.

In [10]:
dates = pd.date_range('1990-06-01', '1990-06-30', freq='D')
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime('%Y%m'), yyyymmdd=day.strftime('%Y%m%d')
    )
    for day in dates
]

In [16]:
print(input_urls[:5])

['https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900601.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900602.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900603.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900604.nc', 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/199006/oisst-avhrr-v02r01.19900605.nc']


There we have it!  A list of urls for the OISST dataset.

### Create the File Pattern object

Now we return to our `pattern_from_file_sequence` function. The implementation looks like this:

In [13]:
from pangeo_forge_recipes.patterns import pattern_from_file_sequence

In [14]:
pattern = pattern_from_file_sequence(input_urls, 'time', nitems_per_file=1)

The inputs are:
* `input_urls` - the list of file urls we just created
* `'time'` - indicates which variable is changing between files
* `nitems_per_file` - specifies that each OISST file contains a single timestep

In [15]:
pattern

<FilePattern {'time': 30}>

_Question for Charles -- is another way to look at this?_

## Defining a Recipe Class

Now that we have our File Pattern object the Recipe Class comes pretty easily. In this tutorial we want to convert our dataset to zarr, so we will use the `XarrayZarrRecipe` class.  All we need to do to create the object is input:
1. the FilePattern object
2. indicate how many files should go into a single chunk

In [17]:
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

In [18]:
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)

In [19]:
recipe

XarrayZarrRecipe(file_pattern=<FilePattern {'time': 30}>, inputs_per_chunk=10, target_chunks={}, target=None, input_cache=None, metadata_cache=None, cache_inputs=True, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs={}, xarray_concat_kwargs={}, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs={}, open_input_with_fsspec_reference=False)

TODO - how to read the repr

And there you have your first recipe object! This object holds all the information about the dataset that it needs to run.

In part 2 of the tutorial we will use our recipe to convert some data locally.