# Defining a `FilePattern` (Intro Tutorial Part 1)

Welcome to the Pangeo Forge introduction tutorial!

This tutorial is split into three parts:
1. Defining a `FilePattern`
1. Defining a recipe and running it locally
2. Setting up a recipe to run in the cloud

Throughout this tutorial we are going to convert NOAA OISST stored in netCDF to Zarr. OISST is a global, gridded ocean sea surface temperature dataset at daily 1/4 degree resolution. By the end of this tutorial sequence you will have converted some OISST data to zarr, be able to access a sample on your computer, and see how to propose the recipe for cloud deployment!

Here we tackle **Part 1 - Defining a `FilePattern`**. We will assume that you already have `pangeo-forge-recipes` installed.

## Steps to Creating a `FilePattern`

The steps to creating a `FilePattern` are:

1. Understand the URL Pattern for OISST
1. Create a Generalized URL
1. Define the **Combine Dimension** object
1. Create a Format Function
1. Define a `FilePattern`

We will talk about each of these in turn.

### Where should I write this code?
Eventually, all of the code defining the recipe will need to go in a file called `recipe.py`. If you typically work in a text editor and want to start that way from the beginning that is great. It is also totally fine to work on your recipe code in a Jupyter Notebook and then copy the final code to a single `.py` file later. The choice between the two is personal preference.

## Understand the URL Pattern for OISST


### Explore the structure

In order to create our Recipe, we have to understand how the data are organized on the server.
Like many datasets, OISST is available over the internet via the HTTP protocol.
We can browse the the files at this URL:

<https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/>

By putting the URL into a web browser, we can explore the organization of the dataset.
We need to understand this organization in order to build our Recipe.

The link above shows folders grouped by month. Within each month there is data for individual days. We could represent the file structure like this:

![OISST file structure](../images/OISST_URL_structure.png)

The important takeaways from this structure exploration are:
- 1 file = 1 day
- Folders separate months

### A single URL

By putting together the full URL for a single file we can see that the OISST dataset for December 9th, 1981 would be accessed using the URL:

[https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198112/oisst-avhrr-v02r01.19811209.nc)

Copying and pasting that url into a web browser will download that single file to your computer.

If we just have a few files, we can just manually type out the URLs for each of them.
But that isn't practical when we have thousands of files.
We need to understand the _pattern_.

## Create a Generalized URL
We can generalize the URL to say that OISST datasets are accessed using a URL of the format:

`https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{year-month}/oisst-avhrr-v02r01.{year-month-day}.nc`

where `{year-month}` and `{year-month-day}` change for each file. Of the three dimensions of this dataset - latitude, longitude and time - the individual files are split up by time.
Our goal is to combine, or _concatenate_, these files along the time dimension into a single Zarr dataset.

![OISST file structure conversion](../images/OISST_structure_conversion.png)

### Why does this matter so much?

A Pangeo Forge {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` is built on the premise that

1. We want to combine many individual small files into a larger dataset along one or more dimensions using either "concatenate" or "merge" style operations.
1. The individual files are accessible by URL and organized in a predictable way.
2. There is a some kind of correspondance, or mapping, between the dimensions of the combination process and the actual URLs.

Knowing the generalized structure of the OISST URL leads us to start building the pieces of a `FilePattern`.

## About the `FilePattern` object

```{note}
`FilePattern`s are probably the most abstract part of Pangeo Forge.
It may take some time and experience to become comfortable with the `FilePattern` concept.
```

The goal of the `FilePattern` is to describe how the files in the Generalized URL should be organized when they get combined together into a single zarr datastore.

In order to define a `FilePattern` we need to:
1. Know the dimension of data that will be used to combine the files. In the case of OISST the dimension is time.
2. Define the values of the dimension that correspond to each file. These are called the `key`s
3. Create a function that converts the `keys` to the specific URL for each file. We call this the Format Function.

The first two pieces together are called the **Combine Dimension**. Let's start by defining that.

NOTE: There was a `{ref}` for Combine Dimension but I don't see a place in the docs where that is written about.

## Define the **Combine Dimension**

The **Combine Dimenion** describes the relationship between files. In this dataset we only have one combine dimension: time. There is one file per day, and we want to concatenate the files in time. We will use the Pangeo Forge object `ConcatDim()`.

We also want to define the values of time that correspond to each file. These are called the `key`s. For OISST this means creating a list of every day covered by the dataset. The easiest way to do this is with the Pandas `date_range` function.

In [2]:
import pandas as pd

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')
# print the first 4 dates
dates[:4]

DatetimeIndex(['1981-09-01', '1981-09-02', '1981-09-03', '1981-09-04'], dtype='datetime64[ns]', freq='D')

These will be the `keys` for our **Combine Dimension**.
We now define a {class}`ConcatDim <pangeo_forge_recipes.patterns.ConcatDim>` object as follows:

In [3]:
from pangeo_forge_recipes.patterns import ConcatDim

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
time_concat_dim

ConcatDim(name='time', nitems_per_file=1)

The `nitems_per_file=1` option is a hint we can give to Pangeo Forge. It means, "we know there is only one timestep in each file".
Providing this hint is not necessary, but it makes some things more efficient down the line.

## Define a Format Function

Next we we need to write a function that takes a single key (here representing one day) and translates it into a URL to a data file.
This is just a standard Python function.

```{caution}
If you're not comfortable with writing Python functions, this may be a good time to review
the [official Python tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
on this topic.
```

So we need to write a function that takes a date as its argument and returns the correct URL for the OISST file with that date.

![Format Function Flow](../images/Format_function.png)

A very useful helper for this is the [strftime](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) function,
which is a method on each item in the `dates` array.
For example

In [3]:
dates[0].strftime('%Y')

'1981'

Armed with this, we can now write our function

In [4]:
def make_url(time):
    yyyymm = time.strftime('%Y%m')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
        f'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

Let's test it out:

In [5]:
make_url(dates[0])

'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

It looks good! ðŸ¤© 

Before we move on, there are a couple of important things to note about this function:

- It must have the _same number of arguments as the number of Combine Dimensions_. In our case, this is just one. (to me this doesn't make sense until I see an example that actually has multiple dimensions)
- The name of the argument must match the `name` of the the Combine Dimension. In our case, this is `time`.

These are ideas that will become increasingly relevant as you approach more complex datasets. For now, keep them in mind and we can move on to make our `FilePattern`.

## Define the `FilePattern`

We now have the two ingredients we need for our {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>`.
1. the Format Function
2. the **Combine Dimension** (`ConcatDim`, in this case)

At this point, it's pretty quick:

In [6]:
from pangeo_forge_recipes.patterns import FilePattern

pattern = FilePattern(make_url, time_concat_dim)
pattern

<FilePattern {'time': 14764}>

```{note}
You'll notice that we are using a function as an argument to another function here. If that pattern is new to you that's alright. It is a very powerful technique, so it is used semi-frequently in Pangeo Forge.
```

The `FilePattern` object contains everything Pangeo Forge needs to know about where the data are coming from and how they should be combined. This is huge progress toward making a recipe!

To summarize our process, we made a `ConcatDim` object, our **combine dimension**, which specifies `"time"` as the axis of concatenation and lists the dates. The Format function converts the dates to URLs and the `FilePattern` object keeps track of the URLs and how they relate to each other.

### Iterating through a `FilePattern`

While not necessary for the recipe, if you want to interact with the `FilePattern` object a bit (for example, for debugging) more you can iterate through it using `.items()`.
To keep the output concise, we use an if statement to stop the iteration after a few filepaths.

In [7]:
for index, url in pattern.items():
    print(index)
    print(url)
    # Stop after the 3rd filepath (September 3rd, 1981)
    if '19810903' in url:
        break

Index({DimIndex(name='time', index=0, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
Index({DimIndex(name='time', index=1, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc
Index({DimIndex(name='time', index=2, sequence_len=14764, operation=<CombineOp.CONCAT: 2>)})
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc


The `index` is an object used internally by Pangeo Forge. The url corresponds to the actual file we want to download.

## End of Part 1
And there you have it - your first `FilePattern` object! That object describes 1) all of the URLs to the files that we are planning to convert as well as 2) how we want each of the files to be organized in the output object. Pretty compact!

In part 2 of the tutorial, we will move on to creating a recipe object, and then use it to convert some data locally.

### Code Summary
The code written in part 1 could all be written together as:

In [12]:
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

def make_url(time):
    yyyymm = time.strftime('%Y%m')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
        f'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)