# Setting up a recipe to run in the cloud

Welcome to the Pangeo Forge introduction tutorial! This is the 3rd part in a sequence, the flow of which is described {doc}`here </introduction_tutorial/index>`.

## Outline Part 3

We are at an exciting point - transitioning to Pangeo Forge Cloud. In this part of the tutorial we are setting up our recipe, which we have thus far only run in a limited comupte environment on a small section of data, to run at scale in the cloud. In order to do that we will need to:

1. Fork the `staged-recipes` repo
2. Add the recipe files: a `.py` file and a `meta.yml` file
4. Make a PR to the `staged-recipes` repo


### A note for sandbox users
If you have been using the Pangeo Forge Sandbox for the first two parts that's great. In order to complete this part of the tutorial you will have to complete step 1 locally, and download the files you make in step 2 in order to make the PR in step 3.

## Fork the `staged-recipes` repo

[`staged-recipes`](https://github.com/pangeo-forge/staged-recipes) is a repository that exists as a staging ground for recipes. It is where recipes get reviewed before they are run. Once the recipe is run the code will be transitioned to its own repository for that recipe, called a **feedstock**. 

You can fork a repo through the web browser or the Github CLI. Checkout the [Github docs](https://docs.github.com/en/get-started/quickstart/fork-a-repo) for steps how to do this.

## Add the recipe files

Within `staged-recipes`, recipes files should go in a new folder for your dataset in the `recipes` subdirectory. The name of the new folder will become the name of the feedstock repository, the repository where the recipe code will live after the data have been processed.

In the example below we call the folder `oisst`, so the feedstoack will be called `oisst-feedstock`. The final file structure we are creating is this:

```
staged-recipes/recipes/
                └──oisst/
                   ├──recipe.py
                   └──meta.yml
```
The name of the folder `oisst` would vary based on the name of the dataset.

### Copy the recipe code into a single `.py` file

Within the `oisst` folder create a file called `recipe.py` and copy the recipe creation code from the first two parts of this tutorial. We don't have to copy any of the code we used for local testing - the cloud automation will take care of testing and scaling the processing on the cloud infrastructure. We will call this file `recipe.py` the **recipe module**. For OISST it should look like:

In [1]:
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

def make_url(time):
    yyyymm = time.strftime('%Y%m')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/'
        f'v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc'
    )

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=2)

Another step, complete!

## Create a `meta.yml` file

The `meta.yml` is a YAML file. YAML is a common language used for writing configuration files. `meta.yml` contains two important things:
1. metadata about the recipe 
2. the **bakery**, designating the cloud infrastructure where the recipe will be run and stored.

Here we will walk through each field of the `meta.yml`. A template of `meta.yml` is also available [here](https://github.com/pangeo-forge/sandbox/blob/main/recipe/meta.yaml). 


### `title` and `description`

These fields describe the dataset. They are not highly restricted.

```{code-block} yaml
:lineno-start: 1
title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
```

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:emphasize-lines: 1, 2

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
```

### `pangeo_forge_version`

This is the version of the `pangeo_forge_recipes` library that you used to create the recipe. It's important to track in case someone wants to run your recipe in the future. Conda users can find this information with `conda list`.

```{code-block} yaml
:lineno-start: 3
pangeo_forge_version: "0.6.2"
```

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:lineno-start: 1
:emphasize-lines: 3

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
pangeo_forge_version: "0.6.2"
```

### `recipes` section

The `recipes` section explains the recipes contained in the **recipe module** (`recipe.py`). This feels a bit repetitive in the case of OISST, but becomes relevant in the case where someone is defining multiple recipe classes in the same recipe module, for example with different chunk schemes.

```{code-block} yaml
:lineno-start: 4
recipes:
  - id: noaa-oisst-avhrr-only
    object: "recipe:recipe"
```
The id `noaa-oisst-avhrr-only` is the name that we are giving our recipe class. It is a string that we as the maintainer chose.
The entry `recipe:recipe` describes where the recipe Python object is.  We are telling it that our recipe object is in a file called `recipe`, inside of of a variable called `recipe`. Unless there is a specific reason to deviate, `recipe:recipe` is a good convention here.

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:lineno-start: 1
:emphasize-lines: 4-6

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
pangeo_forge_version: "0.6.2"
recipes:
  - id: noaa-oisst-avhrr-only
    object: "recipe:recipe"
```

### `provenance` section

Provenance explains the origin of the dataset. The core information about provenance is the `provider` field, which is outlined as part of the STAC Metadata Specification. See the [STAC Provider docs](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#provider-object) for more details.

```{code-block} yaml
:lineno-start: 7
provenance:
  providers:
    - name: "NOAA NCEI"
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
      roles:
        - producer
        - licensor
      url: https://www.ncdc.noaa.gov/oisst
  license: "CC-BY-4.0"
```
One field to highlight is the `license` field, described in the STAC docs [here](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#license). It is important to locate the licensing information of the dataset and provide it in the `meta.yml`.

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:lineno-start: 1
:emphasize-lines: 7-15

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
pangeo_forge_version: "0.6.2"
recipes:
  - id: noaa-oisst-avhrr-only
    object: "recipe:recipe"
provenance:
  providers:
    - name: "NOAA NCEI"
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
      roles:
        - producer
        - licensor
      url: https://www.ncdc.noaa.gov/oisst
  license: "CC-BY-4.0"
```

### `maintainers` section

This is information about you, the recipe creator! Multiple maintainers can be listed. The required fields are `name` and `github` username; `orcid` and `email` may also be included.

```{code-block} yaml
:lineno-start: 17
maintainers:
  - name: "Dorothy Vaughan"
    orcid: "9999-9999-9999-9999"
    github: dvaughan0987
```

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:lineno-start: 1
:emphasize-lines: 16-19

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
pangeo_forge_version: "0.6.2"
recipes:
  - id: noaa-oisst-avhrr-only
    object: "recipe:recipe"
provenance:
  providers:
    - name: "NOAA NCEI"
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
      roles:
        - producer
        - licensor
      url: https://www.ncdc.noaa.gov/oisst
  license: "CC-BY-4.0"
maintainers:
  - name: "Dorothy Vaughan"
    orcid: "9999-9999-9999-9999"
    github: dvaughan0987
```

### `bakery` section

**Bakeries** are where the work gets done on Pangeo Forge Cloud. A single bakery is a set of cloud infrastructure hosted by a particular institution or group.

Selecting a `bakery` is how you choose where the recipe will be run and hosted. The [Pangeo Forge website](https://pangeo-forge.org/) will (very soon!) host a full list of available bakeries.

```{code-block} yaml
:lineno-start: 17
bakery:
  id: "pangeo-ldeo-nsf-earthcube"
```

```{admonition} Full File Preview
:class: dropdown
```{code-block} yaml
:lineno-start: 1
:emphasize-lines: 20, 21

title: "NOAA Optimum Interpolated SST"
description: "1/4 degree daily gap filled sea surface temperature (SST)"
pangeo_forge_version: "0.6.2"
recipes:
  - id: noaa-oisst-avhrr-only
    object: "recipe:recipe"
provenance:
  providers:
    - name: "NOAA NCEI"
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
      roles:
        - producer
        - licensor
      url: https://www.ncdc.noaa.gov/oisst
  license: "CC-BY-4.0"
maintainers:
  - name: "Dorothy Vaughan"
    orcid: "9999-9999-9999-9999"
    github: dvaughan0987
bakery:
  id: "pangeo-ldeo-nsf-earthcube"
```

And that is the `meta.yml`! Between the `meta.yml` and `recipe.py` we have now put together all the files we need for cloud processing.

## Make a PR to the `staged-recipes` repo

At this point you should have created two files - `recipe.py` and `meta.yml` and they should be in the new folder you created for your dataset in `staged-recipes/recipes`. 

It's time to submit the changes as a Pull Request. Creating the Pull Request on Github is what officially submits your recipe for review to run. If you have opened an issue for your dataset you can reference it in the Pull Request. Otherwise, provide a notes about the datasets and hit submit! 

## After the PR

With the PR in, all the steps to stage the recipe are complete! At this point a Pangeo Forge Bot will perform some automated steps, such as checking syntax and required fields. [This recipe PR](https://github.com/pangeo-forge/staged-recipes/pull/66#issuecomment-1048578240) is an example of the Bot in action. The bot, and possibly a Pangeo Forge Maintainer will guide you through any steps to be taken on your recipe before merge.

Merging the PR will kick off a series of automated steps to begin the processing. These include:

- creating a feedstock repository
- setting up the necessary bakery infrastructure
- deploying the recipe

The relevant information about the recipe run will be communicated directly in the PR. If you are interested in learning more about how your recipe is processed, check out the {doc}`/cloud_automation_user_guide/index`.

## End of the Introduction Tutorial

Congratulations, you've completed the introduction tutorial!

From here, we hope you are excited to try writing your own recipe. As you write, you can find additional documentation helpful, such as the {doc}`/recipe_user_guide/index` or the more advanced {doc}`/tutorials/index`. You can also open issues in [`pangeo_forge_recipes`](https://github.com/pangeo-forge/pangeo-forge-recipes).

Happy ARCO building!