# Introduction

This is a self-guided step-by-step Jupyter notebook guide that will show you how to run Python scripts saved on the ```/scripts``` directory of the [precog-data-intake](https://github.com/precog-ocean/precog-data-intake) repository.
The scripts will be run from within this Jupyter notebook for illustrative purposes.


To run each ```.py``` script, simply run each cell in this notebook and follow the in-cell prompts as if you were working on any ```Terminal prompt```.

You can also run equivalent commands on any ```Terminal prompt``` with slightly different semantics (shown below).

## Step 1. Check / Activate Python Environment
Type the following command to activate the virtual environment from within your Jupyter notebook server:

In [11]:
%%bash
source .venv/bin/activate

The command above is similar to running the following on the ```Terminal prompt```:
```bash 
source .venv/bin/activate
```

## Step 2. Create directory to save search results

Create a directory named ```test_search``` under your Desktop.

PS. This is equivalent to running ```> mkdir ~/Desktop/test_search ``` on a ```Terminal prompt```

In [12]:
%%bash
mkdir -p ~/Desktop/test_search

## Step 3. ESGF Catalogue sweep for ESM outputs of interest

Now run the next cell in the notebook to execute the program ```intake_CatalogueSearch.py``` and follow the in-prompt instructions. When asked to provied `variable_ids` insert  `['expc', 'epc100']`

PS. This is equivalent to running the command ```> python scripts/intake_CatalogueSearch.py``` on a ```Terminal prompt```

In [None]:
import os.path
%run scripts/intake_CatalogueSearch.py

Now look at the ```path``` you indicated and inspect the files created. You should have the following:

- ESGF_search_<datetime_stamp>.xlsx ==> This is a Dataframe with the raw search results from all ESGF nodes.
- ESGF_search_<datetime_stamp>_<varstamp>.log ==> This is a text file with the log results from the ESGF sweep and also contains results for grid consistency tests as well as continuity of time stamps in files. 
- DF_Downloadable_XXX.xlsx ==> this is a Dataframe with the filtered and tested URLs for the variables you conducted the search for.


For instance, if you inspect the log file, it shows that complete PI and Historical runs for both  `epc100` and `expc` where found for the following CMIP6 models:
```
- GFDL-ESM4
- GISS-E2-1-G
- IPSL-CM6A-LR
- MPI-ESM-1-2-HAM
- MPI-ESM1-2-HR
- MPI-ESM1-2-LR
- UKESM1-0-LL
```

It also shows in a readable format test results where, for example, model `GISS-E2-1-G-CC` has availability for `expc` in a native grid (i.e., `gn`) for the piControl run but is lacking `expc` outputs for the Historical run in the native grid.

```
No complete set of variables for model GISS-E2-1-G-CC for variable ['expc', 'epc100'] in either grid. Test returned:
INFO - ####################################################################################################
INFO - grid_label       var_test   variable_ids  has_all_variables        run          model
INFO -         gn  [True, False] [expc, epc100]              False  piControl GISS-E2-1-G-CC
INFO -         gr [False, False] [expc, epc100]              False  piControl GISS-E2-1-G-CC
INFO -         gn  [False, True] [expc, epc100]              False historical GISS-E2-1-G-CC
INFO -         gr [False, False] [expc, epc100]              False historical GISS-E2-1-G-CC
INFO - ####################################################################################################
```

### Step 3.1

For the sake of illustration, open ```DF_Downloadable_expc_epc100.xlsx``` and delete all rows but the top 5. Otherwise, it will trigger a 719Gb download to your disk once you run the program intended to trigger data fetching on [Step4](##-Step-4.-Fetch-the-data).

You can do so by running this short program below:

In [None]:
import pandas as pd
from pathlib import Path

def keep_top_five_rows(path: Path, sheet_name):
    # Read the sheet into a DataFrame
    df = pd.read_excel(path, sheet_name=0)  # sheet_name=0 = first sheet
    # Keep only the first 5 rows
    df_top5 = df.head(5)
    # Overwrite the original file (no index column)
    df_top5.to_excel(path, sheet_name=sheet_name, index=False)
    print(f"File {excel_path} has been trimmed.")

if __name__=="__main__":
    # Path to your file
    excel_path = input(f"Now either drag onto terminal or type path to Dataframe with the Filtered ESGF search results:")
    excel_path = Path(excel_path.strip(" "))  # strip needed as dragging onto terminal adds a trailing 'space'
    keep_top_five_rows(excel_path, sheet_name='Sheet1')

You can use the same logic from the cell above to edit the script and add some custom filtering and produce new Dataframe files by combining multiple criteria before passing the `DF_Downloadable_<varstamp>.xlsx` dataframe to the program responsible for fetching data `intake_OceanVarsDL.py`.

You could, for instance, just download `UKESM-1-0-LL` by tweaking the code to:

In [56]:
import pandas as pd
from pathlib import Path

def filter_model(path, sheet_name, model):
    # Read the sheet into a DataFrame
    df = pd.read_excel(path, sheet_name=0)  # sheet_name=0 = first sheet
    fname= path.name.split('.')[0] + '_' + model + '.xlsx'
    # Keep only the model you want
    df_model = df[df['source_id']==model]
    # Export filtered file
    df_model.to_excel(os.path.join(path.parent, fname), sheet_name=sheet_name, index=False)
    print(f"File {excel_path} has been trimmed.")

if __name__ == "__main__":
    # Path to your file
    excel_path = input(f"Now either drag onto terminal or type path to Dataframe with the Filtered ESGF search results:")
    excel_path = Path(excel_path.strip(" "))  # strip needed as dragging onto terminal adds a trailing 'space'
    filter_model(excel_path, sheet_name='Sheet1', model='UKESM1-0-LL')

File /Users/leonardobertini/Desktop/DF_Downloadable_expc_epc100.xlsx has been trimmed.


## Step 4. Fetch the data

The next step is to run the downloader script.

The program will download the filtered search results from the ```ESGF_search_<varstamp>.xlsx``` Dataframe.

You can indicate where you'd like files to be downloaded to or keep ```~/Desktop/search_results```  created on [Step 2](##-Step-2.-Create-directory-to-save-search-results) as your default.

Downloads will trigger in parallel, and files will be organised under a directory tree that has a directory named ```CMIP6``` at the top.

Run the following cell:

PS. This is equivalent to running the command ```> python scripts/intake_OcanVarsDL.py``` on a ```Terminal prompt```


In [None]:
%run scripts/intake_OceanVarsDL.py

When the program finishes running, a folder ```CMIP6``` should have been created within your ```downlaod_path``` with the data organised per model. 

## Step 5. Fetch Grid cell measures (`areacello` and `volcello`)

Run the next script to fetch corresponding grid cell measures ```areacello``` and ```volcello``` for the downloaded ESM outputs.

The program will fetch the grid cell measures and will create a new dataframe ```DF_Downloadable_<cellmeasure_stamp>.xlsx``` on the chosen ```download_path```.

Then you'll be prompted to indicate the path to this newly created dataframe, and the cell measure downloads will trigger in parallel.

Files will be organised under a directory tree that has a directory ```CMIP6``` at the top.

Run the following cell:

PS. This is equivalent to running the command ```> python scripts/intake_CellMeasuresDL.py``` on a ```Terminal prompt```

In [None]:
%run scripts/intake_CellMeasuresDL.py

## Step 6. Check files

That's it. Now inspect `download_path` to check if the files were downloaded.