# Get and configure the CLM files

>>> TODO: Intro about the CLM model, the three types of CLM files ("vegm", "vegp", "drv_clm") and how to see what CLM configurations are available 

### 1.  Setup 

In all examples you will need to import the following packages and register your pin in order to have access to the HydroData datasets. <<< for this workbook registering is probably not necessary, but we can keep it anyway?

Refer to the [getting started](https://hydroframesubsettools.readthedocs.io/en/latest/getting_started.html) instructions for creating your pin if you have not done this already.

In [10]:
from subsettools.subsettools import (
    huc_to_ij, 
    latlon_to_ij, 
    config_clm,
)
from hf_hydrodata import grid, gridded
from parflow import Run
import numpy as np

gridded.register_api_pin("your_email", "your_pin")

### 2. Define your area of interest

In order to custom the CLM files to our subsetting domain, we first need to calculate the grid bounds. We show how to do that in three different ways below.

We use i,j indices in order to define the subset the national files that you would like to extract.  The [`latlon_to_ij`](https://hydroframesubsettools.readthedocs.io/en/latest/autoapi/subsettools/subsettools/index.html#subsettools.subsettools.latlon_to_ij) function translates a bounding box in lat-lon  coordinates bounds to i,j indices in whatever grid system we select. It returns a tuple `(imin, jmin, imax, jmax)` of grid indices that define a bounding box containing our region (or point) of interest (Note: `(imin, jmin, imax, jmax)` are the west, south, east and north boundaries respectively).

Here we will show how to define a subset extent for (1) a single point of interest, (2) a user specified bounding box, and (3) a bounding box that surrounds a user specified HUC. 

**IMPORTANT NOTE**: *The i,j indices found in this step are based on whatever grid you select (e.g. `conus1` or `conus2`). Its very important that the grid you use in this step is the same as the grid that the data files (static input and forcing) you are subsetting are in or you will end up subsetting a different location than you expect.  The grids are shown below and described in [Yang et al 2023](https://www.sciencedirect.com/science/article/pii/S0022169423012362)* 

![CONUS domains](CONUS1_2_domain.jpg)

#### 2.1 Defining bounds to extract data for a single point
To extract data for a single point we use the same bounding box function as we would to extract a larger domain but just repeat the point values as the upper and lower bounds.

In [5]:
lat = 39.8379
lon = -74.3791
# Since we want to subset only a single location, both lat-lon bounds are defined by this point:
latlon_bounds = ([lat, lon],[lat, lon])
ij_column_bounds = latlon_to_ij(latlon_bounds=latlon_bounds, grid="conus2")
print(f"bounding box: {ij_column_bounds}")

bounding box: (4057, 1915, 4057, 1915)


#### 3.2 Defining bounds for a box defined by lat-lon bounds
To extract a bounding box, provide the upper and lower latitude and longitude bounds respectively for the area of interest as well as the grid system that you would like to use. 

In [6]:
ij_box_bounds = latlon_to_ij(latlon_bounds=[[37.91, -91.43], [37.34, -90.63]], grid="conus1")
print(f"bounding box: {ij_box_bounds}")

bounding box: (2285, 436, 2358, 495)


#### 3.3 Defining bounds for a HUC watershed
The subsettools [`huc_to_ij`](https://hydroframesubsettools.readthedocs.io/en/latest/autoapi/subsettools/subsettools/index.html#subsettools.subsettools.huc_to_ij) function returns a tuple `(imin, jmin, imax, jmax)` of grid indices that define a bounding box containing any HUC. You can provide 2, 4, 6, 8 or 10-digit HUCs.  For help finding your HUC you can refer to the [USGS HUC picker](https://water.usgs.gov/wsc/map_index.html).

In [9]:
ij_huc_bounds = huc_to_ij(huc_list=["14050002"], grid="conus2")
print(f"bounding box: {ij_huc_bounds}")

bounding box: (1225, 1738, 1347, 1811)


### 3. Configuring_clm with the `config_clm` function 
We will now use the `config_clm` function (API reference [here](https://hydroframesubsettools.readthedocs.io/en/edit-docs/autoapi/subsettools/subsettools/index.html#subsettools.subsettools.config_clm)) to tailor our CLM files to our subsetting domain. We will pass the function the grid bounds (we will pass the `ij_box_bounds` that we calculated, start and end dates for our ParFlow simulation, a dataset to get our CLM files from and the directory path where the files are going to be written to. We can also pass a timezone argument, otherwise it defaults to UTC time. The function will return a dictionary in which the keys are (“vegp”, “vegm”, “drv_clm”) and the values are file paths where the CLM files were written.

>>> TODO: We should have a link to the CLM datasets - what is the equivalent of conus1_baseline_mod for conus2?

**NOTE:** *If you choose to provide a timezone while setting up a ParFlow simulation, it should be consistent across the functions `subset_press_init`, `subset_forcing` and `config_clm`.*

In [12]:
file_paths = config_clm(
    ij_box_bounds, 
    start="2005-10-01", 
    end="2006-10-01", 
    dataset="conus1_baseline_mod",
    write_dir="/home/ga6/subsettools_example",
)

processing vegp
copied vegp
processing vegm
subset vegm
processing drv_clm
copied drv_clmin
edited drv_clmin


### 4. Getting the clm input files with the HydroData API

We can also get the CLM files using the HydroData API. We will use the `get_raw_file` function to get the `vegp` and `drv_clm` files. These are small text files and the `get_raw_file` function is going to get them from HydroData and write them to the given directory path. Note that the `get_raw_file` function is going to copy the files as they are - it will not modify them like the `config_clm` function.

In [20]:
# get the vegp file and write it to filepath:
gridded.get_raw_file(filepath="/home/ga6/subsettools_example",
                         dataset="conus1_baseline_mod",
                         file_type="vegp",
)

# get the drv_clm file and write it to filepath:
gridded.get_raw_file(filepath="/home/ga6/subsettools_example",
                         dataset="conus1_baseline_mod",
                         file_type="drv_clm",
)

>>> About the vegm file: we could use the get_raw_file to copy the entire vegm file and write it to write_dir. But subsetting the vegm is very tricky. The get_numpy function would return an array of subset data but even that is pretty tricky to put back into vegm format. What is the goal in using the API to get the vegm file? Knowing this will help a lot to decide how to do this in code.

### 4. Setting up a single column CLM run with the config function

>>> How is this different that the above - I could just replace the huc_bounds with the single point bounds. Maybe I am misunderstanding this.

### 5. Cite the data sources

Please make sure to cite all data sources that you use. The `get_catalog_entry` function (API reference [here](https://maurice.princeton.edu/hydroframe/docs/hf_hydrodata.gridded.html#hf_hydrodata.gridded.get_catalog_entry)) will return a dictionary of medatada based on our filters, from which we will select the `paper_dois` and `dataset_dois` keys.

In [35]:
metadata = gridded.get_catalog_entry(
    dataset="conus1_baseline_mod",
    variable="clm_run",
    file_type="drv_clm",
)
print("Paper DOIs:", metadata['paper_dois'])
print("Dataset DOIs:", metadata['dataset_dois'])

Paper DOIs: 10.5194/gmd-14-7223-2021
Dataset DOIs: 
