<img src="Images/HSP2.png" />
This Jupyter Notebook Copyright 2017 by RESPEC, INC.  All rights reserved.

$\textbf{HSP}^{\textbf{2}}\ \text{and}\ \textbf{HSP2}\ $ Copyright 2017 by RESPEC INC. and released under this [License](LegalInformation/License.txt)

# TUTORIAL 3: Running HSP$^2$

## Introduction

This tutorial demonstrates how to run $\textbf{HSP}^\textbf{2}$ . It also provides an example workflow to demonstrate additional capabilities of $\textbf{HSP}^\textbf{2}$ .

**Tutorial Contents**

 + Section 1: [Running HSP$^2$](#section1)
 + Section 2: [Tools for Running HSP$^2$](#section2)
    + [SAVE tables to select time series to save](#usersave)
    + [Save All Calculations](#saveall)
    + [Check the HDF File  ](#checkstuff)
    + [Make the HDF5 file Smaller](#makesmall)
 + Section 3: [Techniques for Efficient Simulation](#section3)


### Required Python imports  and setup

In [None]:
import os
import site
site.addsitedir(os.getcwd().rsplit('\\',1)[0] + '\\')  # adds your path to the HSP2 software.

hdfname = 'TutorialData/tutorial.h5'

import shutil
import numpy as np
import pandas as pd
pd.options.display.max_rows    = 18
pd.options.display.max_columns = 10
pd.options.display.float_format = '{:.2f}'.format  # display 2 digits after the decimal point

import HSP2
import HSP2tools

HSP2tools.reset_tutorial()    # make a new copy of the tutorial's data
HSP2tools.versions()          # display version information below

## Section 1: Run $\textbf{HSP}^\textbf{2}$ <a id='section1'></a>

This tutorial assumes that the HDF5 file representing the watershed exists. Tutorial 4 discusses how to create the watershed HDF5 file from the legacy UCI and WDM files.  Otherwise, a future $\textbf{HSP}^\textbf{2}$ GUI (graphics user interface) tool can be used to directly create an HDF5 file for a new watershed.


Here is how to run $\textbf{HSP}^\textbf{2}$ :

In [None]:
HSP2.run(hdfname)

That's all there is!

The first time HSP2 is run in a Python session, Numba will perform a just in time (JIT) compilation. This takes about the same time regardless if the run is for a simple or complex watershed. Afterward, subsequent runs are much faster. 

Go back and rerun the cell above to see the difference.

Many factors such as background system tasks will alter the run time. Saving more calculated resuts than the default will also increase the run time.

**NOTE** The pink warning messages above can be ignored. The bug causing this in the h5py library has been fixed, but not yet released.

####  Examine the HDF5 file.

Use HDFView or HDFCompass to examine the **RUN_INFO** HDF5 directory. It contains the run log (as printed above during the run). It also contains the table **SOFTWARE_VERSION_TABLE** which shows the same table shown in the cell *Required Python imports and settings* above.

The time series calculated during the run are saved in the **RESULTS** group.

Only the time series marked to be saved in the associated **SAVE** tables are actually stored.  By default time series which can be trivially computed are not saved.

**Section Summary**

 + Demonstrated how to run HSP2

## Section 2: Tools for Running HSP$^2$ <a id='section2'></a>

### SAVE tables are used to select time series to save to the HDF5 file<a id='saveall'></a>
By default, a minimum number of timeseries necessary to run the watershed are automatically saved. The user can modify the entries in the SAVE tables for their needs.

This cell will show the names of the calculated time series save by default for PERLND SNOW for segment P001 from the run above:

In [None]:
df = pd.read_hdf(hdfname, '/RESULTS/PERLND_P001/SNOW')
df.columns

Assume the user wants to also save the ALBEDO timeseries.

Read the associated SAVE table:

In [None]:
dfsave= pd.read_hdf(hdfname, '/PERLND/SNOW/SAVE')
dfsave

This changes a single value for a specifc segment:

In [None]:
dfsave.loc['P001', 'ALBEDO'] = True
dfsave

Or if you wanted to ALBEDO for all segments:

In [None]:
dfsave.ALBEDO = True
dfsave

Of course, in the case of test10, there is only one PERLND segment. But this shows the technique.

Now put the SAVE file back into the HDF5 file.

In [None]:
dfsave.to_hdf(hdfname, '/PERLND/SNOW/SAVE', data_columns=True, format='t')

Now rerun the simulation and check which time series were saved.

In [None]:
HSP2.run(hdfname)

In [None]:
df = pd.read_hdf(hdfname, '/RESULTS/PERLND_P001/SNOW')
df.columns

### Save All Calculations <a id='saveall'></a>

A short cut is provided to save all calculations. It does NOT modify the SAVE tables.

In [None]:
HSP2.run(hdfname, saveall=True)

You will see that this increases the run time and makes the HDF5 file larger.

Now look at the columns saved:

In [None]:
df = pd.read_hdf(hdfname, 'RESULTS/PERLND_P001/SNOW')
df.columns

### Check the HDF File  <a id='checkstuff'></a>

$\textbf{HSP}^\textbf{2}$  maintains all the HSPF run-time warning and error messages. However, it does not perform all the HSPF checks for the consistency of flags and data (such as the rules on FTables.)

An HDF5 file automatically created from (working) legacy UCI and WDM files, should be correct. But if the user creates the HDF5 file directly or modifications to an existing HDF5 file, errors might be introduced.

THis tool is provided to perform these checks before the simulation is run whenever the user wants to confirm the the integrity of the HSF5 flags and data.

    checkHDF(hdfname)


In [None]:
HSP2tools.checkHDF(hdfname)

If no errors are printed, everything passes.

This tool is still in development to add additional checking.

### Make the HDF5 file Smaller <a id='makesmall'></a>

**NOTE** HDF5 version 1.10.0 is scheduled for release in the spring of 2016. It will have the capability to reclaim space dynamically and you might not need the following process. 

When running calculations over time, the HDF5 files will grow in size due uncoverd space or over estimated space needs within the HDF5 file. It is a good idea to occasionally repack the HDF5 file to make it smaller.

The HDF Group provides a utility to repack HDF5 files. PyTables (used internally by Pandas) includes another utility to repack HDF5 files, ptrepack.

The function is called as 

     ptrpack inputfile outputfile
     
It is an executable code (.exe) rather than a Python module.

In [None]:
!ptrepack TutorialData\tutorial.h5  TutorialData\tutorial_repacked.h5

Now look at the sizes before and after.

In [None]:
%ls TutorialData

The repacked HDF5 file still runs.

In [None]:
HSP2.run('TutorialData/tutorial_repacked.h5')

This run is usually a bit faster since the initial setup time is shorter for the tighter **/CONTROL** tables.  The setup time is not a large percentage of the run time for larger watersheds - so this isn't too significant.

$\textbf{HSP}^\textbf{2}$ does spend a significant percentage of its time in writing the complete computed time series rather than the typical HSPF HBN files writing daily or monthly timeseries.

## Section 3: Techniques for Efficient Simulation<a id='section2'></a>

### First, don't duplicate the timeseries data
Frequently, you will create multiple Notebooks for a single watershed for initial data processing tasks 
and to try different exploratory analysis (such as determing the impact of changing parameter values). 

#### It is not necessary to duplicate the time series data.

$\textbf{HSP}^\textbf{2}$ makes it easy to have all the watershed's timeseries contained in just one master HDF5 file. Other simulations of this watershed
can just access the time series from that master file. This can save significant storage and insures (from the **QA/QC** perspective) that all simulations are using the same data.  

It is also possible to store timeseries data  in a intranet accessible repository of one or more HDF5 files. This data may include data from many projects
and spanning longer time intervals than used for a specific project. Then all simulation HDF5 files use the same repository.

#### Assume the *MASTER * watershed model HDF5 has been created.

For this example, we will assume the tutorial.h5 is the "master" HDF5 file and contains the time series for this watershed.

Now create a Notebook for a new simulation study.  (Actually, for this tutorial, just make a copy the usual tutorial Notebook.)

In [None]:
myNotebook = 'TutorialData/myNotebook.h5'
shutil.copyfile(hdfname, myNotebook)

For this example, the **/RESULTS** directory is also removed to make the HDF5 file smaller. This is a common practice in large scale calibration simulations when many (perhaps thousands) of individual simulation files are required.

This is the first example of deleting data from an HDF5 file in these tutorials. **Note**, there is **no**  warning (like a "are you really sure you want to delete this?") Deleting a group (directory) in an HDF5 file will delete **ALL** data and groups below it.

The **del** command is the standard Python method to destroy the following object and works with HDF5 files.

In [None]:
with pd.get_store(myNotebook) as store:
    del store['/TIMESERIES']
    del store['/RESULTS']  

Now we need to point the simulation HDF5 file, mystudy.h5, to the timeseries data in the master HDF5
file.

This is easily done by reading the EXT_SOURCES table and putting the name of the master file (full path if not located in the same directory)
into the **HDF_Name** column. The original **\*** in that column indicates the data is found in the same HDF5 file as the EXT_SOURCES table.

(Of course, the time series may be distributed accross any number of HDF5 files. Just put the appropiated HDF5 names in each row of the table.

In [None]:
df = pd.read_hdf(myNotebook, '/CONTROL/EXT_SOURCES')
df.head()

Now change the source of the data from the current HDF5 file (designated by an asterisk in the SVOL column) to the other HDF5 file.

In [None]:
df['SVOL'] = hdfname
df

Save it back to the HDF5 file for later use.

In [None]:
df.to_hdf(myNotebook, '/CONTROL/EXT_SOURCES', data_columns=True, format='table')

Use HDFView or HDFCompass to view the mystudy.h5 file. Note that the /Timeseries directory has been deleted. The /RESULTS directory has been deleted to prevent the user from accidently thinking they were computed by the data in the mystudy.h5.  Good **QA/QC** practice would be to copy from the master and delete both directorys in one script to insure they stay in synch.

Run the mystudy.h5 simulation to show that it works pointing to timeseries data in the master, tutorial.h5.

In [None]:
HSP2.run(myNotebook)

Note: the total run time is essentially identical for runs with local data or for runs fetching their data from a different HDF5 file.

### Second, repack the file periodically to eliminate wasted space

The HDF5 utility currently does not compact an HDF5 file after the adding new tables, deleting tables, or appending to existing tables. After a series of such data operations, the HDF5 file will grow to be large.

Section 2 discussed how to repack files. Remember to do this periodically for all HDF5 files.

**NOTE** HDF5 version 1.10.0 was released in the spring of 2016. It has the capability to reclaim space dynamically and you might not need to do this when this is available in the Python libraries.

### Third, use lower precision for storage and don't duplicate timebase information

Don't use too much precision if you want to save storage. All $\textbf{HSP}^\textbf{2}$  calculations are performed
in double precision. Currently, computed time series are stored in single precision.

$\textbf{HSP}^\textbf{2}$ puts all the computed time series for a single activity (like IWATER) into one table since they share a common timebase (index). If you create your own modules for $\textbf{HSP}^\textbf{2}$, avoid storing each result as a Pandas Series with its own timebase.

$\textbf{HSP}^\textbf{2}$  stores FLAG data as 64 bit integers and stores INITIALIZATIONS, PARAMETERS, MONTHLY, and FTABLE tables in double precision.

Perhaps, the precison of  $\textbf{HSP}^\textbf{2}$ items can become a user configuration - we are looking for user feedback.

### Forth, use data compression

HDF5 has compression options to save storage.  Each dataset in an HDF5 file can specify its own compression algorithm and associated compression factor (if appropriate) or specifiy no compression (default.)

By default, no compression is specified by $\textbf{HSP}^\textbf{2}$.

HDF5 packing tools, h5prepack.exe and ptrepack.exe, can also apply a global compression algorithm and compression factor when they repack an HDF5 file.

However, we have not been able to register a standard compression algorithm, BLOSC, to HDFView and still view the data correctly. Since viewing the HDF5 files in HDFView or HDFCompass is valuable to these tutorials, we will continue to not compress the HDF5 files.  

But for your own projects, you should consider compression.

**NOTE:** HDF5 supports both lossy and lossless compression algorithms.

**NOTE:** Data is compressed and decompressed on the fly by the internals of HDF5. As a user, you don't need to do anything.

### Fifth,  save the computed data in another format.

Currently, $\textbf{HSP}^\textbf{2}$ saves DataFrame tables using these options:

``` data_columns=True, format='table')```

These options result in tables appearing like those in these tutorials when viewed 
with HDFView or HDFCompass.  These options are necessary if the data is to be appended or queried.

However, they waste a lot more space since they require a B-tree to be created to find the non contiquous data blocks. They are slower to read and write as well.

Perhaps, for real world use (not tutorials), the non table format should be considered.  (The I/O time is a significant portion of the $\textbf{HSP}^\textbf{2}$  run time.)




### Sixth, save project documentation to HDF5

Documentation created during the waterhed project can be saved to HDF5.

Version control for MATLAB (.m files), Python files (.py files) and IPython Notebooks (.ipynb files) can also
be saved to the HDF5 file.  (Mercurial version control is the recommend tool.)

Other documents such as scanned data and PDF files can be saved to HDF5.

**NOTE:** Documents will not be viewable with HDFView (except plain text). You can see the dataset as raw bytes (which is not very useful.) But they can be extracted from the HDF5 file and viewed normally.

**EXAMPLE**

Your directory contains a PDF file, JasonWEFTEC 2014.pdf.  We will save this file into the mystudy.h5 file and then restore it.

The cells starting with "!" are Windows command lines.  If you are on a Linux or MAC computer, you will need to use appropriate commands to start a PDF viewer.

Check PDF file is available:

In [None]:
filepath = 'TutorialData/JasonWEFTEC 2014.pdf'
os.listdir('TutorialData')

In [None]:
HSP2tools.save_document(hdfname, filepath)

Use HDFView to see that the document as been added. Delete Pandas.pdf in the directory, and check it is gone.

In [None]:
os.remove(filepath)
os.listdir('TutorialData')

The file does not exist.

Now restore the document.

In [None]:
HSP2tools.restore_document(hdfname, 'TutorialData/JasonWEFTEC 2014.pdf')

In [None]:
os.listdir('TutorialData')

You can check that the file was not damaged by the round-trip by viewing it at this [link](TutorialData/JasonWEFTEC 2014.pdf) or opening it in your favorite PDF viewer.

**SECTION SUMMARY**

 + Demonstrated using data from the Master or Respository HDF5 files rather than duplicating it.
 + Demonstrated packing HDF5 files to make them smaller
 + Discussed HDF5 dataset compression
 + Demonstrated saving a document into an HDF5 file.
 + Demonstrated restoring a document from the HDF5 file into the current directory.
