# High-throughput Scripts Basics

- - -

**Lucas M. Hale**, [lucas.hale@nist.gov](mailto:lucas.hale@nist.gov?Subject=ipr-demo), *Materials Science and Engineering Division, NIST*.

**Chandler A. Becker**, [chandler.becker@nist.gov](mailto:chandler.becker@nist.gov?Subject=ipr-demo), *Office of Data and Informatics, NIST*.

**Zachary T. Trautt**, [zachary.trautt@nist.gov](mailto:zachary.trautt@nist.gov?Subject=ipr-demo), *Materials Measurement Science Division, NIST*.

Version: 2017-03-20

[Disclaimers](http://www.nist.gov/public_affairs/disclaimer.cfm) 
 
- - -

## Introduction

Running calculations in a high-throughput manner means setting up and performing multiple calculations all at once as opposed to doing them individually. In the iprPy Framework, this is accomplished using scripts located in the [high-throughput directory](../../../high-throughput).

The available high-throughput scripts are:

- __build__ copies all reference records in the library directory into a database. 

- __prepare__ generates multiple instances for a specific calculation as well as corresponding incomplete records.

- __runner__ cycles through all prepared calculations in a given run directory, executes them, and uploads the complete records to the database. 

- __check__ accesses a database and prints a list of all records of a specific records style. Useful for a quick check.

- __clean__ resets calculations that failed. 

- __destroy__ permanently deletes all records of a given record style from a database.

- - -

## The input parameter files

Here is a quick explanation of the input parameter files.

### Formating rules

The input parameter files for calculations follow some very simple rules.

1.	Each line is read separately, and divided into whitespace delimited terms.

2.	Blank lines are allowed.

3.	Comments are allowed by starting terms with #. The # term and any subsequent terms on the line are ignored. 

4.	The first term in each line is a variable name.

5.	All remaining (non-comment) terms are collected together as a complete value that is assigned to that variable name.

6.	Any variable names without values are ignored.

7.  For all but the prepare scripts, each variable name can appear at most one time.

8.  For the prepare scripts, some variables can be assigned multiple values by repeating the variable name on different lines. Which variables allow this is specific to the prepare script.

### Formatting example

Script:
    
    #This is a comment and will be ignored
    
    firstvariable    singleterm
    
    secondvariable   multiple terms   using    spaces
    thirdvariable    term #with comments
    thirdvariable    again
    thirdvariable
    
    fourthvariable
    
Gets interpreted as a Python dictionary:
    
    {'firstvariable':  'singleterm',
     'secondvariable': 'multiple terms using spaces',
     'thirdvariable':  ['term', 'again']}

### Input script parameters unique to the high-throughput scripts

#### Database access

Specifies access to the records database to use. Used by all of the scripts.

- __database__: the style and location of the database.

- __database_user__: database user name (if needed).

- __database_pswd__: database password (if needed).

- __database_cert__: path to a certification file (if needed).

#### Directory paths

Indicates where to put the prepared calculation instances.

- __run_directory__: path to the local directory where the calculation instances will be added. Used by prepare, runner and clean.

- __library_directory__: path to the iprPy/library directory. Only used by build and only necessary if the build script is moved. 

- - -
## Build script

The build script copies all reference records in the library directory into a database. This effectively tests the access to the database and makes the reference records available for the prepare and runner scripts. Only has to be performed once per database whenever new records in the library need to be added (i.e. when first setting up, and when the library in iprPy is updated). The library location only needs to be supplied to the build script if it is moved to a different directory.

Executing build is simple:

1. Create an input parameter file with database access information (and library location).

2. In a terminal, cd to the high-throughput/build directory, and enter the command
    
        python build.py [inputscript]

- - -

## Prepare scripts

Prepare scripts interact with a database and create multiple instances of a given calculation. For simplicity, each prepare script is associated with only one calculation, but common input parameter file structures and parameter names are used throughout.

Each prepare script is designed to do effectively the same thing:

1. All variable parameter values are read in from an input parameter script. This script provides information for accessing a records database, conditions for running the calculations, and terms outlining which calculation parameter values to use.

2. The database is accessed and a list is built of all existing records for the calculation being prepared. 

3. Parameter sets for the calulation are built by looping over combinations of parameter values as specified by the prepare script's input parameters.

4. Each unique parameter set is compared against the list of existing records to determine if it is new, or has been executed before.

5. If the parameter set is new, a corresponding incomplete record and calculation instance are created.  The record is added to the database, while the calculation instance is placed in the specified run directory.

6. The calculation instance is a folder consisting of the calculation's script, a filled in calculation script parameter file, and copies of any other files and records necessary to run the calculation. If the calculation instance needs a record from another calculation as input, the other calculation's record is copied as it currently is (complete or incomplete). 


#### NOTES:

- All calculations prepared in a specific run directory must be associated with only one database. This is because the runner script has to associate each calculation instance to a database record.  

- Running multiple prepare scripts at the same time for the same database may cause issues as the list of existing records is only populated at the beginning of the script. This could lead to certain calculation instances not being prepared, or being prepared multiple times.

- Runner scripts can be started for a given run directory while a prepare script is still adding calculations to the run directory, but there is a chance that calculations may fail if a runner starts a calculation that is in the process of being prepared.  

### Finding the prepare scripts

The prepare scripts can be found in the [iprPy/high-throughput/prepare](../../../high-throughput/prepare) directory. The scripts are divided into folders with the appropriate calculation names. 

### Contents of the prepare script folders

With the common design of calculations, the contents of each calculation folder are similar:

-	__prepare\_[calcname].py__: The Python prepare script for the calculation called calcname.

-	__prepare\_[calcname].in__: An example version of the input parameter file that the prepare script reads.

### Preparing a calculation

1. Read the Notebook for the calculation that you care about to know the meanings of the script's parameter names.

2. Open the prepare\_[calcname].in file in a text editor and modify as needed.

3. In a terminal, cd to the prepare folder for the calculation you want and enter the command:
        
        python prepare_[calcname].py prepare_[calcname].in
        
4. Wait (seriously, this can take some time as considerable file access and copying is going on).

5. You can check on the progress by exploring the corresponding records in the database, or by counting the number of folders in the run directory.

- - -

## Runner script

The runner script iterates over all calculation instances in a given run directory, executes them, and updates the corresponding records in the database when the calcuation finishes. Some cool features of the runner script are:

- Multiple runners can operate on the same run directory at the same time. 

- A runner can be executed directly from a terminal, or submitted to a queuing system with a set number of cores.

- The runner queries the database for the complete version of any incomplete input records. If one doesn't exist, the runner tries to perform the reference calculation. This effectively allows for a calculation heierarchy to be handled.

- Runners can operate on different run directories (and even computers!) and access the same database. This allows for a heterogeneous distribution of computational resources to different runners.

#### NOTES:

- All calculations prepared in a specific run directory must be associated with only one database. This is because the runner script has to associate each calculation instance to a database record.  

- Runner scripts can be started for a given run directory while a prepare script is still adding calculations to the run directory, but there is a chance that calculations may fail if a runner starts a calculation that is in the process of being prepared.  

- When submitting a runner to a cluster for MPI LAMMPS simulations, it's up to you to make certain that the number of cores in the mpi_command calculation input parameter matches the number of cores assigned to the runner. Because of this, it is helpful to have a different run directory for each number of cores.

### Setting up a runner

Each runner only needs to know which run directory to operate on, and how to access a database. This information can be given in an input parameter script. Personally, I like to have a different input script for each combination, then simply access it on submission.

Starting a runner from a terminal. 

1. cd to the high-throughput\runner directory and enter the command:
    
        python runner.py [runnerscript]
        
Submitting a runner to a cluster.

1. cd to the high-throughput\runner directory and enter the command:

        [submissioncommand] iprPy_runner [runnerscript]





- - -
## Check script

The check script accesses a database and/or run directory and checks on the status.

- For a run directory, it returns the current number of calculation instance folders.

- For a database, it provides the number of complete, incomplete and errored calculations for a given record style.

- - -
## Clean script

The clean script fixes up .

- For a run directory, it returns the current number of calculation instance folders.

- For a database, it provides the number of complete, incomplete and errored calculations for a given record style.