# 1. Set up a persistent server for QCFractal

Preferably a long running server instance is helpful, otherwise you can setup a server for a few days, run your calculations, backup the data and shudown the server and restart again from the backup. 
A sample slurmscript for a server job of two weeks looks like

```
#! /usr/bin/bash
#SBATCH --partition=partition_name
#SBATCH -t 14-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=10gb
#SBATCH --export ALL
## USAGE

#echo $(hostname) > hostfile
source $HOME/.bashrc

## All this script needs is qcfractal environment, so this conda env should work
## https://github.com/openforcefield/qca-dataset-submission/blob/master/devtools/prod-envs/qcarchive-user-submit.yaml
conda activate qcf-user-submit

# these steps will initiate QCFractal server
qcfractal-server init --base-folder "/tmp/$SLURM_JOBID" --max-active-services 300 --query-limit 100000

qcfractal-server start --base-folder "/tmp/$SLURM_JOBID"

```


After initiating the server you can submit your collection of molecules using the following piece of python code

In [None]:
from qcportal.client import FractalClient
from openff.qcsubmit.datasets import load_dataset


# load the dataset from file 
dataset = load_dataset("dataset.json.bz2")

# host name can be the node name, for example here it is hpc3-l18-03, and the default port is 7777 unless another port is specified during server inititation
host_name_with_default_port = "hpc3-l18-03:7777"

client = FractalClient(host_name_with_default_port, verify=False)

submission = dataset.submit(client)

To run a local instance you can check an earlier blog post here, https://openforcefield.org/community/news/science-updates/ff-training-example-2021-07-01/, which uses
```
from qcfractal import FractalSnowflakeHandler
from qcportal.client import FractalClient

local_fractal_instance = FractalSnowflakeHandler(ncores=16)
local_fractal_client = FractalClient(local_fractal_instance)
submission = torsion_drive_dataset.submit(local_fractal_client)
```

# 2. After submitting the dataset, the server will wait for workers to distribute the work to

Workers can be spawned depending on the program in use with various program-specific conda environments located at our tested production environments at https://github.com/openforcefield/qca-dataset-submission/tree/master/devtools/prod-envs. 


A sample slurmscript for a Psi4 QM job which creates 5 workers with 8 cores each and 180GB distributed among the 5 workers. You can submit as many workers as you can that would speed up the computation since all the jobs would be distributed among the workers. The verbose slurm output would show the job success rate and if there is a high error rate managers can be stopped and failed job outputs can be inspected.


    
```
#! /usr/bin/bash
#SBATCH --partition=free
#SBATCH -t 3-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mem=180gb
#SBATCH --export ALL

## USAGE
#echo $(hostname) > hostfile
source $HOME/.bashrc
conda activate qcarchive-worker-openff-psi4
#host="$(cat ./host)"

qcfractal-manager --verbose --fractal-uri "hpc3-l18-03:7777" --verify False --tasks-per-worker 5 --cores-per-worker 40 --memory-per-worker 180 --update-frequency 5

```


### Please make sure to use the right conda environment for the right program to spawn corresponding workers for a psi4 job, xtb job, openmm job or ANI job.\
### Also, the config can be supplied in an yaml file such as 

```
common:
 adapter: pool
 max_workers: 5
 tasks_per_worker: 1
 cores_per_worker: 48
 memory_per_worker: 180
 retries: 5
 scratch_directory: /tmp/$SLURM_JOB_ID

server:
 fractal_uri: hpc3-l18-03:7777

manager:
 manager_name: local_server
 queue_tag:
    - openff
 log_file_prefix: '../logs/$SLURM_JOB_ID.log'
 update_frequency: 30
 test: False

```
and it can be passed to qcfractal manager as 
```
qcfractal-manager --verbose --config-file qm-config.${SLURM_JOBID}.yaml
```

# 3. Creating backup of the data

A backup of the whole database with any number of collections can be created by using the following command on the host 
```
qcfractal-server backup
```
This would create a `qcfractal_default.bak` file in the directory in which the command is executed. The backup can be read again in a restarted server instance by invoking
```
qcfractal-server restore
```
before starting the server but after initiating it.


# 4. Error cycling

It is common to encounter some failed calculations depending on the partition you're running (if you use pre-empitble queue jobs may get cancelled before completion), slurm configuration used (for some larger systems or basis sets you may run out of memory), or errors due to not reaching convergence, etc. Some of these failed jobs can be rerun after noticing what the error is, this can be done by the following piece of code 

In [None]:
from qcportal.client import FractalClient
from collections import Counter, defaultdict
import pandas as pd

client = FractalClient()

ds = client.get_collection("TorsionDriveDataset", "Biaryl Torsion Drives")

specs_list = list(ds.list_specifications().to_dict()['Description'].keys())

status_dict = defaultdict(dict)
indx = len(recs)
num_complete = 0
num_error = 0
num_incomplete = 0
num_running = 0
num_nan = 0
err_recs = []
#Let's check for one specification from the list
# For torsion scans the errored out jobs would be optimizations on the grid points,so we have to check the optimization history of a torsiondrive record and restart those errored out records
spec = "project_default"
for entry in ds.data.records.values():
    td_rec = ds.get_record(entry.name, specification=spec)
    optrecs_hist = []
    for key, value in td_rec.dict()['optimization_history'].items():
        if len(value) > 1:
            print(key, value)
            optrecs_hist.extend(td_rec.get_history(key))
    for i, item in enumerate(optrecs_hist):
        if item.status == 'COMPLETE':
            num_complete += 1
        elif item.status == 'ERROR':
            err_recs.append(item.record.id)
            # restart the failed calculations with this line
            client.modify_tasks(operation='restart', base_result=item.record.id)
            num_error += 1
        elif item.status == 'INCOMPLETE':
            num_incomplete += 1
        elif item.status == 'RUNNING':
            num_running += 1
        else:
            num_nan += 1

status_dict['project_default'] = {"COMPLETE": num_complete, "ERROR": num_error, "INCOMPLETE": num_incomplete, "RUNNING": num_running, "NaN": num_nan}

df = pd.DataFrame(status_dict)

## To look at the error codes of failed jobs

In [None]:
import lzma
errored_manager = []
error_types = []
for err_rec in err_recs:
    errored_manager.append(err_rec.manager_name.split('-')[0])
    kv = client.query_kvstore(err_rec.error)
    out = lzma.decompress(kv[list(kv.keys())[0]].data).decode()
    print("-----------")
    error_typ = bytes(str(out), 'utf-8').decode("unicode_escape")
    error_types.append(error_typ)
    print(error_typ, err_rec.id)