This is part of the supporting information for the paper  
*ParAMS: Parameter Fitting for Atomistic and Molecular Models* (DOI: *123123*)  
The full documentation can be found at https://www.scm.com/doc.trunk/params/index.html

# Preprocessing Examples
This notebook demonstrates how ParAMS can be used to generate and process training data and help with the selection of suitable models and optimizers.

In [1]:
from scm.params import *
from scm.params import __version__
print(f"ParAMS version: {__version__}")

ParAMS version: 0.4.1


Load the job collection and training set:

In [2]:
jc = JobCollection('../data/jobcollection.yml')
ts = DataSet('../data/trainingset.yml')

## Modifying the Data Set
Let's look at how different subsets of the training set can be created

In [3]:
print(f"Total training set size: {len(ts)}")
print(f"Total jobs associated with the set size: {len(ts.jobids)}")
print("Available extractors:\n")
print('\n'.join(sorted(ts.extractors)))

Total training set size: 4875
Total jobs associated with the set size: 231
Available extractors:

angles
charges
dihedral
distance
energy
forces
hessian
pes
pesscan_angle
pesscan_dihedral
pesscan_distance
stresstensor
vibfreq


### Generating random subsets:

In [4]:
ts1, ts2, ts3 = ts.split(0.2, 0.2, 0.6)
print(f"Subset sizes: {len(ts1)}, {len(ts2)}, {len(ts3)}")

Subset sizes: 975, 975, 2925


### Filtering by property:

In [5]:
ts_energy   = ts.energy()
ts_distance = ts.distance()
print(f"Training set contains {len(ts_energy)} energy  entries")
print(f"Training set contains {len(ts_distance)} distance entries")

Training set contains 219 energy  entries
Training set contains 94 distance entries


## Calculating Reference Values
Although the loaded training set already contains training data provided by [Mueller and Hartke](https://doi.org/10.1021/acs.jctc.6b00461), this might not be the case when starting a parameterization from scratch.
In this case, the user can easily calculate a reference with any AMS engine. For the sake of speed, we will use [MOPAC](http://openmopac.net/) to calculate our reference results.
In a real application, higher level calculations are more likely to act as a good source of reference.

Remove all reference values first:

In [6]:
for i in ts_energy:
    i.reference = None
    i.metadata['Source'] = 'MOPAC'
print(ts_energy[:5])

[---
Expression: +energy("hsh-SH1.15")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH7.5")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH1.05")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH2.5")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH1.7")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
Sigma: 1.0
Source: MOPAC
]


To run jobs, ParAMS is utilizing the `Settings` object from the [PLAMS](https://github.com/SCM-NV/PLAMS) library.
The object allows to communicate with AMS trough a Python interface. Let us set up an object that represents a calculation with MOPAC (ignoring the `_ipython_canary_method_should_not_exist_` output):

In [7]:
from scm.plams import Settings
s = Settings()
s.input.mopac

_ipython_canary_method_should_not_exist_: 	<empty Settings>

Jobs can now be easily calculated with the just created `Settings` instance:

In [9]:
results = jc.run(s, jobids=ts_energy.jobids) # We limit the calculation of jobs to the ones in our `ts_energy`
assert all(r.ok() for r in results.values())

All that is now left to do is tell our Data Set to extract and store the reference results in question from the `results` object:

In [11]:
ts_energy.calculate_reference(results)
print(ts_energy[:5])

[---
Expression: +energy("hsh-SH1.15")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
ReferenceValue: 12.162557590982274
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH7.5")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
ReferenceValue: 182.08253572794663
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH1.05")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
ReferenceValue: 32.08357349265368
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH2.5")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
ReferenceValue: 102.35434747486192
Sigma: 1.0
Source: MOPAC
, ---
Expression: +energy("hsh-SH1.7")/1-energy("hshBase")/1
Weight: 1
Unit: kcal/mol, 627.5096
ReferenceValue: 22.776021077786087
Sigma: 1.0
Source: MOPAC
]


## Model Selection

Different empirical models can be compared much in the same way as providing reference data.
The loss value of a data set can be used in such a case.

In [13]:
# Compare the loss between MOPAC and MOPAC
fx = ts_energy.evaluate(results)
print(f"The MOPAC-MOPAC loss is {fx:.3e}. Amazing!")

The MOPAC-MOPAC loss is 0.000e+00. Amazing!


For a more realistic example, let's compare the original training set references to MOPAC and [UFF](https://pubs.acs.org/doi/10.1021/ja00051a040):

In [15]:
ts_energy = ts.energy() # Restore the subset's reference from the parent

# UFF:
s = Settings()
s.input.forcefield
results_uff = jc.run(s, jobids=ts_energy.jobids)
assert all(r.ok() for r in results_uff.values())

# Calculate the f(x)
fx_mopac = ts_energy.evaluate(results)
fx_uff   = ts_energy.evaluate(results_uff)

print(f"The MH-MOPAC loss is {fx_mopac:.3e}")
print(f"The MH-UFF   loss is {fx_uff:.3e}")

The MH-MOPAC loss is 1.476e+05
The MH-UFF   loss is 1.466e+06


By default, the loss function $L$ that is evaluated when `.evaluate()` is called is the [um of squared errors](https://en.wikipedia.org/wiki/Residual_sum_of_squares) (also known as the residual sum of squares).
This can be easily changed by providing the _loss_ argument to the evaluator:

In [16]:
fx_mopac = ts_energy.evaluate(results,     loss='rmse')
fx_uff   = ts_energy.evaluate(results_uff, loss='rmse')

print(f"The MH-MOPAC RMSE loss is {fx_mopac:.3e}")
print(f"The MH-UFF   RMSE loss is {fx_uff:.3e}")

The MH-MOPAC RMSE loss is 2.596e+01
The MH-UFF   RMSE loss is 8.183e+01


## Optimizer Selection
todo