# Creating a basic Cluster Expansion

In [48]:
import random
import numpy as np
from monty.serialization import loadfn, dumpfn
from pymatgen.core.structure import Structure
from smol.cofe import ClusterSubspace, StructureWrangler, ClusterExpansion, RegressionData

In [49]:
# load the prim structure
#prim_path = '/Users/myless/Dropbox (MIT)/Research/2024/Spring_2024/Computation/structure_maker/v4cr4ti_prim_cell.json'
#prim_path = '/home/myless/Packages/structure_maker/vcrtiwzr_prim_entry.json'
prim_path = '/Users/myless/Packages/structure_maker/Entries/v1_6cr1_6ti1_6w1_6zr_prim_struct_dos.json'
#prim_path = '/Users/myless/Packages/structure_maker/Entries/v1_6cr1_6ti1_6w1_6zr_one_atom.cif'
#lno_prim = loadfn(prim_path)
lno_prim = Structure.from_file(prim_path)
    
# load the fitting data
#entry_path = '/Users/myless/Dropbox (MIT)/Research/2024/Spring_2024/Computation/structure_maker/vcrti_fixed_entries.json'
#entry_path = '/home/myless/Packages/structure_maker/vcrtiwzr_cse.json'
entry_path = '/Users/myless/Packages/structure_maker/Entries/vcrtiwzr_entries.json'
lno_entries = loadfn(entry_path)

### 0) The prim structure
The prim structure defines the **configurational space** for the Cluster Expansion. 
The **configurational space** is defined by the site **compositional spaces** and the crystal symetries of the prim structure.
The occupancy of the sites determine site **compositional spaces**. Sites are **active** if they have compositional degrees of freedom.


Active sites have fractional compositions. Vacancies are allowed in sites where the composition does not sum to one.

0. Is active. The allowed species are: Li+ and vacancies.
1. Is active. The allowed species are: Ni3+ and Ni4+.
2. Is not active. Only O2- is allowed.
3. Is not active. Only O2- is allowed.

In [50]:
print(lno_prim)

Full Formula (Zr0.03124 Ti0.03124 V1.875 Cr0.03124 W0.03124)
Reduced Formula: Zr0.03124Ti0.03124V1.875Cr0.03124W0.03124
abc   :   3.010000   3.010000   3.010000
angles:  90.000000  90.000000  90.000000
pbc   :       True       True       True
Sites (2)
  #  SP                                                a    b    c
---  ----------------------------------------------  ---  ---  ---
  0  Zr:0.016, Ti:0.016, V:0.938, Cr:0.016, W:0.016  0    0    0
  1  Zr:0.016, Ti:0.016, V:0.938, Cr:0.016, W:0.016  0.5  0.5  0.5


In [28]:
dos_prim_path = '/Users/myless/Packages/structure_maker/Entries/v1_6cr1_6ti1_6w1_6zr_dos_atom.cif'
#lno_prim = loadfn(prim_path)
dos_lno_prim = Structure.from_file(dos_prim_path)
print(dos_lno_prim)

Full Formula (Zr0.40624 Ti0.40624 V0.375 Cr0.40624 W0.40624)
Reduced Formula: Zr0.40624Ti0.40624V0.375Cr0.40624W0.40624
abc   :   3.010000   3.010000   3.010000
angles:  90.000000  90.000000  90.000000
pbc   :       True       True       True
Sites (2)
  #  SP                                                a    b    c
---  ----------------------------------------------  ---  ---  ---
  0  Zr:0.203, Ti:0.203, V:0.188, Cr:0.203, W:0.203  0    0    0
  1  Zr:0.203, Ti:0.203, V:0.188, Cr:0.203, W:0.203  0.5  0.5  0.5


In [22]:
print(lno_entries[5].structure)

Full Formula (Zr3 Ti4 V54 Cr2 W1)
Reduced Formula: Zr3Ti4V54Cr2W
abc   :  10.426946  10.426946  10.426946
angles: 109.471221 109.471221 109.471221
pbc   :       True       True       True
Sites (64)
  #  SP           a         b         c
---  ----  --------  --------  --------
  0  Zr    0.747789  0.505238  0.998582
  1  Zr    0.748675  0.995196  0.744172
  2  Zr    0.004     0.251744  0.757563
  3  Ti    0.492537  0.750033  0.245792
  4  Ti    0.004746  0.506798  0.250713
  5  Ti    0.248151  0.749428  0.25362
  6  Ti    0.249438  0.252882  0.747868
  7  V     0.7481    0.998085  0.999193
  8  V     0.49802   0.995576  0.993446
  9  V     0.750516  0.249375  0.99747
 10  V     0.747525  0.003099  0.246838
 11  V     0.252457  0.002326  0.008767
 12  V     0.497079  0.25159   0.999444
 13  V     0.494628  0.992056  0.246885
 14  V     0.752804  0.258033  0.252946
 15  V     0.742702  0.990747  0.489547
 16  V     0.997308  0.996602  0.003284
 17  V     0.24659   0.247662  0.998459
 18

### 1) The cluster subspace
The `ClusterSubspace` represents all the orbits (groups of equivalent clusters) that will be considered when fitting the cluster expansion. Its main purpose is to compute the **correlations functions** for each included orbit given a structure in the compositional space defined by the prim.

In order to do be able to compute the correlation functions, the given structure must match the prim structure in a "crystallographic" sense but allowing for compositional degrees of freedom in the "active" sites.

A cluster subspace most easily created by providing:
1. The prim structure representing the configurational space.
2. A set of diameter cutoffs for each size of orbit we want to consider.
3. A type of site basis function to use.

There are more options allowed by the code to fine grain and tune. See other notebooks for advanced use cases.

In [51]:
subspace = ClusterSubspace.from_cutoffs(
    lno_prim,
    # 4 was 3.01
    #cutoffs={2: 5, 3: 4.1, 4: 4.01, 5: 2}, # will include orbits of 2 and 3 sites.
    # diameter stuffff
    #cutoffs={2: 7, 3: 6, 4: 5.5},
    cutoffs={2 : 7, 3 : 6},
    #cutoffs={2: 5, 3: 4.1},
    basis='sinusoid', # sets the site basis type, default is indicator
    supercell_size='num_sites'
)

# supercell_size specifies the method to determine the supercell size
# when trying to match a structure.
# (See pymatgen.structure_matcher.StructureMatcher for more info)

print(subspace) # single site and empty orbits are always included.

Basis/Orthogonal/Orthonormal : sinusoid/True/False
       Unit Cell Composition : Cr0.03124 Ti0.03124 W0.03124 Zr0.03124 V1.875
            Number of Orbits : 23
No. of Correlation Functions : 681
             Cluster Cutoffs : 2: 6.73, 3: 5.21
              External Terms : []
Orbit Summary
 ------------------------------------------------------------------------
 |  ID     Degree    Cluster Diameter    Multiplicity    No. Functions  |
 |   0       0             NA                 0                1        |
 |   1       1            0.0000              2                4        |
 |   2       2            2.6067              8               10        |
 |   3       2            3.0100              6               10        |
 |   4       2            4.2568              12              10        |
 |   5       2            4.9915              24              10        |
 |   6       2            5.2135              8               10        |
 |   7       2            6.0200         

#### 1.1) Computing a correlation vector.
A correlation vector for a specific structure (represents the feature vector) used to train and predict target values.

In [52]:
#from pymatgen.io.cif import CifWriter
structure = lno_entries[1].structure
#writer = CifWriter(structure)
#writer.write_file('testfor_poscar.cif')

#print(structure)
corr = subspace.corr_from_structure(structure)

print(f'The correlation vector for a structure with'
      f' composition {structure.composition} is: '
      f'\n{corr}')

The correlation vector for a structure with composition Zr3 Ti4 V47 Cr5 W5 is: 
[ 1.00000000e+00  5.66995793e-01 -3.70873814e-01 -1.84183293e-01
  6.33314983e-01  3.08007956e-01 -2.27934447e-01 -1.08398437e-01
  3.62085068e-01  1.51152532e-01  7.91306018e-02 -2.33651634e-01
  4.16014193e-02 -1.25468957e-01  3.95722468e-01  3.47182571e-01
 -2.07320919e-01 -1.00260417e-01  3.62334753e-01  1.26485512e-01
  6.01512809e-02 -2.24189107e-01 -1.38492378e-02 -1.07595003e-01
  3.94347822e-01  3.08482884e-01 -1.99013782e-01 -1.19791667e-01
  3.36509974e-01  1.23998754e-01  7.09777787e-02 -2.24189107e-01
  6.39129491e-02 -1.12729096e-01  3.77303329e-01  3.00875850e-01
 -2.07118920e-01 -9.94466146e-02  3.56407399e-01  1.32717900e-01
  6.50981103e-02 -2.35471351e-01  3.08298796e-02 -1.13459936e-01
  4.02763870e-01  3.10313307e-01 -2.18646786e-01 -8.88671875e-02
  3.35577136e-01  1.61555341e-01  5.70077083e-02 -2.46753595e-01
  2.33567966e-04 -1.19728867e-01  3.75554034e-01  4.17260254e-01
 -2.141663

### 2) The structure wrangler
The `StructureWrangler` is a class that will is used to create and organize the data that will be used to train (and possibly test) the cluster expansion. It makes sure that all the supplied structures appropriately match the prim structure, and obtains the necessary information to correctly normalize target properties (such as energy) necessary for training.

Training data is added to a `StructureWrangler` using `ComputedStructureEntry` instances from `pymatgen`.

Matching relaxed structures can be a tricky problem, especially for ionic systems with vacancies. See the notebook on structure matching for tips on how to tweak parameters.

In [53]:
# filter out the duplicates based on warnings 
duplicates = [4,37,26,45,80,100,22,21,116,120,20,15,130,133,137,142,28,12,7,1,27,19,215,17,222,226]
# remove these entries from lno_entries 
filtered_lno_entries = [lno_entries[i] for i in range(len(lno_entries)) if i not in duplicates]

In [54]:
wrangler = StructureWrangler(subspace)

# the energy is taken directly from the ComputedStructureEntry
# any additional properties can also be added, see notebook on
# training data preparation for an example.
for entry in filtered_lno_entries:
    wrangler.add_entry(entry, verbose=True)
# The verbose flag will print structures that fail to match.

print(f'\nTotal structures that match {wrangler.num_structures}/{len(filtered_lno_entries)}')

 Index 4 - Zr3 Ti4 V47 Cr5 W5 energy=-586.59943751
Index 28 - Zr3 Ti4 V47 Cr5 W5 energy=-586.75713089
 Consider adding more terms to the clustersubspace or filtering duplicates.
 Index 2 - Zr3 Ti4 V54 Cr2 W1 energy=-568.5767083
Index 32 - Zr3 Ti4 V54 Cr2 W1 energy=-568.57691724
 Consider adding more terms to the clustersubspace or filtering duplicates.
 Index 14 - Zr1 Ti3 V53 Cr6 W1 energy=-575.29416208
Index 188 - Zr1 Ti3 V53 Cr6 W1 energy=-575.29682178
 Consider adding more terms to the clustersubspace or filtering duplicates.



Total structures that match 204/204


## 3) Training

Training a cluster expansion is one of the most critical steps. This is how you get **effective cluster interactions (ECI's)**. To do so you need an estimator class that implements some form of regression model. In this case we will use simple least squares regression using the `LinearRegression` estimator from `scikit-learn`.

In `smol` the coefficients from the fit are not exactly the ECI's but the ECI times the multiplicity of their orbit.

In [57]:
from sklearn.linear_model import LinearRegression
# Set fit_intercept to False because we already do this using
# the empty cluster.
estimator = LinearRegression(fit_intercept=False)
estimator.fit(wrangler.feature_matrix, wrangler.get_property_vector('energy'))
coefs = estimator.coef_

#### 3.1) Check the quality of the fit
There are many ways to evaluate the quality of a fit. The simplest involve stadard training set prediction error metrics. But when evaluating a CE more seriously we need to consider further metrics and how the CE will be used.
Here we will just look at in sample mean squared error and max error.

In [58]:
from sklearn.metrics import mean_squared_error, max_error

train_predictions = np.dot(wrangler.feature_matrix, coefs)

rmse = mean_squared_error(
    wrangler.get_property_vector('energy'), train_predictions, squared=False
)
maxer = max_error(wrangler.get_property_vector('energy'), train_predictions)

print(f'RMSE {1E3 * rmse} meV/prim')
print(f'MAX {1E3 * maxer} meV/prim')

RMSE 0.24358002079547741 meV/prim
MAX 2.4676490422521624 meV/prim




### 4) The cluster expansion
Now we can use the above work to create the `ClusterExpansion`. The cluster expansion can be used to predict the fitted property for new structures, either for testing quality or for simulations such as in Monte Carlo.
Note that when using the `predict` function, the cluster expansion will have to match the given structure if it has not seen it before.
We will also store the details of the regression model used to fit the cluster expansion by using a `RegressionData` object.

In [59]:
reg_data = RegressionData.from_sklearn(
    estimator, wrangler.feature_matrix,
    wrangler.get_property_vector('energy')
)


expansion = ClusterExpansion(
    subspace, coefficients=coefs, regression_data=reg_data
)

structure = random.choice(wrangler.structures)
prediction = expansion.predict(structure, normalized=True)

print(
    f'The predicted energy for a structure with composition '
    f'{structure.composition} is {prediction} eV/prim.\n'
)
print(f'The fitted coefficients are:\n{expansion.coefs}\n')
print(f'The effective cluster interactions are:\n{expansion.eci}\n')
print(expansion)

The predicted energy for a structure with composition Zr1 Ti3 V53 Cr6 W1 is -17.977993924788162 eV/prim.

The fitted coefficients are:
[ 7.23489324e+08 -8.95777515e+08 -6.96237965e+06 -5.34381972e+07
 -1.32586356e+09  1.90074947e+09  4.41618696e+08  7.70734921e+08
 -1.43651724e+09 -1.29388397e+08  3.49401269e+08 -1.04535397e+09
 -2.95531403e+08 -1.80049155e+09  2.34386234e+08 -5.55153424e+08
 -1.14354644e+08  6.39771906e+08  8.72366808e+08  6.50922186e+08
  1.01053369e+08  1.21627556e+09 -1.00755863e+08 -2.41914035e+08
 -3.77168980e+08  6.90992394e+08 -2.45854249e+08  1.94712825e+08
 -4.08056780e+08 -3.79064835e+08  5.62790177e+07 -3.83404956e+08
  1.37457563e+08  1.13344419e+08  7.01303890e+08  4.86611465e+08
  1.33827784e+09 -2.02079605e+08 -3.38271241e+08  5.03759036e+08
  1.51566381e+08  3.57329452e+08 -5.29400676e+08 -1.18857474e+08
  3.66088047e+08  1.58106182e+08 -2.02812401e+08  5.11242701e+07
 -3.19427305e+08 -1.43423031e+08  9.49369787e+07 -8.74871347e+07
 -1.10648038e+07  7.

### 5) Saving your work
All core classes in `smol` are `MSONables` and so can be saved using their `as_dict` methods or better yet with `monty.serialization.dumpfn`.

Currently there is also a convenience function in `smol` that will nicely save all of your work for you in a standardized way. Work saved with the `save_work` function is saved as a dictionary with standardized names for the classes. Since a work flow should only contain 1 of each core classes the function will complain if you give it two of the same class (i.e. two wranglers)

In [47]:
from smol.io import save_work
import os 

#file_path = 'v4cr4ti_fin_work.mson'
expansions_path = '/Users/myless/Packages/structure_maker/Expansions'
file_path = 'un_fixed_vcrtizrw_fin_work.mson'
# we can save the subspace as well, but since both the wrangler
# and the expansion have it, there is no need to do so.
save_work(os.path.join(expansions_path,file_path), wrangler, expansion)

#### 5.1) Loading previously saved work

In [12]:
from smol.io import load_work

work = load_work(file_path)
for name, obj in work.items():
    print(f'{name}: {type(obj)}\n')

StructureWrangler: <class 'smol.cofe.wrangling.wrangler.StructureWrangler'>

ClusterExpansion: <class 'smol.cofe.expansion.ClusterExpansion'>

