### Program written by Pablo Sánchez-Palencia, 2022
Built-up on the basis of the code previously written by Scott Midgley

Scope: To ingest VASP energies from .csv format and generate Coulomb matrix eigenspectrum from POSCAR structure files. Output   saved as .pkl file, ready for machine learning models.

In [1]:
### USER INPUT REQUIRED ###

# Windows path
repodir = r'C:\Users\pablo\OneDrive\Documentos\GitHub\GeSn2N4_ML'

In [2]:
# Import modules.
import pandas as pd
import os
import numpy as np
from dscribe.descriptors import MBTR
from ase.io import read
from numpy.linalg import eig
import time
import matplotlib.pyplot as plt

In [3]:
# Read DFT derived energies from .csv file to data frame.
energies = pd.read_csv(repodir + "\\repository_data\\vasp-energies.csv", header=None)
energies.columns = ['tag','inv','SCF', 'BGE']

In [4]:
#List of structures with DFT data to import in files
with open(repodir + '\\repository_data\\gga_structures_list.txt', "r") as obj_file: 
    file_check = obj_file.read().splitlines() 

In [5]:
# Iterate over structures in structure directory, generating SM for each configuration.
dirs=os.listdir(repodir + '\\repository_data\\structure_files')
files=[dirs[int(str)] for str in file_check]

In [6]:
energies['tag']=file_check

In [7]:
mbtr_list = [];
start=time.time()
for i,f in enumerate(files):
    struct = read(repodir + '/repository_data/structure_files/'+f)
    mbtr = MBTR(
                  species=["Sn", "Ge", "N"],
    k1={
        "geometry": {"function": "atomic_number"},
        "grid": {"min": 0, "max": 50, "n": 20, "sigma": 0.1},
    },
    
    k2={
        "geometry": {"function": "inverse_distance"},
        "grid": {"min": 0, "max": 1, "n": 100, "sigma": 0.1},
        "weighting": {"function": "exp", "scale": 0.5, "threshold": 1e-3},
    },
        
    periodic=True,
    normalization="l2_each",
    flatten=True,
    sparse=False
)
    
    
    #k3={
    #    "geometry": {"function": "cosine"},
    #    "grid": {"min": -1, "max": 1, "n": 100, "sigma": 0.1},
    #    "weighting": {"function": "exp", "scale": 0.5, "threshold": 1e-3},
    #},
        
    fitted_tensor = mbtr.create([struct])
    mbtr_list.append(fitted_tensor[0])
    #plt.plot(fitted_tensor[0])
    #print(fitted_tensor[0].shape)
    if i%100==0: print("ITER CHECKER: Structure",str(i).zfill(4)," charged")
print('Number of matrices read: ', len(mbtr_list))
print(f"Runtime of the program is {(time.time() - start)/60} minutes")

ITER CHECKER: Structure 0000  charged
ITER CHECKER: Structure 0100  charged
ITER CHECKER: Structure 0200  charged
ITER CHECKER: Structure 0300  charged
ITER CHECKER: Structure 0400  charged
ITER CHECKER: Structure 0500  charged
ITER CHECKER: Structure 0600  charged
ITER CHECKER: Structure 0700  charged
ITER CHECKER: Structure 0800  charged
ITER CHECKER: Structure 0900  charged
ITER CHECKER: Structure 1000  charged
Number of matrices read:  1013
Runtime of the program is 20.333827590942384 minutes


In [8]:
# Add CME's to data frame with DFT energies. 
ener = energies.iloc[:len(mbtr_list)]
ener["MBTR"] =mbtr_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ener["MBTR"] =mbtr_list


In [9]:
# Shuffle data frame (optional).
ener = ener.sample(frac=1,random_state=38)

In [10]:
# Save data frame to .pkl file.
ener.to_pickle('../input_data_mbtr.pkl')