### Program written by Pablo Sánchez-Palencia, 2022
Built-up on the basis of the code previously written by Scott Midgley

Scope: To ingest VASP energies from .csv format and generate Coulomb matrix eigenspectrum from POSCAR structure files. Output              saved as .pkl file, ready for machine learning models.

In [2]:
# Import modules.
import pandas as pd
import numpy as np
from pymatgen.io.ase import AseAtomsAdaptor as AAA
from matminer.featurizers import structure as sf
from dscribe.descriptors import SineMatrix
from ase.io import read
import time
import os

In [3]:
# Read DFT derived energies from .csv file to data frame.
energies = pd.read_csv('../../repository_data/vasp-energies.csv", header=None)
energies.columns = ['tag','inv','SCF', 'BGE']

In [4]:
#List of structures with DFT data to import in files
with open('../../repository_data/gga_structures_list.txt', "r") as obj_file: 
    file_check = obj_file.read().splitlines() 

In [5]:
# Iterate over structures in structure directory, generating CME for each configuration.
dirs=os.listdir('../../repository_data/structure_files')
files=[dirs[int(str)] for str in file_check] 

In [6]:
energies['tag']=file_check

In [7]:
sm_dscribe_list = []; sm_matminer_list=[]
sm_ds = SineMatrix(n_atoms_max=56,permutation="eigenspectrum")
sm_mm = sf.SineCoulombMatrix()
start_time = time.time()
for i,f in enumerate(files[:]):
    struct = read('../../repository_data/structure_files/'+f)
    struct.set_pbc([True,True,True])
    dscribe_matrix = sm_ds.create([struct])
    dscribe_matrix=np.real(dscribe_matrix)
    sm_dscribe_list.append(dscribe_matrix)
     
    struct = AAA.get_structure(struct)
    matminer_matrix = sm_mm.fit([struct])
    featurized_structure = matminer_matrix.featurize(struct)
    sm_matminer_list.append(np.sort(featurized_structure)[::-1])
     
    if i%200==0: print("ITER CHECKER: Structure",str(i).zfill(4)," charged")
        
print('Number of matrices read: ', len(sm_dscribe_list))
print("--- %s minutes ---" % ((time.time() - start_time)/60))

  zeros[: len(eigs)] = eigs


ITER CHECKER: Structure 0000  charged
ITER CHECKER: Structure 0200  charged
ITER CHECKER: Structure 0400  charged
ITER CHECKER: Structure 0600  charged
ITER CHECKER: Structure 0800  charged
ITER CHECKER: Structure 1000  charged
Number of matrices read:  1013
--- 4.161735506852468 minutes ---


In [8]:
# Add CME's to data frame with DFT energies. 
ener = energies.iloc[:len(sm_dscribe_list)]
ener["Sine_ds"] = sm_dscribe_list
ener["Sine_mm"] = sm_matminer_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ener["Sine_ds"] = sm_dscribe_list
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ener["Sine_mm"] = sm_matminer_list


In [9]:
# Shuffle data frame (optional).
ener = ener.sample(frac=1,random_state=38)

In [10]:
# Save data frame to .pkl file.
ener.to_pickle('../inputalt_data_sm.pkl')