### Program written by Scott Midgley, 2021
Built-up on the basis of the code previously written by Scott Midgley

Scope: To ingest VASP energies from .csv format and generate Coulomb matrix eigenspectrum from POSCAR structure files. Output              saved as .pkl file, ready for machine learning models.

In [1]:
### USER INPUT REQUIRED ###

# Windows path
repodir = r'C:\Users\pablo\OneDrive\Documentos\GitHub\GeSn2N4_ML'

In [2]:
# Import modules.
import pandas as pd
import os
import numpy as np
from pymatgen.io.ase import AseAtomsAdaptor as AAA
from pymatgen.analysis.ewald import EwaldSummation
from dscribe.descriptors import EwaldSumMatrix
from numpy.linalg import eig
from ase.io import read
import time

In [3]:
# Read DFT derived energies from .csv file to data frame.
energies = pd.read_csv(repodir + "\\repository_data\\vasp-energies.csv", header=None)
energies.columns = ['tag','inv','SCF', 'BGE']

In [4]:
#List of structures with DFT data to import in files
with open(repodir + '\\repository_data\\gga_structures_list.txt', "r") as obj_file:  
    file_check = obj_file.read().splitlines() 

In [5]:
# Iterate over structures in structure directory, generating CME for each configuration.
dirs=os.listdir(repodir + '\\repository_data\\structure_files')
files=[dirs[int(str)] for str in file_check] 

In [6]:
energies['tag']=file_check

In [7]:
em_dscribe_list = []; em_matminer_list=[]
oss = {'Sn':4, 'Ge':4, 'N':-3}
em_ds = EwaldSumMatrix(n_atoms_max=56,permutation="eigenspectrum",flatten=True)
start_time = time.time()
for i,f in enumerate(files[:]):
    struct = read(repodir + '/repository_data/structure_files/'+f)
    struct.set_pbc([True,True,True])
    dscribe_matrix = em_ds.create([struct])
    dscribe_matrix=np.real(dscribe_matrix)
    em_dscribe_list.append(dscribe_matrix)
     
    struct = AAA.get_structure(struct) 
    struct.add_oxidation_state_by_element(oss)
    ewald = EwaldSummation(struct)
    matminer_matrix=ewald.total_energy_matrix
    matminer_matrix,litter=eig(matminer_matrix)
    matminer_matrix=np.real(matminer_matrix)
    matminer_matrix=np.sort(matminer_matrix)
    em_matminer_list.append(matminer_matrix)
    
    if i%200==0: print("ITER CHECKER: Structure",str(i).zfill(4)," charged")
    
print('Number of matrices read: ', len(em_dscribe_list))
print("--- %s minutes ---" % ((time.time() - start_time)/60))



ITER CHECKER: Structure 0000  charged
ITER CHECKER: Structure 0200  charged
ITER CHECKER: Structure 0400  charged
ITER CHECKER: Structure 0600  charged
ITER CHECKER: Structure 0800  charged
ITER CHECKER: Structure 1000  charged
Number of matrices read:  1013
--- 4.629091723759969 minutes ---


In [8]:
# Add CME's to data frame with DFT energies. 
ener = energies.iloc[:len(em_dscribe_list)]
ener["Ewald_ds"] = em_dscribe_list
ener["Ewald_mm"] = em_matminer_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ener["Ewald_ds"] = em_dscribe_list
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ener["Ewald_mm"] = em_matminer_list


In [9]:
# Shuffle data frame (optional).
ener = ener.sample(frac=1,random_state=38)

In [10]:
# Save data frame to .pkl file.
ener.to_pickle('../input_data_em.pkl')