### <p style="font-family: Arial; color: gold; font-weight: bold;">**create by Tom Tan in 8.30.2024** </p>
##### The idea is to create one notebook for each prefix. You only need to define the prefix in the follow cell and the notebook will do the rest.

***
# **1. Imports**

In [1]:
import sys
import pandas as pd

import get_properties_functions_for_WI as gp

import importlib

***
# **2. Import the atom map from preprocess notebook**
### <p style="font-family: Arial; color: gold; font-weight: bold;"> **!!!User input required, Change the prefix to the one you want to handle.** </p>

In [2]:
prefix = "pyrd"

file_name = prefix + "_atom_map.xlsx"

atom_map_df = pd.read_excel(
    file_name, "Sheet1", index_col=0, header=0, engine="openpyxl"
)

display(atom_map_df.head())

df = atom_map_df  # df is what properties will be appended to, this creates a copy so that you have the original preserved

Unnamed: 0,log_name,C3,C4,C5,N1,C6,C7,C1,C2
0,pyrd10_conf-1,C10,C5,C6,N7,C8,C9,C4,C5
1,pyrd10_conf-10,C10,C5,C6,N7,C8,C9,C4,C5
2,pyrd10_conf-11,C10,C5,C6,N7,C8,C9,C4,C5
3,pyrd10_conf-12,C10,C9,C8,N7,C6,C5,C4,C5
4,pyrd10_conf-13,C10,C5,C6,N7,C8,C9,C4,C5


In [3]:
# testing code for subprocess and goodvibes

importlib.reload(gp)
import subprocess

log_file = "pyrd1_conf-1.log"

# Construct command-line arguments for the new version of goodvibes
cmd_args = [
    sys.executable, "-m",
    "goodvibes", 
    log_file,
    "--spc", "link",
    "-t", str(298.15)
]

# Run the goodvibes command and capture the output
result = subprocess.run(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

# print the output
print(result.stdout)

   GoodVibes v3.2 2024/09/03 06:42:48
   Citation: Luchini, G.; Alegre-Requena, J. V.; Funes-Ardoiz, I.; Paton, R. S. F1000Research, 2020, 9, 291.
   GoodVibes version 3.2 DOI: 10.12688/f1000research.22758.1

o  Requested: --spc link -t 298.15 

   Temperature = 298.15 Kelvin   Pressure = 1 atm
   All energetic values below shown in Hartree unless otherwise specified.

   Using vibrational scale factor 1.0 for B3LYP/6-31G(d,p) level of theory

   Entropic quasi-harmonic treatment: frequency cut-off value of 100.0 wavenumbers will be applied.
   QS = Grimme: Using a mixture of RRHO and Free-rotor vibrational entropies.
   REF: Grimme, S. Chem. Eur. J. 2012, 18, 9955-9964

   Combining final single point energy with thermal corrections.

   Structure                                       E_SPC             E        ZPE         H_SPC        T.S     T.qh-S      G(T)_SPC   qh-G(T)_SPC
   *********************************************************************************************************

# **3. Define Properties to Collect**
### <p style="font-family: Arial; color: gold"> !!!User input required, Change/comment the properties block to the one you want to collect. </p>

In [4]:
# this box has functions to choose from
df = atom_map_df

# ---------------GoodVibes Engergies---------------
# uses the GoodVibes 2021 Branch (Jupyter Notebook Compatible)
# calculates the quasi harmonic corrected G(T) and single point corrected G(T) as well as other thermodynamic properties
# inputs: dataframe, temperature
df = gp.get_goodvibes_e(df, 298.15)

# ---------------Frontier Orbitals-----------------
# E(HOMO), E(LUMO), mu(chemical potential or negative of molecular electronegativity), eta(hardness/softness), omega(electrophilicity index)
df = gp.get_frontierorbs(df)

# ---------------Polarizability--------------------
# Exact polarizability
df = gp.get_polarizability(df)

# ---------------Dipole----------------------------
# Total dipole moment magnitude in Debye
df = gp.get_dipole(df)

# ---------------Volume----------------------------
# Molar volume
# requires the Gaussian keyword = "volume" in the .com file
df = gp.get_volume(df)

# ---------------SASA------------------------------
# Uses morfeus to calculat sovlent accessible surface area and the volume under the SASA
df = gp.get_SASA(df)

# ---------------NBO-------------------------------
# natural charge from NBO
# requires the Gaussian keyword = "pop=nbo7" in the .com file
nbo_list = ["C1", "C2", "N1"]
df = gp.get_nbo(df, nbo_list)

# ---------------NMR-------------------------------
# isotropic NMR shift
# requires the Gaussian keyword = "nmr=giao" in the .com file
nmr_list = ["C1", "C2", "N1"]
df = gp.get_nmr(df, nmr_list)

# ---------------Distance--------------------------
# distance between 2 atoms
dist_list_of_lists = [["C1", "C2"]]
df = gp.get_distance(df, dist_list_of_lists)

# ---------------Angle-----------------------------
# angle between 3 atoms
# angle_list_of_lists = [["C5", "N1", "C1"]]
# df = gp.get_angles(df, angle_list_of_lists)

# ---------------Dihedral--------------------------
# dihedral angle between 4 atoms
# dihedral_list_of_lists = [["C4", "C5", "N1", "C1"], ["C2", "C1", "N1", "C5"]]
# df = gp.get_dihedral(df, dihedral_list_of_lists)

# ---------------Vbur Scan-------------------------
# uses morfeus to calculate the buried volume at a series of radii (including hydrogens)
# inputs: dataframe, list of atoms, start_radius, end_radius, and step_size
# if you only want a single radius, put the same value for start_radius and end_radius (keep step_size > 0)
vbur_list = ["C1", "C2"]
df = gp.get_vbur_scan(df, vbur_list, 2, 4, 0.5)

# ---------------Sterimol morfeus------------------
# uses morfeus to calculate Sterimol L, B1, and B5 values
# NOTE: this is much faster than the corresponding DBSTEP function (recommendation: use as default/if you don't need Sterimol2Vec)
sterimol_list_of_lists = [["C1", "C2"]]
df = gp.get_sterimol_morfeus(df, sterimol_list_of_lists)

# ---------------Buried Sterimol-------------------
# uses morfeus to calculate Sterimol L, B1, and B5 values within a given sphere of radius r_buried
# atoms outside the sphere + 0.5 vdW radius are deleted and the Sterimol vectors are calculated
# for more information: https://kjelljorner.github.io/morfeus/sterimol.html
# inputs: dataframe, list of atom pairs, r_buried
sterimol_list_of_lists = [["C1", "C2"]]
df = gp.get_buried_sterimol(df, sterimol_list_of_lists, 5.5)

# ---------------Sterimol DBSTEP-------------------
# uses DBSTEP to calculate Sterimol L, B1, and B5 values
# default grid point spacing (0.05 Angstrom) is used (can use custom spacing or vdw radii in the get_properties_functions script)
# more info here: https://github.com/patonlab/DBSTEP
# NOTE: this takes longer than the morfeus function (recommendation: only use this if you need Sterimol2Vec)
# sterimol_list_of_lists = [["N1", "C1"], ["N1", "C5"]]
# df = gp.get_sterimol_dbstep(df, sterimol_list_of_lists)

# ---------------Sterimol2Vec----------------------
# uses DBSTEP to calculate Sterimol Bmin and Bmax values at intervals from 0 to end_radius, with a given step_size
# default grid point spacing (0.05 Angstrom) is used (can use custom spacing or vdw radii in the get_properties_functions script)
# more info here: https://github.com/patonlab/DBSTEP
# inputs: dataframe, list of atom pairs, end_radius, and step_size
# sterimol2vec_list_of_lists = [["N1", "C1"], ["N1", "C5"]]
# df = gp.get_sterimol2vec(df, sterimol2vec_list_of_lists, 1, 1.0)

# ---------------Pyramidalization------------------
# uses morfeus to calculate pyramidalization based on the 3 atoms in closest proximity to the defined atom
# collects values based on two definitions of pyramidalization
# details on these values can be found here: https://kjelljorner.github.io/morfeus/pyramidalization.html
pyr_list = ["C1", "C2"]
df = gp.get_pyramidalization(df, pyr_list)

# ---------------Plane Angle-----------------------
# !plane angle between 2 planes (each defined by 6 atoms)
# planeangle_list_of_lists = [["N1", "C1", "C5"], ["C2", "C3", "C4"]]
# df = gp.get_planeangle(df, planeangle_list_of_lists)

# --------------LP energy - custom from first cell---------------
lp_list = ["N1"]
df = gp.get_one_lp_energy(df, lp_list)

# ---------------Time----------------------------------
# returns the total CPU time and total Wall time (not per subjob) because we are pioneers
# if used in summary df, will give the average (not Boltzmann average) in the Boltzmann average column
# df = gp.get_time(df)

# ---------------ChelpG----------------------------
# ChelpG ESP charge
# requires the Gaussian keyword = "pop=chelpg" in the .com file
# a_list = ["C1", "C2", "C3", "C4", "C5", "N1"]
# df = gp.get_chelpg(df, a_list)

# ---------------Hirshfeld-------------------------
# Hirshfeld charge, CM5 charge, Hirshfeld atom dipole
# requires the Gaussian keyword = "pop=hirshfeld" in the .com file
# a_list = ["C1", "C2", "C3", "C4", "C5", "N1"]
# df = gp.get_hirshfeld(df, a_list)

pd.options.display.max_columns = None
display(df)

Goodvibes function has completed
Frontier orbitals function has completed
Polarizability function has completed
Dipole function has completed
Volume function has completed
SASA function has completed
NBO function has completed for ['C1', 'C2', 'N1']
NMR function has completed for ['C1', 'C2', 'N1']
Distance function has completed for [['C1', 'C2']]
Vbur scan function has completed for ['C1', 'C2'] from 2 to 4
Morfeus Sterimol function has completed for [['C1', 'C2']]
Morfeus Buried Sterimol function has completed for [['C1', 'C2']]
Pyramidalization function has completed for ['C1', 'C2']
{'NBO_LP_occupancy_N1': '1.91767', 'NBO_LP_energy_N1': '-0.37802'}
{'NBO_LP_occupancy_N1': '1.91692', 'NBO_LP_energy_N1': '-0.38179'}
{'NBO_LP_occupancy_N1': '1.91761', 'NBO_LP_energy_N1': '-0.38014'}
{'NBO_LP_occupancy_N1': '1.91605', 'NBO_LP_energy_N1': '-0.37270'}
{'NBO_LP_occupancy_N1': '1.91703', 'NBO_LP_energy_N1': '-0.37341'}
{'NBO_LP_occupancy_N1': '1.91721', 'NBO_LP_energy_N1': '-0.37952'}
{'N

Unnamed: 0,log_name,C3,C4,C5,N1,C6,C7,C1,C2,E_spc (Hartree),ZPE(Hartree),H_spc(Hartree),T*S,T*qh_S,G(T)_spc(Hartree),qh_G(T)_spc(Hartree),T,HOMO,LUMO,μ,η,ω,polar_iso(Debye),polar_aniso(Debye),dipole(Debye),volume(Bohr_radius³/mol),SASA_surface_area(Å²),SASA_volume(Å³),SASA_sphericity,NBO_charge_C1,NBO_charge_C2,NBO_charge_N1,NMR_shift_C1,NMR_shift_C2,NMR_shift_N1,distance_C1_C2(Å),%Vbur_C1_2.0Å,%Vbur_C2_2.0Å,%Vbur_C1_2.5Å,%Vbur_C2_2.5Å,%Vbur_C1_3.0Å,%Vbur_C2_3.0Å,%Vbur_C1_3.5Å,%Vbur_C2_3.5Å,%Vbur_C1_4.0Å,%Vbur_C2_4.0Å,Sterimol_L_C1_C2(Å)_morfeus,Sterimol_B1_C1_C2(Å)_morfeus,Sterimol_B5_C1_C2(Å)_morfeus,Buried_Sterimol_L_C1_C2_5.0(Å),Buried_Sterimol_B1_C1_C2_5.0(Å),Buried_Sterimol_B5_C1_C2_5.0(Å),pyramidalization_Gavrish_C1(°),pyramidalization_Agranat-Radhakrishnan_C1,pyramidalization_Gavrish_C2(°),pyramidalization_Agranat-Radhakrishnan_C2,NBO_LP_occupancy_N1,NBO_LP_energy_N1
0,pyrd10_conf-1,C10,C5,C6,N7,C8,C9,C4,C5,-2939.790206,0.165409,-2939.614581,0.046949,0.045717,-2939.66153,-2939.660298,298.15,-0.31256,-0.0024,-0.15748,0.31016,0.03998,115.274,54.7674,3.9559,1402.36,332.622291,490.744031,0.904557,-0.42677,-0.06744,-0.41008,153.8044,27.2832,-155.3399,1.50887,97.433497,97.507748,85.509181,90.837023,72.330341,81.398056,61.685846,69.576966,51.030352,56.381122,6.835629,1.7,5.086542,6.835629,1.7,5.086542,5.770135,0.785172,0.15096,0.02269814,1.91767,-0.37802
1,pyrd10_conf-10,C10,C5,C6,N7,C8,C9,C4,C5,-2939.788079,0.164826,-2939.612616,0.048883,0.047134,-2939.6615,-2939.65975,298.15,-0.31435,-0.00487,-0.15961,0.30948,0.04116,118.136,59.8668,2.5503,1556.174,351.152486,509.060557,0.878014,-0.41686,-0.05899,-0.40809,148.3617,26.2665,-156.0897,1.50796,97.766012,97.40767,85.478252,89.857078,71.117029,78.791716,59.283568,65.415347,48.236065,51.509062,6.831096,1.717514,5.110089,6.831096,1.717514,5.110089,5.815587,0.789785,0.186357,0.02802021,1.91692,-0.38179
2,pyrd10_conf-11,C10,C5,C6,N7,C8,C9,C4,C5,-2939.786373,0.164812,-2939.610889,0.049836,0.047669,-2939.660725,-2939.658558,298.15,-0.31568,-0.00515,-0.160415,0.31053,0.04143,117.866,77.3913,3.7713,1428.498,349.012508,505.510645,0.879286,-0.41693,-0.05802,-0.40802,151.4002,28.669,-155.5752,1.51518,97.979081,96.694215,84.934562,88.191822,70.535049,77.259945,58.923663,64.291372,47.998262,50.838713,6.850442,1.981456,4.877439,6.850442,1.981456,4.877439,6.120653,0.820322,0.0,2.126168e-12,1.91761,-0.38014
3,pyrd10_conf-12,C10,C9,C8,N7,C6,C5,C4,C5,-2939.785643,0.164912,-2939.610342,0.047464,0.045869,-2939.657806,-2939.656211,298.15,-0.30007,0.00107,-0.1495,0.30114,0.03711,112.983,40.5954,3.6925,1360.037,326.071908,486.759133,0.917727,-0.41876,-0.0605,-0.40999,148.2874,27.3206,-154.4481,1.50807,97.475465,98.217975,84.786431,93.078526,71.066746,84.550991,60.363283,72.598887,49.854525,59.057476,6.825855,1.801792,5.159819,6.825855,1.801792,5.159819,5.883747,0.796737,0.191586,0.02880821,1.91605,-0.3727
4,pyrd10_conf-13,C10,C5,C6,N7,C8,C9,C4,C5,-2939.785283,0.164883,-2939.609998,0.047559,0.045919,-2939.657556,-2939.655916,298.15,-0.30025,0.00406,-0.148095,0.30431,0.03604,112.885,40.264,3.8187,1385.241,326.441895,487.458728,0.917565,-0.41881,-0.06088,-0.4094,147.4085,27.0522,-153.2876,1.50877,97.478693,98.234117,84.762013,93.135499,71.07792,84.555647,60.332999,72.621599,49.84793,59.066787,6.825299,1.795265,5.163457,6.825299,1.795265,5.163457,5.881311,0.796492,0.18922,0.0284504,1.91703,-0.37341
5,pyrd10_conf-2,C10,C5,C6,N7,C8,C9,C4,C5,-2939.790837,0.165481,-2939.615161,0.046805,0.045663,-2939.661966,-2939.660825,298.15,-0.31278,-0.00182,-0.1573,0.31096,0.03979,115.041,51.3399,1.2501,1319.042,332.252283,490.460294,0.905216,-0.42638,-0.06855,-0.41074,154.3357,27.5887,-155.703,1.50874,97.40767,97.523889,85.496158,90.838651,72.271678,81.452063,61.641004,69.680628,50.983412,56.485476,6.833027,1.7,5.107027,6.833027,1.7,5.107027,5.77021,0.785193,0.147321,0.02214917,1.91721,-0.37952
6,pyrd10_conf-3,C10,C5,C6,N7,C8,C9,C4,C5,-2939.788707,0.165016,-2939.613186,0.048456,0.046701,-2939.661642,-2939.659887,298.15,-0.31572,-0.00873,-0.162225,0.30699,0.04286,116.186,40.996,2.6361,1259.341,343.092394,502.027671,0.890345,-0.4192,-0.06652,-0.40826,149.7514,26.9238,-155.2305,1.50939,97.449638,97.875775,84.522724,91.478383,70.390718,81.533075,58.835143,68.765724,47.924943,54.720765,6.835672,1.800451,6.040484,6.835672,1.800451,6.040484,5.844328,0.792699,0.151475,0.02277429,1.91672,-0.38334
7,pyrd10_conf-4,C10,C5,C6,N7,C8,C9,C4,C5,-2939.789215,0.164918,-2939.613811,0.048504,0.046649,-2939.662315,-2939.66046,298.15,-0.3121,-0.0021,-0.1571,0.31,0.03981,116.613,62.9578,1.1396,1376.219,345.432443,504.026205,0.886659,-0.42659,-0.05856,-0.41,150.1875,25.0232,-155.3274,1.50668,97.717588,97.452867,86.547728,89.821266,73.507338,78.769368,62.52446,65.618012,51.51876,52.32799,6.831488,1.708585,3.717173,6.831488,1.708585,3.717173,5.748912,0.782806,0.173134,0.02603178,1.91715,-0.37962
8,pyrd10_conf-5,C10,C5,C6,N7,C8,C9,C4,C5,-2939.788868,0.164879,-2939.613488,0.048621,0.046707,-2939.662109,-2939.660195,298.15,-0.312,-0.00211,-0.157055,0.30989,0.0398,116.685,66.1732,3.7145,1579.731,345.722444,504.417478,0.886373,-0.42652,-0.05861,-0.40865,149.8082,25.0692,-156.541,1.50684,97.746643,97.459323,86.645397,89.879867,73.613491,78.819651,62.587356,65.641889,51.568416,52.300059,6.829789,1.7,3.633848,6.829789,1.7,3.633848,5.746898,0.782583,0.181175,0.02723807,1.91728,-0.37916
9,pyrd10_conf-6,C10,C5,C6,N7,C8,C9,C4,C5,-2939.788363,0.165011,-2939.612842,0.048428,0.046707,-2939.66127,-2939.659549,298.15,-0.31553,-0.0072,-0.161365,0.30833,0.04223,116.244,41.7352,2.9331,1379.845,343.112399,502.232049,0.890534,-0.41901,-0.06679,-0.40647,148.8705,27.0598,-157.1935,1.50947,97.459323,97.685305,84.626905,91.240721,70.366508,81.394331,58.796706,68.68361,47.898175,54.673437,6.832713,1.740492,6.034927,6.832713,1.740492,6.034927,5.832236,0.791471,0.16269,0.02445633,1.91722,-0.38336


## 3.1 Save collected properties to Excel and pickle file

In [5]:
# save the pandas dataframe to a xlsx file
with pd.ExcelWriter(prefix + "_extracted_properties.xlsx") as writer:
    df.to_excel(writer)
# save the pandas dataframe to a pickle file
df.to_pickle(prefix + "_extracted_properties.pkl")

# **4. Post-processing**

In [6]:
import re
import pandas as pd
import numpy as np
from tabulate import tabulate

In [7]:
# for numerically named compounds, prefix is any text common to all BEFORE the number and suffix is common to all AFTER the number
# this is a template for our files that are all named "AcXXX_clust-X.log" or "AcXXX_conf-X.log"
prefix = "pyrd"
suffix = "_"

# columns that provide atom mapping information are dropped, not need if these columns contain cells that cannot be convert to float
# (which is the case for C1, C2, C3, C4, C5, N1 but not pure numbers like 1, 2, 3, 4, 5, 6)
atom_columns_to_drop = ["C3", "C4", "C5", "N1", "C1", "C2"]

# title of the column for the energy you want to use for boltzmann averaging and lowest E conformer determination
energy_col_header = "G(T)_spc(Hartree)"

### Option to import an Excel sheet if you're using properties or energies collected outside of this notebook

##### If you would like to use post-processing functionality (i.e. Boltzmann averaging, lowest E conformers, etc.) you can read in a dataframe with properties (e.g. QikProp properties) or energies (e.g. if you don't/can't run linked jobs) collected outside of this notebook. 

In [8]:
df = pd.read_excel(
    prefix + "_extracted_properties.xlsx",
    "Sheet1",
    index_col=0,
    header=0,
    engine="openpyxl",
)

display(df.head())

Unnamed: 0,log_name,C3,C4,C5,N1,C6,C7,C1,C2,E_spc (Hartree),ZPE(Hartree),H_spc(Hartree),T*S,T*qh_S,G(T)_spc(Hartree),qh_G(T)_spc(Hartree),T,HOMO,LUMO,μ,η,ω,polar_iso(Debye),polar_aniso(Debye),dipole(Debye),volume(Bohr_radius³/mol),SASA_surface_area(Å²),SASA_volume(Å³),SASA_sphericity,NBO_charge_C1,NBO_charge_C2,NBO_charge_N1,NMR_shift_C1,NMR_shift_C2,NMR_shift_N1,distance_C1_C2(Å),%Vbur_C1_2.0Å,%Vbur_C2_2.0Å,%Vbur_C1_2.5Å,%Vbur_C2_2.5Å,%Vbur_C1_3.0Å,%Vbur_C2_3.0Å,%Vbur_C1_3.5Å,%Vbur_C2_3.5Å,%Vbur_C1_4.0Å,%Vbur_C2_4.0Å,Sterimol_L_C1_C2(Å)_morfeus,Sterimol_B1_C1_C2(Å)_morfeus,Sterimol_B5_C1_C2(Å)_morfeus,Buried_Sterimol_L_C1_C2_5.0(Å),Buried_Sterimol_B1_C1_C2_5.0(Å),Buried_Sterimol_B5_C1_C2_5.0(Å),pyramidalization_Gavrish_C1(°),pyramidalization_Agranat-Radhakrishnan_C1,pyramidalization_Gavrish_C2(°),pyramidalization_Agranat-Radhakrishnan_C2,NBO_LP_occupancy_N1,NBO_LP_energy_N1
0,pyrd10_conf-1,C10,C5,C6,N7,C8,C9,C4,C5,-2939.790206,0.165409,-2939.614581,0.046949,0.045717,-2939.66153,-2939.660298,298.15,-0.31256,-0.0024,-0.15748,0.31016,0.03998,115.274,54.7674,3.9559,1402.36,332.622291,490.744031,0.904557,-0.42677,-0.06744,-0.41008,153.8044,27.2832,-155.3399,1.50887,97.433497,97.507748,85.509181,90.837023,72.330341,81.398056,61.685846,69.576966,51.030352,56.381122,6.835629,1.7,5.086542,6.835629,1.7,5.086542,5.770135,0.785172,0.15096,0.02269814,1.91767,-0.37802
1,pyrd10_conf-10,C10,C5,C6,N7,C8,C9,C4,C5,-2939.788079,0.164826,-2939.612616,0.048883,0.047134,-2939.6615,-2939.65975,298.15,-0.31435,-0.00487,-0.15961,0.30948,0.04116,118.136,59.8668,2.5503,1556.174,351.152486,509.060557,0.878014,-0.41686,-0.05899,-0.40809,148.3617,26.2665,-156.0897,1.50796,97.766012,97.40767,85.478252,89.857078,71.117029,78.791716,59.283568,65.415347,48.236065,51.509062,6.831096,1.717514,5.110089,6.831096,1.717514,5.110089,5.815587,0.789785,0.186357,0.02802021,1.91692,-0.38179
2,pyrd10_conf-11,C10,C5,C6,N7,C8,C9,C4,C5,-2939.786373,0.164812,-2939.610889,0.049836,0.047669,-2939.660725,-2939.658558,298.15,-0.31568,-0.00515,-0.160415,0.31053,0.04143,117.866,77.3913,3.7713,1428.498,349.012508,505.510645,0.879286,-0.41693,-0.05802,-0.40802,151.4002,28.669,-155.5752,1.51518,97.979081,96.694215,84.934562,88.191822,70.535049,77.259945,58.923663,64.291372,47.998262,50.838713,6.850442,1.981456,4.877439,6.850442,1.981456,4.877439,6.120653,0.820322,0.0,2.126168e-12,1.91761,-0.38014
3,pyrd10_conf-12,C10,C9,C8,N7,C6,C5,C4,C5,-2939.785643,0.164912,-2939.610342,0.047464,0.045869,-2939.657806,-2939.656211,298.15,-0.30007,0.00107,-0.1495,0.30114,0.03711,112.983,40.5954,3.6925,1360.037,326.071908,486.759133,0.917727,-0.41876,-0.0605,-0.40999,148.2874,27.3206,-154.4481,1.50807,97.475465,98.217975,84.786431,93.078526,71.066746,84.550991,60.363283,72.598887,49.854525,59.057476,6.825855,1.801792,5.159819,6.825855,1.801792,5.159819,5.883747,0.796737,0.191586,0.02880821,1.91605,-0.3727
4,pyrd10_conf-13,C10,C5,C6,N7,C8,C9,C4,C5,-2939.785283,0.164883,-2939.609998,0.047559,0.045919,-2939.657556,-2939.655916,298.15,-0.30025,0.00406,-0.148095,0.30431,0.03604,112.885,40.264,3.8187,1385.241,326.441895,487.458728,0.917565,-0.41881,-0.06088,-0.4094,147.4085,27.0522,-153.2876,1.50877,97.478693,98.234117,84.762013,93.135499,71.07792,84.555647,60.332999,72.621599,49.84793,59.066787,6.825299,1.795265,5.163457,6.825299,1.795265,5.163457,5.881311,0.796492,0.18922,0.0284504,1.91703,-0.37341


## 4.1 Generating a list of compounds that have conformational ensembles

**ONLY RUN THE AUTOMATED OR THE MANUAL CELL, NOT BOTH**

**AUTOMATED:** if your compounds are named consistenly, this section generates your compound list based on the similar naming structure

In [9]:
compound_list = []

for index, row in df.iterrows():
    log_file = row["log_name"]  # read file name from df
    prefix_and_compound = log_file.split(
        str(suffix)
    )  # splits to get "AcXXX" (entry O) (and we don't use the "clust-X" (entry 1))
    compound = prefix_and_compound[0].split(
        str(prefix)
    )  # splits again to get "XXX" (entry 1) (and we don't use the empty string "" (entry 0))
    compound_list.append(compound[1])

compound_list = list(
    set(compound_list)
)  # removes duplicate stuctures that result from having conformers of each
compound_list.sort(
    key=lambda x: int(re.search(r"\d+", x).group())
)  # reorders numerically (not sure if it reorders alphabetically)
print(compound_list)

# this should generate a list that looks like this: ['24', '27', '34', '48']

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13']


## 4.2 Post-processing to get properties for each compound

##### changes made in 8/30/2024 <br> 1. avoid divide by zero error in the Boltzmann averaging, the original code had the if block order reversed, which caused the error. <br> 2. data cleaning by remove columns contain cell that cannot be converted to float. <br> 3. concat all data into row before concat them into the final dataframe. The originl modify individual cells which result in fragmented and raise performance warning.

In [10]:
all_df_master = pd.DataFrame(columns=[])
properties_df_master = pd.DataFrame(columns=[])

for compound in compound_list:
    # defines the common start to all files using the input above
    substring = str(prefix) + str(compound) + str(suffix)

    # makes a data frame for one compound at a time for post-processing
    valuesdf = df[df["log_name"].str.startswith(substring)]
    valuesdf = valuesdf.drop(columns=atom_columns_to_drop)
    valuesdf = valuesdf.reset_index(
        drop=True
    )  # you must re-index otherwise the 2nd, 3rd, etc. compounds fail

    # filter column that are characters, we will attempt to convert them to numbers, if fail, we will drop them
    for column in valuesdf:
        try:
            # exclude column "log_name"
            if column == "log_name":
                continue
            valuesdf[column] = pd.to_numeric(valuesdf[column])
        except:
            valuesdf = valuesdf.drop(columns=column)
            valuesdf = valuesdf.reset_index(
                drop=True
            )  # reset the index after dropping columns

    # define columns that won't be included in summary properties or are treated differently because they don't make sense to Boltzmann average
    non_boltz_columns = [
        "G(Hartree)",
        "∆G(Hartree)",
        "∆G(kcal/mol)",
        "e^(-∆G/RT)",
        "Mole Fraction",
    ]  # don't boltzman average columns containing these strings in the column label
    reg_avg_columns = [
        "CPU_time_total(hours)",
        "Wall_time_total(hours)",
    ]  # don't boltzmann average these either, we average them in case that is helpful
    gv_extra_columns = [
        "E_spc (Hartree)",
        "H_spc(Hartree)",
        "T",
        "T*S",
        "T*qh_S",
        "ZPE(Hartree)",
        "qh_G(T)_spc(Hartree)",
        "G(T)_spc(Hartree)",
    ]
    gv_extra_columns.remove(str(energy_col_header))

    # calculate the summary properties based on all conformers (Boltzmann Average, Minimum, Maximum, Boltzmann Weighted Std)
    valuesdf["∆G(Hartree)"] = (
        valuesdf[energy_col_header] - valuesdf[energy_col_header].min()
    )
    valuesdf["∆G(kcal/mol)"] = valuesdf["∆G(Hartree)"] * 627.5
    valuesdf["e^(-∆G/RT)"] = np.exp(
        (valuesdf["∆G(kcal/mol)"] * -1000) / (1.987204 * 298.15)
    )  # R is in cal/(K*mol)
    valuesdf["Mole Fraction"] = valuesdf["e^(-∆G/RT)"] / valuesdf["e^(-∆G/RT)"].sum()
    values_boltz_row = []
    values_min_row = []
    values_max_row = []
    values_boltz_stdev_row = []
    values_range_row = []
    values_exclude_columns = []

    for column in valuesdf:
        if "log_name" in column:
            values_boltz_row.append("Boltzmann Averages")
            values_min_row.append("Ensemble Minimum")
            values_max_row.append("Ensemble Maximum")
            values_boltz_stdev_row.append("Boltzmann Standard Deviation")
            values_range_row.append("Ensemble Range")
            values_exclude_columns.append(column)  # used later to build final dataframe
        elif any(phrase in column for phrase in non_boltz_columns) or any(
            phrase in column for phrase in gv_extra_columns
        ):
            values_boltz_row.append("")
            values_min_row.append("")
            values_max_row.append("")
            values_boltz_stdev_row.append("")
            values_range_row.append("")
        elif any(phrase in column for phrase in reg_avg_columns):
            values_boltz_row.append(
                valuesdf[column].mean()
            )  # intended to print the average CPU/wall time in the boltz column
            values_min_row.append("")
            values_max_row.append("")
            values_boltz_stdev_row.append("")
            values_range_row.append("")
        else:
            valuesdf[column] = pd.to_numeric(
                valuesdf[column]
            )  # to hopefully solve the error that sometimes occurs where the float(Mole Fraction) cannot be mulitplied by the string(property)
            values_boltz_row.append(
                (valuesdf[column] * valuesdf["Mole Fraction"]).sum()
            )
            values_min_row.append(valuesdf[column].min())
            values_max_row.append(valuesdf[column].max())
            values_range_row.append(valuesdf[column].max() - valuesdf[column].min())

            # this section generates the weighted std deviation (weighted by mole fraction)
            # formula: https://www.statology.org/weighted-standard-deviation-excel/

            boltz = (valuesdf[column] * valuesdf["Mole Fraction"]).sum()  # number
            delta_values_sq = []

            # makes a list of the "deviation" for each conformer
            for index, row in valuesdf.iterrows():
                value = row[column]
                delta_value_sq = (value - boltz) ** 2
                delta_values_sq.append(delta_value_sq)

            # w is list of weights (i.e. mole fractions)
            w = list(valuesdf["Mole Fraction"])
            # !swap the order here to avoid division by zero error
            if (
                len(w) == 1
            ):  # if there is only one conformer in the ensemble, set the weighted standard deviation to 0
                wstdev = 0
            else:
                # np.average(delta_values_sq, weights=w) generates sum of each (delta_value_sq * mole fraction)
                wstdev = np.sqrt(
                    (np.average(delta_values_sq, weights=w))
                    / (((len(w) - 1) / len(w)) * np.sum(w))
                )
            values_boltz_stdev_row.append(wstdev)

    valuesdf.loc[len(valuesdf)] = values_boltz_row
    valuesdf.loc[len(valuesdf)] = values_boltz_stdev_row
    valuesdf.loc[len(valuesdf)] = values_min_row
    valuesdf.loc[len(valuesdf)] = values_max_row
    valuesdf.loc[len(valuesdf)] = values_range_row

    # final output format is built here:
    explicit_order_front_columns = [
        "log_name",
        energy_col_header,
        "∆G(Hartree)",
        "∆G(kcal/mol)",
        "e^(-∆G/RT)",
        "Mole Fraction",
    ]

    # reorders the dataframe using front columns defined above
    valuesdf = valuesdf[
        explicit_order_front_columns
        + [
            col
            for col in valuesdf.columns
            if col not in explicit_order_front_columns
            and col not in values_exclude_columns
        ]
    ]

    # determine the index of the lowest energy conformer
    low_e_index = valuesdf[valuesdf["∆G(Hartree)"] == 0].index.tolist()
    # copy the row to a new_row with the name of the log changed to Lowest E Conformer
    new_row = pd.DataFrame(valuesdf.loc[low_e_index[0]]).T
    new_row["log_name"] = "Lowest E Conformer"
    # check if there is empty/Nan values in the row
    if new_row.isnull().values.any():
        print("There are empty values in the row")
        # print the entire row to see where the empty values are
        print(tabulate(new_row, headers="keys", tablefmt="pretty"))
        # drop the columns with empty values
        new_row = new_row.dropna(axis=1, how="any")
        print("The empty values have been dropped")
        print(tabulate(new_row, headers="keys", tablefmt="pretty"))
    valuesdf = pd.concat([valuesdf, new_row], ignore_index=True, axis=0)

    # ------------------------------EDIT THIS SECTION IF YOU WANT A SPECIFIC CONFORMER----------------------------------
    # if you want all properties for a conformer with a particular property (i.e. all properties for the Vbur_min conformer)
    # this template can be adjusted for min/max/etc.

    # find the index for the min or max column:
    ensemble_min_index = valuesdf[
        valuesdf["log_name"] == "Ensemble Minimum"
    ].index.tolist()
    # find the min or max value of the property (based on index above)
    # saves the value in a list (min_value) with one entry (this is why we call min_value[0])
    min_value = valuesdf.loc[ensemble_min_index, "%Vbur_C1_3.0Å"].tolist()
    vbur_min_index = valuesdf[valuesdf["%Vbur_C1_3.0Å"] == min_value[0]].index.tolist()
    # copy the row to a new_row with the name of the log changed to Property_min_conformer
    new_row = pd.DataFrame(valuesdf.loc[vbur_min_index[0]]).T
    new_row["log_name"] = "%Vbur_C1_3.0Å_min_Conformer"
    # check if there is empty/Nan values in the row
    if new_row.isnull().values.any():
        print("There are empty values in the row")
        # print the entire row to see where the empty values are
        print(tabulate(new_row, headers="keys", tablefmt="pretty"))
        # drop the columns with empty values
        new_row = new_row.dropna(axis=1, how="any")
        print("The empty values have been dropped")
        print(tabulate(new_row, headers="keys", tablefmt="pretty"))
    valuesdf = pd.concat([valuesdf, new_row], ignore_index=True, axis=0)
    # --------------------------------------------------------------------------------------------------------------------

    # appends the frame to the master output
    all_df_master = pd.concat([all_df_master, valuesdf])

    # drop all the individual conformers
    dropindex = valuesdf[valuesdf["log_name"].str.startswith(substring)].index
    valuesdf = valuesdf.drop(dropindex)
    valuesdf = valuesdf.reset_index(drop=True)

    # drop the columns created to determine the mole fraction and some that
    valuesdf = valuesdf.drop(columns=explicit_order_front_columns)
    try:
        valuesdf = valuesdf.drop(columns=gv_extra_columns)
    except:
        pass
    try:
        valuesdf = valuesdf.drop(columns=reg_avg_columns)
    except:
        pass

    # ---------------------THIS MAY NEED TO CHANGE DEPENDING ON HOW YOU LABEL YOUR COMPOUNDS------------------------------
    compound_name = prefix + str(compound)
    # --------------------------------------------------------------------------------------------------------------------

    properties_df = pd.DataFrame({"Compound_Name": [compound_name]})

    # builds a dataframe (for each compound) by adding summary properties as new columns
    for column in valuesdf:
        # the indexes need to match the values dataframe - display it to double check if you need to make changes
        # (uncomment the display(valuesdf) in row 124 of this cell)

        # create a list of headers for the properties_df
        # if you're collecting properties for a specific conformer, edit the header to reflect that, it should match the order in the valuesdf log_name column
        headers = [
            f"{column}_Boltz",
            f"{column}_Boltz_stdev",
            f"{column}_min",
            f"{column}_max",
            f"{column}_range",
            f"{column}_low_E",
            f"{column}_Vbur_min",
        ]
        # Extract values for the current column from valuesdf and create a DataFrame
        row_dataframe = pd.DataFrame([valuesdf[column].values], columns=headers)
        # Display the DataFrame for verification
        # display(row_dataframe)
        # Concatenate the new DataFrame to the properties_df along the columns (axis=1)
        properties_df = pd.concat([properties_df, row_dataframe], axis=1)

    # concatenates the individual acid properties df into the master properties df
    properties_df_master = pd.concat([properties_df_master, properties_df], axis=0)

# Reset the index of the master DataFrames
all_df_master = all_df_master.reset_index(drop=True)
properties_df_master = properties_df_master.reset_index(drop=True)

# 5. Export the data

In [11]:
# Print in tabulated version
print(tabulate(properties_df_master, headers="keys", tablefmt="pretty"))
print(tabulate(all_df_master, headers="keys", tablefmt="pretty"))

# grep the first & last Compound_Name
first_compound = str(properties_df_master["Compound_Name"].iloc[0])
last_compound = str(properties_df_master["Compound_Name"].iloc[-1])

# Define the filename for the Excel file
filename = (
    prefix
    + "_properties_postprocessed_"
    + "for_"
    + first_compound
    + "_to_"
    + last_compound
    + ".xlsx"
)

# export to excel
with pd.ExcelWriter(filename, engine="xlsxwriter") as writer:
    all_df_master.to_excel(writer, sheet_name="All_Conformer_Properties", index=False)
    # automatically adjusts the width of the columns
    for column in all_df_master.columns:
        column_width = max(
            all_df_master[column].astype(str).map(len).max(), len(column)
        )
        col_idx = all_df_master.columns.get_loc(column)
        writer.sheets["All_Conformer_Properties"].set_column(
            col_idx, col_idx, column_width
        )
    properties_df_master.to_excel(writer, sheet_name="Summary_Properties", index=False)
    # automatically adjusts the width of the columns
    for column in properties_df_master.columns:
        column_width = max(
            properties_df_master[column].astype(str).map(len).max(), len(column)
        )
        col_idx = properties_df_master.columns.get_loc(column)
        writer.sheets["Summary_Properties"].set_column(col_idx, col_idx, column_width)

+----+---------------+----------------------+------------------------+----------+----------+------------------------+------------+---------------+------------------------+-----------------------+----------+----------+------------------------+------------+---------------+----------------------+------------------------+-----------+-----------+------------------------+-----------+------------+---------------------+-----------------------+---------+---------+------------------------+---------+------------+----------------------+------------------------+---------+---------+------------------------+---------+------------+------------------------+------------------------------+----------------------+----------------------+------------------------+------------------------+---------------------------+--------------------------+--------------------------------+------------------------+------------------------+--------------------------+--------------------------+-----------------------------+---

In [12]:
# Define filenames for the pickle files
pkl_filename_all = (
    prefix
    + "_properties_postprocessed"
    + "_all_conformer_properties"
    + "_for_"
    + prefix
    + ".pkl"
)
pkl_filename_summary = (
    prefix
    + "_properties_postprocessed"
    + "_summary_properties"
    + "_for_"
    + prefix
    + ".pkl"
)

# Save to pickle
all_df_master.to_pickle(pkl_filename_all)
properties_df_master.to_pickle(pkl_filename_summary)