In [84]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")

In [85]:
# DFT Calculations
dft_calc = pd.read_csv("data/dft_calc.csv").replace(" ", 0)

# Elemental Properties
elemental_prop = pd.read_csv("data/elemental_properties.csv")
elemental_prop = elemental_prop.replace(" ", 0)

# Copy of dft_calc
df = dft_calc.copy()

## Data Setup

The dataset is derived from the work of Jacobs et al. who used density functional theory (DFT) methods to simulate 1,926 perovskite oxides to calculate their thermodynamic stability [1]. As shown below, their dataset contains 11 columns which includes the material name, the ions in the A-site, B-site and X-site, their energy above hull $E_{hull}$ in meV/atom, and their formation energy in meV/atom.

In [86]:
dft_calc.head()

Unnamed: 0,COMPOSITION,A_SITE_1,A_SITE_2,A_SITE_3,B_SITE_1,B_SITE_2,B_SITE_3,X_SITE,NUM_ELEMS,ENERGY_ABOVE_HULL,FORMATION_ENERGY
0,Ba1Sr7V8O24,Ba,Sr,,V,,,O,4,29.747707,-2.113335
1,Ba2Bi2Pr4Co8O24,Ba,Bi,Pr,Co,,,O,5,106.702335,-1.311863
2,Ba2Ca6Fe8O24,Ba,Ca,,Fe,,,O,4,171.608093,-1.435607
3,Ba2Cd2Pr4Ni8O24,Ba,Cd,Pr,Ni,,,O,5,284.89819,-0.868639
4,Ba2Dy6Fe8O24,Ba,Dy,,Fe,,,O,4,270.007913,-1.746806


In [87]:
dft_calc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1926 entries, 0 to 1925
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   COMPOSITION        1926 non-null   object 
 1   A_SITE_1           1926 non-null   object 
 2   A_SITE_2           1159 non-null   object 
 3   A_SITE_3           34 non-null     object 
 4   B_SITE_1           1926 non-null   object 
 5   B_SITE_2           1247 non-null   object 
 6   B_SITE_3           33 non-null     object 
 7   X_SITE             1926 non-null   object 
 8   NUM_ELEMS          1926 non-null   int64  
 9   ENERGY_ABOVE_HULL  1926 non-null   float64
 10  FORMATION_ENERGY   1926 non-null   float64
dtypes: float64(2), int64(1), object(8)
memory usage: 165.6+ KB


Another dataset from the work of Logan et al. is used to supplement the `dft_calc` dataset [2]. As seen below, the dataset contains information regarding the chemical and properties for each element in the periodic table. 

In [88]:
elemental_prop.head()

Unnamed: 0,SYMBOL,IONIC_RADIUS,MOD_OF_ELASTICITY,BP,MP,DENSITY,AT_WT,BCC_EFF_LAT_CNT,BCC_ENERGY,BCC_ENERGY_DIFF,...,IS_NONMETAL,ND_UNFILLED,ND_VALENCE,NF_UNFILLED,NF_VALENCE,NP_UNFILLED,NP_VALENCE,NS_UNFILLED,NS_VALENCE,N_UNFILLED
0,H,1.54,,20.28,13.81,0.0899,1.00797,3.589268,-2.135811,1.19548,...,1,0,0,0,0,0,0,1,1,1
1,He,0.0,,4.216,0.95,0.1785,4.0026,5.373995,-2.000673,-2.001808,...,1,0,0,0,0,0,0,0,2,0
2,Li,0.76,10.0,1615.0,453.7,0.53,6.941,6.416364,-1.865535,0.004352,...,0,0,0,0,0,0,0,1,1,1
3,Be,0.45,301.0,3243.0,1560.0,1.85,9.01218,4.997332,-3.655272,0.099767,...,0,0,0,0,0,0,0,0,2,0
4,B,0.23,441.0,4275.0,2365.0,2.34,10.811,4.60667,-4.966431,1.711267,...,0,0,0,0,0,5,1,0,2,5


In [89]:
elemental_prop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 82 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   SYMBOL                  110 non-null    object 
 1   IONIC_RADIUS            110 non-null    object 
 2   MOD_OF_ELASTICITY       81 non-null     float64
 3   BP                      110 non-null    object 
 4   MP                      110 non-null    object 
 5   DENSITY                 110 non-null    object 
 6   AT_WT                   110 non-null    object 
 7   BCC_EFF_LAT_CNT         110 non-null    float64
 8   BCC_ENERGY              110 non-null    float64
 9   BCC_ENERGY_DIFF         110 non-null    float64
 10  BCC_FERMI               110 non-null    float64
 11  BCC_MAG_MOM             110 non-null    float64
 12  BCC_VOLUME_PA           110 non-null    float64
 13  BCC_VOLUME_DIFF         110 non-null    float64
 14  GS_BANDGAP              110 non-null    fl

## Data pre-processing

Most numerical features in the `elemental_prop` dataset are encoded as strings and should be converted to a data type that machine learning models can work on. Additionally, the column for ionization energy is encoded as strings with commas. The commas should be removed before converting str to float. The first column in `dft_calc`, for example, is shown below:

In [90]:
# Commas in the ionization energy column
elemental_prop["ION_ERGY"]

0          1312
1      2,372.30
2           520
3        899.40
4        800.60
         ...   
105           0
106           0
107           0
108           0
109           0
Name: ION_ERGY, Length: 110, dtype: object

In [91]:
# Removing commas on column `ION_ERGY`
elemental_prop["ION_ERGY"] = elemental_prop["ION_ERGY"].str.replace(",","")

# Creating list of numerical columns in elemental_prop that should be converted to float
col_names_no_symbol = [i for i in elemental_prop.columns if i != "SYMBOL"]

# Conversion to float
elemental_prop[col_names_no_symbol] = elemental_prop[col_names_no_symbol].astype("float")

## Dataset building

To build the actual dataset to be used for modeling, `dft_calc` and `elemental_prop` are joined using the elements in the different sites as key. 

In [92]:
dft_calc.iloc[0,:]

COMPOSITION          Ba1Sr7V8O24
A_SITE_1                      Ba
A_SITE_2                      Sr
A_SITE_3                     NaN
B_SITE_1                       V
B_SITE_2                     NaN
B_SITE_3                     NaN
X_SITE                         O
NUM_ELEMS                      4
ENERGY_ABOVE_HULL      29.747707
FORMATION_ENERGY       -2.113335
Name: 0, dtype: object

It has Ba and Sr in its A-sites, and V in its B-sites. The O in its X-site can be disregarded since all materials in the dataset are perovskite oxides and have O in their X-sites. The elemental properties of Ba, Sr, and V are joined into `dft_calc` with column names appended by the site the element is in.

In [93]:
# List of suffixes to append to elemental properties
# ['A1', 'A2', 'A3', 'B1', 'B2', 'B3']
suffixes = "A1 A2 A3 B1 B2 B3".split()

# List of columns under dft_calc showing sites of elements
# To be used as key in merging 
# ['A_SITE_1', 'A_SITE_2', 'A_SITE_3', 'B_SITE_1', 'B_SITE_2', 'B_SITE_3']
site_names = list(dft_calc.columns[1:7])

# List of column names for symbols under placeholder dataframe 
# ['A1_SYMBOL', 'A2_SYMBOL', 'A3_SYMBOL', 'B1_SYMBOL', 'B2_SYMBOL', 'B3_SYMBOL']
symbol_names = [i+"_SYMBOL" for i in suffixes]

# Initializing empty list containing column names for 
# A-sites and B-sites
elemental_prop_col_names = []

# Populating elemental_prop_col_names
# Loops through all sites and creates site-specific column names for each
# elemental property
for i in range(6):
    placeholder_df = elemental_prop.copy()
    placeholder_df.columns = suffixes[i] + "_" + placeholder_df.columns.values
    dft_calc = pd.merge(dft_calc, placeholder_df, how="left", left_on=site_names[i], right_on=symbol_names[i])
    
    placeholder_df = placeholder_df.drop(columns=[suffixes[i] + "_SYMBOL"])
    elemental_prop_col_names.append(placeholder_df.columns)

Shown below is a list containing new column names for Site A1. In the case of Ba1Sr7V8O24, the columns below correspond to the elemental properties of Ba since it is the first element in its A-site. 

In [94]:
# elemental_prop_col_name is a list of lists
elemental_prop_col_names[0]

Index(['A1_IONIC_RADIUS', 'A1_MOD_OF_ELASTICITY', 'A1_BP', 'A1_MP',
       'A1_DENSITY', 'A1_AT_WT', 'A1_BCC_EFF_LAT_CNT', 'A1_BCC_ENERGY',
       'A1_BCC_ENERGY_DIFF', 'A1_BCC_FERMI', 'A1_BCC_MAG_MOM',
       'A1_BCC_VOLUME_PA', 'A1_BCC_VOLUME_DIFF', 'A1_GS_BANDGAP',
       'A1_GS_EFF_LAT_CNT', 'A1_GS_ENERGY', 'A1_GS_MAG_MOM', 'A1_GS_VOLUME_PA',
       'A1_HH_IP', 'A1_HH_IR', 'A1_ICSD_VOLUME', 'A1_COV_RAD', 'A1_ION_ERGY',
       'A1_ATOM_RAD', 'A1_ELECT_AFF', 'A1_AT_RAD', 'A1_AT_VOL', 'A1_MEN_NUM',
       'A1_N_WS_THIRD', 'A1_1_ION_POT', 'A1_2_ION_POT', 'A1_3_ION_POT',
       'A1_CTE', 'A1_SP_HEAT_CAP', 'A1_THERMAL_COND', 'A1_CONDUCTIVITY',
       'A1_HEAT_OF_FUSION', 'A1_HEAT_OF_VAP', 'A1_ELECTRONEGATIVITY',
       'A1_AT_NUM', 'A1_PERIOD', 'A1_GRP', 'A1_VALENCE', 'A1_IS_HEXAGONAL',
       'A1_IS_BCC', 'A1_IS_CUBIC', 'A1_IS_FCC', 'A1_IS_ORTHO', 'A1_IS_RHOMBO',
       'A1_IS_MONO', 'A1_IS_TETRA', 'A1_IS_ALKALI', 'A1_IS_ALKALI_EARTH',
       'A1_IS_BORON', 'A1_IS_CARBON', 'A1_IS_CHALCO

Below is a function that uses RegEx to determine the number of each atom in a mole of the material. For instance, for Ba1Sr7V8O24, should yield the following: 
- NUM_A1: 1 (Ba)
- NUM_A2: 7 (Sr)
- NUM_A3: NaN
- NUM_B1: 8 (V)
- NUM_B2: NaN
- NUM_B3: NaN

In [95]:
def num_of_sites(site):
    nums =[]
    
    # Loops through all rows of dft_calc
    for i in range(dft_calc.shape[0]):
        
        # Applies RegEx filter to COMPOSITION yielding a list of tuples
        # [('Ba', '1'), ('Sr', '7'), ('V', '8'), ('O', '24'), ('', '')] when i = 0
        matches = re.findall(r"(\D*)(\d*)", dft_calc["COMPOSITION"].iloc[i])

        # Proceed only when site is not NaN, note that most materials do not
        # have 3 elements in their A-site/B-site
        if type(dft_calc[site].iloc[i]) == str:
            
            # Looping through the list of tuples yielded above
            # to populate all sites
            for j in range(len(matches)):
                # When j = 0, i = 0, matches[j] = ('Ba', '1')
                # When j = 1, i = 0, matches[j] = ('Sr', '7')
                if matches[j][0] == dft_calc[site].iloc[i]:
                    nums.append(int(matches[j][1]))
                    continue
                else:
                    continue
                
        # Appends NaN when site is unfilled
        else:
            nums.append(np.nan)
    return nums 

In [96]:
num_sites = []

# Loops through all sites 
# Each iteration populates a site
# Results into a list of lists
for i in site_names:
    num_sites.append(num_of_sites(i))
    
# The first element in num_sites consists of 
# 196 rows corresponding to the number of the A1 atoms
# in the composition
num_sites[0]

[1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 5,
 5,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,


In [97]:
# Generating names for new columns
# ['NUM_A1', 'NUM_A2', 'NUM_A3', 'NUM_B1', 'NUM_B2', 'NUM_B3']
num_col_names = ["NUM_" + i for i in suffixes]
num_col_names

# Populating dft_calc with list of lists created earlier
for i in range(6):
    dft_calc[num_col_names[i]] = num_sites[i]

Below shows the new columns for Ba1Sr7V8O24. 

In [98]:
dft_calc[num_col_names].iloc[0]

NUM_A1    1.0
NUM_A2    7.0
NUM_A3    NaN
NUM_B1    8.0
NUM_B2    NaN
NUM_B3    NaN
Name: 0, dtype: float64

### Adding structural parameters

Structural parameters unique to perovskite crystals such as Goldschmidt tolerance factor ($t$) and octahedral factor ($\mu$) are added to the dataset. The $t$ of a material is a dimensionless number that is calculated from the ratio of the ionic radii of the elements comprising a perovskite crystals. It is expressed mathematically as 

$$t=\frac{r_A + r_O}{\sqrt{2}(r_B + r_O)}$$

where $r_A$ is the radius of the A cation, $r_B$ is the radius of the B cation, and $r_O$ is the radius of the anion (oxygen, in this case, which is equal to 1.4).

The ideal and stable structure for a perovskite crystal is cubic which as $t=1$. 
- \> 1: Hexagonal or tetragonal (A too big or B too small)
- 0.9-1: Cubic (A and B ideal)
- 0.71-0.9: Orthorhombic/rhombohesral (A too small)

For perovskites with multiple ions in each site such as the ones in the dataset, the radius for each ion type should be averaged with respect to composition. The composition averaged radius of site S can be calculated using the formula below:

$$r_S = \sum_{i=0}^{2} \frac{n_i}{n_{tot}} r_i $$

where $n_i$ is the number of the i-th ion, and $n_{tot}$ is the total number of atoms in site $S$ (always 8 in this case). 

Much like $t$, $\mu$ is also widely used to predict crystal stability. It is expressed as: 
$$\mu=\frac{r_B}{r_X}.$$

When $\mu>0.41$, the perovskite structure is said to be stable.

The bond length between the A, B, and X ions are also added as a new feature where:
$$AB=r_A+r_B$$
$$AO=r_A+r_O$$
$$BO=r_B+r_O$$

In [99]:
ionic_radius_names = [i for i in dft_calc if "_IONIC_RADIUS" in i]
ionic_radius_names

['A1_IONIC_RADIUS',
 'A2_IONIC_RADIUS',
 'A3_IONIC_RADIUS',
 'B1_IONIC_RADIUS',
 'B2_IONIC_RADIUS',
 'B3_IONIC_RADIUS']

In [100]:
# Generating names for new columns indicating ionic radii of 
# ions in different sites
# ['A1_IONIC_RADIUS',
#  'A2_IONIC_RADIUS',
#  'A3_IONIC_RADIUS',
#  'B1_IONIC_RADIUS',
#  'B2_IONIC_RADIUS',
#  'B3_IONIC_RADIUS']

ionic_radius_names = [i for i in dft_calc if "_IONIC_RADIUS" in i]
ionic_radius_names

# Empty list for Goldschmidt tolerance factor
gs = []

# Empty list for octahedral factor
of = []

# Empty list for A-B bond length
ab = []

# Empty list for A-O bond length
ao = []

# Empty list for B-O bond length
bo = []

# Empty list for A ions with highest composition
a_max = []

# Empty list for B ions with highest composition
b_max = []

# Loops through all rows of dft_calc
for i in range(dft_calc.shape[0]):
    
    # Takes list of ionic radii of A ions (r_i when S=A)
    ionic_radii_list_a = list(dft_calc[ionic_radius_names[0:3]].iloc[i].dropna())
    
    # Takes list of number of A ions (n_i when S=A)
    num_list_a = list(dft_calc[num_col_names[0:3]].iloc[i].dropna())
    
    # Adding which A ion has highest composition to a_max
    a_max.append(dft_calc[site_names[np.argmax(num_list_a)]].iloc[i])
    
    # Takes list of ionic radii of B ions (r_i when S=B)
    ionic_radii_list_b = list(dft_calc[ionic_radius_names[3:6]].iloc[i].dropna())
    
    # Takes list of number of A ions (n_i when S=A)
    num_list_b = list(dft_calc[num_col_names[3:6]].iloc[i].dropna())
    
    # Adding which A ion has highest composition to a_max
    b_max.append(dft_calc[site_names[np.argmax(num_list_b)+3]].iloc[i])
    
    a_sum = 0
    
    sam_list_a = len(ionic_radii_list_a)
    sam_list_b = len(ionic_radii_list_b)
    
    # Calculating composition-averaged radius when S=A
    for j in range(sam_list_a):
        a = ionic_radii_list_a[j] * num_list_a[j]
        a_sum = a_sum + a
    
    b_sum = 0
    
    # Calculating composition-averaged radius when S=B
    for k in range(sam_list_b):
        b = ionic_radii_list_b[k] * num_list_b[k]
        b_sum = b_sum + b
        # Calculating octahedral factor
        of_ = (b_sum/8)/1.4
        
    # Calculating t
    gs_tf = ((a_sum/8) + 1.4)/(np.sqrt(2)*((b_sum/8)+1.4))    
    
    # Appending t
    gs.append(gs_tf)
    # Appending octahedral factor
    of.append(of_)
    # Appending AB
    ab.append((a_sum/8)+(b_sum/8))
    # Appending AO
    ao.append((a_sum/8)+1.4)
    # Appending BO
    bo.append((b_sum/8)+1.4)
    
df["GOLDSCHMIDT_TF"] = gs
df["OCTAHEDRAL_FACTOR"] = of
df["A_B"] = ab
df["A_O"] = ao
df["B_O"] = bo
df["A_MAX"] = a_max
df["B_MAX"] = b_max

### Adding elemental properties of majority ions

In the previous section, structural parameters such as $t$, $\mu$, $AB$, $AO$, and $BO$ were added to the dataframe. A previously undiscussed feature was also added, namely `A_MAX` and `B_MAX` which represent which ions have the highest total number in their respective sites. For instance, `A_MAX` and `B_MAX` for Ba1Sr7V8O24 are Sr (7) and V (8), respectively. These new columns will be used as keys as the elemental properties for each ion are merged into the dataset.  

In [101]:
# Preparing new column names
placeholder_df_amax = elemental_prop.copy()
placeholder_df_amax.columns = "A_MAX_" + placeholder_df_amax.columns.values
a_max_names = list(placeholder_df_amax.columns)

placeholder_df_bmax = elemental_prop.copy()
placeholder_df_bmax.columns = "B_MAX_" + placeholder_df_bmax.columns.values
b_max_names = list(placeholder_df_bmax.columns)

In [102]:
df = pd.merge(df, placeholder_df_amax, how="inner", left_on="A_MAX", right_on="A_MAX_SYMBOL")
df = df.drop(columns=["A_MAX", "A_MAX_SYMBOL"])
a_max_names.remove("A_MAX_SYMBOL")

In [103]:
df = pd.merge(df, placeholder_df_bmax, how="inner", left_on="B_MAX", right_on="B_MAX_SYMBOL")
df = df.drop(columns=["B_MAX", "B_MAX_SYMBOL"])
b_max_names.remove("B_MAX_SYMBOL")

### Adding average, difference, and ratio of A and B majority ions

In [104]:
# Preparing new column names
placeholder_df_ab_avg = elemental_prop.iloc[:, 1:40].copy()
placeholder_df_ab_avg.columns = "AB_AVG_" + placeholder_df_ab_avg.columns.values
ab_avg_list_names = placeholder_df_ab_avg.columns

placeholder_df_diff = elemental_prop.iloc[:, 1:40].copy()
placeholder_df_diff.columns = "DIFF_" + placeholder_df_diff.columns.values
ab_diff_list_names = placeholder_df_diff.columns

placeholder_df_ratio = elemental_prop.iloc[:, 1:40].copy()
placeholder_df_ratio.columns = "RATIO_" + placeholder_df_ratio.columns.values
ratio_list_names = placeholder_df_ratio.columns

In [105]:
# Adding average columns 
for i in range(len(ab_avg_list_names)):
    df[ab_avg_list_names[i]] = (df[a_max_names[i]] + df[b_max_names[i]])/2

In [106]:
# Adding difference columns 
for i in range(len(ab_diff_list_names)):
    df[ab_diff_list_names[i]] = abs(df[a_max_names[i]] - df[b_max_names[i]])

In [107]:
# Taking ratio of A and B
for i in range(len(ratio_list_names)):
        df[ratio_list_names[i]] = np.divide(df[a_max_names[i]], 
                                            df[b_max_names[i]], 
                                            out=np.zeros_like(df[a_max_names[i]]), # Ensuring no division by zero
                                            where = df[b_max_names[i]]!=0)

### Adding composition averaged properties, maximum, minimum, and range

In [None]:
a_prop_names = []
b_prop_names = []

a_wt_avg_names = []
b_wt_avg_names = []

for i in elemental_prop.columns[1:82]:
    triplet_props_a = [j + "_" + i for j in suffixes[0:3]]
    triplet_props_b = [j + "_" + i for j in suffixes[3:6]]
    
    a_prop_names.append(triplet_props_a)
    b_prop_names.append(triplet_props_b)
    
    a_wt_avg_names.append("A_WT_AVG_" + i)
    b_wt_avg_names.append("B_WT_AVG_" + i)
    

In [None]:
a_max_all_names = ["ALL_MAX_A_" + i for i in elemental_prop.columns[1:82]]
b_max_all_names = ["ALL_MAX_B_" + i for i in elemental_prop.columns[1:82]]

a_min_all_names = ["ALL_MIN_A_" + i for i in elemental_prop.columns[1:82]]
b_min_all_names = ["ALL_MIN_B_" + i for i in elemental_prop.columns[1:82]]

a_range_names = ["RANGE_A_" + i for i in elemental_prop.columns[1:82]]
b_range_names= ["RANGE_B_" + i for i in elemental_prop.columns[1:82]]

In [None]:
# for j in range(81):
    
#     a_vals = []
#     b_vals = []
#     a_max_alls = []
#     b_max_alls = []
#     a_min_alls = []
#     b_min_alls = []
#     a_ranges = []
#     b_ranges = []
    
#     for i in range(dft_calc.shape[0]):
        
#         num_list_a = list(dft_calc[num_col_names[0:3]].astype(float).iloc[i].dropna())
#         num_list_b = list(dft_calc[num_col_names[3:6]].astype(float).iloc[i].dropna())
        
#         len_a = len(num_list_a)
#         len_b = len(num_list_b)

#         a_properties = list(dft_calc[a_prop_names[j]].astype(float).iloc[i].dropna())
#         b_properties = list(dft_calc[b_prop_names[j]].astype(float).iloc[i].dropna())
        
#         a_max_all = max(a_properties)
#         b_max_all = max(b_properties)
        
#         a_min_all = min(a_properties)
#         b_min_all = min(b_properties)
        
#         a_range = a_max_all - a_min_all
#         b_range = b_max_all - b_min_all
        
#         a = np.sum(np.multiply(a_properties, num_list_a))/(8)
#         b = np.sum(np.multiply(b_properties, num_list_b))/(8)
        
#         a_vals.append(a)
#         b_vals.append(b)
#         a_max_alls.append(a_max_all)
#         b_max_alls.append(b_max_all)
#         a_min_alls.append(a_min_all)
#         b_min_alls.append(b_min_all)
#         a_ranges.append(a_range)
#         b_ranges.append(b_range)
        
#     df[a_wt_avg_names[j]] = a_vals
#     df[b_wt_avg_names[j]] = b_vals
    
#     df[a_max_all_names[j]] = a_max_alls
#     df[b_max_all_names[j]] = b_max_alls
    
#     df[a_min_all_names[j]] = a_min_alls
#     df[b_min_all_names[j]] = b_min_alls
    
#     df[a_min_all_names[j]] = a_ranges
#     df[b_min_all_names[j]] = b_ranges

In [None]:
for j in range(81):
    
    a_vals = []
    b_vals = []
    a_max_alls = []
    b_max_alls = []
    a_min_alls = []
    b_min_alls = []
    a_ranges = []
    b_ranges = []
    
    for i in range(dft_calc.shape[0]):
        
        num_list_a = list(dft_calc[num_col_names[0:3]].iloc[i].dropna())
        num_list_b = list(dft_calc[num_col_names[3:6]].iloc[i].dropna())
        
        len_a = len(num_list_a)
        len_b = len(num_list_b)

        a_properties = list(dft_calc[a_prop_names[j]].iloc[i].dropna())
        b_properties = list(dft_calc[b_prop_names[j]].iloc[i].dropna())
        
        a_max_all = max(a_properties)
        b_max_all = max(b_properties)
        
        a_min_all = min(a_properties)
        b_min_all = min(b_properties)
        
        a_range = a_max_all - a_min_all
        b_range = b_max_all - b_min_all
        
        a = np.sum(np.multiply(a_properties, num_list_a))/(8)
        b = np.sum(np.multiply(b_properties, num_list_b))/(8)
        
        a_vals.append(a)
        b_vals.append(b)
        a_max_alls.append(a_max_all)
        b_max_alls.append(b_max_all)
        a_min_alls.append(a_min_all)
        b_min_alls.append(b_min_all)
        a_ranges.append(a_range)
        b_ranges.append(b_range)
        
    df[a_wt_avg_names[j]] = a_vals
    df[b_wt_avg_names[j]] = b_vals
    
    df[a_max_all_names[j]] = a_max_alls
    df[b_max_all_names[j]] = b_max_alls
    
    df[a_min_all_names[j]] = a_min_alls
    df[b_min_all_names[j]] = b_min_alls
    
    df[a_min_all_names[j]] = a_ranges
    df[b_min_all_names[j]] = b_ranges

KeyboardInterrupt: 

---

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=42)

In [None]:
np.argmax(max)

In [None]:
X_train

In [None]:
scaler = StandardScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
df.dtypes

In [None]:
df

---

[1] Jacobs, Ryan, et al. "Material discovery and design principles for stable, high activity perovskite cathodes for solid oxide fuel cells." Advanced Energy Materials 8.11 (2018): 1702708.

[2] Ward, Logan, et al. "A general-purpose machine learning framework for predicting properties of inorganic materials." npj Computational Materials 2.1 (2016): 1-7.
