# Make new data based on experimental data

In this notebook we will demonstrate how we generated the data that we will test our models on.

In the original training data set there is a total of $41$ unique A atoms and $55$ unique B atoms (verified below) based on experimental data. We will implement all the different combinations that are eligible with a total of VI in oxidation number for A+B atom for the test data, which will serve as a larger test set consisting of both stable and unstable compounds. We will be using the same combinations as in predicting perovskite article, but we are unable to verify if our data sets are equal, since their test set in not publically available. 

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In /home/oliver/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/oliver/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/oliver/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /home/oliver/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/oliver/.local/lib/python3.6/site-packages/matplotlib/mpl-data/s

In [2]:
X = pd.read_csv('preprocessed_data/X.csv')
y = pd.read_csv('preprocessed_data/target.csv')
oldData = pd.read_csv('preprocessed_data/data.csv')
newData = pd.read_csv('../../data/625unlabeledABO3.csv', sep='\s+')
newCompositions = newData["Compound"].values.tolist()

## Identifying the the A atom and the B atom in the ABO3 formula. 

In [3]:
unique_A_atoms = X.MA
unique_A_atoms = list(set(unique_A_atoms.to_list()))
print(len(unique_A_atoms))


unique_B_atoms = X.MB
unique_B_atoms = list(set(unique_B_atoms.to_list()))
print(len(unique_B_atoms))

41
55


In [4]:
def findElements(Compositions):
    
    small_alphabet = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
    bigg_alphabet  = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","P","Q","R","S","T","U","V","W","X","Y","Z"]
    #capital O is not included (since it is oxygen)

    compoundA = []
    compoundB = []
    oxidationNr = False
    
    def addToCompound(temp):
        if len(compoundA)>len(compoundB):
            compoundB.append(temp)
        else:
            compoundA.append(temp)
        
    for i, compound in enumerate(Compositions):
        temp = ""
        for j, letter in enumerate(compound):
            #checking for oxidation number
            if letter == "(":
                oxidationNr = True
            
            if oxidationNr == True:
                temp += letter
                if letter == ")":
                    oxidationNr = False
                    addToCompound(temp)
                    temp = ""
                next
            #checking for oxygen
            elif letter == "O":
                if len(compoundA)>len(compoundB): 
                    compoundB.append(temp)
                next
            elif letter =="3":
                next
            #checking for elements in compound
            else:
                for bigLetter in bigg_alphabet: 
                    if letter == bigLetter:
                        for bigLetter2 in bigg_alphabet:
                            if compound[j-1] == bigLetter2:      
                                compoundA.append(temp)
                                temp=""
                        temp += letter
                for smallLetter in small_alphabet: 
                    if letter == smallLetter:
                        temp += letter
                        if compound[j+1]!="(":
                            addToCompound(temp)
                            temp = ""
    return compoundA, compoundB

In [5]:
newAatoms, newBatoms = findElements(newCompositions)
newData["Aatoms"] = newAatoms
newData["Batoms"] = newBatoms
print(newCompositions[-5:])
print(newAatoms[-5:])
print(newBatoms[-5:])

['ScYbO3', 'ScYO3', 'ErLaO3', 'HoLaO3', 'TmLaO3']
['Sc', 'Sc', 'Er', 'Ho', 'Tm']
['Yb', 'Y', 'La', 'La', 'La']


This can also be done for old lists, such as the labeled dataset.

In [6]:
oldCompositions = oldData["Compound"].values.tolist()
oldAatoms, oldBatoms = findElements(oldCompositions)
print(oldCompositions[45:50])
print(oldAatoms[45:50])
print(oldBatoms[45:50])

['FeBO3', 'FeMnO3', 'FeSiO3', 'FeSO3', 'FeTiO3']
['Fe', 'Fe', 'Fe', 'Fe', 'Fe']
['B', 'Mn', 'Si', 'S', 'Ti']


In [7]:
oldData["Aatoms"] = oldAatoms
oldData["Batoms"] = oldBatoms

In [8]:
oldData

Unnamed: 0,Compound,Perovskite,Cubic,rA,rB,MA,MB,dAO,dBO,rA/rO,rB/rO,t,Aatoms,Batoms
0,AgBiO3,-1,0,1.460,0.760,65,86,1.805,2.060,1.081,0.563,0.942,Ag,Bi
1,AgBrO3,-1,0,1.460,0.470,65,95,1.805,1.840,1.081,0.348,1.092,Ag,Br
2,AgNO3,-1,0,1.460,0.130,65,82,1.805,1.432,1.081,0.096,1.343,Ag,N
3,AgPO3,-1,0,1.460,0.380,65,83,1.805,1.604,1.081,0.281,1.149,Ag,P
4,AgSbO3,-1,0,1.460,0.600,65,85,1.805,1.942,1.081,0.444,1.019,Ag,Sb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,YMnO3,1,-1,1.196,0.645,12,52,2.014,1.732,0.886,0.478,0.902,Y,Mn
386,YNiO3,1,-1,1.196,0.600,12,61,2.014,1.750,0.886,0.444,0.923,Y,Ni
387,YScO3,1,-1,1.196,0.745,12,11,2.014,1.849,0.886,0.552,0.859,Y,Sc
388,YTiO3,1,-1,1.196,0.670,12,43,2.014,1.791,0.886,0.496,0.891,Y,Ti


In [9]:
#A-atoms first 
rA = np.zeros(len(newData["Aatoms"]))
MA = np.zeros(len(newData["Aatoms"]))
dAO = np.zeros(len(newData["Aatoms"]))
rArO = np.zeros(len(newData["Aatoms"]))

for i, newAtom in enumerate(newData["Aatoms"]):
    for j, oldAtom in enumerate(oldData["Aatoms"]):
        if newAtom == oldAtom: 
            rA[i]   = oldData["rA"][j]
            MA[i]   = oldData["MA"][j]
            dAO[i]  = oldData["dAO"  ][j]
            rArO[i] = oldData["rA/rO"][j]

rB = np.zeros(len(newData["Batoms"]))
MB = np.zeros(len(newData["Batoms"]))
dBO = np.zeros(len(newData["Batoms"]))
rBrO = np.zeros(len(newData["Batoms"]))

for i, newAtom in enumerate(newData["Batoms"]):
    for j, oldAtom in enumerate(oldData["Batoms"]):
        if newAtom == oldAtom: 
            rB[i]   = oldData["rB"][j]
            MB[i]   = oldData["MB"][j]
            dBO[i]  = oldData["dBO"  ][j]
            rBrO[i] = oldData["rB/rO"][j]

And finally we need to generate the tolerance factor, which is defined as 
$$ t = \frac{r_A + r_O }{\sqrt{2} (r_B + r_O)} .$$
We find $r_O$ by utilising the using the data from AgBiO$_3$,
$$ \frac{r_A}{r_O} = 1.081$$

$$ \frac{r_A}{1.081} = \frac{1.460}{1.081} = r_O$$ 

$$ r_O = 1.350601295 $$ 

In [10]:
#t = np.zeros(len(newData["Aatoms"]))
rO = 1.35061295
t = (rA + rO)/(np.sqrt(2)*(rB+rO))

### The final data 

In [13]:
finalTestData = pd.DataFrame({})
finalTestData["Compound"] = newData["Compound"]
finalTestData["Aatom"] = newAatoms
finalTestData["Batom"] = newBatoms

finalTestData["rA"] = rA
finalTestData["rB"] = rB

finalTestData["MA"] = MA.astype(int)
finalTestData["MB"] = MB.astype(int)

finalTestData["dAO"] = dAO
finalTestData["dBO"] = dBO

finalTestData["rA/rO"] = rArO
finalTestData["rB/rO"] = rBrO

finalTestData["t"] = t
finalTestData

Unnamed: 0,Compound,Aatom,Batom,rA,rB,MA,MB,dAO,dBO,rA/rO,rB/rO,t
0,AgIO3,Ag,I,1.460,0.950,65,96,1.805,2.003,1.081,0.704,0.863858
1,AgPaO3,Ag,Pa,1.460,0.780,65,18,1.805,2.110,1.081,0.578,0.932785
2,AgReO3,Ag,Re,1.460,0.580,65,54,1.805,1.860,1.081,0.430,1.029416
3,AgUO3,Ag,U,1.460,0.760,65,20,1.805,2.075,1.081,0.563,0.941624
4,AgWO3,Ag,W,1.460,0.620,65,51,1.805,1.890,1.081,0.459,1.008520
...,...,...,...,...,...,...,...,...,...,...,...,...
620,ScYbO3,Sc,Yb,0.870,0.868,11,39,1.849,1.954,0.644,0.643,0.707744
621,ScYO3,Sc,Y,0.870,0.900,11,12,1.849,2.014,0.644,0.667,0.697681
622,ErLaO3,Er,La,1.179,1.032,35,13,1.979,2.148,0.873,0.764,0.750733
623,HoLaO3,Ho,La,1.194,1.032,33,13,1.992,2.148,0.884,0.764,0.755185


In [14]:
finalTestData.to_csv("../../data/625TestData.csv",  index = False)