## Screening the Materials Project for Previously Unsynthesized Materials 
## Revision
#### Date: July 24, 2023
 
**Authors**: Matthew McDermott and Max Gallant

From Nathan: "what if you were to allow materials with E_hull < 20 meV/atom? Or < 50 meV/atom? And so on..."

#### Imports

In [1]:
from collections import OrderedDict
import re

import emmet
from emmet.core.summary import HasProps
import matplotlib.pyplot as plt
from monty.serialization import dumpfn, loadfn
from mp_api.client import MPRester
import pandas 
from pymatgen.core.composition import Composition, Element
import seaborn as sns

sns.set("talk",style="ticks")

%load_ext autoreload
%autoreload 2

In [2]:
radioactive = ["Ac","Th","Pa","U","Np","Pu","Tc"]  # all other radioactive elements are not on MP
expensive = ["Cs","Rb","Pm","Tm","Sc","Tl","Re","Ir","Os","Pd","Ru","Pt","Rh","Au"]
dangerous = ["Hg","As"]

exclude_elements = radioactive + expensive + dangerous

#### Get summary documents with MPRester

In [3]:
new_metastable_cutoff = 0.050  # eV/atom

In [4]:
with MPRester() as mpr:
    materials = mpr.summary.search(theoretical=True, 
                                   energy_above_hull=(0.000,new_metastable_cutoff), 
                                   exclude_elements=exclude_elements, 
                                   all_fields=False,
                                   deprecated=False,
                                   has_props=[HasProps("oxi_states")],
                                   fields=["material_id","composition","formation_energy_per_atom", 
                                          "energy_above_hull", "equilibrium_reaction_energy_per_atom"])

Retrieving SummaryDoc documents:   0%|          | 0/29605 [00:00<?, ?it/s]

#### Get oxidation state docs with MPRester

In [5]:
with MPRester() as mpr:
    oxi_states = mpr.oxidation_states.search([doc.material_id for doc in materials], fields=["material_id","average_oxidation_states", "structure"])

Retrieving OxidationStateDoc documents:   0%|          | 0/29605 [00:00<?, ?it/s]

#### Join and filter by compounds with pre-computed oxidation states
**WARNING**: This may take a few mins if there are a lot of materials (i.e., big e_hull cutoff)

In [6]:
all_data = []
for d1 in materials:
    d2 = list(filter(lambda i: i.material_id == d1.material_id, oxi_states))[0]
    average_oxidation_states = d2.average_oxidation_states
    if average_oxidation_states:
        doc = d1.dict()
        del doc["fields_not_requested"]
        doc["average_oxidation_states"] = average_oxidation_states
        doc["structure"] = d2.structure
        all_data.append(doc)

In [7]:
len(all_data)  # this number will be lower than the original number

24278

#### Removing any duplicate formulas

In [8]:
seen_formulas = set()
all_data_no_duplicates = []

for d in sorted(all_data, key=lambda i: i["formation_energy_per_atom"]):
    formula = d["composition"].reduced_formula 
    if formula in seen_formulas:
        continue
    else:
        all_data_no_duplicates.append(d)
        seen_formulas.add(formula)

In [9]:
len(all_data_no_duplicates)

16406

#### Filter by unreasonable oxidation states
If the average oxidation state is outside of the range spanned by the `common_oxidation_states` property (see pymatgen), then we will discard.

For convenience, we will print out all the discard compositions and their avg. oxidation states.

In [10]:
filtered_data = []
for d in all_data_no_duplicates:
    species = set(d["structure"].species)
    bad = False
    for specie in species:
        common_oxidation_states = specie.element.common_oxidation_states
        if int(specie.oxi_state) not in range(min(common_oxidation_states), max(common_oxidation_states)+1):
            print(specie, d["composition"])
            bad=True
            break
    if not bad:
        filtered_data.append(d)

Yb2+ Yb2 F4
Pr2+ Ca18 Pr2 F40
Yb2+ Yb8 Al8 F40
Yb2+ Yb4 Hf4 O12
Ti3+ Ba4 Ca4 Ti4 F28
Ag2+ Ca6 Hf6 Ag2 F40
Sm2+ Ce4 Sm1 O9
Ti3+ Ca4 Ti4 F20
Ti3+ Sr8 Ti8 F40
La2+ La3 Ce4 O12
Ti3+ Ba4 Mg4 Ti4 F28
Ti2+ Na1 Hf2 Ti1 F11
V3+ Ba4 Ca4 V4 F28
Ti3+ La10 Ti10 O34
Ta4+ Y11 Ta5 O28
La2+ La20 Al3 Si9 O52
Tb4+ Tb1 Ce3 O8
Yb2+ Yb4 Ti4 O12
Ti3+ Sr2 Li2 Ti2 F12
La2+ Sr2 La18 Al1 Si11 O52
V4+ Zr15 V1 O32
Ti3+ Nd20 Ti20 O68
Ti3+ Na4 Sr4 Ti4 F24
V3+ Ba4 Mg4 V4 F28
La2+ La3 Ti4 O12
Pr4+ Ce3 Pr1 O8
Ti3+ Ba8 Na8 Ti8 F48
V4+ Zr11 V1 O24
Cr2+ Ba4 Cr2 F12
V3+ Ca2 V2 F10
Ti3+ Sm12 Ta4 Ti8 O42
Ti3+ La4 Mg1 Ti3 O12
Ti3+ Nd20 Ti18 Ga2 O68
Ti3+ Li1 La4 Ti3 O12
Ti3+ Pr20 Ti18 Ga2 O68
V3+ Ba4 Ca2 Mn2 V4 F28
V2+ Ba6 Al3 V6 F33
Nd2+ Nd3 Ti4 O12
Ho2+ Ca2 Ho2 Ti4 O12
V4+ Zr14 V2 O32
Pr4+ Ce2 Pr1 O6
Ti3.25+ Ca1 Y3 Ti4 O12
Ti3+ Nd18 Ti20 O60
Ti3+ Sr1 La2 Ti3 O9
Ti3.5+ Ca2 Y2 Ti4 O12
Cr2+ Ba6 Al3 Cr6 F33
Mo3+ Ba4 Ca4 Mo4 F28
Ti3.75+ Ca3 Y1 Ti4 O12
Cr2+ Zr1 Cr1 F6
Ti3+ Sr4 La1 Ti5 O15
Ti3+ Nd19 Ti20 O60
Pr2+ Ca2 Pr2 Ti4 O12
V3

In [11]:
filtered_data_df = []
for d in filtered_data:
    clean_d = {"formula":d["composition"].reduced_formula}
    clean_d["mpid"] = str(d["material_id"])
    clean_d["e_hull"] = d["energy_above_hull"]
    clean_d["e_f"] = round(d["formation_energy_per_atom"],4)
    clean_d["e_d"] = round(d["equilibrium_reaction_energy_per_atom"],4) if d["equilibrium_reaction_energy_per_atom"] else None
    clean_d["avg_oxi"] = d["average_oxidation_states"]
    clean_d["num_elems"] = len(d["composition"].elements)
    filtered_data_df.append(clean_d)
    
df = pandas.DataFrame(filtered_data_df)

In [12]:
len(filtered_data_df)

11948

#### Removing compounds which have other synthesized polymorphs

In [13]:
with MPRester() as mpr:
    previously_synthesized = mpr.summary.search(theoretical=False, 
                                   exclude_elements=exclude_elements, 
                                   all_fields=False,
                                   deprecated=False,
                                   has_props=[emmet.core.summary.HasProps("oxi_states")],
                                  fields=["composition"])

Retrieving SummaryDoc documents:   0%|          | 0/32545 [00:00<?, ?it/s]

In [14]:
known_formulas = {d.composition.reduced_formula for d in previously_synthesized}

No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.
No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.
No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


In [15]:
filtered_data_df_2 = [d for d in filtered_data_df if d['formula'] not in known_formulas]

In [16]:
df = pandas.DataFrame(filtered_data_df_2)

In [17]:
len(df)

10635

#### Removing binary compounds

In [18]:
df_2 = df[df["num_elems"]>2]
df_2

Unnamed: 0,formula,mpid,e_hull,e_f,e_d,avg_oxi,num_elems
0,BaGdCrFeO6,mp-1518724,0.000000,-5.1543,-2.4397,"{'Ba': 2.0, 'Gd': 3.0, 'Cr': 5.0, 'Fe': 2.0, '...",5
3,SrLa5F17,mp-675492,0.022906,-4.4354,,"{'Sr': 2.0, 'La': 3.0, 'F': -1.0}",3
4,BaLu2F8,mp-1214962,0.000000,-4.4284,-0.0539,"{'Ba': 2.0, 'Lu': 3.0, 'F': -1.0}",3
5,BaY2F8,mp-768240,0.000000,-4.3848,-0.0106,"{'Ba': 2.0, 'Y': 3.0, 'F': -1.0}",3
6,BaCaLu2F10,mp-1228137,0.008439,-4.3762,,"{'Ba': 2.0, 'Ca': 2.0, 'Lu': 3.0, 'F': -1.0}",4
...,...,...,...,...,...,...,...
10624,Zn3SbN3,mp-1271278,0.000000,-0.0506,-0.0000,"{'Zn': 2.0, 'Sb': 3.0, 'N': -3.0}",3
10625,Zn4SnN4,mp-1247422,0.035929,-0.0441,,"{'Zn': 2.0, 'Sn': 4.0, 'N': -3.0}",3
10627,Zn2SbN3,mp-1029334,0.020381,-0.0298,,"{'Zn': 2.0, 'Sb': 5.0, 'N': -3.0}",3
10630,Bi2(CN2)3,mp-1246532,0.023457,0.0235,,"{'Bi': 3.0, 'C': 4.0, 'N': -3.0}",3


#### Filter by literature data

In [19]:
f1 = loadfn("solid-state_dataset_20200713.json.xz")
f2 = loadfn("sol-gel_dataset_20200713.json.xz")

In [20]:
def get_formulas_from_lit(f):
    all_formulas = set()
    for r in f["reactions"]:
        for precursor in r["precursors"]:
            try:
                all_formulas.add(Composition(precursor["material_formula"]).reduced_formula)
            except Exception as e:
                pass
        try:
            all_formulas.add(Composition(r["target"]["material_formula"]).reduced_formula)
        except Exception as e:
            pass
                
    return all_formulas

In [21]:
all_lit_formulas = get_formulas_from_lit(f1) | get_formulas_from_lit(f2)

In [22]:
df_3 = df_2[[f not in all_lit_formulas for f in df_2["formula"]]]

In [23]:
len(df_3)

10059

#### Label common composition types

In [24]:
def get_formula_type_by_anion(formula):
    polyatomics={
        'PO3': 'phosphite',
        'PO4': 'phosphate',
        'HPO4': 'hydrogen phosphate',
        'H2PO4': 'dihydrogen phosphate',
        'BO3': 'orthoborate',
        'BO2': 'metaborate',
        'B2O5': 'diborate',
        'B3O7': 'triborate',
        'B4O7': 'tetraborate',
        'ClO4': 'perchlorate',
        'ClO3': 'chlorate',
        'ClO2': 'chlorite',
        'ClO': 'hypochlorite',
        'OCl': 'oxychloride',
        'BrO4': 'perbromate',
        'BrO3': 'bromate',
        'BrO2': 'bromite',
        'BrO': 'hypobromite',
        'IO4': 'periodate',
        'IO3': 'iodate',
        'IO2': 'iodite',
        'IO': 'hypoiodite',
        'NH4': 'ammonium',
        'NO2': 'nitrite',
        'NO3': 'nitrate',
        'SO3': 'sulfite',
        'SO4': 'sulfate',
        'HSO3': 'bisulfite',
        'HSO4': 'bisulfate',
        'S2O3': 'thiosulfate',
        'HS2O3': 'hydrogen thiosulfate',
        'C2O4': 'oxalate',
        'HC2O4': 'hydrogen oxalate',
        'OH': 'hydroxide',
        'CH3COO': 'acetate',
        'CO3': 'carbonate',
        'HCO3': 'bicarbonate',
        'CrO4': 'chromate',
        'Cr2O7': 'dichromate',
        'MnO4': 'permanganate',
        'CN': 'cyanide',
        'OCN': 'cyanate',
        'SCN': 'thiocyanate',
        'C8H4O4': 'phthalate',
        'C2H3O2': 'acetate',
        'HCO2': 'fomate',
        'SiO4': 'silicate'
    }
    polyatomics = OrderedDict({k: polyatomics[k] for k in sorted(polyatomics, key=len, reverse=True)})
    anions = {"H":'hydride',
            "F":"fluoride",
              "Cl":"chloride",
              "Br":"bromide",
              "I":"iodide",
              "O":"oxide",
              "S":"sulfide",
              "Se":"selenide",
              "Te":"telluride",
              "N":"nitride", 
              "P":"phosphide",
              "As":"arsenide",
              "Sb":"antimonide",
              "Bi":"bismide",
              "C":"carbide",
              "Si":"silicide",
              "Ge":"germanide",
             "Sn":"stannide",
             "Pb":"plumbide",
             "B":"boride",
             "Al":"aluminide",
             "Ga":"gallide",
             "In":"indide",
             "Tl":"thallide",
             "Ni":"nickelide",
             "Mo":"molybdenide"}
    
    for p, name in polyatomics.items():
        if p in formula:
            return name

    output = None
    try:
        output = anions[re.findall(r"([A-Z][a-z]?)(\d*\)?\d*$)", formula)[0][0]]
    except:
        print(formula)
        
    return output

In [25]:
df_3["type"] = [get_formula_type_by_anion(f) for f in df_3["formula"]]  # some compounds do not quite work

NdSiAg
LaSiAg
PrSiAg
CeSiAg
AlSi3W2
LiSiAg2



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Desired composition types:

In [26]:
desired = ["oxide","sulfite","sulfate","bisulfate","silicate",
           "fluoride","chloride","bromide","orthoborate", "metaborate",
           "hydroxide","carbonate","tetraborate","phosphate","phosphite",
           "chlorate","chlorite","hypochlorite","oxychloride","bicarbonate",
          ]

In [27]:
df_4 = df_3[[r["type"] in desired for _, r in df_3.iterrows()]]

In [28]:
df_4.sort_values("e_d").head(10)

Unnamed: 0,formula,mpid,e_hull,e_f,e_d,avg_oxi,num_elems,type
0,BaGdCrFeO6,mp-1518724,0.0,-5.1543,-2.4397,"{'Ba': 2.0, 'Gd': 3.0, 'Cr': 5.0, 'Fe': 2.0, '...",5,oxide
7490,Na3ErBr6,mp-1210529,0.0,-2.0638,-0.9318,"{'Na': 1.0, 'Er': 3.0, 'Br': -1.0}",3,bromide
7532,Li3ErBr6,mp-1222492,0.0,-2.0536,-0.9269,"{'Li': 1.0, 'Er': 3.0, 'Br': -1.0}",3,bromide
4216,Tb2(SO4)3,mp-1208801,0.0,-2.6597,-0.3736,"{'Tb': 3.0, 'S': 6.0, 'O': -2.0}",3,sulfate
4434,Sm2(SO4)3,mp-1208980,0.0,-2.6233,-0.373,"{'Sm': 3.0, 'S': 6.0, 'O': -2.0}",3,sulfate
4195,Dy2(SO4)3,mp-1213482,0.0,-2.6635,-0.3698,"{'Dy': 3.0, 'S': 6.0, 'O': -2.0}",3,sulfate
4156,Ho2(SO4)3,mp-768978,0.0,-2.6695,-0.3688,"{'Ho': 3.0, 'S': 6.0, 'O': -2.0}",3,sulfate
4080,Lu2(SO4)3,mp-768453,0.0,-2.6833,-0.3559,"{'Lu': 3.0, 'S': 6.0, 'O': -2.0}",3,sulfate
1423,La5Mn5O16,mp-531717,0.0,-3.1633,-0.3022,"{'La': 3.0, 'Mn': 3.4, 'O': -2.0}",3,oxide
3356,YbCO3,mp-755213,0.0,-2.815,-0.3007,"{'Yb': 3.0, 'C': 3.0, 'O': -2.0}",3,carbonate


In [29]:
df_4.value_counts("type")  # most are oxides!

type
oxide           5732
phosphate        743
fluoride         737
sulfate          237
chloride         196
silicate         116
orthoborate       98
bromide           85
phosphite         76
hypochlorite      55
carbonate         41
metaborate        26
chlorite          22
chlorate          16
sulfite            5
Name: count, dtype: int64

In [30]:
len(df_4)

8185

#### Removing any missed ICSD compounds
Citrine ICSD dataset

In [31]:
citrine = loadfn("icsd_compounds.json.gz")
df_5 = df_4[[f not in citrine for f in df_4["formula"]]]

In [32]:
len(df_5)

5850

#### Removing compounds that appear in De Gruyter's Handbook of Inorganic Substances (2017)

In [33]:
de_gruyter = loadfn("de_gruyter_formulas.json.gz")
de_gruyter = set(de_gruyter)

In [34]:
df_6 = df_5[[f not in de_gruyter for f in df_5["formula"]]]

In [35]:
len(df_6)

4354

In [36]:
df_6.sort_values(["e_d", "e_hull"]).head(10)

Unnamed: 0,formula,mpid,e_hull,e_f,e_d,avg_oxi,num_elems,type
0,BaGdCrFeO6,mp-1518724,0.0,-5.1543,-2.4397,"{'Ba': 2.0, 'Gd': 3.0, 'Cr': 5.0, 'Fe': 2.0, '...",5,oxide
1423,La5Mn5O16,mp-531717,0.0,-3.1633,-0.3022,"{'La': 3.0, 'Mn': 3.4, 'O': -2.0}",3,oxide
988,Eu5P3O13,mp-1212818,0.0,-3.3259,-0.2512,"{'Eu': 2.2, 'P': 5.0, 'O': -2.0}",3,oxide
8408,Eu2PCl,mp-1225319,0.0,-1.779,-0.2175,"{'Eu': 2.0, 'P': -3.0, 'Cl': -1.0}",3,chloride
57,YbGd3F12,mp-1215594,0.0,-4.1427,-0.2041,"{'Yb': 3.0, 'Gd': 3.0, 'F': -1.0}",3,fluoride
191,BaCeEuHfO6,mp-1516655,0.0,-3.847,-0.1453,"{'Ba': 2.0, 'Ce': 4.0, 'Eu': 2.0, 'Hf': 4.0, '...",5,oxide
895,CaWF6,mp-1395540,0.0,-3.3691,-0.1426,"{'Ca': 2.0, 'W': 4.0, 'F': -1.0}",3,fluoride
6709,HfZn(SO4)3,mp-2713876,0.0,-2.2481,-0.1379,"{'Hf': 4.0, 'Zn': 2.0, 'S': 6.0, 'O': -2.0}",4,sulfate
1294,MgWF6,mp-1407403,0.0,-3.207,-0.1286,"{'Mg': 2.0, 'W': 4.0, 'F': -1.0}",3,fluoride
3049,Hf2Si(SO6)2,mp-2713341,0.0,-2.8616,-0.1188,"{'Hf': 4.0, 'Si': 4.0, 'S': 6.0, 'O': -2.0}",4,oxide


#### Saving as Excel file
**Note: this is not manually curated**! i.e., some of these may be removed from further searches of recent literature.

In [37]:
df_6.sort_values(["e_d", "e_hull"]).to_excel("screened_targets_50meV.xlsx")