include-in-header:
  text: |
    <style>
    .cell-output-stdout {
      overflow-y: scroll;
      max-height: 400px;
    }
    </style>

# Retrieving Data from Chacolgenide Perovskites

::: {.callout-note title="Motivation"}
Here we want to generate an output that will allow us to train a composition-property prediction model.

- We want to ensure that the output formulas retrieved can be parsed into a pymatgen `Composition` object. This is important to be able to reuse the extracted data with popular packages for these tasks, such as those leading in [MatBench](https://hackingmaterials.lbl.gov/matbench/).
- We want to ensure control over  the sotichiometry of the compounds retrieved. We also want to ensure that the formulas fullfill the criteria of charge neutrality. For this, we will emply a constrain decoding approach as exemplify in the [constrained decoding]() notebook.
- We want to focus on a given class of materials, in this case, chalcogenide perovskite-based semiconductors. We will valide if the formulas retrieved fullfill some stability criteria for being a perovskite. For this we will employ a similar approach as in the [validation example]() notebook.
:::

## What are Chalcogenide Perovskites?

::: {.callout-tip title="Motivation"}
We aim to constrain our language model to output formulas that adhere to specific chemical and structural criteria, focusing on chalcogenide perovskite-based compounds. 

Chalcogenide perovskites are a subclass of perovskites where the anions are chalcogens (**sulfur**, **selenium**, or **tellurium**). 

These materials are of great interest due to their unique properties that make them suitable for energy applications, such as:

- **Optoelectronic Properties**: They exhibit favorable band gaps and strong light absorption, making them ideal for photovoltaic applications.
- **Stability**: Compared to their halide counterparts, chalcogenide perovskites often show enhanced chemical stability.
- **Flexibility in Composition**: The ability to tune the composition allows for optimization of properties for specific applications like solar cells, photodetectors, and LEDs.



### The Perovskite-type Structure

The perovskite-type structure is defined based on the [mineral CaTiO₃](https://en.wikipedia.org/wiki/Calcium_titanate). To be classified as a perovskite, materials must exhibit close structural similarity to this archetype. Bretterniz and Schorr have clarified a [set of criteria](https://doi.org/10.1002/aenm.201802366) to determine whether a material should be called perovskite.

### Key Criteria for Perovskites

1. **Stoichiometry**: ABX₃, or at least an A🟠: B🔵: X🔴 ratio of 1:1:3.
2. **B-cation Coordination**: Octahedral (or distorted octahedra) coordination.
3. **3D Network**: The [BX₆] octahedra should form an all-corner-sharing 3D network.

<img src="BaZrS3.png" width="320" height="300"/>

### Implementation

We will use [pymatgen](https://pymatgen.org/) to generate a list of ternary chalcogenide perovskites that fulfill at least the first criterion for sulfide and selenide compounds (the 1:1:3 stoichiometry one). 
The [`oxi_state_guesses` method](https://pymatgen.org/pymatgen.core.html#pymatgen.core.composition.Composition.oxi_state_guesses) of the pymatgen `Composition` object will ensure the stoichiometry is correct and that the compound is charge balanced. This is performed by using guessed oxidation states for the elemental combinations screened.

In [2]:
from pymatgen.core import Composition

def generate_compositions(element_symbols, anions=['S', 'Se']):
    valid_compositions = []

    for i, cation1 in enumerate(element_symbols):
        for cation2 in element_symbols[i+1:]:  # Ensure unique pairs
            if cation1 != cation2:  # Ensure that two cations are not the same
                for anion in anions:
                    formula = f"{cation1}{cation2}{anion}3"
                    try:
                        comp = Composition(formula)
                        # Guess oxidation states
                        oxi_states_override = {anion: [-2]}
                        oxi_state_guesses = comp.oxi_state_guesses(oxi_states_override=oxi_states_override)

                        for guess in oxi_state_guesses:
                            if all(val is not None for val in guess.values()):
                                valid_compositions.append((comp.reduced_formula, guess))
                                break
                    except ValueError:
                        continue  # Skip invalid combinations

    return valid_compositions

element_symbols = ['Sr', 'Ba', 'Hf', 'Zr', 'Eu', 'Ti', 'La', 'Ce', 'Sm', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu']
valid_compositions = generate_compositions(element_symbols)

for formula, oxi_states in valid_compositions:
    print(f"Formula: {formula}, Oxidation States: {oxi_states}")

print(f"Number of valid compositions: {len(valid_compositions)}")

valid_formulas = [formula for formula, _ in valid_compositions]


Formula: SrHfS3, Oxidation States: {'Sr': 2.0, 'Hf': 4.0, 'S': -2.0}
Formula: SrHfSe3, Oxidation States: {'Sr': 2.0, 'Hf': 4.0, 'Se': -2.0}
Formula: SrZrS3, Oxidation States: {'Sr': 2.0, 'Zr': 4.0, 'S': -2.0}
Formula: SrZrSe3, Oxidation States: {'Sr': 2.0, 'Zr': 4.0, 'Se': -2.0}
Formula: SrTiS3, Oxidation States: {'Sr': 2.0, 'Ti': 4.0, 'S': -2.0}
Formula: SrTiSe3, Oxidation States: {'Sr': 2.0, 'Ti': 4.0, 'Se': -2.0}
Formula: SrCeS3, Oxidation States: {'Sr': 2.0, 'Ce': 4.0, 'S': -2.0}
Formula: SrCeSe3, Oxidation States: {'Sr': 2.0, 'Ce': 4.0, 'Se': -2.0}
Formula: SrTbS3, Oxidation States: {'Sr': 2.0, 'Tb': 4.0, 'S': -2.0}
Formula: SrTbSe3, Oxidation States: {'Sr': 2.0, 'Tb': 4.0, 'Se': -2.0}
Formula: BaHfS3, Oxidation States: {'Ba': 2.0, 'Hf': 4.0, 'S': -2.0}
Formula: BaHfSe3, Oxidation States: {'Ba': 2.0, 'Hf': 4.0, 'Se': -2.0}
Formula: BaZrS3, Oxidation States: {'Ba': 2.0, 'Zr': 4.0, 'S': -2.0}
Formula: BaZrSe3, Oxidation States: {'Ba': 2.0, 'Zr': 4.0, 'Se': -2.0}
Formula: BaTiS3, Oxi

We could turn the compositions into reduced formula using pymatgen (we lost the oxidation state information in this example).

To enable constrained decoding, we will use one of the most popular packages for this task [`instructor`](https://jxnl.github.io/instructor/).
It is built on [`pydantic`]() and can leverage function calling and JSON-mode of the OpenAI API as well as other constrained sampling approaches.
This part of the ntoebook is based on the example provided in the [constrained decoding example notebook](index.ipynb).

In [2]:
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
# import erdantic as erd
# import instructor
# from IPython.display import SVG
# from openai import OpenAI
# from dotenv import load_dotenv
# load_dotenv('../.env', override=True)

## Defining a data schema

For most constrained generation tasks, we need to define a data schema in a programmatic way.
The most common way to do so is to use `pydantic` data classes. For more information on how to define data classes, see the [pydantic documentation](https://pydantic-docs.helpmanual.io/) and our notebook on [constrained decoding](index.ipynb).

If we want to extract valid composition for chalcogenide perovskites, and additional propeties reported in the literature, associated with them. In our schema, we will only allow comositions from our precompiled list.

We can now use `instructor` to "patch" the OpenAI API client to ensure that our output fulfils the schema.

In [7]:
# client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)


# Create the model
class ChalcogenidePerovskite(BaseModel):
    name: str = Field(..., title="Name", description="Name of the chalcogenide perovskite.")
    bandgap: Optional[float] = Field(
        None,
        title="Bandgap",
        description="Bandgap of the chalcogenide perovskite. Must be greater or equal to 0.",
        ge=0,
    )
    formula: Optional[Literal[tuple(valid_formulas)]] = Field(
        None,
        title="Formula",
        description="Chemical formula of the chalcogenide perovskite.",
    )

# Example usage
example = ChalcogenidePerovskite(name="Example", bandgap=1.5, formula="BaZrS3")

from IPython.display import JSON

JSON(example.dict())


<IPython.core.display.JSON object>

## Parsing papers 

For this task, we want to retrive the structure dataset from the review paper [Chalcogenide Perovskites: Tantalizing Prospects, Challenging Materials](https://onlinelibrary.wiley.com/doi/10.1002/adom.202101704). 

Firstly, we will use Nougat to extract the text from the paper. A detail exmaple of how to do this can be found in the [Nougat example notebook]().

## Validating the output to be a perovskite type structure

Before, we have ensured that the output formulas follow the stoichiometry criteria of 1:1:3 for potentially being a perovskite. 
Now, we aim to validate whether a given set of chemical formulas can yield a perovskite-type structure. We use the tolerance factor and octahedral factor as criteria for validation. The steps involved in the process are as follows:

### Introduction

A perovskite structure typically follows the formula ABX\(_3\). The stability of such a structure can be assessed using:

1. **Tolerance Factor (\(t\))**: This is a dimensionless number calculated from the ionic radii of the constituent ions. For a stable perovskite structure, the tolerance factor should ideally be in the range of 0.8 to 1.0.

### Methodology

To determine the tolerance factor and octahedral factor, we perform the following steps:

1. **Fetch Elemental Properties**: We retrieve ionic radii and electronegativities of the elements involved using the `pymatgen` library.
2. **Calculate Modified Tolerance Factor**: We use a modified formula for the tolerance factor that incorporates electronegativity differences:
   
   $$ 
   t^* = \frac{\left( \frac{\Delta \chi_{(A-X)}}{\Delta \chi_{(A-O)}} \right) (r_A + r_X)}{\sqrt{2} \left( \frac{\Delta \chi_{(B-X)}}{\Delta \chi_{(B-O)}} \right) (r_B + r_X)} 
   $$

3. **Validation**: We validate the structure by checking if the calculated tolerance factor fall within the acceptable ranges.


Let's start by fetching the shannon ionic radii of the species involved in the perovskite structure. And store this in our working dataframe.

Get a dicctionary for the elements of the formula and guess the oxidation states. 

In [13]:
# function that given a formula returns the element symbol and the charge for its elements into a dict with key as the element symbol and value as the charge
def get_oxidation_states(formula):
    comp = Composition(formula)
    guesses =  comp.oxi_state_guesses(oxi_states_override={'S': [-2], 'Se': [-2]})
    return guesses[0]

# test the function
dict_oxi = get_oxidation_states('ZrBaS3')
print(dict_oxi)

{'Zr': 4.0, 'Ba': 2.0, 'S': -2.0}


Let's make sure to order the elements so they follow the A, B, X cations correcly, with A being the largest cation, B the transition metal, and X the anion. 

In [14]:
def order_oxidation_states(oxidation_states):
    cations = {el: ox for el, ox in oxidation_states.items() if ox > 0}
    anion = {el: ox for el, ox in oxidation_states.items() if ox < 0}

    sorted_cations = {k: v for k, v in sorted(cations.items(), key=lambda item: item[1])}

    return {**sorted_cations, **anion}

dict_oxi = order_oxidation_states(dict_oxi)
print(dict_oxi)



{'Ba': 2.0, 'Zr': 4.0, 'S': -2.0}


In [15]:
from mendeleev.fetch import fetch_ionic_radii, fetch_table
import pandas as pd
import math

# Fetch the elements table to get the mapping from symbol to atomic number
elements_df = fetch_table("elements")
symbol_to_atomic_number = dict(zip(elements_df['symbol'], elements_df['atomic_number']))

def get_ionic_radius(symbol, charge, coordination_number):
    # Get the atomic number for the given symbol
    atomic_number = symbol_to_atomic_number.get(symbol)

    if atomic_number is None:
        return None

    # Fetch the ionic radii table
    ionic_radii_df = fetch_ionic_radii(radius="ionic_radius")

    # Check if the provided coordination number is a valid column
    if coordination_number not in ionic_radii_df.columns:
        return None

    # Filter the DataFrame for the specific atomic number and charge
    element_data = ionic_radii_df[
        (ionic_radii_df.index.get_level_values('atomic_number') == atomic_number) &
        (ionic_radii_df.index.get_level_values('charge') == charge)
    ]

    # Extract the ionic radius for the given coordination number
    if element_data.empty:
        return None

    ionic_radius = element_data[coordination_number].values

    if len(ionic_radius) > 0 and pd.notna(ionic_radius[0]):
        return ionic_radius[0]
    else:
        return None

def get_electronegativity(symbol):
    elements_df = fetch_table("elements")
    en_pauling = elements_df.set_index('symbol')['en_pauling'].to_dict()
    return en_pauling.get(symbol)

def calculate_modified_tolerance_factor(A_symbol, A_charge, B_symbol, B_charge, X_symbol='S', X_charge=-2):
    # Determine coordination numbers for A and B cations
    if A_charge == 2 and B_charge == 4:
        cn_A = 'XII'  # Typically 12 for A site cation in perovskites
        cn_B = 'VI'   # Typically 6 for B site cation in perovskites
    elif A_charge == 3 and B_charge == 3:
        cn_A = 'XII'
        cn_B = 'VI'
    elif A_charge == 1 and B_charge == 5:
        cn_A = 'XII'
        cn_B = 'VI'
    else:
        raise ValueError("Invalid charge combination for perovskite structure.")

    cn_X = 'VI'  # Typically 6 for anions in perovskites

    r_A = get_ionic_radius(A_symbol, A_charge, cn_A)
    r_B = get_ionic_radius(B_symbol, B_charge, cn_B)
    r_X = get_ionic_radius(X_symbol, X_charge, cn_X)

    # Ensure radii are retrieved successfully
    if r_A is None or r_B is None or r_X is None:
        print(f"Failed to retrieve ionic radii for: A({A_symbol}, {A_charge}), B({B_symbol}, {B_charge}), X({X_symbol}, {X_charge})")
        return None

    chi_A_X = get_electronegativity(A_symbol) - get_electronegativity(X_symbol)
    chi_A_O = get_electronegativity(A_symbol) - get_electronegativity('O')
    chi_B_X = get_electronegativity(B_symbol) - get_electronegativity(X_symbol)
    chi_B_O = get_electronegativity(B_symbol) - get_electronegativity('O')

    # Ensure electronegativities are retrieved successfully
    if chi_A_X is None or chi_A_O is None or chi_B_X is None or chi_B_O is None:
        print(f"Failed to retrieve electronegativity for: A({A_symbol}), B({B_symbol}), X({X_symbol})")
        return None

    t_star = (chi_A_X / chi_A_O * (r_A + r_X)) / (math.sqrt(2) * chi_B_X / chi_B_O * (r_B + r_X))
    return t_star

def is_perovskite(composition_str):
    # Get oxidation states of elements in the composition
    oxidation_states = get_oxidation_states(composition_str)
    oxidation_states = order_oxidation_states(oxidation_states)

    A_symbol, A_charge = list(oxidation_states.items())[0]
    B_symbol, B_charge = list(oxidation_states.items())[1]
    X_symbol, X_charge = list(oxidation_states.items())[2]

    t_star = calculate_modified_tolerance_factor(A_symbol, A_charge, B_symbol, B_charge, X_symbol, X_charge)

    if t_star is None:
        return False

    return 0.8 <= t_star <= 1.1

# Example usage
composition_str = 'BaZrS3'
is_perovskite_result = is_perovskite(composition_str)
print(f"Is {composition_str} a perovskite? {is_perovskite_result}")

Is BaZrS3 a perovskite? True


In [16]:
# test for formulas in valid_compositions building a dataframe

formulas = [formula for formula, _ in valid_compositions]

resultd_df = pd.DataFrame({'Formula': formulas, 'Is Perovskite': [is_perovskite(formula) for formula in formulas]})
resultd_df

Failed to retrieve ionic radii for: A(Zr, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Zr, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(Eu, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Eu, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(Ti, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Ti, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(La, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(La, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(Sm, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Sm, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(Yb, 2.0), B(Hf, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Yb, 2.0), B(Hf, 4.0), X(Se, -2.0)
Failed to retrieve ionic radii for: A(Eu, 2.0), B(Zr, 4.0), X(S, -2.0)
Failed to retrieve ionic radii for: A(Eu, 2.0), B(Zr, 4.0), X(Se, -2.0)

Unnamed: 0,Formula,Is Perovskite
0,SrHfS3,True
1,SrHfSe3,True
2,SrZrS3,True
3,SrZrSe3,True
4,SrTiS3,False
...,...,...
209,YbTmSe3,False
210,TmLuS3,False
211,TmLuSe3,False
212,YbLuS3,False


In [17]:
resultd_df['Is Perovskite'].value_counts(normalize=True)

Is Perovskite
False    0.761682
True     0.238318
Name: proportion, dtype: float64

In [160]:
formulas = [formula for formula, _ in valid_compositions]
print(formulas)

for formula in formulas:
    is_perovskite_result = is_perovskite(formula)
    print(f"Is {formula} a perovskite? {is_perovskite_result}")

['SrHfS3', 'SrHfSe3', 'SrZrS3', 'SrZrSe3', 'SrTiS3', 'SrTiSe3', 'SrCeS3', 'SrCeSe3', 'SrPrS3', 'SrPrSe3', 'SrTbS3', 'SrTbSe3', 'BaHfS3', 'BaHfSe3', 'BaZrS3', 'BaZrSe3', 'BaTiS3', 'BaTiSe3', 'BaCeS3', 'BaCeSe3', 'BaPrS3', 'BaPrSe3', 'BaTbS3', 'BaTbSe3', 'HfZrS3', 'HfZrSe3', 'EuHfS3', 'EuHfSe3', 'HfTiS3', 'HfTiSe3', 'LaHfS3', 'LaHfSe3', 'NdHfS3', 'NdHfSe3', 'SmHfS3', 'SmHfSe3', 'EuHfS3', 'EuHfSe3', 'YbHfS3', 'YbHfSe3', 'EuZrS3', 'EuZrSe3', 'ZrTiS3', 'ZrTiSe3', 'YZrS3', 'YZrSe3', 'LaZrS3', 'LaZrSe3', 'CeZrS3', 'CeZrSe3', 'PrZrS3', 'PrZrSe3', 'NdZrS3', 'NdZrSe3', 'PmZrS3', 'PmZrSe3', 'SmZrS3', 'SmZrSe3', 'EuZrS3', 'EuZrSe3', 'GdZrS3', 'GdZrSe3', 'TbZrS3', 'TbZrSe3', 'DyZrS3', 'DyZrSe3', 'HoZrS3', 'HoZrSe3', 'ErZrS3', 'ErZrSe3', 'TmZrS3', 'TmZrSe3', 'YbZrS3', 'YbZrSe3', 'LuZrS3', 'LuZrSe3', 'EuTiS3', 'EuTiSe3', 'EuYS3', 'EuYSe3', 'LaEuS3', 'LaEuSe3', 'CeEuS3', 'CeEuSe3', 'PrEuS3', 'PrEuSe3', 'NdEuS3', 'NdEuSe3', 'PmEuS3', 'PmEuSe3', 'SmEuS3', 'SmEuSe3', 'Eu2S3', 'Eu2Se3', 'EuGdS3', 'EuGdSe3

IndexError: list index out of range

In [133]:
import json
with open("shannon-radii.json") as f:
    out = f.read()

d = json.loads(out)

# Enter Element, Charge, Coordination and one of - r_crystal, r_ionic, spin, remark

print(d['Cd']['2']['VII']['r_ionic'])

1.03


In [None]:
def get_ionic_radius_2(symbol, charge, coordination_number):
    return d[symbol][str(charge)][coordination_number]['r_ionic']

# test


