# Ensuring correctness of formulas 

::: {.callout-note title="Motivation"}
Here we want to generate an output that will allow us to train a composition-property prediction model.

- We want to ensure that the output formulas retrieved can be parsed into a pymatgen `Composition` object. This is important to be able to reuse the extracted data with popular packages for these tasks, such as those leading in [MatBench](https://hackingmaterials.lbl.gov/matbench/).
- We want to ensure that the formulas fulfill the criteria of charge balance, so the stoichiometry is correct.
- We want to focus on a given class of materials, in this case, chalcogenide perovskite-based semiconductors.
:::

In [None]:
! pip install smact

## First SMACT example
Example of creating a list of all ternary Ba-Zr-S compounds in which their stoichiometry fullfills the charge balance criteria. We will use for this the [SMACT](https://smact.readthedocs.io/en/latest/) package.

In [66]:
# Imports
import itertools
import multiprocessing
from datetime import datetime

from smact import Element, element_dictionary, ordered_elements
from smact.screening import smact_filter

# Define the elements we are interested in
element_symbols = ['Sr', 'Ba']
all_el = element_dictionary()  # A dictionary of all element objects

# Retrieve element objects
els = [all_el[symbol] for symbol in element_symbols]

# Generate all combinations of the elements with S and Se
combinations = itertools.combinations(els, 2)
sulfides_selenides = [(*pair, all_el['S']) for pair in combinations]

# Filter the compositions using smact_filter
start = datetime.now()
if __name__ == "__main__":
    with multiprocessing.Pool(processes=4) as p:
        result = p.map(smact_filter, sulfides_selenides)
print(f"Time taken to generate list:  {datetime.now() - start}")

# Flatten the list of lists
flat_list = [item for sublist in result for item in sublist]

# Print the number of compositions and a few examples
print(f"Number of compositions: --> {len(flat_list)} <--")
print("Each list entry looks like this:\n  elements, oxidation states, stoichiometries")
for i in flat_list[:5]:
    print(i)

# Convert compositions to reduced formulas using pymatgen
from pymatgen.core import Composition

def comp_maker(comp):
    form = []
    for el, ammt in zip(comp[0], comp[2]):
        form.append(el)
        form.append(ammt)
    form = "".join(str(e) for e in form)
    pmg_form = Composition(form).reduced_formula
    return pmg_form


pretty_formulas = [comp_maker(comp) for comp in flat_list]

print("Each list entry now looks like this: ")
for i in pretty_formulas[:5]:
    print(i)

# Store the results in a pandas DataFrame
import pandas as pd

new_data = pd.DataFrame({"pretty_formula": pretty_formulas})
# Drop any duplicate compositions
new_data = new_data.drop_duplicates(subset="pretty_formula")
new_data.describe()

# Display the dataframe
print(new_data.head())


Time taken to generate list:  0:00:01.614377
Number of compositions: --> 52 <--
Each list entry looks like this:
  elements, oxidation states, stoichiometries
(('Sr', 'Ba', 'S'), (1, 2, -2), (2, 1, 2))
(('Sr', 'Ba', 'S'), (1, 2, -2), (2, 2, 3))
(('Sr', 'Ba', 'S'), (1, 2, -2), (2, 3, 4))
(('Sr', 'Ba', 'S'), (1, 2, -2), (2, 4, 5))
(('Sr', 'Ba', 'S'), (1, 2, -2), (2, 5, 6))
Each list entry now looks like this: 
Ba(SrS)2
Ba2Sr2S3
Ba3(SrS2)2
Ba4Sr2S5
Ba5(SrS3)2
  pretty_formula
0       Ba(SrS)2
1       Ba2Sr2S3
2     Ba3(SrS2)2
3       Ba4Sr2S5
4     Ba5(SrS3)2


## Targeting Chalcogenide Perovskites 

We want to constrain our language model to output formulas that follow certain chemical and structural criteria. In this case, we are looking for chalcogenide perovskite-based semiconductors.

The perovskite structure is defined based on the mineral CaTiO3. Perovskite structures must exhibit close structural similarity to this archetype. Bretterniz and Schorr have clarified a [set of criteria](https://doi.org/10.1002/aenm.201802366) to determine whether a material should be called perovskite.

Key criteria include:
1. **Stoichiometry**: ABX3, or at least an A:B:X ratio of 1:1:3.
2. **B-cation Coordination**: Octahedral (or distorted octahedra) coordination.
3. **3D Network**: The [BX6] octahedra should form an all-corner-sharing 3D network.


We will use [SMACT](https://smact.readthedocs.io/en/latest/) to generate a list of ternary chalcogenide perovskites that fullfills at least the firts criterion. On top of that, the `smact_filter` function will ensure that the stoichiometry is correct, and that the compound is charge balanced using common oxidation stated for the elemental combinations tested.

In [28]:
# Imports
from smact.screening import smact_filter
from smact import Element, Species, element_dictionary
import itertools


# Define the elements we are interested in including Sr, Ba, Hf, Zr, Eu, Ti, S and Se
element_symbols = ['Sr', 'Ba', 'Hf', 'Zr', 'Eu', 'Ti']

# Retrieve the element objects
elements = [Element(symbol) for symbol in element_symbols]

# make combinations of 2 elements with S
combinations = itertools.combinations(elements, 2)
# Add O to each pair
combinations = [[*m, Element("S")] for m in combinations]

# Define the stoichiometry constraint for 1:1:3 compositions
stoichiometries = [[1], [1], [3]]

# Filter the compositions using smact_filter
comps = [smact_filter(comb, stoichs=stoichiometries, oxidation_states_set='pymatgen') for comb in combinations]

# Print the filtered compositions
for comp in comps:
    print(comp)

# flatten the list and print the composition

flat_list = [item for sublist in comps for item in sublist]

print(flat_list)


[]
[(('Sr', 'Hf', 'S'), (2, 4, -2), (1, 1, 3))]
[(('Sr', 'Zr', 'S'), (2, 4, -2), (1, 1, 3))]
[]
[(('Sr', 'Ti', 'S'), (2, 4, -2), (1, 1, 3))]
[(('Ba', 'Hf', 'S'), (2, 4, -2), (1, 1, 3))]
[(('Ba', 'Zr', 'S'), (2, 4, -2), (1, 1, 3))]
[]
[(('Ba', 'Ti', 'S'), (2, 4, -2), (1, 1, 3))]
[(('Hf', 'Zr', 'S'), (4, 2, -2), (1, 1, 3))]
[(('Hf', 'Eu', 'S'), (4, 2, -2), (1, 1, 3))]
[(('Hf', 'Ti', 'S'), (4, 2, -2), (1, 1, 3))]
[(('Zr', 'Eu', 'S'), (3, 3, -2), (1, 1, 3)), (('Zr', 'Eu', 'S'), (4, 2, -2), (1, 1, 3))]
[(('Zr', 'Ti', 'S'), (2, 4, -2), (1, 1, 3)), (('Zr', 'Ti', 'S'), (3, 3, -2), (1, 1, 3)), (('Zr', 'Ti', 'S'), (4, 2, -2), (1, 1, 3))]
[(('Eu', 'Ti', 'S'), (2, 4, -2), (1, 1, 3)), (('Eu', 'Ti', 'S'), (3, 3, -2), (1, 1, 3))]
[(('Sr', 'Hf', 'S'), (2, 4, -2), (1, 1, 3)), (('Sr', 'Zr', 'S'), (2, 4, -2), (1, 1, 3)), (('Sr', 'Ti', 'S'), (2, 4, -2), (1, 1, 3)), (('Ba', 'Hf', 'S'), (2, 4, -2), (1, 1, 3)), (('Ba', 'Zr', 'S'), (2, 4, -2), (1, 1, 3)), (('Ba', 'Ti', 'S'), (2, 4, -2), (1, 1, 3)), (('Hf', 'Z

! pip install smact

We could turn the compositions into reduced formula using pymatgen (we lost the oxidation state information in this example).

In [29]:
from pymatgen.core import Composition

def comp_maker(comp):
    form = []
    for el, ammt in zip(comp[0], comp[2]):
        form.append(el)
        form.append(ammt)
    form = "".join(str(e) for e in form)
    pmg_form = Composition(form).reduced_formula
    return pmg_form

pretty_formulas = [comp_maker(comp) for comp in flat_list]

# print the list of pretty formulas. Make a set to remove duplicates

chalcogenide_perovskite_formulas = list(set(pretty_formulas))
print(set(pretty_formulas))

{'HfZrS3', 'ZrTiS3', 'BaHfS3', 'SrHfS3', 'SrTiS3', 'SrZrS3', 'HfTiS3', 'BaZrS3', 'EuHfS3', 'EuZrS3', 'EuTiS3', 'BaTiS3'}


To enable constrained decoding, we will use one of the most popular packages for this task [`instructor`](https://jxnl.github.io/instructor/).
It is built on [`pydantic`]() and can leverage function calling and JSON-mode of the OpenAI API as well as other constrained sampling approaches.
This part of the ntoebook is based on the example provided in the [constrained decoding example notebook](index.ipynb).

```python

In [30]:
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
# import erdantic as erd
import instructor
from IPython.display import SVG
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv('../.env', override=True)

ModuleNotFoundError: No module named 'instructor'

## Defining a data schema

For most constrained generation tasks, we need to define a data schema in a programmatic way.
The most common way to do so is to use `pydantic` data classes.
Here is an example of a simple data schema for a recipe:

```python
from pydantic import BaseModel

class Recipe(BaseModel):
    title: str
    ingredients: List[str]
    instructions: List[str]
```

This schema can also be extended to include descriptions of different fields or to only allow certain values for specific fields. For example, we could add a field for the number of servings and only allow positive integers.

```python
from pydantic import BaseModel, Field
from typing import Literal, List

class Recipe(BaseModel):
    title: str
    ingredients: List[str]
    instructions: List[str]
    servings: int = Field(..., gt=0, description="The number of servings for this recipe")
    rating: Literal["easy", "medium", "hard"] = Field("easy", description="The difficulty level of this recipe")
```

If we want to extract copolymerization reactions a data schema could look like the following.

We can now use `instructor` to "patch" the OpenAI API client to ensure that our output fulfils the schema.

In [84]:
# client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)


# Create the model
class ChalcogenidePerovskite(BaseModel):
    name: str = Field(..., title="Name", description="Name of the chalcogenide perovskite.")
    bandgap: Optional[float] = Field(
        None,
        title="Bandgap",
        description="Bandgap of the chalcogenide perovskite. Must be greater or equal to 0.",
        ge=0,
    )
    formula: Optional[Literal[tuple(chalcogenide_perovskite_formulas)]] = Field(
        None,
        title="Formula",
        description="Chemical formula of the chalcogenide perovskite.",
    )

# Example usage
example = ChalcogenidePerovskite(name="Example", bandgap=1.5, formula="BaZrS3")
print(example)


name='Example' bandgap=1.5 formula='BaZrS3'


In [67]:
failing_example = ChalcogenidePerovskite(name="Failing Example", bandgap=1.9, formula="BaZrSe3")
print(failing_example)

ValidationError: 1 validation error for ChalcogenidePerovskite
formula
  Input should be 'SrZrS3', 'HfZrS3', 'SrHfS3', 'BaHfS3', 'BaZrS3', 'HfTiS3', 'BaTiS3', 'EuZrS3', 'ZrTiS3', 'EuTiS3', 'EuHfS3' or 'SrTiS3' [type=literal_error, input_value='BaZrSe3', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/literal_error

## Collecting data from abstracts

In [26]:
from crossref.restful import Works
import json

# Initialize Works with a timeout
works = Works(timeout=90)

# Perform the search for sources on the topic of chalcogenide perovskites
try:
    query_result = works.query(bibliographic='chalcogenide perovskites bandgap').filter(has_abstract='true').sort('relevance').select(
        'DOI', 'title', 'author', 'type', 'publisher', 'issued', 'abstract').sample(10)

    # Convert the query results to a list
    results = [item for item in query_result]

    # Save results including their metadata in a JSON file
    with open('chalcogenide_perovskites_results.json', 'w') as file:
        json.dump(results, file, indent=4)

    # Print results
    print(results)

except Exception as e:
    print(f"An error occurred: {e}")


[{'DOI': '10.26434/chemrxiv.14502372.v1', 'abstract': '<jats:p>Mixed halide azetidinium lead perovskites AzPbBr<jats:sub>3-<jats:italic>x</jats:italic></jats:sub>X<jats:italic><jats:sub>x</jats:sub></jats:italic> (X = Cl or I) were obtained by mechanosynthesis. With varying halide composition from Cl- to Br- to I-; the chloride and bromide analogs both form in the hexagonal 6H polytype while the iodide adopts the 9R polytype. An intermediate 4H polytype is observed for mixed Br/I compositions. Overall the structure progresses from 6H to 4H to 9R perovskite polytype with varying halide composition. Rietveld refinement of the powder X-ray diffraction patterns revealed a linear variation in unit cell volume as a function of the average radius of the anion, which is not only observed within the solid solution of each polytype (according to Vegard’s law) but extends uniformly across all three polytypes. This is correlated with a progressive (linear) tuning of the bandgap from 3.41 to 2.00 e

## Parsing papers 

## Retrieving Structured Data with an LLM