## Data annotation

For structured data extraction it is essential to define a structure into which the data is supposed to be stored.
This raises the question of how one would decide on a good data structure (and format).

While there is no general answer, there are a few points to consider: 


### Nested vs. flat
Certain data formats are better to express dependencies and nesting and others. For instance, a tabular format like `csv` only allows columns to be specified, a format like `yaml` or `json` would also allow expressing hierarchies. 

### Verbosity and human readability 

Certain data format come with more verbose boilerplate than others (e.g. `xml`). 
This can hamper readability. 

### Type annotations and documentation 

In general, it is best practice to annotate data schema with as much information as possible - e.g. including data types (e.g., `float` vs. `int` vs. `str`) or descriptions. 

### Example 

For modeling reactions, we can serialize data in a simple data model in different formats. 

Description of our data model:

- A reaction has a name, reactants, products, and conditions.
- Each reactant and product has a chemical formula and amount.
- Conditions include temperature and pressure.

#### CSV example

```
reaction_name,reactant_formula,reactant_amount,product_formula,product_amount,temperature,pressure
Combustion of Methane,CH4,1 mol,CO2,1 mol,298 K,1 atm
Combustion of Methane,O2,2 mol,H2O,2 mol,298 K,1 atm
Photosynthesis,6 CO2,6 mol,C6H12O6,1 mol,300 K,1 atm
Photosynthesis,6 H2O,6 mol,O2,6 mol,300 K,1 atm
```

#### YAML Example

```
reactions:
  - name: Combustion of Methane
    reactants:
      - formula: CH4
        amount: 1 mol
      - formula: O2
        amount: 2 mol
    products:
      - formula: CO2
        amount: 1 mol
      - formula: H2O
        amount: 2 mol
    conditions:
      temperature: 298 K
      pressure: 1 atm
  - name: Photosynthesis
    reactants:
      - formula: 6 CO2
        amount: 6 mol
      - formula: 6 H2O
        amount: 6 mol
    products:
      - formula: C6H12O6
        amount: 1 mol
      - formula: O2
        amount: 6 mol
    conditions:
      temperature: 300 K
      pressure: 1 atm
```

#### JSON Example

```
{
  "reactions": [
    {
      "name": "Combustion of Methane",
      "reactants": [
        { "formula": "CH4", "amount": "1 mol" },
        { "formula": "O2", "amount": "2 mol" }
      ],
      "products": [
        { "formula": "CO2", "amount": "1 mol" },
        { "formula": "H2O", "amount": "2 mol" }
      ],
      "conditions": {
        "temperature": "298 K",
        "pressure": "1 atm"
      }
    },
    {
      "name": "Photosynthesis",
      "reactants": [
        { "formula": "6 CO2", "amount": "6 mol" },
        { "formula": "6 H2O", "amount": "6 mol" }
      ],
      "products": [
        { "formula": "C6H12O6", "amount": "1 mol" },
        { "formula": "O2", "amount": "6 mol" }
      ],
      "conditions": {
        "temperature": "300 K",
        "pressure": "1 atm"
      }
    }
  ]
}
```

#### XML example 

```
<reactions>
  <reaction>
    <name>Combustion of Methane</name>
    <reactants>
      <reactant>
        <formula>CH4</formula>
        <amount>1 mol</amount>
      </reactant>
      <reactant>
        <formula>O2</formula>
        <amount>2 mol</amount>
      </reactant>
    </reactants>
    <products>
      <product>
        <formula>CO2</formula>
        <amount>1 mol</amount>
      </product>
      <product>
        <formula>H2O</formula>
        <amount>2 mol</amount>
      </product>
    </products>
    <conditions>
      <temperature>298 K</temperature>
      <pressure>1 atm</pressure>
    </conditions>
  </reaction>
  <reaction>
    <name>Photosynthesis</name>
    <reactants>
      <reactant>
        <formula>6 CO2</formula>
        <amount>6 mol</amount>
      </reactant>
      <reactant>
        <formula>6 H2O</formula>
        <amount>6 mol</amount>
      </reactant>
    </reactants>
    <products>
      <product>
        <formula>C6H12O6</formula>
        <amount>1 mol</amount>
      </product>
      <product>
        <formula>O2</formula>
        <amount>6 mol</amount>
      </product>
    </products>
    <conditions>
      <temperature>300 K</temperature>
      <pressure>1 atm</pressure>
    </conditions>
  </reaction>
</reactions>
```

```{admonition} Data model, schema format 
*Data Model:* Represents the structure and organization of data, defining how data is stored, organized, and manipulated. In this case, it includes reactions, reactants, products, and conditions.

*Schema:* Defines the structure of data within a particular format, specifying the data types, constraints, and relationships. For example, an XML Schema (XSD) or JSON Schema can be used to validate the structure of XML or JSON data respectively.

*Format:* Refers to the way data is encoded and represented for storage or transmission. CSV, YAML, JSON, and XML are different formats that encode data in different way.
```

In practice, it is often most convenient to define a data schema in code. This has multiple advantages: 

- the data schema can be tracked with all other code using version control 
- there are existing routines for export in various formats 
- data can be conveniently accessed in code, e.g., via class attributes

`pydantic` is a library that makes it easy to define and validate data in Python.
It can also parse data from various formats and serialize it to various formats.

In [4]:
from pydantic import BaseModel, Field
from typing import List

In [7]:
class Reaction(BaseModel):
    reaction_name: str = Field(..., description="The name of the reaction")
    reactants: List[str] = Field(..., description="The reactants of the reaction")
    catalyst: List[str] = Field(..., description="The catalysts of the reaction")
    base: str = Field(..., description="The base of the reaction")
    solvent: str = Field(..., description="The solvent of the reaction")
    temperature: int = Field(..., description="The temperature of the reaction")
    temperature_unit: str = Field(..., description="The unit of the temperature")
    product: str = Field(..., description="The product of the reaction")
    rxn_yield: float = Field(..., description="The yield of the reaction")

We can now use the `Reaction` class to create an instance of the `Reaction` using a dict

In [9]:
rxn_dict = {
    "reaction_name": "Buchwald-Hartwig reaction",
    "reactants": ["5-Bromo-m-xylene", "Benzylmethylamine"],
    "catalyst": ["Bis(dibenzylideneacetone)palladium(0)", "Tri(o-tolyl)phosphine"],
    "base": "Sodium tert-butoxide",
    "solvent": "Toluene",
    "temperature": 65,
    "temperature_unit": "°C",
    "product": "N-Benzyl-N-methyl(3,5-xylyl)amine",
    "rxn_yield": 88,
}

In [11]:
rxn = Reaction(**rxn_dict)

We can also save the data as a json file

In [15]:
rxn.model_dump_json()

'{"reaction_name":"Buchwald-Hartwig reaction","reactants":["5-Bromo-m-xylene","Benzylmethylamine"],"catalyst":["Bis(dibenzylideneacetone)palladium(0)","Tri(o-tolyl)phosphine"],"base":"Sodium tert-butoxide","solvent":"Toluene","temperature":65,"temperature_unit":"°C","product":"N-Benzyl-N-methyl(3,5-xylyl)amine","rxn_yield":88.0}'

```{admonition} Annotating data
:class: tip

The creation of a test and validation set is crucial and can be time-consuming. Therefore, [using an automation tool is recommended](https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html). For doing so, [TeamTat](https://www.teamtat.org) can be a convenient choice.
```