<a href="https://colab.research.google.com/github/dani-lbnl/mudit/blob/main/Chem_DataExtractor_extracting_a_custom_property.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting a Custom Property

In [2]:
!pip install chemdataextractor

Collecting chemdataextractor
  Downloading ChemDataExtractor-1.3.0-py3-none-any.whl (182 kB)
[?25l[K     |█▉                              | 10 kB 20.2 MB/s eta 0:00:01[K     |███▋                            | 20 kB 23.4 MB/s eta 0:00:01[K     |█████▍                          | 30 kB 11.3 MB/s eta 0:00:01[K     |███████▏                        | 40 kB 3.9 MB/s eta 0:00:01[K     |█████████                       | 51 kB 3.9 MB/s eta 0:00:01[K     |██████████▊                     | 61 kB 4.5 MB/s eta 0:00:01[K     |████████████▌                   | 71 kB 4.8 MB/s eta 0:00:01[K     |██████████████▍                 | 81 kB 4.7 MB/s eta 0:00:01[K     |████████████████▏               | 92 kB 5.2 MB/s eta 0:00:01[K     |██████████████████              | 102 kB 4.4 MB/s eta 0:00:01[K     |███████████████████▊            | 112 kB 4.4 MB/s eta 0:00:01[K     |█████████████████████▌          | 122 kB 4.4 MB/s eta 0:00:01[K     |███████████████████████▎        | 133 kB 4.4 

In [4]:
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading

## read file

In [6]:
from chemdataextractor import Document
f = open('j.jnoncrysol.2017.07.006.xml', 'rb')
doc = Document.from_file(f)

IndexError: ignored

## Example Document

Let's create a simple example document with a single heading followed by a single paragraph:

In [None]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like:

In [None]:
d

## Default Parsers

By default, ChemDataExtractor won't extract the boiling point property:

In [None]:
d.records.serialize()

[{'labels': ['3a'], 'names': ['2,4,6-trinitrotoluene'], 'roles': ['product']}]

## Defining a New Property Model

The first task is to define the schema of a new property, and add it to the `Compound` model:

In [None]:
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))

## Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model:

In [None]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

In [None]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound


In [None]:
Paragraph.parsers = [BpParser()]

## Running the New Parser

In [None]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

d.records.serialize()

[{'boiling_points': [{'units': '°C', 'value': '240'}],
  'labels': ['3a'],
  'names': ['2,4,6-trinitrotoluene'],
  'roles': ['product']}]