<a href="https://colab.research.google.com/github/dani-lbnl/mudit/blob/main/Chem_DataExtractor_extracting_a_custom_property.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting a Custom Property

In [1]:
!pip install chemdataextractor2

Collecting chemdataextractor2
  Downloading chemdataextractor2-2.1.0.tar.gz (898 kB)
[K     |████████████████████████████████| 898 kB 5.3 MB/s 
Collecting click==6.7
  Downloading click-6.7-py2.py3-none-any.whl (71 kB)
[K     |████████████████████████████████| 71 kB 8.0 MB/s 
[?25hCollecting cssselect
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting pdfminer.six
  Downloading pdfminer.six-20211012-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 46.9 MB/s 
Collecting requests==2.21.0
  Downloading requests-2.21.0-py2.py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 4.9 MB/s 
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743 kB)
[K     |████████████████████████████████| 743 kB 42.6 MB/s 
[?25hCollecting tabledataextractor
  Downloading tabledataextractor-1.5.10.tar.gz (27 kB)
Collecting selenium==3.141.0
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904

In [2]:
!pip list -v | grep chem

chemdataextractor2            2.1.0                 /usr/local/lib/python3.7/dist-packages pip
jsonschema                    4.3.3                 /usr/local/lib/python3.7/dist-packages pip
SQLAlchemy                    1.4.31                /usr/local/lib/python3.7/dist-packages pip


In [3]:
import chemdataextractor2

ModuleNotFoundError: ignored

In [None]:
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading

## read file

In [None]:
from chemdataextractor import Document
f = open('j.jnoncrysol.2017.07.006.xml', 'rb')
doc = Document.from_file(f)

IndexError: ignored

## Example Document

Let's create a simple example document with a single heading followed by a single paragraph:

In [None]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like:

In [None]:
d

## Default Parsers

By default, ChemDataExtractor won't extract the boiling point property:

In [None]:
d.records.serialize()

[{'labels': ['3a'], 'names': ['2,4,6-trinitrotoluene'], 'roles': ['product']}]

## Defining a New Property Model

The first task is to define the schema of a new property, and add it to the `Compound` model:

In [None]:
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))

## Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model:

In [None]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

In [None]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound


In [None]:
Paragraph.parsers = [BpParser()]

## Running the New Parser

In [None]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

d.records.serialize()

[{'boiling_points': [{'units': '°C', 'value': '240'}],
  'labels': ['3a'],
  'names': ['2,4,6-trinitrotoluene'],
  'roles': ['product']}]