In [2]:
from chemdataextractor import Document
import chemdataextractor as cde

### Read in text from end of paper containing PCE info: 
Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.

In [3]:
caption = Document("PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.")

In [4]:
caption

In [5]:
caption[0].records.serialize()

[{u'labels': [u'1'], u'names': [u'(PEA)2(MA)2[Pb3I10]']},
 {u'names': [u'(MA)[PbI3]']},
 {u'names': [u'PbI2']},
 {u'names': [u'PbCl2']},
 {u'names': [u'PbI2']}]

Will try combining caption and paragraph into one Document manually. 

In [6]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))

In [7]:
doc_w_cap

In [8]:
doc_w_cap.records.serialize()

[{u'names': [u'fluorocarbons']},
 {u'names': [u'(MA)[PbI3]']},
 {u'names': [u'PbCl2']},
 {u'labels': [u'1'], u'names': [u'(PEA)2(MA)2[Pb3I10]']},
 {u'names': [u'PbI2']}]

## Try and build custom model for PCE extraction
Following example is ChemDataExtractor git repo `extracting_a_custom_property.ipynb`

In [9]:
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class PCE(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.pce = ListType(ModelType(PCE))

In [10]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge, ZeroOrMore


prefix = (I(u'PCEs') | I(u'pce') | I(u'power') + I(u'conversion') + I(u'efficiency')).hide()
common_text = R('\D').hide()
units = (W(u'%') | I(u'percent'))(u'units')
# value = R(u'^\d+(\.\d+)?$')(u'value')
value = R(u'\d+(\.\d+)?')(u'value')
pce = (prefix + Optional(common_text)+Optional(common_text) + value + units)(u'pce')

In [11]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class PCEParser(BaseParser):
    root = pce

    def interpret(self, result, start, end):
        compound = Compound(
            pce=[
                PCE(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound


In [12]:
Paragraph.parsers.append(PCEParser())



In [13]:
Paragraph.parsers 

[<chemdataextractor.parse.cem.CompoundParser at 0x10cfb8710>,
 <chemdataextractor.parse.cem.ChemicalLabelParser at 0x10cfb8750>,
 <chemdataextractor.parse.nmr.NmrParser at 0x10cfb8790>,
 <chemdataextractor.parse.ir.IrParser at 0x10cfb8810>,
 <chemdataextractor.parse.uvvis.UvvisParser at 0x10cfb8850>,
 <chemdataextractor.parse.mp.MpParser at 0x10cfb8890>,
 <chemdataextractor.parse.tg.TgParser at 0x10cfb88d0>,
 <chemdataextractor.parse.context.ContextParser at 0x10cfb8910>,
 <__main__.PCEParser at 0x10a889250>]

In [14]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))
doc_w_cap.records.serialize()

[{u'pce': [{u'units': u'%', u'value': u'4.73'}]},
 {u'pce': [{u'units': u'%', u'value': u'15'}]},
 {u'names': [u'fluorocarbons']},
 {u'names': [u'(MA)[PbI3]']},
 {u'names': [u'PbCl2']},
 {u'labels': [u'1'], u'names': [u'(PEA)2(MA)2[Pb3I10]']},
 {u'names': [u'PbI2']}]

Parser pulls both PCE values now from test string. Still doesn't perform as well as I would like, as it does not associate the PCE with a chemical species. This is especially as problem for the given example, where there are two PCE values reported for two different materials. But I guess we could just throw out papers that do this. 

Will work this into a .py file tonight.

In [15]:
output_dict = doc_w_cap.records.serialize()

#### Some older tests

In [40]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'),
#                     cde.doc.text.Paragraph(u'The compund (PEA)2(MA)2[Pb3I10] has a pce 3 percent'),
                    cde.doc.text.Paragraph(u'Solar cells containing 1 display pce 4 percent'),
#                     cde.doc.text.Paragraph(u'The compund PbCl2 (pce 4 percent)'),
                    )
doc_w_cap.records.serialize()

[{u'pce': [{u'units': u'%', u'value': u'4.73'}]},
 {u'pce': [{u'units': u'%', u'value': u'15'}]},
 {u'names': [u'fluorocarbons']},
 {u'names': [u'(MA)[PbI3]']},
 {u'names': [u'PbCl2']},
 {u'pce': [{u'units': u'percent', u'value': u'4'}]},
 {u'labels': [u'1'], u'names': [u'(PEA)2(MA)2[Pb3I10]']},
 {u'names': [u'PbI2']}]

Can't get the pce's to associate with the chemical entities