In [1]:
from chemdataextractor import Document
from chemdataextractor.doc import Paragraph, Sentence
import chemdataextractor as cde

### Read in text from end of paper containing PCE info: 
Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.

In [2]:
caption = Document("PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.")

In [3]:
caption

In [4]:
caption[0].records.serialize()

[{'names': ['(PEA)2(MA)2[Pb3I10]'], 'labels': ['1']},
 {'names': ['(MA)[PbI3]']},
 {'names': ['PbI2']},
 {'names': ['PbCl2']},
 {'names': ['PbI2']}]

Will try combining caption and paragraph into one Document manually. 

In [5]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))

In [6]:
doc_w_cap

In [7]:
doc_w_cap.records.serialize()

[{'names': ['fluorocarbons']},
 {'names': ['(MA)[PbI3]']},
 {'names': ['PbCl2']},
 {'names': ['(PEA)2(MA)2[Pb3I10]'], 'labels': ['1']},
 {'names': ['PbI2']}]

## Try and build custom model for PCE extraction
Following example is ChemDataExtractor git repo `extracting_a_custom_property.ipynb`

In [8]:
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class PCE(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.pce = ListType(ModelType(PCE))

In [9]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge, ZeroOrMore


prefix = (I(u'PCEs') | I(u'pce') | I(u'power') + I(u'conversion') + I(u'efficiency')).hide()
common_text = R('\D').hide()
units = (W(u'%') | I(u'percent'))(u'units')
# value = R(u'^\d+(\.\d+)?$')(u'value')
value = R(u'\d+(\.\d+)?')(u'value')
pce = (prefix + Optional(common_text)+Optional(common_text) + value + units)(u'pce')

In [10]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class PCEParser(BaseParser):
    root = pce

    def interpret(self, result, start, end):
        compound = Compound(
            pce=[
                PCE(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
#         print('result = ',result)
#         print("result.xpath = ", result.xpath())
#         print("result.xpath('./value/text()') = {}".format(result.xpath('./value/text()')))
        yield compound


In [11]:
Paragraph.parsers.append(PCEParser())



In [12]:
Paragraph.parsers 

[<chemdataextractor.parse.cem.CompoundParser at 0x10edc45f8>,
 <chemdataextractor.parse.cem.ChemicalLabelParser at 0x10edc4630>,
 <chemdataextractor.parse.nmr.NmrParser at 0x10edc4668>,
 <chemdataextractor.parse.ir.IrParser at 0x10edc46a0>,
 <chemdataextractor.parse.uvvis.UvvisParser at 0x10edc46d8>,
 <chemdataextractor.parse.mp.MpParser at 0x10edc4710>,
 <chemdataextractor.parse.tg.TgParser at 0x10edc4748>,
 <chemdataextractor.parse.context.ContextParser at 0x10edc4780>,
 <__main__.PCEParser at 0x10e62b668>]

In [13]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))
doc_w_cap.records.serialize()

[{'pce': [{'value': '4.73', 'units': '%'}]},
 {'pce': [{'value': '15', 'units': '%'}]},
 {'names': ['fluorocarbons']},
 {'names': ['(MA)[PbI3]']},
 {'names': ['PbCl2']},
 {'names': ['(PEA)2(MA)2[Pb3I10]'], 'labels': ['1']},
 {'names': ['PbI2']}]

Parser pulls both PCE values now from test string. Still doesn't perform as well as I would like, as it does not associate the PCE with a chemical species. This is especially as problem for the given example, where there are two PCE values reported for two different materials. But I guess we could just throw out papers that do this. 

Will work this into a .py file tonight.

In [14]:
output_dict = doc_w_cap.records.serialize()

#### Some older tests

In [15]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'),
#                     cde.doc.text.Paragraph(u'The compund (PEA)2(MA)2[Pb3I10] has a pce 3 percent'),
                    cde.doc.text.Paragraph(u'Solar cells containing 1 display pce 4 percent'),
#                     cde.doc.text.Paragraph(u'The compund PbCl2 (pce 4 percent)'),
                    )
doc_w_cap.records.serialize()

[{'pce': [{'value': '4.73', 'units': '%'}]},
 {'pce': [{'value': '15', 'units': '%'}]},
 {'names': ['fluorocarbons']},
 {'names': ['(MA)[PbI3]']},
 {'names': ['PbCl2']},
 {'pce': [{'value': '4', 'units': 'percent'}]},
 {'names': ['(PEA)2(MA)2[Pb3I10]'], 'labels': ['1']},
 {'names': ['PbI2']}]

### Testing a list of sentences

In [16]:
test_sentences = ['For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.',
 'The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.',
 'The average values from the J–V curves from the reverse and forward scans (Fig.\xa05a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.',
 'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.',
 'One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.\xa06).',
 'In summary, we developed a solvent-engineering technology for the deposition of extremely uniform perovskite layers, and demonstrated a solution-processed perovskite solar cell with 16.5% PCE under standard conditions (AM 1.5 G radiation, 100 mW cm−2).']

In [17]:
Paragraph(test_sentences).records.serialize()

TypeError: Text must be a unicode string

In [20]:

Sentence.parsers = [PCEParser()]
[Sentence(sent).records.serialize() for sent in test_sentences]

[[], [], [], [], [], []]

In [21]:
Sentence(test_sentences[0]).records.serialize()

[]

In [23]:
Sentence('Solar cells containing 1 display PCEs up to 4.73 %.').records.serialize()

[{'pce': [{'value': '4.73', 'units': '%'}]}]

In [24]:
 Sentence('corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.').records.serialize()

[]

In [22]:
test_sentences[0]

'For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.'

In [23]:
for sent in test_sentences: print sent

For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.
The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.
The average values from the J–V curves from the reverse and forward scans (Fig.�5a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.
The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.�5b.
One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.�6).
In summary, we developed a solvent-engineering technology for the deposition of ext

In [27]:
R(u'avs')(u'avs')

<chemdataextractor.parse.elements.Regex at 0x1a2619ed68>

# Try a smarter parser

### old parser


In [22]:

from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class PCE(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.pce = ListType(ModelType(PCE))

import re
from chemdataextractor.parse import R, I, W, Optional, merge, ZeroOrMore


prefix = (I(u'PCEs') | I(u'pce') | I(u'power') + I(u'conversion') + I(u'efficiency')).hide()
common_text = R('\D').hide()
units = (W(u'%') | I(u'percent'))(u'units')
# value = R(u'^\d+(\.\d+)?$')(u'value')
value = R(u'\d+(\.\d+)?')(u'value')
pce = (prefix + Optional(common_text)+Optional(common_text) + value + units)(u'pce')


from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class PCEParser(BaseParser):
    root = pce

    def interpret(self, result, start, end):
        compound = Compound(
            pce=[
                PCE(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
#         print('result = ',result)
#         print("result.xpath = ", result.xpath())
#         print("result.xpath('./value/text()') = {}".format(result.xpath('./value/text()')))
        yield compound


### new parser

Want to inherit from CompoundParser in file 'chemdataextractor/parse/cem.py'
```
class CompoundParser(BaseParser):
    """Chemical name possibly with an associated label."""

    root = cem_phrase

    def interpret(self, result, start, end):
        # TODO: Parse label_type into label model object
        for cem_el in result.xpath('./cem'):
            c = Compound(
                names=cem_el.xpath('./name/text()'),
                labels=cem_el.xpath('./label/text()'),
                roles=[standardize_role(r) for r in cem_el.xpath('./role/text()')]
            )
            yield c
            ```

In [35]:
from chemdataextractor.parse import CompoundParser

from chemdataextractor.model import Compound
Compound.pce = ListType(ModelType(PCE))


class PCE(BaseModel):
    value = StringType()
    units = StringType()
    
class PCEParser(BaseParser):
    root = pce

    def interpret(self, result, start, end):
                
        compound = Compound(
#             name=first(result.xpath('./name/text()')),
#             labels=first(result.xpath('./label/text()')),
            pce=[
                PCE(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ],
            names=first(result.xpath('./name/text()')),
            labels=first(result.xpath('./label/text()')),
#             roles=[standardize_role(r) for r in cem_el.xpath('./role/text()')],
        )
#         print('result = ',result)
#         print("result.xpath = ", result.xpath())
#         print("result.xpath('./value/text()') = {}".format(result.xpath('./value/text()')))
        yield compound


In [36]:
Paragraph.parsers = [PCEParser()]

In [37]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))
doc_w_cap.records.serialize()

TypeError: 'NoneType' object is not iterable

Thinking I will need to manually build the parsing syntax to associate a material or label with the PCE... 

Is it the compound parser doing the work?

In [39]:
from chemdataextractor.parse.cem import CompoundParser 

Paragraph.parsers = [CompoundParser()]

doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))
doc_w_cap.records.serialize()



[{u'names': [u'fluorocarbons']},
 {u'names': [u'(MA)[PbI3]']},
 {u'names': [u'PbCl2']},
 {u'labels': [u'1'], u'names': [u'(PEA)2(MA)2[Pb3I10]']},
 {u'names': [u'PbI2']}]

yes it is... What I would like to do is just add on to this... but it seems to operate specifically on the pattern defined by `cem_phrase` in chemdataextractor/parse/cem.py

Thinking I will have to build a specific pattern that combines `cem_phrase` with my `pce`