## Notes

Use `new_env` (not the best name, oh well) to run this notebook due to Python 3.6 dependencies.

In [1]:
! which python

/home/christinehc/anaconda3/envs/py36/bin/python


In [2]:
! python -V

Python 3.6.8 :: Anaconda, Inc.


## Initialization

In [2]:
import io

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

## Testing

In [3]:
def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDF document can be downloaded from the link on Slack; I used the Angewandte paper because it is shorter.  Change pathlength to match where to find the pdf in your local directory.

In [4]:
f = open('angew.txt','w')
f.write(convert_pdf_to_txt('/home/christinehc/downloads/anie.201406466.pdf'))
f.close()

In [5]:
with open('angew.txt') as f:
    read_data = f.read()
    
read_data

'Angewandte\n\n.\n\nCommunications\n\nPerovskite Solar Cells\n\nDOI: 10.1002/anie.201406466\n\nA Layered Hybrid Perovskite Solar-Cell Absorber with Enhanced\nMoisture Stability**\nIan C. Smith, Eric T. Hoke, Diego Solis-Ibarra, Michael D. McGehee, and\nHemamala I. Karunadasa*\n\nAbstract: Two-dimensional hybrid perovskites are used as\nabsorbers in solar cells. Our first-generation devices containing\n(PEA)2(MA)2[Pb3I10] (1; PEA = C6H5(CH2)2NH3\n+, MA =\nCH3NH3\n+) show an open-circuit voltage of 1.18 V and\na power conversion efficiency of 4.73 %. The layered structure\nallows for high-quality films to be deposited through spin\ncoating and high-temperature annealing is not required for\ndevice fabrication. The 3D perovskite (MA)[PbI3] (2) has\nrecently been identified as a promising absorber for solar cells.\nHowever, its instability to moisture requires anhydrous proc-\nessing and operating conditions. Films of 1 are more moisture\nresistant than films of 2 and devices containing 1 

## ChemDataExtractor

In [6]:
from chemdataextractor import Document

In [7]:
doc = Document(read_data)

In [8]:
doc.cems

[Span('(MA)[PbI3]', 13804, 13814),
 Span('Cl', 1354, 1356),
 Span('S', 20156, 20157),
 Span('TiO2', 10382, 10386),
 Span('nitromethane', 5138, 5150),
 Span('Pb', 15615, 15617),
 Span('perovskite (PEA)2(MA)2[Pb3I10]', 6022, 6052),
 Span('perovskite', 2022, 2032),
 Span('PEA', 411, 414),
 Span('perovskite', 17252, 17262),
 Span('perovskites', 15412, 15423),
 Span('perovskite (MA)[PbI3]', 5965, 5986),
 Span('Br', 1358, 1360),
 Span('PbI2', 14454, 14458),
 Span('perovskite', 1843, 1853),
 Span('Pb–I perovskites', 15485, 15501),
 Span('perovskite', 1452, 1462),
 Span('C6H5(CH2)2NH3', 3117, 3130),
 Span('H', 20885, 20886),
 Span('perovskites', 15775, 15786),
 Span('3 lead perovskite', 5252, 5269),
 Span('Pb', 5007, 5009),
 Span('CA 94305', 2600, 2608),
 Span('H', 17878, 17879),
 Span('H', 20250, 20251),
 Span('H', 20239, 20240),
 Span('N,N-dimethylformamide', 9585, 9606),
 Span('perovskites', 297, 308),
 Span('N = blue', 6280, 6288),
 Span('perovskite', 1168, 1178),
 Span('PbCl2', 10208, 102

In [9]:
doc.abbreviation_definitions

[(['PCE'], ['power', 'conversion', 'efficiency'], None),
 (['USA'], ['University', 'Stanford', ',', 'CA', '94305'], None),
 (['USA'], ['University', ',', 'Stanford', ',', 'CA', '94305'], None),
 (['PXRD'], ['Powder', 'X-ray', 'diffraction'], None),
 (['PXRD'], ['Powder', 'X-ray', 'diffraction'], None),
 (['SEM'], ['scanning', 'electron', 'microscopy'], None),
 (['EQE'], ['external', 'quantum', 'efficiency'], None)]

In [10]:
doc.records

[<Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>, <Compound>]

Testing ChemDataExtractor's native PDF reader:

In [11]:
f = open('/home/christinehc/downloads/anie.201406466.pdf', 'rb')
docpdf = Document.from_file(f)
f.close()

In [12]:
docpdf

In [13]:
docpdf.cems

[Span('H', 17, 18),
 Span('hydrogens', 398, 407),
 Span('Solis', 117, 122),
 Span('Perovskite', 17, 27),
 Span('Lumin', 20, 25),
 Span('perovskites', 33, 44),
 Span('perovskites', 873, 884),
 Span('perovskite', 904, 914),
 Span('(PEA)2(MA)2[Pb3I10]', 84, 103),
 Span('Li', 8, 10),
 Span('Pb', 1199, 1201),
 Span('CH3NH3', 71, 77),
 Span('PbCl2', 701, 706),
 Span('S', 246, 247),
 Span('perovskites', 30, 41),
 Span('(MA)[PbI3]', 61, 71),
 Span('nitromethane', 1330, 1342),
 Span('(MA)[PbI3]', 449, 459),
 Span('(MA)[PbI3]', 641, 651),
 Span('PbI2', 338, 342),
 Span('perovskite', 51, 61),
 Span('(MA)[PbI3]', 100, 110),
 Span('(PEA)2(MA)2[Pb3I10]', 36, 55),
 Span('perovskite', 525, 535),
 Span('Perovskite', 0, 10),
 Span('CH3NH3', 175, 181),
 Span('PbI2', 95, 99),
 Span('Pb–I perovskite', 23, 38),
 Span('halide', 352, 358),
 Span('H', 59, 60),
 Span('Br', 279, 281),
 Span('.Keywords', 0, 9),
 Span('PEA', 140, 143),
 Span('perovskite', 835, 845),
 Span('spiro-OMeTAD', 1065, 1077),
 Span('tin', 

In [14]:
doc.records[0].serialize()

{'names': ['(MA)[PbX3]']}

In [15]:
doc.records.serialize()

[{'names': ['(MA)[PbX3]']},
 {'names': ['Cl']},
 {'names': ['tin']},
 {'names': ['Christoforo']},
 {'names': ['nitromethane']},
 {'names': ['acetone']},
 {'names': ['3 lead perovskite']},
 {'names': ['N = blue']},
 {'names': ['hydrogens']},
 {'names': ['(PEA)2(MA)2-']},
 {'names': ['2,2’,7,- 7’-tetrakis-(N,N-di-p-methoxyphenylamine)-9,9’-spirobifluor- ene ( spiro-OMeTAD )']},
 {'names': ['N,N-dimethylformamide']},
 {'names': ['spiro-OMeTAD']},
 {'names': ['Pb–I perovskite']},
 {'names': ['Pb–I perovskites']},
 {'names': ['fluorocarbons']},
 {'names': ['Lumin']},
 {'names': ['Li']},
 {'names': ['S']},
 {'names': ['Noh']},
 {'names': ['Solis']},
 {'names': ['C6H5(CH2)2NH3']},
 {'names': ['CH3NH3']},
 {'names': ['Br']},
 {'names': ['halide']},
 {'names': ['CA 94305', 'USA']},
 {'names': ['PEA']},
 {'names': ['TiO2']},
 {'names': ['PbCl2']},
 {'names': ['Pb']},
 {'names': ['N']},
 {'names': ['perovskite (MA)[PbI3]', '(MA)[PbI3]'], 'labels': ['2']},
 {'names': ['(PEA)2(MA)2[Pb3I10]', 'perov

In [16]:
docpdf.records.serialize()

[{'names': ['(MA)[PbX3]']},
 {'names': ['Cl']},
 {'names': ['tin']},
 {'names': ['I. Karunadasa']},
 {'names': ['Christoforo']},
 {'names': ['nitromethane']},
 {'names': ['acetone']},
 {'names': ['3 lead perovskite']},
 {'names': ['N = blue']},
 {'names': ['hydrogens']},
 {'names': ['(PEA)2(MA)2-']},
 {'names': ['2,2’,7,- 7’-tetrakis-(N,N-di-p-methoxyphenylamine)-9,9’-spirobifluor- ene ( spiro-OMeTAD )']},
 {'names': ['N,N-dimethylformamide']},
 {'names': ['spiro-OMeTAD']},
 {'names': ['Pb–I perovskite']},
 {'names': ['Pb–I perovskites']},
 {'names': ['fluorocarbons']},
 {'names': ['.Keywords']},
 {'names': ['Lumin']},
 {'names': ['Li']},
 {'names': ['S']},
 {'names': ['Noh']},
 {'names': ['Solis']},
 {'names': ['C6H5(CH2)2NH3']},
 {'names': ['CH3NH3']},
 {'names': ['Br']},
 {'names': ['halide']},
 {'names': ['CA 94305', 'USA']},
 {'names': ['PEA']},
 {'names': ['TiO2']},
 {'names': ['PbCl2']},
 {'names': ['Pb']},
 {'names': ['N']},
 {'names': ['perovskite (MA)[PbI3]', '(MA)[PbI3]'], '

In [17]:
docpdf.elements

[Paragraph(id=None, references=[], text='Angewandte'),
 Paragraph(id=None, references=[], text='.'),
 Paragraph(id=None, references=[], text='Communications'),
 Paragraph(id=None, references=[], text='Perovskite Solar Cells'),
 Paragraph(id=None, references=[], text='DOI: 10.1002/anie.201406466'),
 Paragraph(id=None, references=[], text='A Layered Hybrid Perovskite Solar-Cell Absorber with Enhanced\nMoisture Stability**\nIan C. Smith, Eric T. Hoke, Diego Solis-Ibarra, Michael D. McGehee, and\nHemamala I. Karunadasa*'),
 Paragraph(id=None, references=[], text='Abstract: Two-dimensional hybrid perovskites are used as\nabsorbers in solar cells. Our first-generation devices containing\n(PEA)2(MA)2[Pb3I10] (1; PEA = C6H5(CH2)2NH3\n+, MA =\nCH3NH3\n+) show an open-circuit voltage of 1.18 V and\na power conversion efficiency of 4.73 %. The layered structure\nallows for high-quality films to be deposited through spin\ncoating and high-temperature annealing is not required for\ndevice fabricati

Stolen from CDE's documentation; haven't played around with in any capacity

In [18]:
# Define a very basic entity tagger
specifier = (I('PCEs') + I('temperature') + Optional(lrb | delim) + Optional(R('^T(C|c)(urie)?')) 
             + Optional(rrb) | R('^T(C|c)(urie)?'))('specifier').add_action(join)
units = (R('^[CFK]\.?$'))('units').add_action(merge)
value = (R('^\d+(\.\,\d+)?$'))('value')

NameError: name 'I' is not defined