# PDFDataExtractor Demo

PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following various publishers: 
* Elsevier
* Royal Society of Chemistry
* Advanced Material Families (Wiley)
* Angewandte
* Chemistry A European Journal
* American Chemistry Society
* Springer (Temporarily unavailable)

## To install PDFDataExtractor, simply run the following code in your terminal

In [None]:
git clone git@github.com:cat-lemonade/PDFDataExtractor.git

## Then run the following code:

In [1]:
python setup.py install

SyntaxError: invalid syntax (1000827859.py, line 1)

## Pass a single PDF file

### Import necessary module

In [2]:
from pdfdataextractor import Reader

ModuleNotFoundError: No module named 'pdfdataextractor'

In [2]:
path = r'../data/acs.jcim.6b00207.pdf'

In [3]:
file = Reader()

In [4]:
pdf = file.read_file(path)

Reading:  /Users/miao/Downloads/acs.jcim.6b00207.pdf
*** American Chemistry Society detected ***


### Test if PDF is returned successful

In [5]:
pdf.test()

PDF returned successfully


### Get Caption

In [6]:
pdf.caption()

{'figure 1': 'Figure 1. Overview of the complete information extraction system. Document Processors convert various input formats into a universal document model that consists of a single linear stream of elements such as paragraphs and tables that are each processed independently to extract information. This information is then merged to produce a single collection of chemical records for the overall document.',
 'figure 2': 'Figure 2. Natural language processing pipeline. Text is ﬁrst split into sentences and then into individual tokens. The part-of-speech tagger and entity recognizer outputs are combined to assign a single tag to each token, which is then parsed using a rule-based grammar to produce a tree structure. This tree structure is interpreted to extract individual chemical records for this sentence, which are then combined to resolve data with records interdependencies and produce uniﬁed records for depositing in a database. Tags shown: NN = noun, CD = cardinal number, VBZ 

### Get Keywords

In [28]:
pdf.keywords()# Note: Some articles do not contain keywords. For example, the current one.

''

### Get Title

In [29]:
pdf.title()

'ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientiﬁc Literature'

### Get DOI

In [30]:
pdf.doi()

'10.1021/acs.jcim.6b00207'

### Get Abstract

In [31]:
pdf.abstract()

'ABSTRACT: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientiﬁc literature. Since chemical information can be present in ﬁgures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientiﬁc documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry art

### Get Journal

In [7]:
pdf.journal()

{'name': 'J. Chem. Inf. Model. 2016, 56, 1894−1904',
 'year': '2016',
 'volume': '56',
 'page': '1894-1904'}

### Get Journal name

In [33]:
pdf.journal('name')

'J. Chem. Inf. Model. 2016, 56, 1894−1904'

### Get Journal Year

In [34]:
pdf.journal('year')

'2016'

### Get Journal Volume

In [35]:
pdf.journal('volume')

'56'

### Get Journal Page

In [36]:
pdf.journal('page')

'1894-1904'

### Get Plain Text

In [37]:
pdf.plaintext()

'Article\n\npubs.acs.org/jcim\n\nChemDataExtractor: A Toolkit for Automated Extraction of Chemical\nInformation from the Scientiﬁc Literature\nMatthew C. Swain and Jacqueline M. Cole*\n\nCavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.\n\nABSTRACT: The emergence of “big data” initiatives has led\nto the need for tools that can automatically extract valuable\nchemical information from large volumes of unstructured data,\nsuch as the scientiﬁc literature. Since chemical information can\nbe present in ﬁgures, tables, and textual paragraphs, successful\ninformation extraction often depends on the ability to interpret\nall of these domains simultaneously. We present a complete\ntoolkit for the automated extraction of chemical entities and\ntheir associated properties, measurements, and relationships\nfrom scientiﬁc documents that can be used to populate\nstructured chemical databases. Our system provides an extensible, chemistry-aware, natural la

### Get Section titles and corresponding text

In [38]:
pdf.section()

{'■ INTRODUCTION': ['Scientiﬁc results are typically communicated in the form of papers, patents, and theses that contain unstructured and semistructured data described by free-ﬂowing natural language that is not readily interpretable by machines. Yet, manual data abstraction by humans with expert knowledge is an expensive, labor-intensive, and error-prone process. With the continued growth of new publications, it is becoming increasingly diﬃcult to create and maintain up-to-date manually curated databases, and automated information extraction by machines is fast becoming a necessity.',
  'The chemistry literature presents an attractive and tractable target for this automated extraction as it is typically comprised of formulaic, data-rich language that is well-suited for machine analysis with the potential for high recall and precision. The extracted chemical information can be used to create and populate databases of chemical structures, properties, and observations, opening up new av

### Get References

In [5]:
for seq, ref in pdf.reference().items():
    print(seq)
    print(ref)

0
['National Science and Technology Council', ' Oﬃce of Science and Technology Policy. Materials Genome Initiative for Global Competitive- ness; 2011. ']
1
['Olivares-Amaya', ' R.; Amador-Bedolla', ' C.; Hachmann', ' J.; Atahan- Evrenk', ' S.; Sanchez-Carrera', ' R. S.; Vogt', ' L.; Aspuru-Guzik', ' A. Accelerated Computational Discovery of High-performance Materials for Organic Photovoltaics by Means of Cheminformatics. Energy Environ. Sci. 2011', ' 4', ' 4849−4861. ']
2
['Jain', ' A.; Ong', ' S. P.; Hautier', ' G.; Chen', ' W.; Richards', ' W. D.; Dacek', ' S.; Cholia', ' S.; Gunter', ' D.; Skinner', ' D.; Ceder', ' G.; Persson', ' K. A. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Mater. 2013', ' 1', ' 011002. ']
3
['Tsuruoka', ' Y.; Tateishi', ' Y.; Kim', ' J.-D.; Ohta', ' T.; McNaught', ' J.; Ananiadou', ' S.; Tsujii', ' J. In Advances in Informatics; Bozanis', ' P.', ' Houstis', ' E. N.', ' Eds.; Springer Berlin Heidelbe

## Pass multiple files at one time

In [2]:
import glob

In [47]:
def read_single(file):
    reader = Reader()
    pdf = reader.read_file(file)
    print(pdf.abstract())

    
def read_multiple(path):
    for i in path:
        read_single(i)
        print('-------------------', '\n')


In [48]:
read_multiple(glob.glob(r'/Users/miao/Desktop/test/els/*.pdf'))

Reading:  /Users/miao/Desktop/test/els/6.pdf
*** Elsevier detected ***
For policymakers, planners, urban design practitioners and city service decision-makers who endeavour to create policies and take decisions to improve the function of cities, developing an understanding of cities, and the particular city in question, is important. However, in the ever-increasing ﬁeld of urban measurement and analysis, the challenges cities face are frequently presumed: crime and fear of crime, social inequality, environmental degradation, economic deterioration and disjointed governance. Although it may be that many cities share similar problems, it is unwise to assume that cities share the same challenges, to the same degree or in the same combination. And yet, diagnosing the challenges a city faces is often overlooked in preference for improving the understanding of known challenges. To address this oversight, this study evidences the need to diagnose urban challenges, introduces a novel mixed-met

*** Elsevier detected ***
Cities are increasingly challenged to improve their competitiveness. Performance indicators stand as an important element to interpret the success of the policy regime adopted by the municipality. Cities with a set of superior economic, social and environmental indicators have the potential to present better living conditions for their inhabitants. In this context, the aim of this research is to analyze whether the in- dicators published by Brazilian cities are aligned with the approach of a smart or sustainable city. The research used a set of 3150 data points regarding the performance of these cities. It analyzed the per- formance of the 150 best cities, divided into three groups of interest identiﬁed as small cities, medium- sized cities and big cities, on a set of 21 indicators. The set of identiﬁed indicators shows the attention of the cities to socioeconomic and information and communication technologies issues, thus revealing that Brazilian city manager

## Use PDFDataExtractor to perform chemistry related extraction

### You can use the flag "chem=Ture" to instruct the function to carry out chemistry related information extraction at the same time when extracting metadata, using ChemDataExtractor

In [3]:
file_test = r'/Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf'

In [4]:
reader = Reader()

In [5]:
pdf = reader.read_file(file_test)

Reading:  /Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf
*** Elsevier detected ***


### Pass True to 'chem'

In [6]:
r = pdf.abstract(chem=True)

### Show records

In [7]:
r.records.serialize()

[{'names': ['donor-π-bridge-acceptor-π']},
 {'names': ['quinoxaline']},
 {'names': ['deep red']}]

## Things to notice

### PDFDataExtractor uses ChemDataExtrator to perform all chemistry related extraction, for more detailed use cases, please refer to http://chemdataextractor.org

## Known Issues

In ACS
* In ACS, a few journals have two section title styles existing at the same time, namely: numbered one and ■ one. This could confuse the title filtration function because two styles have largely different font sizes. But this won’t affect reference extraction
* Reference extracted might not be in order
* Parts of extracted reference could be missing

In Elesvier
* Potentially weak journal extraction leads to missing journal information
* Unnumbered references can be messy

In RSC
* Title can be missing
* Journal year, volume and page numbers can be missing in certain articles
* Some section titles can be missed but reference section remains solid


In Advanced Family
* Reference entries can be mixed
* Keywords can be found inside reference entries, roughly 1 in 20
* Some authors place their bio at the very end, such words are not excluded from reference at the moment

In CAEJ
* Keywords can be incomplete

In Angewandte
* Keywords might not be in order