# Examples using the various tools that make up PaperParser

In [2]:
import sys
sys.path.insert(0, '../paperparser/read_paper')
import extract_sentences
import sentence_classifier
import search_paper_for_perform_sentences

sys.path.insert(0, '../paperparser/parse')
import anneal
import order
import spincoat
import pce

import pandas as pd
from sklearn.externals import joblib

## Reading a paper for synthesis information

The first thing to do is take a paper and load it using `chemdataextractor`'s .html reader wrapped up as paperparser's `read_paper.extract_sentences` module.

In [5]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read paper in HTML format as an input and store as a 
# chemdataextractor Document type.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
paper = extract_sentences.read_html_paper('journal_articles/Paper0.html')

paperparser's `read_paper.extract_sentences` module also contains methods for extraction of all sentences and their unique indicies cooresponding to their location in the paper.

In [6]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Extract all sentences from the document and keep track of 
# the sentences original location (element index in document 
# and sentence index in element)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
X_sentences, sentences_record = extract_sentences.extract_all_sentences(paper)

### Finding sentences containing synthesis information

In order to save time parsing the whole paper, `paperparser` implements a pre-trained SVM classifier to find sentences corresponsing to synthesis. To implement, we load the pre-trained model as a `.pkl`. We acknowledge the security issues with using `pkl` and with more time we would like to implement `hdf5`.

In [7]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Load pre-trained model to extract relevant sentences from 
# paper that contains synthesis steps
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
syn_sen_model = joblib.load('syn_sen_model.pkl')

In [8]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Use model to classify sentences as pertaining to 
# synthesis or not.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(
    pred_data, 
    synthesis_sentences, 
    not_synthesis_sentences,
    ) = sentence_classifier.classify_sentences(
        syn_sen_model, 
        X_sentences,
        )

In [9]:
synthesis_sentences

['The spin-coated layer formed with the solvent mixture followed by the toluene drip is extremely uniform and transparent, and covers the full surface with low surface roughness.',
 'We see that the formation of the perovskite phase is accompanied by the complete transformation of the MAI–PbI2–DMSO at 130 °C, whereas both MAI–PbI2–DMSO and perovskite phases coexist at 100 °C.',
 'Accordingly, the formation of the intermediate phase is a critical factor for smoothing the surface via dropwise toluene application, which finally results in compact and uniform thin layers.',
 'Generally, the average value of the efficiency, determined from the forward and reverse scans should be widely accepted when the scanning delay time is longer than 40 ms (ref.\xa023), because an excessively long time to complete the measurement is impractical.',
 'For a deeper understanding of the dependence of the I–V parameters on both scan directions, we investigated the difference between the forward and reverse s

#### Organizing classified sentences into a dataframe 

In [10]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Arrangle sentences to demonstrate indexing and 'Tag' for synthesis 
# classification.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
df_sentences = pd.DataFrame({'Sentences':X_sentences, 'Element # in doc':[rec[0] for rec in sentences_record],
                             'Sentence_index_in_para':[rec[1] for rec in sentences_record], 'Tag':pred_data})
df_sentences

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
0,Solvent engineering for high-performance inorg...,0,0,0.0
1,Cookie Notice,1,0,0.0
2,"We use cookies to personalise content and ads,...",2,0,0.0
3,We also share information about your use of ou...,2,1,0.0
4,You can manage your preferences in 'Manage Coo...,2,2,0.0
5,Close,3,0,0.0
6,OK,4,0,0.0
7,Manage Cookies,5,0,0.0
8,Your Privacy,6,0,0.0
9,Strictly Necessary Cookies,7,0,0.0


In [15]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Show sentences tagged as containing synthesis information, should be 
# equivilent to 'synthesis_sentences' output from classifier.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
synth_sent_df = df_sentences.loc[df_sentences['Tag'] == 1.0]
synth_sent_df

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
124,The spin-coated layer formed with the solvent ...,92,15,1.0
152,We see that the formation of the perovskite ph...,96,2,1.0
160,"Accordingly, the formation of the intermediate...",99,0,1.0
187,"Generally, the average value of the efficiency...",100,21,1.0
188,For a deeper understanding of the dependence o...,103,0,1.0
217,CH3NH3I (MAI) and CH3NH3Br (MABr) were first s...,109,2,1.0
218,The precipitate was recovered by evaporation a...,109,3,1.0
220,The resulting solution was coated onto the mp-...,109,5,1.0
221,"During the second spin-coating step, the subst...",109,6,1.0
223,The substrate was dried on a hot plate at 100 ...,109,8,1.0


### Parsing synthesis information from selected sentences

With the sentences containing desired synthesis extracted and classified, we can procede to efficiently parseing the data. Currently, specific parsers are implemented for spincoating and annealing, but in the future many more can be added to capture all common synthesis steps. 

As of the current implementation, each parser must be called individually from the `paperparser.parse` module.

In [20]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Run spincoat parser on each synthesis containing sentence
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
spincoat_parse_results = [spincoat.parse_spincoat(syn_sentence) for syn_sentence in synthesis_sentences]
spincoat_parse_results

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'spin_coat': [{'spds': [{'spdvalue': '1,000', 'spdunits': 'r.p.m'},
      {'spdvalue': '5,000', 'spdunits': 'r.p.m'}],
     'times': [{'timevalue': '10', 'timeunits': 's'},
      {'timevalue': '20', 'timeunits': 's'}]}]}],
 [],
 [],
 [{'spin_coat': [{'spds': [{'spdvalue': '3,000', 'spdunits': 'r.p.m'}],
     'times': [{'timevalue': '30', 'timeunits': 's'}]}]}],
 []]

Sentences that match parser patterns return a list of nested dictionaries of synthesis parameters. Sentences that do not match parser return an empty list.

In [21]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Run anneal parser on each synthesis containing sentence 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
anneal_parse_results = [anneal.parse_anneal(syn_sentence) for syn_sentence in synthesis_sentences]
anneal_parse_results

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'anneal': [{'temps': [{'tempvalue': '100', 'tempunits': '°C'}],
     'times': [{'timevalue': '10', 'timeunits': 'min'}]}]}],
 [],
 []]

#### Parsing the order of steps in synthesis 

`paperparser` contains the a method `order.syn_order` that can be used to extract the order of various steps in the synthesis procedure. Let's try it on a specific paragraph from the paper loaded above

In [23]:
a_paragraph = paper[109]
a_paragraph

In [24]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find synthesis steps and return order
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
steps_order, steps_dict = order.syn_order(a_paragraph)
steps_dict

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - '/Users/chair/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/chair/anaconda/envs/direct/nltk_data'
    - '/Users/chair/anaconda/envs/direct/lib/nltk_data'
    - ''
**********************************************************************


## Parsing performance metrics

The goal of `paperparser` is to correlate synthesis parameters with the performace metrics of the resulting device. The second part is accomplished by a set of tools for extracting the performance metrics. First, we will perform a simple search of the paper for sentences containing the a performance metric identifier ('PCE', 'JSC', 'VOC', _ect_) along with numeric content. 

For this primitive release of `paperparser`, only the PCE parser is fully functional and we will demo that below. 

### Searching paper for relevant sentences to parse

The first step is to find the relavent sentences to pass to the parser for speedy parsing, as above. Luckily, it is easier to identify the performance metric information that the synthesis info, so we do not need any fancy-pants machine learning with a pre-trained model. Instead we implement a simple seach with the method `search_paper_for_perform_sentences.list_perform_sents`, which looks for PCE information by default, but is implemented with generality in mind. Check the docstring and source code for more info. 

In [27]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find sentences relevant to PCE 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

relevant_sentences_to_pce = search_paper_for_perform_sentences.list_perform_sents(
    'journal_articles/Paper0.html'
    )

relevant_sentences_to_pce

['For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.',
 'The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.',
 'The average values from the J–V curves from the reverse and forward scans (Fig.\xa05a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.',
 'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.',
 'One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.\xa06).',
 'In summary, we developed a solvent-engineering techn

Wow! Look at that output. It's lookin real nice. A whole paper down to just those sentences, and they all have quantitative info on the PCE! I don't know about you, but I am inpressed... But unfortunately they dont come with index information for their location in the paper. That is sure to be added in a later verision on `paperparser` because it would be nice to know what section of the paper these informations come from. 

Back to buisiness, these sentences can be fed to the pce parcer to extract values and relations.

In [29]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Parse relevant senteces for quantitative information on the PCE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
parsed_pce_info = pce.parse_pce(relevant_sentences_to_pce)
parsed_pce_info

[[],
 [{'pce_pattern': [{'value': '14.4', 'units': '%'}]}],
 [{'pce_pattern': [{'value': '16.5', 'units': '%'}]}],
 [],
 [{'pce_pattern': [{'value': '16.2', 'units': '%'}]}],
 [{'pce_pattern': [{'value': '16.5', 'units': '%'}]}]]

The parser works as hoped for all except the first sentence, where the pce is reported in the phrase `'15.0% power-conversion efficiency (PCE)'`. Although this pattern is incorperated into `parse.pce`, there is some bug. 

The 4th sentece was accuratly not parsed because it reports `'a very broad IPCE plateau of over 80%'`, which is not info we want. 

##### Another small demonstration
The PCE parser is designed to detected different patterns of phrasing

In [10]:
pceparser.parse_pce(['Solar cells containing THE CHEMICALS display PCEs up to 4.73%.'])

[[{'pce_pattern': [{'value': '4.73', 'units': '%'}]}]]

In [11]:
pceparser.parse_pce(['Solar cells containing THE CHEMICALS display a 4.73% PCE.'])

[[{'pce_pattern': [{'value': '4.73', 'units': '%'}]}]]

# Work for future development...

An obvious flaw in both synthesis and performance metric parsers as implements is that there is no way to associate each parsed step/value with a chemical identifier. This is critical when a paper contains information on more than one device (as the example paper in this notebook obviously does). ChemDataExtractor contains a full suite of tools for accomplishing exactly this task, but implementation is not simple and could not be accomplished by the requied launch data of `paperparser` version 0.1