# PaperParser Example Notebook

**PaperParser** is a toolkit allowing for the automatic extraction of key parameters (synthesis, performance) from a scientific article.  In this example notebook, we will demonstrate the various functionalities of **PaperParser**, what each component of the package is built to do, and how to use the package from start to finish.

## 0.  Initialization

In [200]:
import sys
sys.path.insert(0, '../paperparser/read_paper')
import extract_sentences
import sentence_classifier
import search_paper_for_perform_sentences

sys.path.insert(0, '../paperparser/parse')
import anneal
import order
import spincoat
import pce

import pandas as pd
from sklearn.externals import joblib

## 1.  Reading Synthesis Information from a Paper

PaperParser is designed to read through a scientific document, identify only the relevant synthesis-related sentences therein, and parse those sentences to extract synthetic steps and parameters.  The full process will be demonstrated below.

First, load a scientific paper through PaperParser's `read_paper.extract_sentences` module.  (The module is build on top of `chemdataextractor`'s .html reader.)  The paper is stored as a ChemDataExtractor Document type.  To demonstrate, we load an example paper provided in the PaperParser repo.

In [101]:
paper = extract_sentences.read_html_paper('journal_articles/Paper0.html')

PaperParser's `read_paper.extract_sentences` module can also extract all sentences and their location in the paper (as represented by unique indices).

In [3]:
X_sentences, sentences_record = extract_sentences.extract_all_sentences(paper)

### 1.1.  Identifying Synthesis Sentences in a Paper

Rather than read the whole paper, `paperparser` implements a pre-trained SVM classifier to find sentences corresponding to synthesis.  The pre-trained model has been stored as a `.pkl` file.

(Note: We acknowledge potential security issues with `pkl` files; future updates will implement `hdf5`.)

In [4]:
syn_sen_model = joblib.load('syn_sen_model.pkl')

We can use the classifier on our example paper to identify synthesis and non-synthesis sentences.

In [6]:
pred_data, synthesis_sentences, not_synthesis_sentences = sentence_classifier.classify_sentences(syn_sen_model, X_sentences)

A preview of classified sentences is shown below (for the full list, de-comment and rerun).

In [12]:
print(f"First synthesis sentence: '{synthesis_sentences[0]}'",
      f"Number of synthesis sentences: {len(synthesis_sentences)}",
      f"Total number of sentences: {len(synthesis_sentences) + len(not_synthesis_sentences)}",
      sep="\n")
#synthesis_sentences

First synthesis sentence: 'The spin-coated layer formed with the solvent mixture followed by the toluene drip is extremely uniform and transparent, and covers the full surface with low surface roughness.'
Number of synthesis sentences: 12
Total number of sentences: 851


Nice!  We have now reduced the number of sentences that PaperParser has to read from almost 1000 to only 12.

#### 1.1.1. Organizing Synthesis Sentences into a Dataframe

For user-friendliness, we have also arranged the sentences into a pandas dataframe.  Here the indices can be more clearly read, and the `Tag` aspect which tags synthesis sentences (=1) and non-synthesis sentences (=0) is shown in a column.

(To streamline this notebook, only the first 5 sentences are shown-- de-comment in order to see the full dataframe!)

In [15]:
df_sentences = pd.DataFrame({'Sentences':X_sentences, 'Element # in doc':[rec[0] for rec in sentences_record],
                             'Sentence_index_in_para':[rec[1] for rec in sentences_record], 'Tag':pred_data})
df_sentences.head()
#df_sentences

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
0,Solvent engineering for high-performance inorg...,0,0,0.0
1,Cookie Notice,1,0,0.0
2,"We use cookies to personalise content and ads,...",2,0,0.0
3,We also share information about your use of ou...,2,1,0.0
4,You can manage your preferences in 'Manage Coo...,2,2,0.0


We can also use pandas to show only synthesis sentences.

In [17]:
synth_sent_df = df_sentences.loc[df_sentences['Tag'] == 1.0]
synth_sent_df.head()
#synth_sent_df

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
124,The spin-coated layer formed with the solvent ...,92,15,1.0
152,We see that the formation of the perovskite ph...,96,2,1.0
160,"Accordingly, the formation of the intermediate...",99,0,1.0
187,"Generally, the average value of the efficiency...",100,21,1.0
188,For a deeper understanding of the dependence o...,103,0,1.0


### 1.2. Parsing Synthesis Information from Selected Sentences

Once synthesis sentences have been identified, PaperParser is ready to begin parsing!  PaperParser has currently been designed only to recognize spincoating and annealing parameters, but in the future, many more can be added to capture other synthesis steps. 

As of the current implementation, each parser must be called individually from the `paperparser.parse` module.

In [27]:
spincoat_parse_results = [spincoat.parse_spincoat(syn_sentence) for syn_sentence in synthesis_sentences]
spincoat_parse_results

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'spin_coat': [{'spds': [{'spdvalue': '1,000', 'spdunits': 'r.p.m'},
      {'spdvalue': '5,000', 'spdunits': 'r.p.m'}],
     'times': [{'timevalue': '10', 'timeunits': 's'},
      {'timevalue': '20', 'timeunits': 's'}]}]}],
 [],
 [],
 [{'spin_coat': [{'spds': [{'spdvalue': '3,000', 'spdunits': 'r.p.m'}],
     'times': [{'timevalue': '30', 'timeunits': 's'}]}]}],
 []]

In [19]:
anneal_parse_results = [anneal.parse_anneal(syn_sentence) for syn_sentence in synthesis_sentences]
anneal_parse_results

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'anneal': [{'temps': [{'tempvalue': '100', 'tempunits': '°C'}],
     'times': [{'timevalue': '10', 'timeunits': 'min'}]}]}],
 [],
 []]

Sentences that match parser patterns return synthesis parameters. Sentences that do not match parser return an empty list.

#### 1.2.1. Capabilities and Flexibility of Synthesis Parsers

PaperParser's synthesis parsers have been designed to (somewhat) flexibly extract spin-coating and annealing parameters.

In [48]:
sc1 = "The substrate was spun at 12,000 rcf for 2 h."
sc2 = "The substrate was coated at 1,200 r.p.m for 10 minutes."
spincoat.parse_spincoat(sc1), spincoat.parse_spincoat(sc2)

([{'spin_coat': [{'spds': [{'spdvalue': '12,000', 'spdunits': 'rcf'}],
     'times': [{'timevalue': '2', 'timeunits': 'h'}]}]}],
 [{'spin_coat': [{'spds': [{'spdvalue': '1,200', 'spdunits': 'r.p.m'}],
     'times': [{'timevalue': '10', 'timeunits': 'minutes'}]}]}])

In [54]:
an1 = "The substrate was heated at 400°F for 20 seconds."
an2 = "The substrate was dried at 20 °C for 20 min"
anneal.parse_anneal(an1), anneal.parse_anneal(an2)

([{'anneal': [{'temps': [{'tempvalue': '400', 'tempunits': '°F'}],
     'times': [{'timevalue': '20', 'timeunits': 'seconds'}]}]}],
 [{'anneal': [{'temps': [{'tempvalue': '20', 'tempunits': '°C'}],
     'times': [{'timevalue': '20', 'timeunits': 'min'}]}]}])

The parsers also ignore unlikely values.

In [99]:
sc3 = "The substrate was spun at 1,000,000 r.c.f for 5 min" # Houston, we have a problem. A very fast one.
sc4 = "The substrate was spun at 2,000 rpm for 3,000 hours" # Doubtful that anyone spends 34% of a year spin-coating a film.
an3 = "The substrate was dried at 0 °C for 20 min" # Annealing at 0 °C might not do much for your film.
an4 = "The substrate was dried at 100°C for 200 min" # Annealing for 200 minutes might do too much for your film.
spincoat.parse_spincoat(sc3), spincoat.parse_spincoat(sc4), anneal.parse_anneal(an3), anneal.parse_anneal(an4)

([], [], [], [])

#### 1.2.2. Parsing the Order of Steps in a Synthesis 

As an added feature, `paperparser` contains a method `order.syn_order` that can be used to extract the order of various steps in the synthesis procedure. Let's try it on a specific paragraph from the example paper.

In [28]:
a_paragraph = paper[109]
a_paragraph

PaperParser can then use an ordering method on the above paragraph to generate an ordered list of the synthesis steps contained within.

In [29]:
steps_order, steps_dict = order.syn_order(a_paragraph)
steps_dict

{0: [],
 1: ['spin-coat'],
 2: [],
 3: ['dry'],
 4: [],
 5: ['coat', 'spin-coat'],
 6: ['spin-coat', 'dry'],
 7: ['spin-coat'],
 8: ['dry'],
 9: ['spin-coat'],
 10: [],
 11: []}

Pretty neat, right?

## 2.  Reading Device Performance from a Paper

The goal of `paperparser` is to correlate synthesis parameters with the performance of a given device.  For perovskites, performance is quantified using a number of parameters, including power conversion efficiency ($PCE$), short-circuit current ($J_{SC}$), and open-circuit voltage ($V_{OC}$).  (To learn more about how solar cell performance is measured, try some [nifty](http://depts.washington.edu/cmditr/modules/opv/physics_of_solar_cells.html) [online](https://www.ossila.com/pages/solar-cells-theory) [resources](https://en.wikipedia.org/wiki/Solar_cell_efficiency).

To extract performance metrics, PaperParser uses a set of tools similar to those used in the synthesis section above.  First, PaperParser searches the paper for sentences containing a performance metric identifier ('$PCE$', '$J_{SC}$', '$V_{OC}$', _etc._) along with its corresponding value.  PaperParser then organizes the information into a user-friendly, readable format. 

(For version 0.1 of `paperparser`, only the PCE parser is fully functional and will be demonstrated below.  The other parsers are planned for future development.)

### 2.1. Identifying Device Performance Sentences in a Paper

The first step is to find the relevant sentences to pass to the parser for speedy parsing, as above. Luckily, identiying performance metrics is easier than identifying syntheses, so no fancy-pants machine learning is required. Instead, PaperParser implements a simple search with the method `search_paper_for_perform_sentences.list_perform_sents`, which searches for PCE information by default.  However, the method is designed with generality in mind for future parser design. (See docstrings to learn more.)

In [30]:
relevant_sentences_to_pce = search_paper_for_perform_sentences.list_perform_sents('journal_articles/Paper0.html')
relevant_sentences_to_pce

['For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.',
 'The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.',
 'The average values from the J–V curves from the reverse and forward scans (Fig.\xa05a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.',
 'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.',
 'One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.\xa06).',
 'In summary, we developed a solvent-engineering techn

Wow! Look at that output. A whole paper down to just those sentences, and they all have quantitative info on the PCE! I don't know about you, but I am impressed.

Unfortunately, unlike with the synthesis parameters, the location indices of PCE sentences are not labeled. This is planned to be addressed in a future  version of `paperparser` to better identify which section of the paper each parameter comes from. 

### 2.2. Parsing Device Performance Information from Selected Sentences

Now these sentences can be fed to the PCE parcer to extract values and relations.

In [31]:
parsed_pce_info = pce.parse_pce(relevant_sentences_to_pce)
parsed_pce_info

[[],
 [{'pce_pattern': [{'value': '14.4', 'units': '%'}]}],
 [{'pce_pattern': [{'value': '16.5', 'units': '%'}]}],
 [],
 [{'pce_pattern': [{'value': '16.2', 'units': '%'}]}],
 [{'pce_pattern': [{'value': '16.5', 'units': '%'}]}]]

The parser works as hoped for all except the first sentence, where the pce is reported in the phrase `'15.0% power-conversion efficiency (PCE)'`. Although this pattern is incorporated into `parse.pce`, there is some bug. 

The 4th sentence was not parsed. Closer inspection of the sentence shows that it says:

In [201]:
relevant_sentences_to_pce[3]

'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.'

The above sentence does not contain the correct information, which is why its output is blank.  (Good parser.)

#### 2.2.1. Capabilities and Flexibility of Performance Parsers

The PCE parser is designed to detect different patterns of phrasing, as demonstrated below.

In [15]:
pce.parse_pce(['Solar cells containing THE CHEMICALS display PCEs up to 4.73%.'])

[[{'pce_pattern': [{'value': '4.73', 'units': '%'}]}]]

In [16]:
pce.parse_pce(['Solar cells containing THE CHEMICALS display a 4.73% PCE.'])

[[{'pce_pattern': [{'value': '4.73', 'units': '%'}]}]]

# Future Work

**PaperParser** version 0.1 is a handy tool for the extraction of solar cell synthesis and performance information from often lengthy and arcane scientific papers on the subject.  Future improvements planned as of 0.1 include:

* More generalized parsers
    * Generalized parsers will enable expansion of PaperParser beyond perovskite literature to exciting frontiers!
* Increased flexibility for synthesis parsers
* Performance parsers for $J_{SC}$, $V_{OC}$, etc.
* Association of all parameters with chemical identifiers
    * Chemical identifier association (using ChemDataExtractor to recognize chemical names) allows syntheses and performance metrics to be specifically associated with certain formulations, making complete paper analysis a one-step, easy process!

The PaperParser team always welcomes suggestions, comments, or feedback (both positive and negative!).  We are a small (but dedicated) development team and look forward to improving PaperParser's functionality and user experience.

Thank you!

~ Team PaperParser (Christine, Harrison, Linnette, and Neel)