## Example notebook that uses all the functions in paperparser/read_paper

Given file path of paper, extract all sentences as a list 

Given a list of tagged sentences (0 or 1), train a support vector machine classifier

Given trained model, classify a list of sentences as 0 or 1

In [6]:
import sys
sys.path.insert(0, '../paperparser/read_paper')

In [7]:
import extract_sentences
import sentence_classifier

In [8]:
sys.path.insert(0, '../paperparser/parse')
import spincoat

### Data manipulation for training and testing datasets

In [9]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score 

In [10]:
# Load sample data
#manually identified/tagged synthesis paragraphs
train_p = [[117, 118, 119], [112], [117], [122, 125], [88],
           [142, 146], [130], [115], [123,125], [105]]
p = [1,2,3,4,5,6,7,8,9,10]
syn_yes = []
syn_no = []
for i in range(len(p)):
    paper = extract_sentences.read_html_paper('journal_articles/Paper' + str(p[i]) + '.html')
    sen_yes_arr, sen_no_arr = extract_sentences.extract_sentences_given_tag(paper, train_p[i])
    for j in range(len(sen_yes_arr)):
        syn_yes.append(sen_yes_arr[j])
    for k in range(len(sen_no_arr)):
        syn_no.append(sen_no_arr[k])
Syn_sen = pd.DataFrame({'x':syn_yes, 'y':np.ones(len(syn_yes))})
Syn_not_sen = pd.DataFrame({'x':syn_no, 'y':np.zeros(len(syn_no))})
Train = [Syn_sen, Syn_not_sen]
train_data = pd.concat(Train, ignore_index=True)

In [11]:
t = [0]
test_p = [[109]]
syn_test_yes = []
syn_test_no = []
for i in range(len(t)):
    paper = extract_sentences.read_html_paper('journal_articles/Paper' + str(t[i]) + '.html')
    sen_yes_arr, sen_no_arr = extract_sentences.extract_sentences_given_tag(paper, test_p[i])
    for j in range(len(sen_yes_arr)):
        syn_test_yes.append(sen_yes_arr[j])
    for k in range(len(sen_no_arr)):
        syn_test_no.append(sen_no_arr[k])
Syn_test_sen = pd.DataFrame({'X':syn_test_yes, 'Y':np.ones(len(syn_test_yes))})
Syn_test_not_sen = pd.DataFrame({'X':syn_test_no, 'Y':np.zeros(len(syn_test_no))})
Test = [Syn_test_sen, Syn_test_not_sen]
test_data = pd.concat(Test, ignore_index=True)

In [12]:
X_train = [str(train_data['x'][x]) for x in range(train_data.shape[0])]
Y_train = [str(train_data['y'][x]) for x in range(train_data.shape[0])]
X_test = [str(test_data['X'][x]) for x in range(test_data.shape[0])]
Y_test = [str(test_data['Y'][x]) for x in range(test_data.shape[0])]

### Train predictor model

In [13]:
syn_sen_model = sentence_classifier.train_predictor(X_train, Y_train)

### Test model and measure accuracy

In [14]:
pred_data = syn_sen_model.predict(X_test) 
print ("Accuracy:", accuracy_score(Y_test, pred_data))

Accuracy: 0.9882491186839013


In [15]:
pred_data, synthesis_sentences, not_synthesis_sentences = sentence_classifier.classify_sentences(syn_sen_model, X_test)
synthesis_sentences

['CH3NH3I (MAI) and CH3NH3Br (MABr) were first synthesized by reacting 27.86 ml CH3NH2 (40% in methanol, Junsei Chemical) and 30 ml HI (57 wt% in water, Aldrich) or 44 ml HBr (48 wt% in water, Aldrich) in a 250 ml round-bottom flask at 0 °C for 4 h with stirring, respectively.',
 'The precipitate was recovered by evaporation at 55 °C for 1 h. MAI and MABr were dissolved in ethanol, recrystallized from diethyl ether, and dried at 60 °C in a vacuum oven for 24 h.',
 'The resulting solution was coated onto the mp-TiO2/bl-TiO2/FTO substrate by a consecutive two-step spin-coating process at 1,000 and 5,000 r.p.m for 10 and 20 s, respectively.',
 'During the second spin-coating step, the substrate (around 1 cm × 1 cm) was treated with toluene drop-casting.',
 'The substrate was dried on a hot plate at 100 °C for 10 min.',
 'A solution of poly(triarylamine) (15 mg, PTAA, EM Index, Mw\xa0 = \xa017,500 g mol−1) in toluene (1.5 ml) was mixed with 15 μl of a solution of lithium bistrifluoromethan

### Try on a paper

In [16]:
paper = extract_sentences.read_html_paper('journal_articles/Paper0.html')
X_sentences, sentences_record = extract_sentences.extract_all_sentences(paper)

In [17]:
pred_data, synthesis_sentences, not_synthesis_sentences = sentence_classifier.classify_sentences(syn_sen_model, X_sentences)
synthesis_sentences

['The spin-coated layer formed with the solvent mixture followed by the toluene drip is extremely uniform and transparent, and covers the full surface with low surface roughness.',
 'We see that the formation of the perovskite phase is accompanied by the complete transformation of the MAI–PbI2–DMSO at 130 °C, whereas both MAI–PbI2–DMSO and perovskite phases coexist at 100 °C.',
 'Accordingly, the formation of the intermediate phase is a critical factor for smoothing the surface via dropwise toluene application, which finally results in compact and uniform thin layers.',
 'Generally, the average value of the efficiency, determined from the forward and reverse scans should be widely accepted when the scanning delay time is longer than 40 ms (ref.\xa023), because an excessively long time to complete the measurement is impractical.',
 'For a deeper understanding of the dependence of the I–V parameters on both scan directions, we investigated the difference between the forward and reverse s

##### if you want to find paragraph of tagged synthesis sentences

In [18]:
df_sentences = pd.DataFrame({'Sentences':X_sentences, 'Element # in doc':[rec[0] for rec in sentences_record],
                             'Sentence_index_in_para':[rec[1] for rec in sentences_record], 'Tag':pred_data})

In [19]:
df_sentences.loc[df_sentences['Tag'] == 1.0]

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
124,The spin-coated layer formed with the solvent ...,92,15,1.0
152,We see that the formation of the perovskite ph...,96,2,1.0
160,"Accordingly, the formation of the intermediate...",99,0,1.0
187,"Generally, the average value of the efficiency...",100,21,1.0
188,For a deeper understanding of the dependence o...,103,0,1.0
217,CH3NH3I (MAI) and CH3NH3Br (MABr) were first s...,109,2,1.0
218,The precipitate was recovered by evaporation a...,109,3,1.0
220,The resulting solution was coated onto the mp-...,109,5,1.0
221,"During the second spin-coating step, the subst...",109,6,1.0
223,The substrate was dried on a hot plate at 100 ...,109,8,1.0


previously, with only 5 papers as training data, looks like the classifier correctly tagged 6 of the 12 sentences and falsely tagged 3 other sentences

now, with 10 papers as training data, correctly tagged 7/12 and falsely tagged 5 other sentences

In [20]:
paper.elements[109]

In [21]:
from sklearn.externals import joblib

In [23]:
joblib.dump(syn_sen_model, 'syn_sen_model.pkl')

['syn_sen_model.pkl']