## Example notebook that uses all the functions in paperparser/read_paper

Given file path of paper, extract all sentences as a list 

Given a list of tagged sentences (0 or 1), train a support vector machine classifier

Given trained model, classify a list of sentences as 0 or 1

In [7]:
import sys
sys.path.insert(0, '../paperparser/read_paper')

In [3]:
sys.path.insert?

In [2]:
import extract_sentences
import sentence_classifier

### Data manipulation for training and testing datasets

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score 

In [4]:
# Load sample data
#manually identified/tagged synthesis paragraphs
train_p = [[117, 118, 119], [112], [117], [122, 125], [88]]
p = [1,2,3,4,5]
syn_yes = []
syn_no = []
for i in range(len(p)):
    paper = extract_sentences.read_html_paper('journal_articles/Paper' + str(p[i]) + '.html')
    sen_yes_arr, sen_no_arr = extract_sentences.extract_sentences_given_tag(paper, train_p[i])
    for j in range(len(sen_yes_arr)):
        syn_yes.append(sen_yes_arr[j])
    for k in range(len(sen_no_arr)):
        syn_no.append(sen_no_arr[k])
Syn_sen = pd.DataFrame({'x':syn_yes, 'y':np.ones(len(syn_yes))})
Syn_not_sen = pd.DataFrame({'x':syn_no, 'y':np.zeros(len(syn_no))})
Train = [Syn_sen, Syn_not_sen]
train_data = pd.concat(Train, ignore_index=True)

In [5]:
t = [0]
test_p = [[109]]
syn_test_yes = []
syn_test_no = []
for i in range(len(t)):
    paper = extract_sentences.read_html_paper('journal_articles/Paper' + str(t[i]) + '.html')
    sen_yes_arr, sen_no_arr = extract_sentences.extract_sentences_given_tag(paper, test_p[i])
    for j in range(len(sen_yes_arr)):
        syn_test_yes.append(sen_yes_arr[j])
    for k in range(len(sen_no_arr)):
        syn_test_no.append(sen_no_arr[k])
Syn_test_sen = pd.DataFrame({'X':syn_test_yes, 'Y':np.ones(len(syn_test_yes))})
Syn_test_not_sen = pd.DataFrame({'X':syn_test_no, 'Y':np.zeros(len(syn_test_no))})
Test = [Syn_test_sen, Syn_test_not_sen]
test_data = pd.concat(Test, ignore_index=True)

In [6]:
X_train = [str(train_data['x'][x]) for x in range(train_data.shape[0])]
Y_train = [str(train_data['y'][x]) for x in range(train_data.shape[0])]
X_test = [str(test_data['X'][x]) for x in range(test_data.shape[0])]
Y_test = [str(test_data['Y'][x]) for x in range(test_data.shape[0])]

### Train predictor model

In [7]:
syn_sen_model = sentence_classifier.train_predictor(X_train, Y_train)

### Test model and measure accuracy

In [8]:
pred_data = syn_sen_model.predict(X_test) 
print ("Accuracy:", accuracy_score(Y_test, pred_data))

Accuracy: 0.9894242068155111


In [9]:
pred_data, synthesis_sentences, not_synthesis_sentences = sentence_classifier.classify_sentences(syn_sen_model, X_test)
synthesis_sentences

['A 200–300-nm-thick mesoporous TiO2 (particle size: about 50 nm, crystalline phase: anatase) film was spin-coated onto the bl-TiO2/FTO substrate using home-made pastes14 and calcining at 500 °C for 1 h in air to remove organic components.',
 'CH3NH3I (MAI) and CH3NH3Br (MABr) were first synthesized by reacting 27.86 ml CH3NH2 (40% in methanol, Junsei Chemical) and 30 ml HI (57 wt% in water, Aldrich) or 44 ml HBr (48 wt% in water, Aldrich) in a 250 ml round-bottom flask at 0 °C for 4 h with stirring, respectively.',
 'The precipitate was recovered by evaporation at 55 °C for 1 h. MAI and MABr were dissolved in ethanol, recrystallized from diethyl ether, and dried at 60 °C in a vacuum oven for 24 h.',
 'The resulting solution was coated onto the mp-TiO2/bl-TiO2/FTO substrate by a consecutive two-step spin-coating process at 1,000 and 5,000 r.p.m for 10 and 20 s, respectively.',
 'During the second spin-coating step, the substrate (around 1 cm × 1 cm) was treated with toluene drop-castin

### Try on a paper

In [10]:
paper = extract_sentences.read_html_paper('journal_articles/Paper0.html')
X_sentences, sentences_record = extract_sentences.extract_all_sentences(paper)

In [11]:
pred_data, synthesis_sentences, not_synthesis_sentences = sentence_classifier.classify_sentences(syn_sen_model, X_sentences)
synthesis_sentences

['Furthermore, it was reported that the uniformity of the perovskite films depended on the thickness of the TiO2 compact layer, and modification of the spinning conditions could not achieve 100% surface coverage20.',
 'We see that the formation of the perovskite phase is accompanied by the complete transformation of the MAI–PbI2–DMSO at 130 °C, whereas both MAI–PbI2–DMSO and perovskite phases coexist at 100 °C.',
 'As shown in Fig.\xa02d, at the initial stage during spinning, the film is composed of MAI and PbI2 dissolved in the DMSO/GBL solvent mixture, whereas in the intermediate stage, the composition of the film is concentrated by the evaporation of GBL.',
 'A 200–300-nm-thick mesoporous TiO2 (particle size: about 50 nm, crystalline phase: anatase) film was spin-coated onto the bl-TiO2/FTO substrate using home-made pastes14 and calcining at 500 °C for 1 h in air to remove organic components.',
 'CH3NH3I (MAI) and CH3NH3Br (MABr) were first synthesized by reacting 27.86 ml CH3NH2 (4

##### if you want to find paragraph of tagged synthesis sentences

In [12]:
df_sentences = pd.DataFrame({'Sentences':X_sentences, 'Element # in doc':[rec[0] for rec in sentences_record],
                             'Sentence_index_in_para':[rec[1] for rec in sentences_record], 'Tag':pred_data})

In [16]:
df_sentences.loc[df_sentences['Tag'] == 1.0]

Unnamed: 0,Sentences,Element # in doc,Sentence_index_in_para,Tag
103,"Furthermore, it was reported that the uniformi...",90,90,1.0
152,We see that the formation of the perovskite ph...,96,96,1.0
154,"As shown in Fig. 2d, at the initial stage duri...",97,97,1.0
216,A 200–300-nm-thick mesoporous TiO2 (particle s...,109,109,1.0
217,CH3NH3I (MAI) and CH3NH3Br (MABr) were first s...,109,109,1.0
218,The precipitate was recovered by evaporation a...,109,109,1.0
220,The resulting solution was coated onto the mp-...,109,109,1.0
221,"During the second spin-coating step, the subst...",109,109,1.0
223,The substrate was dried on a hot plate at 100 ...,109,109,1.0


Looks like the classifier correctly tagged 6 of the 12 sentences and falsely tagged 3 other sentences

In [19]:
paper.elements[109]