#Extracting Structure from Scientific Abstracts
###using a LSTM neural network


_Paul Willot_


In this notebook we will go through all steps required to build a [LSTM](https://en.wikipedia.org/wiki/Long_short_term_memory "Long Short Term Memory") neural network to classify sentences inside a scientific paper abstract.

**Summary:**
* [Extract dataset](#extract)
* [Pre-process](#pre-process)
* [Label analysis](#label analysis)
* [Choosing labels](#choosing label)
* [Create train and test set](#create train)

In [2]:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark

Installed watermark.py. To use it, type:
  %load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [7]:
%watermark -a 'Paul Willot' -mvp numpy,scipy,keras

Paul Willot 

CPython 2.7.6
IPython 3.2.0

numpy 1.9.2
scipy 0.16.0
keras 0.1.2

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)
system     : Darwin
release    : 14.3.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit


First, let's gather some data. 

We use the [PubMed](http://www.ncbi.nlm.nih.gov/pubmed) database of medical paper.

Specificaly, we will focus on [structured abstracts](http://www.ncbi.nlm.nih.gov/pubmed/?term=hasstructuredabstract). There is approximately 3 million avalaible, we will focus on a reduced portion of this (500.000) but feel free to use a bigger corpus.

The easiest way to try this is to use the `toy_corpus.txt` and `tokenizer.pickle` included in the [project repo](https://github.com/m3at/Labelizer).

To work on real dataset, for convenience I prepared the following files, jump to .. [link](#shortcut "intra link")

Download a **full corpus** (500.000 structured abstracts, 500 MB compressed)

In [5]:
#!wget https://www.dropbox.com/s/lhqe3bls0mkbq57/pubmed_result_548899.txt.zip -P ./data/
#!unzip -o ./data/pubmed_result_548899.txt.zip -d ./data/

Download a **toy corpus** (224 structured abstracts, 200 KB compressed)

*__Note:__ this file is already included in my GitHub repository*

In [6]:
#!wget https://www.dropbox.com/s/ujo1l8duu31js34/toy_corpus.txt.zip -P ./data/
#!unzip -o ./TMP/toy_corpus.txt.zip -d ./data/

Download a **lemmatized corpus** (preprocessed, 350 MB compressed)

In [7]:
#!wget https://www.dropbox.com/s/lmv88n1vpmp6c19/corpus_lemmatized.pickle.zip -P ./data/
#!unzip -o ./data/corpus_lemmatized.pickle.zip -d ./data/

Download **training and testing datas** for the LSTM (preprocessed, vectorized and splitted, 100 MB compressed)

In [8]:
#!wget https://www.dropbox.com/s/0o7i0ejv4aqf6gs/training_4_BacObjMetCon.pickle.zip -P ./data/
#!unzip -o ./data/training_4_BacObjMetCon.pickle.zip -d ./data/

Bunch of imports

In [1]:
%matplotlib inline
from __future__ import absolute_import
from __future__ import print_function

# import local library
import tools
import prepare
import lemmatize
import analyze
import preprocess

import spacy
from spacy.en import English
#import nnlstm

<a id='extract'></a>
##Extract and parse the dataset

Separate each documents, isolate the abstracts

In [2]:
data = prepare.extract_txt('data/toy_corpus.txt')

Exctracting from 'toy_corpus'...
224 documents exctracted - 1.9KB  [395.3KB/s]
Done. [0.00s]


Our data look like this:

In [3]:
print("%s\n[...]"%data[0][:800])


1. EJNMMI Res. 2014 Dec;4(1):75. doi: 10.1186/s13550-014-0075-x. Epub 2014 Dec 14.

Labeling galectin-3 for the assessment of myocardial infarction in rats.

Arias T(1), Petrov A, Chen J, de Haas H, Pérez-Medina C, Strijkers GJ, Hajjar RJ,
Fayad ZA, Fuster V, Narula J.

Author information: 
(1)Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine 
at Mount Sinai, One Gustave L. Levy Place, Box 1030, New York, NY, 10029, USA,
tvarias@cnic.es.

BACKGROUND: Galectin-3 is a ß-galactoside-binding lectin expressed in most of
tissues in normal conditions and overexpressed in myocardium from early stages of
heart failure (HF). It is an established biomarker associated with extracellular 
matrix (ECM) turnover during myocardial remodeling. The aim of this study is to
test t
[...]


In [4]:
abstracts = prepare.get_abstracts(data)

Working on 4 core...
1.4KB/s on each of the [4] core
Done. [0.34s]


Cleaning, dumping the abstracts with incorrect number of labels

In [5]:
def remove_err(datas,errs):
    err=sorted([item for subitem in errs for item in subitem],reverse=True)
    for e in err:
        for d in datas:
            del d[e]

In [6]:
remove_err([abstracts],prepare.get_errors(abstracts))

In [7]:
print("Working on %d documents."%len(abstracts))

Working on 219 documents.


<a id='pre-process'></a>
#Pre-process
**Replacing numbers** with ##NB.

In [8]:
abstracts = prepare.filter_numbers(abstracts)

Filtering numbers...
Done. [0.04s]


For **correct sentence splitting**, let's train a tokenizer using NLTK Punkt Sentence Tokenizer. This tokenizer use an unsupervised algorithm to learn how to split sentences on a corpus.

In [10]:
tokenizer = prepare.create_sentence_tokenizer(abstracts)
# For a more general parser, use the one provided in NLTK:
#import nltk.data
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

abstracts_labeled = prepare.ex_all_labels(abstracts,tokenizer)

Loading sentence tokenizer...
Done. [0.34s]
Working on 4 core...
1.5KB/s on each of the [4] core
Done. [0.34s]


Now our data look like this:

In [11]:
abstracts_labeled[0][0]

[u'BACKGROUND',
 [u'Galectin-##NB is a \xdf-galactoside-binding lectin expressed in most of tissues in normal conditions and overexpressed in myocardium from early stages of heart failure (HF).',
  u'It is an established biomarker associated with extracellular  matrix (ECM) turnover during myocardial remodeling.',
  u'The aim of this study is to test the ability of (##NB)I-galectin-##NB (IG##NB) to assess cardiac remodeling in a model of myocardial infarction (MI) using imaging techniques. ']]

###Lemmatization

In [54]:
lemmatized = lemmatize.lemm(abstracts_labeled)

Working on 4 core...
Splitting datas... Done. [0.00s]
Lemmatizing...
Done. [0min 9s]


Let's save that

In [15]:
tools.dump_pickle(lemmatized,"data/fast_lemmatized.pickle")

Dumping...
Done. [0.04s]


To directly load a lemmatized corpus

In [82]:
lemmatized = tools.load_pickle("data/corpus_lemmatized.pickle")

Loading 'data/corpus_lemmatized.pickle'...
Done. [1903.28s]


<a id='label analysis'></a>
#Label analysis
*Does not affect the corpus*, simply to understand the data

In [70]:
dic = analyze.create_dic_simple(lemmatized)

Copying corpus...Done. [0.01s]
Creating dictionary of labels...
Done. [0.00s]


In [71]:
print("Number of labels :",len(dic.keys()))
analyze.show_keys(dic,threshold=10)

Number of labels : 58
195______RESULTS
151______METHODS
146______BACKGROUND
117______CONCLUSIONS
91_______CONCLUSION
26_______INTRODUCTION
22_______OBJECTIVE
16_______MATERIALS AND METHODS
10_______OBJECTIVES
10_______PURPOSE
...
(48 other labels with less than 10 occurences)
...


In [72]:
primary_keyword=['AIM','BACKGROUND','INTRODUCTION','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDING','OUTCOME','PURPOSE']

In [73]:
analyze.regroup_keys(dic,primary_keyword)

Keys regrouped: 31


In [74]:
analyze.show_keys(dic,threshold=10)

212______CONCLUSION
200______RESULT
192______METHOD
149______BACKGROUND
33_______OBJECTIVE
26_______INTRODUCTION
10_______PURPOSE
...
(22 other labels with less than 10 occurences)
...


In [75]:
keys_to_replace = [['INTRODUCTION','CONTEXT','PURPOSE'],
                  ['AIM','SETTING'],
                  ['FINDING','OUTCOME','DISCUSSION']]

replace_with =    ['BACKGROUND',
                  'METHOD',
                  'CONCLUSION']

In [76]:
analyze.replace_keys(dic,keys_to_replace,replace_with)

Keys regplaced: 8


In [77]:
analyze.show_keys(dic,threshold=10)

221______CONCLUSION
203______METHOD
200______RESULT
186______BACKGROUND
33_______OBJECTIVE
...
(16 other labels with less than 10 occurences)
...


<a id='choosing labels'></a>
#Choosing labels
_Does affect the corpus_

We can restrict our data only on abstracts having labels maching a **specific pattern**.

In [79]:
pattern = [
    ['BACKGROUND','BACKGROUNDS'],
    ['METHOD','METHODS'],
    ['RESULT','RESULTS'],
    ['CONCLUSION','CONCLUSIONS'],
]

In [80]:
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=True)

Selecting abstracts...
91/219 match the pattern (41%)
Done. [0.00s]


In [81]:
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=False)

Selecting abstracts...
98/219 match the pattern (44%)
Done. [0.00s]


In [82]:
print("%d abstracts labeled and ready for the next part!"%len(sub_perfect))

98 abstracts labeled and ready for the next part!


Or we can keep a more **noisy dataset** and reduce it to a set of labels

In [83]:
dic = preprocess.create_dic(lemmatized,100)

Copying corpus...Done. [0.02s]
Creating dictionary of labels...
Done. [0.01s]


In [84]:
#primary_keyword=['AIM','BACKGROUND','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDINGS','OUTCOME','PURPOSE']
analyze.regroup_keys(dic,primary_keyword)

Keys regrouped: 31


In [85]:
#keys_to_replace = [['INTRODUCTION','BACKGROUND','AIM','PURPOSE','CONTEXT'],
#                  ['CONCLUSION']]

#replace_with =    ['OBJECTIVE',
#                  'RESULT']

analyze.replace_keys(dic,keys_to_replace,replace_with)

Keys regplaced: 8


In [86]:
dic = {key:dic[key] for key in ['BACKGROUND','RESULT','METHOD','CONCLUSION']}

In [87]:
analyze.show_keys(dic,threshold=10)

221______CONCLUSION
203______METHOD
200______RESULT
186______BACKGROUND


In [41]:
print("Sentences per label :",["%s %d"%(s,len(dic[s][1])) for s in dic.keys()])

Sentences per label : ['OBJECTIVE 56', 'METHOD 640', 'BACKGROUND 470', 'CONCLUSION 1392']


<a id='create train'></a>
#Creating train and test data

Let's format the datas for the classifier

_Reorder the labels for better readability_

In [89]:
classes_names = ['BACKGROUND', 'METHOD', 'RESULT','CONCLUSION']
dic.keys()

['CONCLUSION', 'RESULT', 'BACKGROUND', 'METHOD']

In [90]:
# train/test split
split = 0.8

# truncate the number of abstracts to consider for each label,
# -1 to set to the maximum while keeping the number of sentences per labels equal
raw_x_train, raw_y_train, raw_x_test, raw_y_test = preprocess.split_data(dic,classes_names,
                                                              split_train_test=split,
                                                              truncate=-1)

Vectorize the sentences.

In [91]:
X_train, y_train, X_test, y_test, feature_names, max_features, vectorizer = preprocess.vectorize_data(raw_x_train, raw_y_train, raw_x_test, raw_y_test)

Vectorizing the training set...Done. [0.06s]
Getting features...Done. [0.01s]
Creating order...Done. [0.05s]
Done. [0.12s]


In [92]:
print("Number of features : %d"%(max_features))

Number of features : 4532


Now let's save all this

In [94]:
tools.dump_pickle([X_train, y_train, X_test, y_test, feature_names, max_features, classes_names, vectorizer],"data/unpadded_4_BacObjMetCon.pickle")

Dumping...
Done. [0.38s]


and jump to the other notebook to train the LSTM.