# Paper Classification

Here, we will explain how to determine whether a paper is a GWAS or not. We classify papers using a text-based binary classifier, in addition to a couple hand-picked filters. First, let's see what happens when we run the classifier on the downloaded GWAS papers. We should expect almost all of them to be classified as GWAS.

In [1]:
from bs4 import BeautifulSoup
import glob
import numpy as np
import pickle
import re

First, we extract features from the paper. In particular, we extract text, which is comprised of the article title, abstract, table captions, and table headers. We also extract three filters based on the year the paper was published, the number of times an RSID appears in the paper's tables, and the number of times an RSID appears in the tables and extracted text.

In [2]:
def extract_features(paper):
    """
    input: name of paper
    output: [text, filters]
    """
    with open(paper) as f:
        soup = BeautifulSoup(f, 'xml')

        ### Text ###
        # Title and Abstract
        text = ""
        article_title = ""
        article_title_tag = soup.find('article-title')
        abstract_tag = soup.find('abstract')
        if article_title_tag is not None:
            text += article_title_tag.get_text().lower() + " "
            article_title = article_title_tag.get_text().lower() + " "
        if abstract_tag is not None:
            text += abstract_tag.get_text().lower()
        else:
            if soup.find('body') is not None:
                body = soup.find('body').get_text().lower()
                # get first 2000 characters from body
                limit = min(len(body), 2000)
                text += body[:limit]

        # Tables
        table_tags = soup.find_all('table-wrapper') + soup.find_all('table-wrap')
        table_titles = ""
        table_headers = ""
        table_data = ""
        for table_tag in table_tags:
            table_title_tag = table_tag.find('caption')
            if table_title_tag is not None:
                table_titles += table_title_tag.get_text().lower() + " "
            table_header = table_tag.find('thead')
            if table_header is not None:
                header_names = table_header.find_all('td')
                for name in header_names:
                    table_headers += name.get_text().lower() + " "
            table_body = table_tag.find('tbody')
            if table_body is not None:
                body_data = table_body.find_all('td')
                for data in body_data:
                    table_data += data.get_text().lower() + " "

        extracted_text = text + table_titles + table_headers

        ### Filters ###
        year = soup.find('pub-date').find('year')
        if year:
            year_filter = int(year.get_text()) >= 2006
        else:
            year_filter = 1
        
        rsid_regex = re.compile('rs[0-9]+?')
        table_rsid = re.findall(rsid_regex, table_data)
        all_rsid = re.findall(rsid_regex, extracted_text + ' ' + table_data)
        weak_rsid_filter = len(all_rsid) > 0
        strong_rsid_filter = len(table_rsid) > 0

        extracted_filters = [year_filter, weak_rsid_filter, strong_rsid_filter]

        return [extracted_text, extracted_filters]

In [3]:
X_text = []
X_filters = []

papers = glob.glob('../data/db/papers/*.xml')

for paper in papers:
    print paper
    features = extract_features(paper)
    X_text.append(features[0])
    X_filters.append(features[1])

../data/db/papers/17447842.xml
../data/db/papers/17658951.xml
../data/db/papers/17684544.xml
../data/db/papers/17903292.xml
../data/db/papers/17903293.xml
../data/db/papers/17903294.xml
../data/db/papers/17903295.xml
../data/db/papers/17903296.xml
../data/db/papers/17903297.xml
../data/db/papers/17903298.xml
../data/db/papers/17903300.xml
../data/db/papers/17903301.xml
../data/db/papers/17903302.xml
../data/db/papers/17903303.xml
../data/db/papers/17903304.xml
../data/db/papers/17903305.xml
../data/db/papers/17903306.xml
../data/db/papers/17903307.xml
../data/db/papers/17903308.xml
../data/db/papers/17997608.xml
../data/db/papers/18159244.xml
../data/db/papers/18262040.xml
../data/db/papers/18282107.xml
../data/db/papers/18369459.xml
../data/db/papers/18455228.xml
../data/db/papers/18464913.xml
../data/db/papers/18483556.xml
../data/db/papers/18604267.xml
../data/db/papers/18776929.xml
../data/db/papers/18823527.xml
../data/db/papers/18840781.xml
../data/db/papers/18846228.xml
../data/

To predict whether a paper is GWAS or not, we use an SVM classifier that has been trained on the extracted text of papers using a bag-of-words model.

In [4]:
with open('../data/classifiers/classifier.pkl') as f:
    clf = pickle.load(f)
predicted = clf.predict(X_text)

Also, we filter out any paper published before 2006, because those papers are highly unlikely to be GWAS. (The first GWAS paper was published in 2005.)

In [5]:
filters = np.asarray(X_filters)
predicted = np.logical_and(predicted, filters[:,0])

Optionally, we can also filter out papers that do not have any extractable information. The weak filter excludes any papers that don't have an RSID in the extracted text or tables. The strong filter excludes any papers that don't have an RSID in the tables. We use the weak filter as default.

In [6]:
filter_type = 'weak'  # You can change this! The choices are None, 'weak', or 'strong'
if filter_type == 'weak':
    predicted = np.logical_and(predicted, filters[:,1])
elif filter_type == 'strong':
    predicted = np.logical_and(predicted, filters[:,2])

Write out the results.

In [7]:
def get_title(paper):
    with open(paper) as f:
        text = f.read()
        title_regex = re.compile('<article-title>.+?</article-title>')
        match = re.search(title_regex, text)
        if match is None:
            return ""
        title = match.group(0)[15:-16]
        title = title.replace('<italic>', '').replace('</italic>', '')
        return title

num_gwas_predicted = np.count_nonzero(predicted)
print "number GWAS predicted: {} out of {}".format(num_gwas_predicted, len(predicted))
for i, p in enumerate(predicted):
    if p == 1:
        print "{} - {}".format(papers[i], get_title(papers[i]))

number GWAS predicted: 297 out of 320
../data/db/papers/17658951.xml - Genome-Wide Association Scan Shows Genetic Variants in the FTO Gene Are Associated with Obesity-Related Traits
../data/db/papers/17684544.xml - Systematic Association Mapping Identifies NELL1 as a Novel IBD Disease Gene
../data/db/papers/17903292.xml - A genome-wide association for kidney function and endocrine-related traits in the NHLBI's Framingham Heart Study
../data/db/papers/17903293.xml - Genome-wide association with select biomarker traits in the Framingham Heart Study
../data/db/papers/17903294.xml - Genome-wide association and linkage analyses of hemostatic factors and hematological phenotypes in the Framingham Heart Study
../data/db/papers/17903295.xml - Genetic correlates of longevity and selected age-related phenotypes: a genome-wide association study in the Framingham Study
../data/db/papers/17903296.xml - Genome-wide association with bone mass and geometry in the Framingham Heart Study
../data/db/pape

Now let's see what happens when we run on a set of 100 random open-access papers. We should expect very few papers to be classified as GWAS, if any.

In [8]:
filter_type = 'weak'  # You can change this again! The choices are None, 'weak', or 'strong'

In [None]:
X_text = []
X_filters = []

papers = glob.glob('../data/db/non-gwas/*')

for paper in papers:
    features = extract_features(paper)
    X_text.append(features[0])
    X_filters.append(features[1])
    
with open('../data/classifiers/classifier.pkl') as f:
    clf = pickle.load(f)
predicted = clf.predict(X_text)

filters = np.asarray(X_filters)
predicted = np.logical_and(predicted, filters[:,0])
if filter_type == 'weak':
    predicted = np.logical_and(predicted, filters[:,1])
elif filter_type == 'strong':
    predicted = np.logical_and(predicted, filters[:,2])

num_gwas_predicted = np.count_nonzero(predicted)
print "number GWAS predicted: {} out of {}".format(num_gwas_predicted, len(predicted))
for i, p in enumerate(predicted):
    if p == 1:
        print "{} - {}".format(papers[i], get_title(papers[i]))

We ran the classifier with weak filter on 1.37 million open-access papers, and identified 1431 of them as GWAS. The PMC IDs for these 1431 can be found at /data/db/predicted/weak. You can download those 1431 papers by running the following:

In [None]:
%%bash
cd ../data/db
make dl-predicted-papers

The predicted papers will be downloaded in XML format into /data/db/predicted-papers.