Use fancy machine learning to predict whether an article makes it into Nature/Science or PRL. This time we'll only look at articles in the physics.atom-ph section.

In [1]:
#Need to add parent directoy to sys.path to find 'metadataDB'
import sys
sys.path.append('../')

%matplotlib inline
import matplotlib.pyplot as plt 
import time
import numpy as np

# Natural language processing toolkit
# To use this, run nltk.download() and download 'stopwords'
from nltk.corpus import stopwords
s=stopwords.words('english') + ['']

# Machine learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.pipeline import Pipeline
from sklearn import metrics

# SQL
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from metadataDB.declareDatabase import *
from sqlalchemy import or_, and_

engine = create_engine("sqlite:///../arXiv_metadata.db", echo=False)
Base.metadata.bind = engine
DBsession = sessionmaker(bind=engine)
session = DBsession()

In [2]:
query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%atom-ph%'),
                            or_(Article.journal_ref.like('Physics Review Letters%'),
                                Article.journal_ref.like('Phys. Rev. Lett.%'),
                                Article.journal_ref.like('PRL%')))
abstractPRL = [x.article.abstract for x in query.all()]
titlePRL = [x.article.title for x in query.all()]

In [3]:
query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%atom-ph%'),
                            or_(Article.journal_ref.like('Nature%'),
                                Article.journal_ref.like('Nat.%'),
                                Article.journal_ref.like('Science%')))
abstractNatureScience = [x.article.abstract for x in query.all()]
titleNatureScience = [x.article.title for x in query.all()]

In [4]:
session.close_all()

In [5]:
# # Train with 80% of the data, test with 20%
# # First start with abstracts.

# indPRL = len(abstractPRL)*4/5
# indNatureScience = len(abstractNatureScience)*4/5

# train_abstract = abstractPRL[:indPRL] + abstractNatureScience[:indNatureScience]
# train_title = titlePRL[:indPRL] + titleNatureScience[:indNatureScience]
# train_target = [0]*indPRL + [1]*indNatureScience
# train_target_names = ['PRL']*indPRL + ['Nature']*indNatureScience

# test_abstract = abstractPRL[indPRL:] + abstractNatureScience[indNatureScience:]
# test_title = titlePRL[indPRL:] + titleNatureScience[indNatureScience:]
# test_target = [0]*len(abstractPRL[indPRL:]) + [1]*len(abstractNatureScience[indNatureScience:])
# test_target_names = ['PRL', 'Nature/Science']

In [6]:
# Train with 80% of the Nature data, test with 20% of the Nature data
# Choose the same number of PRL and Nature articles in the test sets.

indNatureScience = len(abstractNatureScience)*4/5
indPRL = len(abstractPRL) - (len(abstractNatureScience) - indNatureScience)

train_abstract = abstractPRL[:indPRL] + abstractNatureScience[:indNatureScience]
train_title = titlePRL[:indPRL] + titleNatureScience[:indNatureScience]
train_target = [0]*indPRL + [1]*indNatureScience
train_target_names = ['PRL']*indPRL + ['Nature']*indNatureScience

test_abstract = abstractPRL[indPRL:] + abstractNatureScience[indNatureScience:]
test_title = titlePRL[indPRL:] + titleNatureScience[indNatureScience:]
test_target = [0]*len(abstractPRL[indPRL:]) + [1]*len(abstractNatureScience[indNatureScience:])
test_target_names = ['PRL', 'Nature/Science']

In [14]:
print len(abstractNatureScience)
print len(abstractPRL)

123
508


In [7]:
#SVC(kernel='linear') is good
text_abstract_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,3))),
                              ('tfidf', TfidfTransformer()),
                              ('clf', SVC(kernel='linear'))])
text_abstract_clf.fit(train_abstract, train_target)
predict_abstract = text_abstract_clf.predict(test_abstract)
#print text_abstract_clf.predict(train_abstract)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 1 1 1 1 1 1 1 

In [8]:
text_title_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                              ('tfidf', TfidfTransformer()),
                              ('clf', LinearSVC())])
text_title_clf.fit(train_title, train_target)
predict_title = text_title_clf.predict(test_title)
#print text_abstract_clf.predict(train_title)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [9]:
print predict_abstract
print predict_title

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0]


In [10]:
#SVC(kernel='linear')
print(metrics.classification_report(test_target, predict_abstract,
                                    target_names=test_target_names))

                precision    recall  f1-score   support

           PRL       0.52      1.00      0.68        25
Nature/Science       1.00      0.08      0.15        25

   avg / total       0.76      0.54      0.42        50



In [11]:
#SVC(kernel='linear')
print(metrics.classification_report(test_target, predict_title,
                                    target_names=test_target_names))

                precision    recall  f1-score   support

           PRL       0.53      1.00      0.69        25
Nature/Science       1.00      0.12      0.21        25

   avg / total       0.77      0.56      0.45        50



In [12]:
#SVC(kernel='linear')
print(metrics.classification_report(test_target, predict_title,
                                    target_names=test_target_names))

                precision    recall  f1-score   support

           PRL       0.53      1.00      0.69        25
Nature/Science       1.00      0.12      0.21        25

   avg / total       0.77      0.56      0.45        50



In [15]:
def inverseVectorizer(val):
    return (key for key, value in text_abstract_clf.named_steps['vect'].vocabulary_.iteritems() if value == val).next()

# This is super inefficient!!!
sorted_coefs = sorted( ((i,v) for i, v in np.ndenumerate(text_abstract_clf.named_steps['clf'].coef_.todense()) ),
                      key=lambda x: x[1] )

print "Top 50 indicators of PRL:"
bottom = sorted_coefs[:50]
print ", ".join([ inverseVectorizer(item[0][1]) for item in bottom])
print ""
print "Top 50 indicators of Nature/Science:"
top = list(reversed(sorted_coefs[-50:]))
print ", ".join([ inverseVectorizer(item[0][1]) for item in top])

Top 50 indicators of PRL:
shift, show, show that, between the, we show, hyperfine, collective, magnetic, trap, of the, molecules in, from the, alpha, temperature, method, and we, attosecond, in an, sample, in their, the ion, relaxation, of the two, gas, resonant, scattering, rydberg, we show that, energy, yb, of 10, electric, condensation, background, we, sensitive, polaritons, agreement with, propose, inelastic, the atom, of spin, ground, spin, fluorescence, shifts, channel, gravity, and demonstrate, dot

Top 50 indicators of Nature/Science:
quantum, here we, here, of, matter, been, of quantum, fundamental, clocks, processing, precision, such, quantum information, however, have, here we demonstrate, has, here we report, control, these, information, entanglement, system, information processing, such as, sensitivity, in, quantum system, environment, have been, studies, force, quantum information processing, physics, superconducting, atomic clocks, measurement, its, we demonstrate, and, 

In [13]:
print text_abstract_clf.predict([test_abstract[-4]])
print test_abstract[-4]

[0]
  Many modern theories predict that the fundamental constants depend on time,
position, or the local density of matter. We develop a spectroscopic method for
pulsed beams of cold molecules, and use it to measure the frequencies of
microwave transitions in CH with accuracy down to 3 Hz. By comparing these
frequencies with those measured from sources of CH in the Milky Way, we test
the hypothesis that fundamental constants may differ between the high and low
density environments of the Earth and the interstellar medium. For the fine
structure constant we find \Delta\alpha/\alpha = (0.3 +/- 1.1)*10^{-7}, the
strongest limit to date on such a variation of \alpha. For the
electron-to-proton mass ratio we find \Delta\mu/\mu = (-0.7 +/- 2.2) * 10^{-7}.
We suggest how dedicated astrophysical measurements can improve these
constraints further and can also constrain temporal variation of the constants.

