Can I predict the existence of subfields with some cool unsupervised learning algorithm? 

For starters, let's just use regular n-grams. A more advanced version would be to look for noun phrases or J&K POS tags.

In [1]:
#Need to add parent directoy to sys.path to find 'metadataDB'
import sys
sys.path.append('../../')

%matplotlib inline
# import matplotlib.pyplot as plt 
import time
import numpy as np
# import scipy as sp
import re
from collections import Counter
import itertools
import random
import copy

# Natural language processing toolkit
# To use this, run nltk.download() and download 'stopwords'
# from nltk.corpus import stopwords
# s=stopwords.words('english') + ['']

# Machine learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cluster import KMeans
from sklearn.decomposition import SparsePCA
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
# from sklearn import metrics

# SQL
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from metadataDB.declareDatabase import *
from sqlalchemy import or_, and_

engine = create_engine("sqlite:///../../arXiv_metadata.db", echo=False)
Base.metadata.bind = engine
DBsession = sessionmaker(bind=engine)
session = DBsession()

In [2]:
query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%atom-ph%'),
                            )
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts = [' '.join(x.article.abstract.split()) for x in query]

query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%quant-ph%'),
                            )
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts_general = [' '.join(x.article.abstract.split()) for x in query]

session.close_all()

In [3]:
print len(abstracts)
print len(abstracts_general)

9156
60594


Use KMeans to find interesting subfields.
See: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [4]:
n_clusters = 10
# Reduce n_init to 10 for testing purposes.
clf_unsupervised = Pipeline([('vect', CountVectorizer(ngram_range=(1,3), stop_words='english')),
                             ('tfidf', TfidfTransformer()),
                             ('clf', KMeans(n_clusters=n_clusters, n_init=20, n_jobs=-1))])
start = time.time()
clf_unsupervised.fit(abstracts)
print time.time() - start

start = time.time()
predict = clf_unsupervised.predict(abstracts)
print time.time() - start

208.693447113
4.06625699997


In [5]:
# Most important chunks. See http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

order_centroids = clf_unsupervised.named_steps['clf'].cluster_centers_.argsort()[:, ::-1]
count_clusters = Counter(predict)

terms =  clf_unsupervised.named_steps['vect'].get_feature_names()
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''

Cluster 0 (746 articles):
alpha, hyperfine, structure, nuclear, proton, corrections, hydrogen, fine, mu, fine structure, variation, muonic, constant, results, calculations, qed, electron, relativistic, order, constants

Cluster 1 (724 articles):
magnetic, field, electric, magnetic field, dipole, fields, electric dipole, moment, edm, spin, electric field, dipole moment, electric dipole moment, magnetic fields, atomic, atoms, electron, state, nuclear, states

Cluster 2 (936 articles):
trap, cooling, atoms, ion, optical, ions, laser, trapped, trapping, atom, traps, beam, mot, quantum, loading, magneto, magnetic, single, magneto optical, cold

Cluster 3 (1388 articles):
quantum, density, energy, states, theory, systems, potential, method, spin, functions, body, non, function, state, wave, results, equation, interaction, particle, matrix

Cluster 4 (354 articles):
clock, frequency, optical, clocks, transition, shift, lattice, 10, clock transition, atomic, uncertainty, shifts, magic, laser, 

In [6]:
import AbstractWriter
reload(AbstractWriter)


# Make the general abstract writer, transform complete list of abstracts (general vocabulary).
writer_general = AbstractWriter.AbstractWriter(ngram=5, randomize=True, seed=42)
start = time.time()
writer_general.fit(abstracts_general)
print time.time() - start



# # Get abstracts in cluster
# cluster = 8
# current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)

# # Make a copy of the general AbstractWriter instance.
# writer = copy.copy(writer_general)
# start = time.time()
# writer.fit_specialized(current_abstracts_iterator)
# print time.time() - start

# fake_abstract = writer.write_abstract()
# print fake_abstract
# print ''
# print writer.find_similar(fake_abstract)

130.651701927


In [22]:
# This function writes abstracts as a function of (precomputed) cluster
def writeAbstractCluster(cluster, article_number=1, check_article=True, seed=42, specialized_weight=0.8):
    writer = AbstractWriter.AbstractWriter(ngram=writer_general.ngram,
                                           randomize=writer_general.randomize,
                                           seed=seed,
                                           maxWords=writer_general.maxWords)
    writer._data = dict(writer_general._data)
    writer._abstracts = list(writer_general._abstracts)
    writer._specialized_weight = specialized_weight
#     print writer._data
#     writer = copy.deepcopy(writer_general)
    start = time.time()
    current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)
    writer.fit_specialized(current_abstracts_iterator)
    print time.time() - start
    
    for _ in range(article_number):
        start = time.time()
        fake_abstract = writer.write_abstract()
        print time.time() - start

        print 'New abstract: ' + fake_abstract
        print ''
        if check_article:
            print 'Existing abstract: ' + writer.find_similar(fake_abstract)
            print ''
    return writer

In [11]:
# Write two abstracts per cluster.

In [12]:
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''
    writeAbstractCluster(i, 2)

Cluster 0 (746 articles):
alpha, hyperfine, structure, nuclear, proton, corrections, hydrogen, fine, mu, fine structure, variation, muonic, constant, results, calculations, qed, electron, relativistic, order, constants

3.32222795486
0.000500917434692
New abstract: We investigate the muonic hydrogen spectrum relevant to this transition using bound state QED with nonperturbative relativistic Dirac wave functions for a two dimensional conformal field theories exciting a state with a component in the DFS the relaxation time is quite short for quantum systems by numerical simulations. We consider this is one of the hard problems in present science and no method has been described for calculating this probability from measurements on a cluster state that is universal for different types of RDM statistics corresponding to different PDFs W t . Obtained general results are illustrated as applied to the phenomenon of weak values. Expressed in units of the trap frequency. This is reminiscent of 

In [23]:
i = 7
print "Cluster %d (%i articles):" % (i, count_clusters[i])
print ', '.join([terms[x] for x in order_centroids[i, :20]])
print ''
writeAbstractCluster(i, 8, check_article=False, seed=1492, specialized_weight=0.9)

Cluster 7 (637 articles):
laser, time, harmonic, pulse, ionization, field, pulses, attosecond, electron, dependent, time dependent, hhg, generation, high, strong, harmonic generation, strong field, high order, delay, dynamics

3.80051398277
0.000510931015015
New abstract: We experimentally disentangle the contributions of different quantum paths in high order harmonic generation of the laser driven atoms and molecules. The Coulomb singularities in the system have been removed by a regularization procedure. Action angle variables have been used to calculate semiclassical transition rates. Simple analytical expressions for the spinwave fidelity as a function of time delay. These modulations originate from the weak initial density modulations induced by the disorder and not from initial phase fluctuations thermal or quantum .

0.000612020492554
New abstract: We consider the time dynamics of the ionization process the formation of electronic wave packets and the development of new types of

<AbstractWriter.AbstractWriter instance at 0x17c1adab8>