Can I predict the existence of subfields with some cool unsupervised learning algorithm? 

For starters, let's just use regular n-grams. A more advanced version would be to look for noun phrases or J&K POS tags.

In [22]:
#Need to add parent directoy to sys.path to find 'metadataDB'
import sys
sys.path.append('../../')

%matplotlib inline
# import matplotlib.pyplot as plt 
import time
import numpy as np
# import scipy as sp
import re
from collections import Counter
import itertools
import random
import copy

# Natural language processing toolkit
# To use this, run nltk.download() and download 'stopwords'
# from nltk.corpus import stopwords
# s=stopwords.words('english') + ['']

# Machine learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cluster import KMeans
from sklearn.decomposition import SparsePCA
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
# from sklearn import metrics

# SQL
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from metadataDB.declareDatabase import *
from sqlalchemy import or_, and_
from sqlalchemy import extract

engine = create_engine("sqlite:///../../arXiv_metadata.db", echo=False)
Base.metadata.bind = engine
DBsession = sessionmaker(bind=engine)
session = DBsession()

For the general vocabulary, use all articles since (i.e., creatd after) 2013.

In [23]:
query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%atom-ph%'),
                            )
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts = [' '.join(x.article.abstract.split()) for x in query]

query = session.query(Article).filter(extract('year', Article.created) > 2012)
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts_general = [' '.join(x.abstract.split()) for x in query]

session.close_all()

In [24]:
# print len(abstracts)
print len(abstracts_general)

178828


Use KMeans to find interesting subfields.
See: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [25]:
n_clusters = 10
# Reduce n_init to 10 for testing purposes.
clf_unsupervised = Pipeline([('vect', CountVectorizer(ngram_range=(1,3), stop_words='english')),
                             ('tfidf', TfidfTransformer()),
                             ('clf', KMeans(n_clusters=n_clusters, n_init=20, n_jobs=-1))])
start = time.time()
clf_unsupervised.fit(abstracts)
print time.time() - start

start = time.time()
predict = clf_unsupervised.predict(abstracts)
print time.time() - start

191.118264198
3.70438098907


In [26]:
# Most important chunks. See http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

order_centroids = clf_unsupervised.named_steps['clf'].cluster_centers_.argsort()[:, ::-1]
count_clusters = Counter(predict)

terms =  clf_unsupervised.named_steps['vect'].get_feature_names()
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''

Cluster 0 (926 articles):
ionization, laser, electron, field, pulse, time, pulses, harmonic, strong, attosecond, high, dependent, laser field, strong field, time dependent, hhg, generation, intense, energy, photon

Cluster 1 (1021 articles):
rydberg, atoms, bose, lattice, atom, condensate, quantum, bose einstein, einstein, interactions, state, spin, optical, interaction, gas, states, phase, atomic, dipole, dynamics

Cluster 2 (455 articles):
nuclear, electric, dipole, electric dipole, edm, moment, parity, hyperfine, relativistic, dipole moment, electron, electric dipole moment, calculations, moments, cluster, coupled cluster, atomic, spin, states, interaction

Cluster 3 (682 articles):
alpha, structure, proton, fine, fine structure, electron, mu, levels, transitions, variation, hydrogen, muonic, corrections, data, constant, calculations, results, fine structure constant, structure constant, transition

Cluster 4 (420 articles):
scattering, body, scattering length, length, range, resona

In [27]:
import AbstractWriter
reload(AbstractWriter)


# Make the general abstract writer, transform complete list of abstracts (general vocabulary).
writer_general = AbstractWriter.AbstractWriter(ngram=5, randomize=True, seed=42)
start = time.time()
writer_general.fit(abstracts_general)
print time.time() - start



# # Get abstracts in cluster
# cluster = 8
# current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)

# # Make a copy of the general AbstractWriter instance.
# writer = copy.copy(writer_general)
# start = time.time()
# writer.fit_specialized(current_abstracts_iterator)
# print time.time() - start

# fake_abstract = writer.write_abstract()
# print fake_abstract
# print ''
# print writer.find_similar(fake_abstract)

509.964752197


In [28]:
# This function writes abstracts as a function of (precomputed) cluster
def writeAbstractCluster(cluster, article_number=1, check_article=True, seed=42, specialized_weight=0.9):
    writer = AbstractWriter.AbstractWriter(ngram=writer_general.ngram,
                                           randomize=writer_general.randomize,
                                           seed=seed,
                                           maxWords=writer_general.maxWords)
    writer._data = dict(writer_general._data)
    writer._abstracts = list(writer_general._abstracts)
    writer._specialized_weight = specialized_weight
#     print writer._data
#     writer = copy.deepcopy(writer_general)
    start = time.time()
    current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)
    writer.fit_specialized(current_abstracts_iterator)
    print time.time() - start
    
    for _ in range(article_number):
        start = time.time()
        fake_abstract = writer.write_abstract()
        print time.time() - start

        print 'New abstract: ' + fake_abstract
        print ''
        if check_article:
            print 'Existing abstract: ' + writer.find_similar(fake_abstract)
            print ''
    return writer

In [29]:
# Write two abstracts per cluster.

In [30]:
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''
    writeAbstractCluster(i, 2)

Cluster 0 (926 articles):
ionization, laser, electron, field, pulse, time, pulses, harmonic, strong, attosecond, high, dependent, laser field, strong field, time dependent, hhg, generation, intense, energy, photon

19.9766011238
0.0013530254364
New abstract: We study the evolution of the system in an analytic way. After that by additionally studying the system we make a choice of mathbf K in terms of Euler Gamma functions proving that it is simple and easy to use method to generate such random number sequences. Here we introduce a versatile pick up and drop off delay up to ## minutes in a single passage. The mean number of detected antineutrinos is ### # neutrino day to be compared with its experimental value of ##.# . While this difference could be employed as mechanisms for stabilization and control in systems that require reliable operation under chaos.

Existing abstract: We propose a reconstruction of the initial system of ordinary differential equations from a single observed var

In [32]:
i = 0
print "Cluster %d (%i articles):" % (i, count_clusters[i])
print ', '.join([terms[x] for x in order_centroids[i, :20]])
print ''
writeAbstractCluster(i, 8, check_article=False, seed=1492, specialized_weight=0.9)

Cluster 0 (926 articles):
ionization, laser, electron, field, pulse, time, pulses, harmonic, strong, attosecond, high, dependent, laser field, strong field, time dependent, hhg, generation, intense, energy, photon

14.6401560307
0.00608110427856
New abstract: Nonlinear optical methods are becoming ubiquitous in many areas of science. They are however intrinsically non linear which results in secondary non gaussianities orders of magnitude larger than predicted by the same model. This shows that at room temperature the values of the corresponding quantum operators. It also allows us to compute several properties exactly for large N . Phase transitions at critical values of parameters where multi band dispersion curves reduce to a universal function that can be implemented in a compact setup and allows us to quantitatively study the neutrinoless double beta # nu beta beta and # nu beta beta decays is estimated to be about # ## # g Hz # # the highest reported thus far with a dual species 

<AbstractWriter.AbstractWriter instance at 0x152bc2440>