Can I predict the existence of subfields with some cool unsupervised learning algorithm? 

For starters, let's just use regular n-grams. A more advanced version would be to look for noun phrases or J&K POS tags.

In [1]:
#Need to add parent directoy to sys.path to find 'metadataDB'
import sys
sys.path.append('../../')

%matplotlib inline
# import matplotlib.pyplot as plt 
import time
import numpy as np
# import scipy as sp
import re
from collections import Counter
import itertools
import random
import copy

# Natural language processing toolkit
# To use this, run nltk.download() and download 'stopwords'
# from nltk.corpus import stopwords
# s=stopwords.words('english') + ['']

# Machine learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cluster import KMeans
from sklearn.decomposition import SparsePCA
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
# from sklearn import metrics

# SQL
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from metadataDB.declareDatabase import *
from sqlalchemy import or_, and_

engine = create_engine("sqlite:///../../arXiv_metadata.db", echo=False)
Base.metadata.bind = engine
DBsession = sessionmaker(bind=engine)
session = DBsession()

In [2]:
query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%quant-ph%'),
                            )
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts = [' '.join(x.article.abstract.split()) for x in query]

query = session.query(Article_Category)\
                    .join(Category)\
                    .join(Article)\
                    .filter(Category.name.like('%quant-ph%'),
                            )
#                             or_(Article.journal_ref.like('Physics Review Letters%'),
#                                           Article.journal_ref.like('Phys. Rev. Lett.%'),
#                                           Article.journal_ref.like('PRL%')))

# Clean up text
abstracts_general = [' '.join(x.article.abstract.split()) for x in query]

session.close_all()

In [3]:
print len(abstracts)
print abstracts[0]

60594
One-dimensional scattering problem admitting a complex, PT-symmetric short-range potential V(x) is considered. Using a Runge-Kutta-discretized version of Schroedinger equation we derive the formulae for the reflection and transmission coefficients and emphasize that the only innovation emerges in fact via a complexification of one of the potential-characterizing parameters.


Use KMeans to find interesting subfields.
See: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In [4]:
n_clusters = 10
# Reduce n_init to 10 for testing purposes.
clf_unsupervised = Pipeline([('vect', CountVectorizer(ngram_range=(1,3), stop_words='english')),
                             ('tfidf', TfidfTransformer()),
                             ('clf', KMeans(n_clusters=n_clusters, n_init=10, n_jobs=-2))])
start = time.time()
clf_unsupervised.fit(abstracts)
print time.time() - start

start = time.time()
predict = clf_unsupervised.predict(abstracts)
print time.time() - start

2013.29995084
29.1351377964


In [5]:
# Most important chunks. See http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

order_centroids = clf_unsupervised.named_steps['clf'].cluster_centers_.argsort()[:, ::-1]
count_clusters = Counter(predict)

terms =  clf_unsupervised.named_steps['vect'].get_feature_names()
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''

Cluster 0 (99 articles):
withdrawn, paper withdrawn, paper, paper withdrawn author, withdrawn author, author, paper withdrawn authors, withdrawn authors, withdrawn author crucial, author crucial, article withdrawn, error, authors, withdrawn author problems, author problems, crucial, ph, 20 11 97, withdrawn 20, withdrawn 20 11

Cluster 1 (9280 articles):
photon, cavity, optical, quantum, single, light, atoms, state, atom, photons, atomic, field, laser, mode, states, frequency, single photon, phase, scheme, coupling

Cluster 2 (6653 articles):
quantum, time, dynamics, systems, state, decoherence, environment, evolution, non, control, markovian, model, classical, bath, open, phase, equation, walk, quantum systems, study

Cluster 3 (14047 articles):
quantum, states, state, space, classical, information, systems, theory, measurement, operators, paper, operator, problem, non, phase, matrix, given, dimensional, new, group

Cluster 4 (4931 articles):
quantum, mechanics, quantum mechanics, theo

In [6]:
import AbstractWriter
reload(AbstractWriter)


# Make the general abstract writer, transform complete list of abstracts (general vocabulary).
writer_general = AbstractWriter.AbstractWriter(ngram=5, randomize=True, seed=42)
start = time.time()
writer_general.fit(abstracts_general)
print time.time() - start



# # Get abstracts in cluster
# cluster = 8
# current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)

# # Make a copy of the general AbstractWriter instance.
# writer = copy.copy(writer_general)
# start = time.time()
# writer.fit_specialized(current_abstracts_iterator)
# print time.time() - start

# fake_abstract = writer.write_abstract()
# print fake_abstract
# print ''
# print writer.find_similar(fake_abstract)

151.159186125


In [10]:
# This function writes abstracts as a function of (precomputed) cluster
def writeAbstractCluster(cluster, article_number=1, check_article=True, seed=42, specialized_weight=0.8):
    writer = AbstractWriter.AbstractWriter(ngram=writer_general.ngram,
                                           randomize=writer_general.randomize,
                                           seed=seed,
                                           maxWords=writer_general.maxWords)
    writer._data = dict(writer_general._data)
    writer._abstracts = list(writer_general._abstracts)
    writer._specialized_weight = specialized_weight
#     print writer._data
#     writer = copy.deepcopy(writer_general)
    start = time.time()
    current_abstracts_iterator = (x for x, y in zip(abstracts, predict) if y==cluster)
    writer.fit_specialized(current_abstracts_iterator)
    print time.time() - start
    
    for _ in range(article_number):
        start = time.time()
        fake_abstract = writer.write_abstract()
        print time.time() - start

        print 'New abstract: ' + fake_abstract
        print ''
        if check_article:
            print 'Existing abstract: ' + writer.find_similar(fake_abstract)
            print ''
    return writer

In [8]:
# Write two abstracts per cluster.

In [9]:
for i in range(n_clusters):
    print "Cluster %d (%i articles):" % (i, count_clusters[i])
    print ', '.join([terms[x] for x in order_centroids[i, :20]])
    print ''
    writeAbstractCluster(i, 2)

Cluster 0 (99 articles):
withdrawn, paper withdrawn, paper, paper withdrawn author, withdrawn author, author, paper withdrawn authors, withdrawn authors, withdrawn author crucial, author crucial, article withdrawn, error, authors, withdrawn author problems, author problems, crucial, ph, 20 11 97, withdrawn 20, withdrawn 20 11

2.90811300278
0.000174999237061
New abstract: This paper has been withdrawn by the author because Lemma # is incorrect. This mistake is crucial in this paper.

Existing abstract: [[[This paper has been withdrawn by the author because Lemma # is incorrect. This mistake is crucial in this]]] paper.

0.00097393989563
New abstract: This paper has been withdrawn by the author because it is superseded by cond mat ####### . These systems constitute quantum bits with logical states differing by one Cooper pair charge. Single and two bit operations required for quantum information applications. The uncertainty relations allow an arbitrary choice of a large number exponent

In [11]:
i = 7
print "Cluster %d (%i articles):" % (i, count_clusters[i])
print ', '.join([terms[x] for x in order_centroids[i, :20]])
print ''
writeAbstractCluster(i, 8, check_article=False, seed=1492, specialized_weight=0.9)

Cluster 7 (2483 articles):
key, protocol, quantum, channel, security, qkd, key distribution, quantum key, quantum key distribution, communication, protocols, secure, distribution, channels, capacity, information, secret, classical, rate, bob

5.07771587372
0.000466108322144
New abstract: We show how weak non linearities can be used in quantum key distribution protocols. This exposition does not require knowledge of the quantum seal offer unprecedented security for unattended monitoring systems.

0.00108504295349
New abstract: We consider the notion of canonical attacks which are the cryptographic analog of the canonical forms of positive partial transposition and of the sets of unitaries implementable by circuits over the Clifford and T library and unitaries over the ring mathbb Z # sqrt # i . Moreover we show that the resolved sideband regime where the resonance states overlap and many branch points exceptional points in the parameter space which separates the Markovian and non Markov

<AbstractWriter.AbstractWriter instance at 0x10cbf0758>