http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html#example-applications-topics-extraction-with-nmf-py

In [1]:
from __future__ import print_function

In [2]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.decomposition import RandomizedPCA

In [3]:
import json
import random

In [22]:
data = []
ids = []
with open("with_abstracts.json", 'r') as json_file:
    for line in json_file.readlines():
        if random.random() < 0.04:
            data.append(json.loads(line).get("abstract"))   
            ids.append(json.loads(line).get("recid"))    

In [25]:
len(data)

17612

u'Systems that exhibit pattern formation are typically driven and dissipative. In the early universe, parametric resonance can drive explosive particle production called preheating. The fields that are populated then decay quantum mechanically if their particles are unstable. Thus, during preheating, a driven-dissipative system exists. We have shown previously that pattern formation can occur in two dimensions in a self-coupled inflaton system undergoing parametric resonance. In this paper, we provide evidence of pattern formation for more realistic initial conditions in both two and three dimensions. In the one-field case, we have the novel interpretation that these patterns can be thought of as a network of domain walls. We also show that the patterns are spatio-temporal, leading to a distinctive, but probably low-amplitude peak in the gravitational wave spectrum. In the context of a two-field model, we discuss putting power from resonance into patterns on cosmological scales, in par

In [39]:
n_features = 25
n_topics = 20 
n_top_words = 20 

In [40]:
vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = n_features, stop_words='english')
tfidf = vectorizer.fit_transform(data)


In [41]:
tfidf

<17612x25 sparse matrix of type '<type 'numpy.float64'>'
	with 69095 stored elements in Compressed Sparse Row format>

In [44]:
tfidf[19,:].toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.36246621,  0.        ,  0.66971222,
         0.        ,  0.        ,  0.        ,  0.        ,  0.36613548,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.38029082,  0.37606847,  0.        ]])

In [45]:
pca = RandomizedPCA(n_components=15)

In [46]:
pca.fit(tfidf.toarray())

RandomizedPCA(copy=True, iterated_power=3, n_components=15, random_state=None,
       whiten=False)

In [47]:
print(pca.explained_variance_ratio_)

[ 0.0776513   0.06693365  0.06088904  0.05789804  0.0533353   0.04962283
  0.04787239  0.04558359  0.04440875  0.04313766  0.04302982  0.03962672
  0.03830302  0.03634753  0.03585678]


In [48]:
sum(pca.explained_variance_)

0.61097714710806694

In [59]:
pca.transform(tfidf[1,:].toarray())

array([[ 0.02946374,  0.49706797, -0.29585848, -0.44068286, -0.36224846,
         0.22645292, -0.09460469, -0.10511972, -0.05819341,  0.06657629,
         0.10417855, -0.08303495, -0.17156866,  0.02300129, -0.04456839]])

In [60]:
pca.transform(tfidf[2,:].toarray())

array([[ 0.78371395, -0.10428656, -0.06692724,  0.10971775, -0.28026573,
        -0.03354644, -0.00862006, -0.14109541,  0.07161482, -0.05691854,
        -0.0227676 ,  0.00279854, -0.13364749,  0.05617544, -0.04112762]])

In [61]:
pca.transform(tfidf[3,:].toarray())

array([[ 0.3380412 ,  0.40088511,  0.12709053, -0.1425649 ,  0.50340428,
         0.24268593,  0.05205757, -0.17956719, -0.00679553,  0.00646802,
        -0.0339809 , -0.06772522, -0.07978799, -0.08348602, -0.01058996]])

In [65]:
ids[1:4]

[514094, 522623, 522648]

In [66]:
data[1:4]

[u'Taking the effective interaction between a quark and an anti-quark from previous work, the dependendence on a regularization scale is removed in line with the renormalization group. In order to emphasize the essential point, the full spinor interaction is replaced by a model which includes only the Coulomb and the hyperfine interaction. By adjusting the effective quark masses, the only free parameters of the theory, the mass and the size of the pion are reproduced, as well as the mass of all other pseudo-scalar mesons. Estimates for the vector mesons are close to the empirical values. The model exposes screening rather than strict confinement. The ionization thresholds are in general much larger than the pion mass.',
 u"Invariant functions under the transformations of a compact linear group $G$ acting in $\\real^n$ can be expressed in terms of functions defined in the orbit space of $G$. We develop a method to determine the isotropy classes of the orbit spaces of all the real linear

In [49]:
pca.components_

array([[ -1.36979634e-01,  -2.79812595e-01,  -9.46451381e-02,
         -1.42855735e-01,   5.23622310e-01,  -8.08232147e-02,
         -1.61205045e-01,  -4.56000385e-02,  -7.03811922e-02,
         -2.20222629e-01,  -7.83199985e-03,   7.12715278e-04,
         -2.62045758e-03,  -5.20700983e-04,  -1.80755880e-02,
          1.60594233e-01,  -5.20006540e-02,  -1.37289537e-01,
         -8.61126279e-02,  -8.47701412e-02,  -7.14647745e-02,
          5.35674745e-02,   6.57682312e-01,   3.40544666e-02,
         -8.55738206e-02],
       [ -7.72308236e-02,   2.45746631e-02,  -9.08073859e-02,
         -1.73931258e-01,  -1.15177715e-01,  -6.78995321e-02,
         -1.76239677e-01,   1.63555669e-02,  -5.36460450e-02,
          1.29628127e-01,   5.81802027e-02,  -4.60837026e-03,
          9.07620491e-01,   1.09709672e-01,   6.08127207e-03,
          1.45726963e-02,  -2.82688607e-02,  -1.46530007e-01,
          1.06302455e-02,  -6.72659912e-02,  -7.11653997e-02,
          1.83369145e-02,  -2.28275178e-02,