# Topic Model Exploration

BOW, Document-term matrix, TF-IDF. Based on tutorials: https://radimrehurek.com/gensim/tut1.html

In [1]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
# test corpora based on: http://topicmodels.west.uni-koblenz.de/ckling/tmt/part1.pdf
'''documents = ["probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model"
]'''
'''documents = ["probabilistic topic model",
    "probabilistic topic model",
    "probabilistic topic model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model",
    "famous fashion model"
]'''
documents = ["modem the steering linux. modem, linux the modem. steering the modem. linux!",
    "linux; the linux. the linux modem linux. the modem, clutch the modem. gear.",
    "gear! clutch the steering, steering, linux. the steering clutch gear. clutch the gear; the clutch.",
    "the the the. clutch clutch clutch! steering gear; steering gear gear; steering gear!!!!"]
len(documents)

4

In [3]:
from nltk.tokenize import word_tokenize
import string

texts = [[word for word in word_tokenize(document) if word not in string.punctuation] for document in documents]
print(texts)

[['modem', 'the', 'steering', 'linux', 'modem', 'linux', 'the', 'modem', 'steering', 'the', 'modem', 'linux'], ['linux', 'the', 'linux', 'the', 'linux', 'modem', 'linux', 'the', 'modem', 'clutch', 'the', 'modem', 'gear'], ['gear', 'clutch', 'the', 'steering', 'steering', 'linux', 'the', 'steering', 'clutch', 'gear', 'clutch', 'the', 'gear', 'the', 'clutch'], ['the', 'the', 'the', 'clutch', 'clutch', 'clutch', 'steering', 'gear', 'steering', 'gear', 'gear', 'steering', 'gear']]


In [4]:
from gensim import corpora

dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)



{'modem': 0, 'the': 1, 'steering': 2, 'linux': 3, 'clutch': 4, 'gear': 5}


In [5]:
bow = [dictionary.doc2bow(text) for text in texts]
bow

[[(0, 4), (1, 3), (2, 2), (3, 3)],
 [(0, 3), (1, 4), (3, 4), (4, 1), (5, 1)],
 [(1, 4), (2, 3), (3, 1), (4, 4), (5, 3)],
 [(1, 3), (2, 3), (4, 3), (5, 4)]]

In [6]:
from gensim.matutils import corpus2dense
bow_dense = corpus2dense(bow, len(dictionary.token2id), len(bow))

In [7]:
import pandas as pd

columns = [f'D{i+1}' for i, doc in enumerate(bow)]
bow_df = pd.DataFrame(bow_dense, columns = columns, index = dictionary.token2id)
print(bow_df)

           D1   D2   D3   D4
modem     4.0  3.0  0.0  0.0
the       3.0  4.0  4.0  3.0
steering  2.0  0.0  3.0  3.0
linux     3.0  4.0  1.0  0.0
clutch    0.0  1.0  4.0  3.0
gear      0.0  1.0  3.0  4.0


In [8]:
from gensim import models

model = models.TfidfModel(bow)
tfidf = [model[doc] for doc in bow]
tfidf

[[(0, 0.936603022962913), (2, 0.19436268823376643), (3, 0.29154403235064963)],
 [(0, 0.8624176140851579),
  (3, 0.47724753317857443),
  (4, 0.11931188329464361),
  (5, 0.11931188329464361)],
 [(2, 0.50709255283711),
  (3, 0.1690308509457033),
  (4, 0.6761234037828132),
  (5, 0.50709255283711)],
 [(2, 0.5144957554275266), (4, 0.5144957554275266), (5, 0.6859943405700354)]]

In [9]:
tfidf_dense = corpus2dense(tfidf, len(dictionary.token2id), len(bow))

In [10]:
tfidf_df = pd.DataFrame(tfidf_dense, columns = columns, index = dictionary.token2id)
print(tfidf_df)

                D1        D2        D3        D4
modem     0.936603  0.862418  0.000000  0.000000
the       0.000000  0.000000  0.000000  0.000000
steering  0.194363  0.000000  0.507093  0.514496
linux     0.291544  0.477248  0.169031  0.000000
clutch    0.000000  0.119312  0.676123  0.514496
gear      0.000000  0.119312  0.507093  0.685994


# SVD, LSA (LSI)

## On bow matrix, with gensim

In [11]:
lsi = models.LsiModel(bow, id2word=dictionary, num_topics=2) # initialize an LSI transformation
lsi.print_topics(-1)

[(0,
  '0.624*"the" + 0.384*"clutch" + 0.376*"gear" + 0.364*"steering" + 0.335*"linux" + 0.277*"modem"'),
 (1,
  '0.583*"modem" + 0.514*"linux" + -0.412*"gear" + -0.397*"clutch" + -0.242*"steering" + 0.099*"the"')]

**Important Note:** Interpretation of the _topic loadings_ (the coefficients associated with each probable word in a topic) is a major difficulty with LSI/LSA - [some](https://www.quora.com/What-is-the-interpretation-of-negative-scores-in-the-basis-vectors-obtained-with-LSI/answer/Luis-Argerich?srid=3A71), like this gensim LSI model, assign words to topics using highest absolute value/magnitude. [Others](https://www.slideshare.net/vitomirkovanovic/topic-modeling-for-learning-analytics-researchers-lak15-tutorial/65) assign with the highest signed value.

**Important Note:** Notice the effect of not cleaning up _stop words_ when running LSA on the term frequency matrix. 'the' has been determined to be the most important word in the most important topic (in both magnitude and signed interpretations of _topic loadings_), which rings false.

**TODO:** Word->Loading bar charts for each topic (like http://topicmodels.west.uni-koblenz.de/ckling/tmt/part1.pdf)

In [12]:
print(lsi.projection.u) # left singular vectors (U)

[[ 0.27744519  0.5826725 ]
 [ 0.62416737  0.09912868]
 [ 0.36351351 -0.24249522]
 [ 0.33508594  0.51399463]
 [ 0.38393822 -0.39735845]
 [ 0.37630578 -0.4120414 ]]


In [13]:
print(lsi.projection.s) # singular values (Ʃ)

[ 11.25526786   6.53892079]


In [14]:
corpus_lsi = lsi[bow] # create a wrapper over the original corpus: bow->fold-in-lsi

# https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q3-how-do-you-calculate-the-matrix-v-in-lsi-space
print((corpus2dense(corpus_lsi, len(lsi.projection.s)).T / lsi.projection.s).T) # right singular vectors (V^T)

[[ 0.41887654  0.48240457  0.58523405  0.4993289 ]
 [ 0.56355929  0.51860406 -0.40412391 -0.50013462]]


In [15]:
for doc in corpus_lsi: # doc inference: bow->lsi transformation is actually executed here, on the fly
     print(doc)
        
# TODO: why are these numbers so big (and not < 1)?

[(0, 4.7145677241169315), (1, 3.6850694969601752)]
[(0, 5.4295928192507432), (1, 3.3911108898747324)]
[(0, 6.5869661866505869), (1, -2.6425342939042018)]
[(0, 5.620080433766681), (1, -3.2703405625661404)]


## On bow matrix, with scikit-learn

In [16]:
from sklearn.utils.extmath import randomized_svd
from gensim.matutils import corpus2csc

# requires sparse matrix form of bow
bow_sparse = corpus2csc(bow)
U, Sigma, VT = randomized_svd(bow_sparse, n_components=2)

In [17]:
print(U)

[[ 0.27744519  0.5826725 ]
 [ 0.62416737  0.09912868]
 [ 0.36351351 -0.24249522]
 [ 0.33508594  0.51399463]
 [ 0.38393822 -0.39735845]
 [ 0.37630578 -0.4120414 ]]


In [18]:
print(Sigma)

[ 11.25526786   6.53892079]


In [19]:
print(VT)

[[ 0.41887655  0.48240458  0.58523407  0.49932889]
 [ 0.56355928  0.51860406 -0.40412392 -0.5001346 ]]


Using [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) below to 1) perform dimensionality reduction on the doc-term matrix, 2) get extra characteristics like explained variance

In [20]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2, n_iter=1)
svd.fit(bow_sparse)
print(svd.transform(bow_sparse)) # dimensionality reduction on original bow matrix: word inference (TODO: right?)

[[ 3.12271994  3.81004931]
 [ 7.02517092  0.64819461]
 [ 4.09144197 -1.58565701]
 [ 3.77148204  3.36097017]
 [ 4.32132753 -2.59829542]
 [ 4.23542236 -2.69430611]]


In [21]:
'''SVD suffers from a problem called "sign indeterminancy", which means the
sign of the ``components_`` and the output from transform depend on the
algorithm and random state. To work around this, fit instances of this
class to data once, then keep the instance around to do transformations.'''
print(svd.components_) # V^T

[[ 0.41887655  0.48240458  0.58523407  0.49932889]
 [ 0.56355928  0.51860406 -0.40412392 -0.5001346 ]]


**TODO**: write about signification of explained variance

In [22]:
print(svd.singular_values_)

[ 11.25526786   6.53892079]


In [23]:
print(svd.explained_variance_)

[ 1.50696752  7.10165316]


In [24]:
print(svd.explained_variance_ratio_)

[ 0.15816569  0.74536301]


## On tf-idf matrix, with gensim

In [25]:
lsi = models.LsiModel(tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
lsi.print_topics(-1)

[(0,
  '0.593*"modem" + 0.441*"clutch" + 0.438*"gear" + 0.405*"steering" + 0.314*"linux" + 0.000*"the"'),
 (1,
  '-0.707*"modem" + 0.418*"gear" + 0.412*"clutch" + 0.316*"steering" + -0.236*"linux" + -0.000*"the"')]

Interesting results. On the bright side, tf-idf LSA correctly gives the stopword 'the' a loading of ~0, which bow LSA struggled with.

However, the topics are not very interpretable: all words in topic 0 have very similar topic loadings, topic 1 has a clear separation between the word 'modem' and the rest of the field (from the magnitude perspective), and you can really only see the separation between the 'cars' and 'IT' words by using sign in topic 1.

Comparing loadings per word across topics, the higher magnitudes mostly come from the same topic for this example (topic 0 - the only max that comes from topic 1 is for _modem_). **TODO:** is this cross-topic comparison valid, or should I stick with the separate word->loading bar charts per topic?

In [26]:
print(lsi.projection.u) # left singular vectors (U)

[[  5.92808679e-01  -7.06679855e-01]
 [  4.09109497e-17  -4.72574037e-17]
 [  4.04574539e-01   3.15748047e-01]
 [  3.13511892e-01  -2.36236614e-01]
 [  4.40860983e-01   4.12293569e-01]
 [  4.38462313e-01   4.18465087e-01]]


In [27]:
print(lsi.projection.s) # singular values (Ʃ)

[ 1.50330866  1.28357349]


In [28]:
corpus_lsi = lsi[tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

# https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q3-how-do-you-calculate-the-matrix-v-in-lsi-space
print((corpus2dense(corpus_lsi, len(lsi.projection.s)).T / lsi.projection.s).T) # right singular vectors (V^T)

[[ 0.48244458  0.50939982  0.5179025   0.48942423]
 [-0.52149894 -0.48542378  0.47612706  0.51546638]]


In [29]:
for doc in corpus_lsi: # doc inference: both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
     print(doc)

[(0, 0.72526311682447664), (1, -0.66938222377208589)]
[(0, 0.76578514191849489), (1, -0.62307711556749001)]
[(0, 0.77856731972283111), (1, 0.61114406793638831)]
[(0, 0.73575565287651956), (1, 0.66163900269050036)]


## On tf-idf matrix, with scikit-learn

In [30]:
from sklearn.utils.extmath import randomized_svd

# randomized_svd requires sparse matrix form of tfidf
tfidf_sparse = corpus2csc(tfidf)
U, Sigma, VT = randomized_svd(tfidf_sparse, n_components=2)

In [31]:
print(U)

[[  5.92808679e-01   7.06679855e-01]
 [ -7.74993893e-24  -2.38175662e-22]
 [  4.04574539e-01  -3.15748047e-01]
 [  3.13511892e-01   2.36236614e-01]
 [  4.40860983e-01  -4.12293569e-01]
 [  4.38462313e-01  -4.18465087e-01]]


In [32]:
print(Sigma)

[ 1.50330866  1.28357349]


In [33]:
print(VT)

[[ 0.48244458  0.50939981  0.5179025   0.48942421]
 [ 0.52149895  0.4854238  -0.47612706 -0.5154664 ]]


Using [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) below to 1) perform dimensionality reduction on the doc-term matrix, 2) get extra characteristics like explained variance

http://scikit-learn.org/stable/modules/decomposition.html#lsa

In [34]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2, n_iter=1)
svd.fit(tfidf_sparse)
print(svd.transform(tfidf_sparse)) # dimensionality reduction on original bow matrix: word inference (TODO: right?)

[[ 0.89117442  0.90707552]
 [ 0.          0.        ]
 [ 0.60820041 -0.40528582]
 [ 0.47130514  0.30322705]
 [ 0.66275013 -0.52920909]
 [ 0.65914419 -0.53713069]]


In [35]:
'''SVD suffers from a problem called "sign indeterminancy", which means the
sign of the ``components_`` and the output from transform depend on the
algorithm and random state. To work around this, fit instances of this
class to data once, then keep the instance around to do transformations.'''
print(svd.components_) # V^T

[[ 0.48244458  0.50939981  0.5179025   0.48942421]
 [ 0.52149895  0.4854238  -0.47612706 -0.5154664 ]]


In [36]:
print(svd.singular_values_)

[ 1.50330866  1.28357349]


In [37]:
print(svd.explained_variance_)

[ 0.075516    0.27269655]


In [38]:
print(svd.explained_variance_ratio_)

[ 0.20772864  0.75013086]


gensim [tutorial](https://radimrehurek.com/gensim/tut2.html) mentions that it is "preferable" to perform LSA on the tf-idf matrix rather than the bow/term frequency matrix. Why?

[sklearn says similarly about tf-idf](http://scikit-learn.org/stable/modules/decomposition.html#lsa): "While the TruncatedSVD transformer works with any (sparse) feature matrix, using it on tf–idf matrices is recommended over raw frequency counts in an LSA/document processing setting."

## On bow matrix with removed stop words, with gensim

In [39]:
from nltk import download
download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\narho_000\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
from nltk.corpus import stopwords

stopword_texts = [(i, word) for i, word in enumerate(dictionary.token2id) if word in stopwords.words('english')]
print(stopword_texts)

[(1, 'the')]


In [41]:
# filter stopwords from gensim corpora dictionary
dictionary.filter_tokens(bad_ids=[i for i, word in stopword_texts])

In [42]:
dictionary.token2id

{'clutch': 3, 'gear': 4, 'linux': 2, 'modem': 0, 'steering': 1}

In [43]:
bow = [dictionary.doc2bow(text) for text in texts]
bow

[[(0, 4), (1, 2), (2, 3)],
 [(0, 3), (2, 4), (3, 1), (4, 1)],
 [(1, 3), (2, 1), (3, 4), (4, 3)],
 [(1, 3), (3, 3), (4, 4)]]

In [44]:
lsi = models.LsiModel(bow, id2word=dictionary, num_topics=2) # initialize an LSI transformation
lsi.print_topics(-1)

[(0,
  '-0.518*"clutch" + -0.514*"gear" + -0.489*"steering" + -0.372*"linux" + -0.300*"modem"'),
 (1,
  '0.648*"modem" + 0.587*"linux" + -0.331*"gear" + -0.317*"clutch" + -0.160*"steering"')]

**Important Note:** Notice the effect of cleaning up _stop words_ when running LSA on the term frequency matrix. Results look more believable! Further, the separation between words _clutch_, _gear_, _steering_ ('cars') and _modem_, _linux_ ('IT') is more clear and the loading magnitudes seem to reflect topic relevance better.

In [45]:
lsi.projection.u # left singular vectors (U)

array([[-0.30024393,  0.64782113],
       [-0.48911522, -0.15959877],
       [-0.37171036,  0.58689171],
       [-0.5183769 , -0.31707865],
       [-0.513553  , -0.33147438]])

In [46]:
lsi.projection.s # singular values (Ʃ)

array([ 8.83493413,  6.462851  ])

In [47]:
corpus_lsi = lsi[bow]

# https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q3-how-do-you-calculate-the-matrix-v-in-lsi-space
(corpus2dense(corpus_lsi, len(lsi.projection.s)).T / lsi.projection.s).T # right singular vectors (V^T)

array([[-0.37287627, -0.38704343, -0.61723412, -0.5746153 ],
       [ 0.62399118,  0.56360224, -0.33338883, -0.42642631]])

In [48]:
for doc in corpus_lsi:
     print(doc)

[(0, -3.2943372469083361), (1, 4.0327621295130145)]
[(0, -3.4195031421177662), (1, 3.6424772099715881)]
[(0, -5.4532226067907299), (1, -2.1546423228579714)]
[(0, -5.0766883466956418), (1, -2.7559297667574412)]


## On bow matrix with removed stop words, with scikit-learn

In [49]:
from sklearn.utils.extmath import randomized_svd

bow_sparse = corpus2csc(bow)
U, Sigma, VT = randomized_svd(bow_sparse, n_components=2)

In [50]:
print(U)

[[ 0.30024393  0.64782113]
 [ 0.48911522 -0.15959877]
 [ 0.37171036  0.58689171]
 [ 0.5183769  -0.31707865]
 [ 0.513553   -0.33147438]]


In [51]:
print(Sigma)

[ 8.83493413  6.462851  ]


In [52]:
print(VT)

[[ 0.37287627  0.38704342  0.6172341   0.5746153 ]
 [ 0.6239912   0.56360223 -0.33338883 -0.42642632]]


Using [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) below to 1) perform dimensionality reduction on the doc-term matrix, 2) get extra characteristics like explained variance

In [53]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2, n_iter=1)
svd.fit(bow_sparse)
print(svd.transform(bow_sparse))

[[ 2.65263533  4.18677147]
 [ 4.32130074 -1.03146305]
 [ 3.28403659  3.79299367]
 [ 4.57982573 -2.04923205]
 [ 4.53720693 -2.14226954]]


In [54]:
'''SVD suffers from a problem called "sign indeterminancy", which means the
sign of the ``components_`` and the output from transform depend on the
algorithm and random state. To work around this, fit instances of this
class to data once, then keep the instance around to do transformations.'''
print(svd.components_) # V^T

[[ 0.37287627  0.38704342  0.6172341   0.5746153 ]
 [ 0.6239912   0.56360223 -0.33338883 -0.42642632]]


In [55]:
print(svd.singular_values_)

[ 8.83493413  6.462851  ]


In [56]:
print(svd.explained_variance_)

[ 0.59557896  8.04969065]


In [57]:
print(svd.explained_variance_ratio_)

[ 0.06152675  0.83157961]


## On tf-idf matrix with removed stop words, with gensim

In [58]:
model = models.TfidfModel(bow)
tfidf = [model[doc] for doc in bow]
tfidf

[[(0, 0.936603022962913), (1, 0.19436268823376643), (2, 0.29154403235064963)],
 [(0, 0.8624176140851579),
  (2, 0.47724753317857443),
  (3, 0.11931188329464361),
  (4, 0.11931188329464361)],
 [(1, 0.50709255283711),
  (2, 0.1690308509457033),
  (3, 0.6761234037828132),
  (4, 0.50709255283711)],
 [(1, 0.5144957554275266), (3, 0.5144957554275266), (4, 0.6859943405700354)]]

In [59]:
lsi = models.LsiModel(tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
lsi.print_topics(-1)

[(0,
  '0.593*"modem" + 0.441*"clutch" + 0.438*"gear" + 0.405*"steering" + 0.314*"linux"'),
 (1,
  '-0.707*"modem" + 0.418*"gear" + 0.412*"clutch" + 0.316*"steering" + -0.236*"linux"')]

Same topics and loadings as before stop word removal - which makes sense since tf-idf was robust enough back then to zero-out the loading for 'the'.

In [60]:
print(lsi.projection.u) # left singular vectors (U)

[[ 0.59280868 -0.70667985]
 [ 0.40457454  0.31574805]
 [ 0.31351189 -0.23623661]
 [ 0.44086098  0.41229357]
 [ 0.43846231  0.41846509]]


In [61]:
print(lsi.projection.s) # singular values (Ʃ)

[ 1.50330866  1.28357349]


In [62]:
corpus_lsi = lsi[tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

# https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q3-how-do-you-calculate-the-matrix-v-in-lsi-space
print((corpus2dense(corpus_lsi, len(lsi.projection.s)).T / lsi.projection.s).T) # right singular vectors (V^T)

[[ 0.48244458  0.50939982  0.5179025   0.48942423]
 [-0.52149894 -0.48542378  0.47612706  0.51546638]]


In [63]:
for doc in corpus_lsi: # doc inference: both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
     print(doc)

[(0, 0.7252631168244752), (1, -0.669382223772087)]
[(0, 0.76578514191849389), (1, -0.62307711556749124)]
[(0, 0.77856731972283222), (1, 0.61114406793638687)]
[(0, 0.73575565287652078), (1, 0.66163900269049902)]


## On tf-idf matrix with removed stop words, with scikit-learn

In [64]:
from sklearn.utils.extmath import randomized_svd

# requires sparse matrix form of tfidf
tfidf_sparse = corpus2csc(tfidf)
U, Sigma, VT = randomized_svd(tfidf_sparse, n_components=2)

In [65]:
print(U)

[[ 0.59280868  0.70667985]
 [ 0.40457454 -0.31574805]
 [ 0.31351189  0.23623661]
 [ 0.44086098 -0.41229357]
 [ 0.43846231 -0.41846509]]


In [66]:
print(Sigma)

[ 1.50330866  1.28357349]


In [67]:
print(VT)

[[ 0.48244458  0.50939981  0.5179025   0.48942421]
 [ 0.52149895  0.4854238  -0.47612706 -0.5154664 ]]


Using [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) below to 1) perform dimensionality reduction on the doc-term matrix, 2) get extra characteristics like explained variance

http://scikit-learn.org/stable/modules/decomposition.html#lsa

In [68]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2, n_iter=1)
svd.fit(tfidf_sparse)
print(svd.transform(tfidf_sparse)) # dimensionality reduction on original bow matrix: word inference (? TODO)

[[ 0.89117442  0.90707552]
 [ 0.60820041 -0.40528582]
 [ 0.47130514  0.30322705]
 [ 0.66275013 -0.52920909]
 [ 0.65914419 -0.53713069]]


In [69]:
'''SVD suffers from a problem called "sign indeterminancy", which means the
sign of the ``components_`` and the output from transform depend on the
algorithm and random state. To work around this, fit instances of this
class to data once, then keep the instance around to do transformations.'''
print(svd.components_) # V^T

[[ 0.48244458  0.50939981  0.5179025   0.48942421]
 [ 0.52149895  0.4854238  -0.47612706 -0.5154664 ]]


**Small note:** Up to this point, these results have been exactly the same as _On tf-idf matrix, with scikit-learn_. Explained variance characteristics, however, have changed below.

In [70]:
print(svd.singular_values_)

[ 1.50330866  1.28357349]


In [71]:
print(svd.explained_variance_)

[ 0.01834557  0.32678059]


In [72]:
print(svd.explained_variance_ratio_)

[ 0.05047117  0.89901829]


I wonder if tf-idf struggles with such a small corpus... I prefer the results here from bow LSA, despite the suggestions from the experts. However, I should probably run a larger analyses before coming to a conclusion on which is better.

# NMF

Non-negative Matrix Factorization

"NMF is an alternative approach to decomposition that **assumes that the data and the components are non-negative**. NMF can be **plugged in instead of PCA or its variants**, in the cases where the data matrix does not contain negative values."

Base code: http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html

In [73]:
def print_nmf_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" + ".join([f'{topic[i]:.3f}*{feature_names[i]}'
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [74]:
from sklearn.decomposition import NMF

n_topics = 2
n_top_words = 20

# Fit the NMF model
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf_sparse.T) # Why transpose? Because NMF automatically clusters the columns of input data - so we transpose to put the words as columns

print("\nTopics in NMF model:")
tfidf_feature_names = list(dictionary.token2id)
print_nmf_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model:
Topic #0:
0.663*gear + 0.661*clutch + 0.558*steering + 0.070*linux + 0.000*modem
Topic #1:
1.057*modem + 0.424*linux + 0.044*steering + 0.000*gear + 0.000*clutch



Topic separation (for this toy example at least) is much cleaner with NMF than LSA.

In [75]:
print(nmf.components_) # document scores per topic (H)

[[ 0.          0.55830476  0.06987073  0.66135614  0.66317597]
 [ 1.05721258  0.04355812  0.42441744  0.          0.        ]]


In [76]:
print(nmf.transform(tfidf_sparse.T)) # word scores per topic (W)

[[ 0.02899529  0.79336551]
 [ 0.07958809  0.78535598]
 [ 0.82681189  0.        ]
 [ 0.83008648  0.        ]]


# pLSA

Probabilistic Latent Semantic Analysis

"Previous studies (Ding, Li, and Peng 2006; Gaussier and Goutte 2005) have suggested that **PLSA and Kullback-Leibler (KL) divergence based Non-negative Matrix Factorization (NMF) indeed optimize the same objective function** although they converge to different local minima. It is noteworthy that, as a widely used dimension reduction technique, traditional NMF with Frobenius norm performs well for document clustering and topic modeling (Arora et al. 2012; 2013; Kuang, Choo, and Park 2015), although it
lacks explicit probabilistic meaning of factors." - https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14469/14409

Going to use NMF with Kullback-Leibler divergence as the _distance function_ to approximate pLSA

In [77]:
# Fit the NMF model
plsa = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5, beta_loss='kullback-leibler', solver='mu').fit(tfidf_sparse.T)

print("\nTopics in pLSA model:")
tfidf_feature_names = list(dictionary.token2id)
print_nmf_words(plsa, tfidf_feature_names, n_top_words)


Topics in pLSA model:
Topic #0:
0.667*gear + 0.666*clutch + 0.619*steering + 0.099*linux + 0.000*modem
Topic #1:
1.064*modem + 0.450*linux + 0.000*gear + 0.000*clutch + 0.000*steering



**TODO**: Can this be run on BOW?

# LDA

"LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics **can be interpreted as probability distributions over words**... Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA)." - https://radimrehurek.com/gensim/tut2.html

"Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.
The graphical model of LDA is a three-level Bayesian model" - http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

In [78]:
#lda = models.LdaModel(tfidf, id2word=dictionary, num_topics=2)
lda = models.LdaModel(tfidf, id2word=dictionary, num_topics=n_topics, passes=5)
lda.print_topics(-1)

[(0,
  '0.276*"clutch" + 0.275*"gear" + 0.247*"steering" + 0.111*"linux" + 0.091*"modem"'),
 (1,
  '0.410*"modem" + 0.231*"linux" + 0.129*"steering" + 0.116*"gear" + 0.115*"clutch"')]

https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2

In [79]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='batch', learning_offset=0, n_jobs=-1)
lda.fit(tfidf_sparse.T)

print("\nTopics in LDA model:")
tf_feature_names = list(dictionary.token2id)
print_nmf_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
2.277*modem + 1.287*linux + 0.655*steering + 0.588*clutch + 0.587*gear
Topic #1:
1.725*gear + 1.722*clutch + 1.561*steering + 0.651*linux + 0.522*modem



In [80]:
lda.components_ # topic-term matrix

array([[ 2.27748806,  0.65542975,  1.28714876,  0.58779908,  0.58711402],
       [ 0.52153258,  1.56052125,  0.65067366,  1.72213196,  1.72528476]])

In [81]:
lda.transform(tfidf_sparse.T) # document-topic matrix

array([[ 0.75569981,  0.24430019],
       [ 0.74977371,  0.25022629],
       [ 0.20759303,  0.79240697],
       [ 0.19702837,  0.80297163]])

In [82]:
lda.perplexity(tfidf_sparse.T)

10.811135640598456

In [83]:
lda.score(tfidf_sparse.T)

-15.652586556243815