# Topic Modeling using Python

Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. In Vector space model (VSM) documents are represented by the occurences of the different words (terms) in them, i.e. vecotrs of terms. Nevertheless, documents in general are about different topics. We are going to use Singular Value Decomposition (SVD) here to reveal such topics. 

Here is a list of documents, each is composed of different terms.

In [43]:
documents = [
    'alice likes gadgets',
    'bob likes sweets',
    'alice likes cars',
    'bob likes chocolate',
    'alice likes python',
    'bob likes java'
]

In [None]:
import numpy as np
import pandas as pd

## Create Vector Space Model

We are going to use Scikit learn to convert them into a vector space. Will not use TF.IDF here.

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1) 
vsm = vectorizer.fit_transform(documents) 
vsm = vsm.toarray() 

In [98]:
vocabulary = vectorizer.get_feature_names() 
vsm_df = pd.DataFrame(vsm, columns=vocabulary)

As you can see here, we have 6 documents, 9 terms, and a matrix showing how frequent each term in each document. Each term here appears either once or doesn't appear at all in the document. In real life, terms may appear more that once in a decument, easpecially with much longer documents; think of a web page as a single document.

In [99]:
print vsm_df

   alice  bob  cars  chocolate  gadgets  java  likes  python  sweets
0      1    0     0          0        1     0      1       0       0
1      0    1     0          0        0     0      1       0       1
2      1    0     1          0        0     0      1       0       0
3      0    1     0          1        0     0      1       0       0
4      1    0     0          0        0     0      1       1       0
5      0    1     0          0        0     1      1       0       0


## Singular Value Decomposition (SVD)

We are using Singular Value Decomposition in Numpy. The imput matrix (out svm), is decomposed into 3 matrices, U, s and V. Actually, the third of the three is conjugate transpose of V, but numpy returns the transposed matrix anyway. We get (small letter s) as a vector, but we will convert it into matrix (capital letter S) in a moment. 

In [100]:
U, s, V = np.linalg.svd(vsm)

# print u.shape, np.diag(s).shape, v.shape
# (6, 6) (6, 6) (9, 9)

To convert U, s and V into the original SVM, you just convert s into S, and return the dot product of the 3 matrices. Never mind the slight differences between the original and the returned matrixs, you know, computer ain't perfect.

In [101]:
S = np.zeros((U.shape[1], V.shape[0]))
S[:6, :6] = np.diag(s)

np.dot(U,np.dot(S,V))

array([[  1.00000000e+00,   1.81298661e-16,   5.55111512e-17,
         -9.09531586e-17,   1.00000000e+00,  -8.79796769e-17,
          1.00000000e+00,   0.00000000e+00,   2.22044605e-16],
       [ -2.99954345e-16,   1.00000000e+00,   2.93606963e-16,
          4.33680869e-17,  -1.11669288e-15,  -4.33680869e-17,
          1.00000000e+00,   4.22627021e-16,   1.00000000e+00],
       [  1.00000000e+00,   4.59711739e-16,   1.00000000e+00,
          5.29645297e-17,   1.11022302e-16,   3.66488066e-16,
          1.00000000e+00,  -5.55111512e-17,   8.32667268e-17],
       [ -1.20828713e-16,   1.00000000e+00,   3.40134432e-16,
          1.00000000e+00,  -6.20500092e-16,   1.11022302e-16,
          1.00000000e+00,   4.87008878e-17,  -1.94289029e-16],
       [  1.00000000e+00,   5.24723776e-16,  -2.49800181e-16,
          4.97053740e-16,  -2.77555756e-16,   1.72199037e-16,
          1.00000000e+00,   1.00000000e+00,  -8.32667268e-17],
       [ -1.34328455e-16,   1.00000000e+00,   1.95690971e-16,
   

## Topics

S is the singular values, in our case here, these are the topics. In practice, you remove all small values of s, set them to zero; the number of non-zero values are your topics. There is no silver bullet for selecting a good number of topics. We here are going to try 2 topics. Convert U, S, and V accordingly.  

In [103]:
topics = 2

U_ = U[:,0:topics]
V_ = V[0:topics,:]
S_ = np.diag(s[0:topics])

# print U_.shape, S_.shape, V_.shape
# (6, topics) (topics, topics) (topics, 9)

Now we can have two new matrices, one for the documents and topics co-occurences, and one for the topics and terms co-occurences. Notice, topics are shown as columns below. Try altering the value of topics = 2 above and run again.

In [102]:
documents_topics = np.dot(U_,S_)
topics_terms = np.dot(S_,V_)

documents_topics_df = pd.DataFrame(documents_topics, columns=range(topics))
terms_topics_df = pd.DataFrame(topics_terms.T, columns=range(topics), index=vocabulary)

print documents_topics_df
print terms_topics_df

          0         1
0 -1.290994 -0.816497
1 -1.290994  0.816497
2 -1.290994 -0.816497
3 -1.290994  0.816497
4 -1.290994 -0.816497
5 -1.290994  0.816497
                  0             1
alice     -1.224745 -1.224745e+00
bob       -1.224745  1.224745e+00
cars      -0.408248 -4.082483e-01
chocolate -0.408248  4.082483e-01
gadgets   -0.408248 -4.082483e-01
java      -0.408248  4.082483e-01
likes     -2.449490 -5.551115e-16
python    -0.408248 -4.082483e-01
sweets    -0.408248  4.082483e-01


Topic 0 seems to be more of general topic for the 6 documents, while topic 1 seems to differentiate between the two persons and their preferences. See how related terms are shown together what sorted the terms_topics_df according to term 1.

In [97]:
t = 1

print 'Terms sorted according to topic {}\n'.format(t)
print terms_topics_df.sort([t])[t]

print '\n'

print 'Documents sorted according to topic {}\n'.format(t)
print documents_topics_df.sort([t])[t]

Terms sorted according to topic 1

alice       -1.224745e+00
gadgets     -4.082483e-01
cars        -4.082483e-01
python      -4.082483e-01
likes       -5.551115e-16
chocolate    4.082483e-01
java         4.082483e-01
sweets       4.082483e-01
bob          1.224745e+00
Name: 1, dtype: float64


Documents sorted according to topic 1

0   -0.816497
2   -0.816497
4   -0.816497
3    0.816497
5    0.816497
1    0.816497
Name: 1, dtype: float64


We can try restoring our VSM from the new U, S and V. This time, I will remove very small values to make it clearer.

In [109]:
vsm_ = np.dot(U_,np.dot(S_,V_))
vsm_[vsm_ < 0.01] = 0
pd.DataFrame(vsm_, columns=vocabulary) 

Unnamed: 0,alice,bob,cars,chocolate,gadgets,java,likes,python,sweets
0,1,0,0.333333,0.0,0.333333,0.0,1,0.333333,0.0
1,0,1,0.0,0.333333,0.0,0.333333,1,0.0,0.333333
2,1,0,0.333333,0.0,0.333333,0.0,1,0.333333,0.0
3,0,1,0.0,0.333333,0.0,0.333333,1,0.0,0.333333
4,1,0,0.333333,0.0,0.333333,0.0,1,0.333333,0.0
5,0,1,0.0,0.333333,0.0,0.333333,1,0.0,0.333333


See how terms like cars, python and gadgete, now have weights even in documents it are not in. This is because they belong to the same topic, so we assume they share a common meaning.

## References

* [Latent Semantic Analysis Tutorial](http://www.engr.uvic.ca/~seng474/svd.pdf)
* [NumPy SVD](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.linalg.svd.html)