# Latent Semantic Analysis

One NLP method I've always found interesting is latent semantic analysis (LSA), an early NLP method that uses matrix decomposition to discover unobserved "latent" semantic associations between words in different documents. In this project, I want to use this method to construct an index for a particular work. This was inspired by Cosma Shalizi's data analysis book, in which he discusses using LSI to construct an index in Adam Smith's "The Wealth of Nations." I was unable to find the edition of the work he uses, so I employ a slightly different but related book: Karl Marx's "Capital, Volume I."  

This document is completely self-contained except for the document itself. We use the Penguin edition originally published in 1976, but for copyright reasons, I don't have it uploaded on Github (Marx 2004). First, we import the modules we will need:


In [1]:
import PyPDF2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd
import numpy as np
import logging
logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.ERROR)

capital_file = 'Marx_Capital_1.pdf'
capital = open(capital_file, 'rb')
capitalReader = PyPDF2.PdfReader(capital)

# The actual text starts on page 124
first_page = 123
last_page = 1083
page_text = []
for pageNum in range(first_page, last_page):
    page_text.append(capitalReader.pages[pageNum].extract_text())

We need to construct a matrix that can be used to find the latent semantics. In the spirit of the original paper, we look at a pure "count" matrix that measures the number of times each word appears in each document (which in this project is a page of the text). Alternatively, we could use a term frequency-inverse document frequency (tf-idf) matrix, which in addition to counts incorporates how frequently a word appears across the corpus. We instantiate a `CountVectorizer` object that eliminates English language stop words and then construct the count matrix $X$. We don't do any other sort of processing like stemming words. We will see that even with very basic processing, we can get interesting results.

In [None]:
vectorizer = CountVectorizer(stop_words='english',
                             strip_accents='ascii')
X = vectorizer.fit_transform(page_text)
features = vectorizer.get_feature_names_out()
idx_word = 'religion'
word_idx = np.where(features == idx_word)[0][0]

In this project, we are going to construct an index for the word "religion." One obvious way to do this is to look at raw word counts and peel off the top 10 pages that most frequently mention the word "religion." 

In [4]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
top10 = df.sort_values(by='religion', ascending=False).iloc[0:10, :]
top10_index = top10.index.values

This doesn't work in general, because it will only include pages that have the word "religion" and will not include pages that discuss religion but don't use the word (Shalizi 2016, p. 383). Instead, we will decompose the matrix $X$ to try and discover latent semantics.

For any $m \times n$ matrix with real or complex entries, we can decompose $A$ as:

\begin{equation}
A = U \Sigma V^T
\end{equation}

where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix. The columns of $U$ are the left singular vectors of $A$ and are the eigenvectors of $AA^T$. Similarly, the columns are the right singular vectors of $A$, or the eigenvectors of $A^TA$. These two matrices can be used to eigendecompose the matrices $AA^T$ or $A^TA$. The idea behind LSI is that, once we have the SVD decomposition of $X$, we can find a lower dimensional representation of the matrix by truncating the singular values to only keep the $k$ largest ones. It is a well-known theorem that this produces the best rank-$k$ approximation to the original matrix (measured in terms of the Frobenius norm). 

We rarely have a clear idea of what $k$ should be in any practical application. As with PCA, this is primarily a trial and error process. The original LSI paper recommends using between 50-100 factors (Deerwester et. al. 1990, p. 7). In this case, we use 75 singular vectors to truncate the SVD. The module `sklearn` has a fast truncated SVD decomposition in the `TruncatedSVD` class.

In [5]:
ncomponents = 75
svd = TruncatedSVD(n_components=ncomponents, n_iter=10, random_state=42)

Once we do this, we can fit the method on our $X$ matrix and reconstruct a rank-k approximation $\hat{X}$; after this, we extract the documents most similar to our word of interest. In this case, our target word "religion" is already in the corpus, but we can apply this in general to words not in the corpus by transforming the query word into the new reduced-dimension feature space.

In [6]:
Xhat = svd.fit_transform(X).dot(svd.components_)
word_col = Xhat[:, word_idx]
row_idx = np.argsort(word_col)[::-1]
row_idx = row_idx[:10]

In the same way principal components analysis (PCA) assumes there is underlying structure in the dataset that can be extracted, the SVD decomposition assumes there are meaningful underlying patterns in the data that are obscured by noise. To see an example of this, let's see what happens when we decompose the matrix. Looking at the raw counts, we see the word "religion" does not appear on the first five pages.

In [7]:
print(df.iloc[0:5, word_idx])

0    0
1    0
2    0
3    0
4    0
Name: religion, dtype: int64


Compare this to the first five pages of the approximate matrix:

In [10]:
dfhat = pd.DataFrame(Xhat, columns=vectorizer.get_feature_names_out())
print(dfhat.iloc[0:5, word_idx])

0    0.029056
1    0.027789
2    0.010388
3    0.001513
4   -0.003449
Name: religion, dtype: float64


The entries are no longer 0, representing that the decomposition has discovered there is some association between the target word and these documents, no matter how small it is.

## Results

How does the decomposition method compare to the index produced by the raw word count? We compare the top 10 pages from the decomposition matrix to the original matrix. 

In [11]:
print(f"The truncated SVD composition produces the following index: {row_idx}")
print(f"The word count method produces the following index: {top10_index}")

The truncated SVD composition produces the following index: [ 50  40 643 368 642  41 634 222  48  96]
The word count method produces the following index: [634  47 643 782 450  50  40 369 237 773]


We can see that these are similar, but not quite the same. What is the SVD method uncovering? Let's look at the index 48, which appears in the top 10 for the SVD index but not in top 10 for the word count. Does this appear in the word count index at all?

In [12]:
if df.iloc[48, word_idx] == 0:
    print("The word 'religion' does not appear on the page")
else:
    print("The word 'religion' appears on the page")

The word 'religion' does not appear on the page


In [13]:
print(page_text[48])

The Commod ity 173 
:But they are founded either on the immaturity of man as an in­
, dividual, when he has not yet torn himselfloose from the umbilical 
.: cord of his natural species-connection with other men, or on direct 
· ' relations of dominanc e and servitude. They are conditioned by a 
low stage of development of the productive powers of labour and 
... .correspondingly limited relations between men within the process 
of creating and reproducing their material life, hence also limited 
reiations between man and nature. These real limitations are re­
fl�ted in the ancient worship of nature, and in other elements of 
tribal religions. The religious reflections of the real world can, in 
any case, vanish only when the practical relations of everyday life 
between man and man, and man and nature, generally present 
-themselves to him in a transparent and rational form. The veil is 
not removed from the countenanc e of the social life-process, i.e . 
. the process of material prod

We clearly see words associated with religion: "religions," "religious," and "worship." Of course, appropriate word stemming would have put this word in the word count index by removing the plural from "religions." But the point is that the SVD decomposition discovered this without seeing the literal word "religion."

The latent sematnics the method uncovers are sometimes mysterious and not always "correct". For example, the method found index page 41 as related to religion. But there's nothing obviously about religion on the page.

In [14]:
if df.iloc[41, word_idx] == 0:
    print("The word 'religion' does not appear on the page")
else:
    print("The word 'religion' appears on the page")

The word 'religion' does not appear on the page


In [15]:
print(page_text[41])

166 Commodities and Money 
the producers, therefore, the social relations between their private 
labours appear as what they are, i.e. they do not appear as direct 
social relations between persons in their work, but rather as 
material [dinglich] relations between persons and social relations 
between things. 
· It is only by being exchanged that the products of labour 
acquire a socially uniform objectivity as values, which is distinct 
from their sensuously varied objectivity as articles of utility. 
This division of the product of labour into a useful thing and a 
thing possessing value appears in practice only when exchange has 
already acquired a sufficient extension and importance to allow 
useful things to be produced for the purpose of being exchanged, 
so that their character as values has already to be taken into 
consideration during production. From this moment on, the 
labour of the individual producer acquires a twofold social 
character. On the one hand, it must, as a d

In a real project, we would probably need to do more processing because of how poor the OCR is in this pdf document. Nevertheless, LSA is a simple method that can uncover interesting results.

## References

- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.
- Marx, K. (2004). Capital: volume I (Vol. 1). Penguin UK.
- Shalizi, C. R. (2016). Advanced data analysis from an elementary point of view. 2013. URL http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.