# Latent Semantic Analysis

Before we begin, let's define some basic vocabulary. 

**Natural language processing** refers to a family of techniques used to derive meaning from text data.

A **document** refers to some collection of words and represents the instances or "rows" of our dataset. 

A **body** is a collection of documents and is our entire data set.

A **dictionary** is the set of all words that appear in at least one document in our body.

A **topic** is a collection of words that co-occur.

The word **latent** means hidden. In this context, we are referring to features that are "hidden" in the data. That they are hidden referes to the fact that they can not be directly measured. These latent features are essential to the data, but are not the orginal features of the data set.

**Latent Semantic Analysis (LSA)** is:

- a natural language processing technique
- an unsupervised learning technique
- aims to create representations of the documents in a body based on the topics inherent to that body
- reducing the dimensionality of a text-based dataset
- consists of two steps:
   - creating a document-term matrix
   - dimensionality reduction via a singular value decomposition

## Document-Term Matrix

A basic idea of a Document-Term Matrix is that documents can be represented as points in Euclidean space aka **vectors**.

Here is an example of a document-term matrix.

![](https://www.evernote.com/l/AAE9rZErr9BCcLX-wE6dpPbqNTsxKNmxH3UB/image.png)

Here, each document is a simple statement describing the nature of a canine and defines the rows of our matrix. The dictionary defines the columns of our matrix.

#### Documents as Vectors

According to this Document-Term matrix,

$$\text{"the quick brown fox"} = (1,0,1,0,1,0,0,1,0)$$
$$\text{"the slow brown dog"}  = (1,1,0,0,0,0,1,1,0)$$
$$\text{"the quick red dog"}   = (0,1,0,0,1,1,0,1,0)$$
$$\text{"the lazy yellow fox"} = (0,0,1,1,0,0,0,1,1)$$

## Singular Value Decomposition

The Singular Value Decomposition (SVD) 

- is similar to a Principal Component Analysis
- reduces the dimension of the original data
- transforms the data to be encoded using latent, or hidden, variables
- for LSA, these latent variables represent topics

## Implementation in Scikit-Learn

We will first demonstrate a trivial implementation using the Python library, [Scikit-Learn](https://scikit-learn.org/stable/).

![](https://www.evernote.com/l/AAGiYGcKcIxIaJ7sCg97K9JDtUO2dY9mywoB/image.png)

### Raw Text Data

<img src="https://www.evernote.com/l/AAFfAyDQQ1xGPLTIxT2hcUSLrHuQDbYzsuYB/image.png" width=600px>

Here each line of text is a **document** and the collection of all lines of text is the **body**.

One advantage of working in databricks is that the Databricks Runtime for ML contains many popular machine learning libraries, including Scikit-Learn, TensorFlow, and XGBoost. We will the Databricks Runtime for ML to implement our Latent Semantic Analysis in Scikit-Learn.

In [0]:
body = [
    "the quick brown fox",
    "the slow brown dog",
    "the quick red dog",
    "the lazy yellow fox"
]

### Document-Term Matrix

<img src="https://www.evernote.com/l/AAFtjaKOjT5CYr5N_NPHKU6vpBWNnBgbWLIB/image.png" width=600px>

The Document-Term Matrix can be created using the `CountVectorizer` model [[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in Scikit-Learn.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(body)

This process has converted each **document** into a vector. The matrix consists of a vector for each "document" in the **body**.

$$\text{"the quick brown fox"} = (1,0,1,0,1,0,0,1,0)$$
$$\text{"the slow brown dog"}  = (1,1,0,0,0,0,1,1,0)$$
$$\text{"the quick red dog"}   = (0,1,0,0,1,1,0,1,0)$$
$$\text{"the lazy yellow fox"} = (0,0,1,1,0,0,0,1,1)$$

In [0]:
bag_of_words.todense()

### Singular Value Decomposition

<img src="https://www.evernote.com/l/AAEhTiOBufhPwKBx-Hgufx4XZ5XyfsCp8cMB/image.png" width=600px>

This can be achieved using the `TruncatedSVD` model. 

The function is named "truncated" SVD because it is capable of returning a dataset with fewer features than it is passed without significant loss of information, that is, it is great for reducing the dimension of data.

In [0]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

In [0]:
lsa

### Topic Encoded Data

<img src="https://www.evernote.com/l/AAGhSgfs1nZHAIYfbnmNaHU8YjMV2i9fTmgB/image.png" width=600px>

The process transforms the original data into **topic-encoded data**.

Here, each row is indexed by its original text value. The data now consists of two columns of data one representing each of the two topics used to encode the **body**. Recall that this value of 2 was passed as an argument to the `TruncatedSVD` in the previous step.

In [0]:
import pandas as pd

topic_encoded_df = pd.DataFrame(lsa, columns = ["topic_1", "topic_2"])
topic_encoded_df["body"] = body
display(topic_encoded_df[["body", "topic_1", "topic_2"]])

body,topic_1,topic_2
the quick brown fox,1.694904931186462,0.2995240544049724
the slow brown dog,1.515851114202598,-0.7691103672363853
the quick red dog,1.5158511142025994,-0.769110367236388
the lazy yellow fox,1.266186062866739,1.440585132717669


## Byproducts of the Latent Semantic Analysis

The LSA generates a few byproducts that are useful for analysis:

- the **dictionary** or the set of all words that appear at least once in the **body**
- the **encoding matrix** used to encode the documents into topics. The **encoding matrix** can be studied to gain an understanding of the **topics** that are latent to the **body**.

#### The Dictionary

The dictionary is an attribute of a fit `CountVectorizer` model and can be accessed using the `.get_feature_names` method.

In [0]:
dictionary = vectorizer.get_feature_names()
dictionary

#### The Encoding Matrix

The **encoding matrix** is comprised of the `components_` stored as an attribute of a fit `TruncatedSVD`. We can examine this matrix to gain an understanding of the **topics** latent to the **body**.

**Note:** in `sklearn`, attributes of a model that are generated by a fitting process have a trailing underscore in their name as can be seen here with `svd.components_`.

In [0]:
encoding_matrix = pd.DataFrame(svd.components_,
                               index=['topic_1', 'topic_2'],
                               columns=dictionary).T
encoding_matrix

Unnamed: 0,topic_1,topic_2
brown,0.353937,-0.140256
dog,0.334199,-0.459436
fox,0.326416,0.519736
lazy,0.139578,0.430274
quick,0.353937,-0.140256
red,0.1671,-0.229718
slow,0.1671,-0.229718
the,0.660615,0.0603
yellow,0.139578,0.430274


#### Interpret The Encoding Matrix

What are the top words for each topic? What dimensions in word-space explain most of the variance in the data? 

To analyze this, we will need to look at the *absolute value* of the expression of each word in the topic.

In [0]:
import numpy as np

encoding_matrix['abs_topic_1'] = np.abs(encoding_matrix['topic_1'])
encoding_matrix['abs_topic_2'] = np.abs(encoding_matrix['topic_2'])
encoding_matrix.sort_values('abs_topic_1', ascending=False)

Unnamed: 0,topic_1,topic_2,abs_topic_1,abs_topic_2
the,0.660615,0.0603,0.660615,0.0603
quick,0.353937,-0.140256,0.353937,0.140256
brown,0.353937,-0.140256,0.353937,0.140256
dog,0.334199,-0.459436,0.334199,0.459436
fox,0.326416,0.519736,0.326416,0.519736
red,0.1671,-0.229718,0.1671,0.229718
slow,0.1671,-0.229718,0.1671,0.229718
lazy,0.139578,0.430274,0.139578,0.430274
yellow,0.139578,0.430274,0.139578,0.430274


In [0]:
encoding_matrix.sort_values('abs_topic_2', ascending=False)

Unnamed: 0,topic_1,topic_2,abs_topic_1,abs_topic_2
fox,0.326416,0.519736,0.326416,0.519736
dog,0.334199,-0.459436,0.334199,0.459436
lazy,0.139578,0.430274,0.139578,0.430274
yellow,0.139578,0.430274,0.139578,0.430274
red,0.1671,-0.229718,0.1671,0.229718
slow,0.1671,-0.229718,0.1671,0.229718
quick,0.353937,-0.140256,0.353937,0.140256
brown,0.353937,-0.140256,0.353937,0.140256
the,0.660615,0.0603,0.660615,0.0603
