# LSA 

## What you will learn in this course 🧐🧐

LSA is short for *Latent Semantic Analysis*. It is a method mostly used for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997).

This led to one main use case of this algorithm 👉 *Topic Modeling*. In this course, you will learn: 

* What is Topic Modeling 
* The difference between SVD and Truncated SVD 

## Topic Modeling 💃💃 

Here is a proper definition from <a href="https://en.wikipedia.org/wiki/Topic_model" target="_blank">Wikipedia</a>:

*In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents*

The whole idea is to create groups that we will call *topics* that each document in a corpus will belong to. 

There are **a lot** of use cases that uses Topic Modeling. Among them are: 

* **News:**  Labeling topics of articles i.e reading a headline and being able to automatically tell whether the article is talking about *Politics*, *Gaming*, *Sport* etc. 📰

* **Customer Support:** Imagine your customer has a problem and you need to what the topic is so that you redirect automatically to the right person. Topic modeling is great! 💁🏼

* **Meeting Summary:** Let's say you recorded a meeting, you can then give a summary using LSA 💼

## TruncatedSVD ✂️✂️

### SVD Reminder 💡

Let's consider the matrix X:

$$
\begin{bmatrix}
x_{1,1} & \ldots & x_{1,j} & \ldots & x_{1,n}\\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{i,1} & \ldots & x_{i,j} & \ldots & x_{i,n}\\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{m,1} & \ldots & x_{m,j} & \ldots & x_{m,n}
\end{bmatrix}
$$

Let's note :
* $t_i^T$ the vector $[x_{i,1}, \ldots, x_{i,j}, \ldots, x_{i,n}]$ representing the weight of word $i$ in each of the $n$ documents.
* $d_j$ the vector

$$
\begin{bmatrix}
x_{1,j}\\
\vdots\\
x_{i,j}\\
\vdots\\
x_{m,j} 
\end{bmatrix}
$$

that describes the weight of each of the $m$ terms in document $j$.

Therefore the dot product $t_i^T t_p$ gives the correlation between terms $i$ and $p$ across the corpus of documents. Equivalently $d_q^T d_j$ gives the correlation between documents $q$ and $j$. 

SVD states the existence of two orthogonal matrices $U$ and $V$ and a diagonal matrix $\Sigma$ such that:

$$X = U\Sigma V^T$$

An orthogonal matrix is a matrix which column vectors are orthogonal to one another and of vector norm equal to 1. Let's consider the matrix $O$ orthognal:

$$
\begin{bmatrix}
\begin{bmatrix}  \\ O_1 \\ \\ \end{bmatrix} & \ldots & \begin{bmatrix} \\ O_i \\ \\ \end{bmatrix} & \ldots & \begin{bmatrix} \\ O_l \\ \\ \end{bmatrix}\\
\end{bmatrix}
$$

Where $O_i$ is the column vector $i$ of O. The following equalities are given thanks to $O$ being orthognal:

$$
O_i^T Oj = 0 \; \forall \; i\neq j
$$

$$
O_i^TO_i = 1 \; \forall i
$$

Which means that:

$$
O^TO=I_p
$$

The Identity square matrix of $p$ vectors.

This major property of the singular value decomposition helps us write the following:

$$X^TX = (U\Sigma V^T)^T U\Sigma V^T = V\Sigma U^TU\Sigma V^T = V\Sigma^2V^T$$

$$XX^T = U\Sigma V^T(U\Sigma V^T) = U\Sigma V^TV\Sigma U^T = U\Sigma^2U^T$$

Since $\Sigma^2$ is diagonal it means that $V$ is made out of the <a href="https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors" target="_blank">eigenvectors</a> of $X^TX$ and U from the eigenvector of $XX^T$.

If we rewrite slightly $U$ and $V$ in terms of their column vectors:

$$
X = \begin{bmatrix}
\begin{bmatrix}  \\ U_1 \\ \\ \end{bmatrix} & \ldots & \begin{bmatrix} \\ U_i \\ \\ \end{bmatrix} & \ldots & \begin{bmatrix} \\ U_l \\ \\ \end{bmatrix} \\
\end{bmatrix}
\begin{bmatrix}
\sigma_1 & \ldots & 0 \\
\vdots & \ddots & \vdots \\
0 & \ldots & \sigma_l
\end{bmatrix}
\begin{bmatrix}
\begin{bmatrix} & V_1 & & \end{bmatrix} \\ \vdots \\ \begin{bmatrix} & V_j & & \end{bmatrix} \\ \vdots \\ \begin{bmatrix} & V_l & & \end{bmatrix} \\
\end{bmatrix}
$$

Where $\sigma_i$ are called the singular values, $U_i$ are the left singular vectors and $V_j$ are the right singular vectors. Notice the only part of $U$ that contributes to $t_i^T$ is the $i^{th}$ row, let's call this row vector from $U$ : $\hat{t_i^T}$. Similarly the only part from $V$ contributing to $d_j$ is the $j^{th}$ column, let's note this column vector from $V$: $\hat{d_i}$.

### Problem with sparse matrices 🐚

<a href="https://machinelearningmastery.com/sparse-matrices-for-machine-learning/" target="_blank">Sparse matrices</a> are basically matrices with a lot of `0`. Even if it does not seem like a big deal, it actually is for computers because it requires a lot more calculation. 

Therefore the bigger your matrix, the slower your computer will be to make calculus. 

In [1]:
# Import time to keep track of time
import time
# Import numpy to create matrices
import numpy as np 
# import sparse module from SciPy package 
from scipy import sparse
# import uniform module to create random numbers
from scipy.stats import uniform
# import NumPy
import numpy as np

sparse_matrix = sparse.random(100,100)

# Calculate time to create dot product
start_time = time.time()

sparse_matrix.dot(sparse_matrix)

print("Calculus took for sparse matrix {:.5f} seconds".format(time.time() - start_time))

# Create a dense matrix
dense_matrix = sparse_matrix.toarray()


# Calculate time to perform a dot product on dense matrix
start_time = time.time()

dense_matrix.dot(dense_matrix)

print("Calculus took for dense matrix {:.5f} seconds".format(time.time() - start_time))


Calculus took for sparse matrix 0.00403 seconds
Calculus took for dense matrix 0.03961 seconds


As you can see, for 2 100x100 matrices, it took almost 10x more time to calculate a dot product for a sparse matrix. 😮 

The problem is that when we perform SVD, we do a lot of dot products... 👉👉 You guessed the problem. 

Indeed, when dealing with NLP or recommandation systems for example, you might end up dealing with sparse matrices. So how can we solve the problem?

!(https://vimeo.com/486595658)

### TruncatedSVD algorithm

TruncatedSVD (AKA LSA) is perfect when dealing with sparse matrices. Instead of using perfect SVD like this: 

$$A = U \Sigma V^\intercal$$

Where $A$ is an $n \times m$ matrice 

Each matrices look like this: 

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/SVD_no_legend.png" alt="SVD"/>


Especially let's pay attention to $\Sigma$ that has $r$ singular values. Well, what TruncatedSVD says is that we can approximate $A$ simply by taking $k$ highest singular values (where $k < r$). 

Another particularity that is specified in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html?highlight=truncated%20svd#sklearn.decomposition.TruncatedSVD" target="_blank">`sklearn`</a> is that truncatedSVD does not center data. Which definitely make calculations faster when dealing with sparse matrices. 

!(https://vimeo.com/486596019)

## Resources 📚📚

* <a href="https://en.wikipedia.org/wiki/Topic_model" target="_blank">Topic Model</a>

* <a href="https://medium.com/@fatmafatma/industrial-applications-of-topic-model-100e48a15ce4" target="_blank">Industrial Applications of Topic Model</a>

* <a href="https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors" target="_blank">Eigenvectors</a> 

* <a href="https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/" target="_blank">Sparse Matrices with scipy</a>

* <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html?highlight=truncated%20svd#sklearn.decomposition.TruncatedSVD" target="_blank">TruncatedSVD</a>