Singular Value Decomposition (SVD) is a powerful technique in linear algebra and is widely used in various fields, including natural language processing (NLP) and information retrieval. In the context of **text mining** and **document analysis**, SVD is used to reduce the dimensionality of large document-term matrices, uncover latent structures (such as topics), and reveal relationships between terms and documents.

Here's a more detailed explanation of the SVD components in the context of text mining:

### Document-Term Matrix (D)

A **Document-Term Matrix (DTM)** represents the occurrence of terms (words) in a set of documents. In this matrix:
- Rows correspond to individual documents.
- Columns correspond to individual terms (words) that appear in the corpus.
- Each entry \( D_{ij} \) represents the frequency (or other weight such as TF-IDF) of term \( j \) in document \( i \).

For example:

| Document/Term | apple | banana | fruit | programming | python |
|----------------|-------|--------|-------|-------------|--------|
| Doc 1          | 1     | 2      | 1     | 0           | 0      |
| Doc 2          | 0     | 0      | 0     | 1           | 1      |
| Doc 3          | 0     | 0      | 0     | 0           | 0      |

- Here, **Doc 1** contains the words "apple", "banana", and "fruit", and their corresponding frequencies.
- **Doc 2** contains the words "programming" and "python".

### Singular Value Decomposition (SVD)

SVD is a factorization technique that decomposes a matrix into three components. For a document-term matrix **D**, SVD decomposes it as:

\[
D = U S V^T
\]

Where:
1. **U (Left Singular Vector Matrix)**: This matrix contains the left singular vectors (eigenvectors of \( D D^T \)), and its columns correspond to documents in the corpus. Each column in **U** represents a **latent concept** or **topic** for the corresponding document.
   
2. **S (Singular Value Matrix)**: This diagonal matrix contains the singular values, which represent the importance or strength of each topic. Larger singular values indicate more significant topics, while smaller values correspond to less important or weaker topics. The singular values in **S** allow us to determine how much variance in the document-term matrix is captured by each topic.

3. **V^T (Right Singular Vector Matrix)**: This matrix contains the right singular vectors (eigenvectors of \( D^T D \)), and its rows represent terms in the corpus. Each row in **V^T** corresponds to a **topic**, and the values in each row indicate the **contribution of each term to that topic**. These rows allow us to see how closely related different terms are to one another, often interpreted as a latent topic in the data.

### Intuition Behind the Decomposition

1. **Left Singular Vectors (U)**: 
   - The matrix **U** can be thought of as representing how each document is related to the underlying topics. Each column in **U** corresponds to a document, and the values indicate the strength or importance of each latent topic for that document.
   
   - For example, a document about "machine learning" would likely have high values in the columns corresponding to topics about technology or programming in **U**.

2. **Singular Values (S)**:
   - The diagonal values in **S** represent the significance or "strength" of each topic. The first singular value usually represents the most significant topic, the second singular value represents the second most important topic, and so on. These singular values allow us to rank topics by their importance.

3. **Right Singular Vectors (V^T)**:
   - The rows of **V^T** represent **topics** in the text, with each row corresponding to a particular topic. The values in the row represent how much each term contributes to that topic.
   
   - For instance, a topic about "fruit" might have high weights for words like "apple", "banana", and "fruit" in the corresponding row of **V^T**.

### Document Similarity (Gram Matrix \( D D^T \))

- The matrix **\( D D^T \)** is called the **Gram Matrix** or **Document Similarity Matrix**, and it represents the pairwise similarities between documents. The i,j-th entry of this matrix shows how similar document **i** is to document **j**. This is typically computed as the dot product of the corresponding rows of the document-term matrix.
  
- By performing eigen decomposition on **\( D D^T \)**, we find the left singular vectors (columns of **U**) which capture the principal components of the document similarities. This decomposition helps us understand the underlying structure in the document corpus.

### Applications of SVD in Text Mining:

1. **Latent Semantic Analysis (LSA)**:
   - **LSA** is a technique that uses SVD to uncover the latent structure of a collection of text. It reduces the dimensionality of the document-term matrix, revealing hidden topics or concepts in the text.
   - By keeping only the largest singular values (top-k topics), LSA can efficiently represent documents and terms in a lower-dimensional space while preserving the most important information.

2. **Topic Modeling**:
   - The right singular vectors in **V^T** can help uncover the topics in the text corpus. Each row in **V^T** gives a distribution over terms that can be interpreted as a "topic". Topics with higher singular values are more significant.

3. **Document Clustering and Classification**:
   - By reducing the document-term matrix using SVD (keeping only the most important singular values and corresponding vectors), documents can be projected into a lower-dimensional space. This representation can be used for clustering similar documents or classifying documents into categories based on their topic distribution.

4. **Information Retrieval and Search Engines**:
   - SVD can improve search by representing documents and queries in a lower-dimensional space. This helps identify more relevant documents that may share similar underlying topics, even if they don’t have identical keywords.

5. **Data Compression**:
   - The SVD decomposition allows us to approximate the document-term matrix with fewer dimensions by keeping only the largest singular values and corresponding vectors. This is useful for compressing large text data, reducing storage space while retaining essential information.

### Example

Let's say we have a set of documents with terms and their frequencies:

| Document/Term | apple | banana | fruit | programming | python |
|----------------|-------|--------|-------|-------------|--------|
| Doc 1          | 1     | 2      | 1     | 0           | 0      |
| Doc 2          | 0     | 0      | 0     | 1           | 1      |
| Doc 3          | 0     | 0      | 0     | 0           | 0      |

Performing SVD on the Document-Term Matrix would decompose it into three matrices:

1. **U (Document Similarity Matrix)**: This will tell us the relationship between documents in terms of the latent topics.
2. **S (Singular Values)**: This will give us the importance of each topic.
3. **V^T (Term-Topic Matrix)**: This will show us how terms contribute to each topic.

By performing dimensionality reduction (e.g., keeping only the top k singular values), we can reduce noise and focus on the most significant topics that represent the data.

---

### Summary:
- **SVD** decomposes a document-term matrix into three matrices: **U**, **S**, and **V^T**.
- **U** represents the documents in terms of latent topics (document-topic associations).
- **S** represents the importance of the topics.
- **V^T** represents the terms in terms of latent topics (term-topic associations).
- This decomposition helps uncover hidden relationships, reduce dimensionality, and find meaningful patterns in text data.
- It is widely used in applications like **Latent Semantic Analysis (LSA)**, **topic modeling**, and **information retrieval**.

By understanding the SVD decomposition, we can better analyze and manipulate text data, uncover latent themes, and improve tasks like document classification and information retrieval.

### Application of Singular Value Decomposition (SVD) for Topic Modeling

Singular Value Decomposition (SVD) is a powerful technique used in topic modeling, especially in Latent Semantic Analysis (LSA), to extract hidden topics from a collection of documents. In this approach, we use **SVD** to reduce the dimensions of the **Document-Term Matrix (DTM)**, and the resulting components (U, S, V^T) help us understand the underlying topics in the text.

Let’s break down how to use SVD for topic modeling in more detail:

### Step-by-Step Process for Topic Modeling Using SVD:

1. **Create the Document-Term Matrix (DTM)**:
   The first step is to convert the collection of documents into a **Document-Term Matrix (DTM)**. This matrix represents the frequency (or the weighted frequency, like TF-IDF) of words in documents.
   
   - **Rows** represent individual documents.
   - **Columns** represent unique words in the corpus.
   - Each entry \( D_{ij} \) represents the frequency of word \( j \) in document \( i \).

2. **Apply Singular Value Decomposition (SVD)**:
   We perform **SVD** on the **DTM** to decompose it into three matrices:
   - **U (Document-Topic Matrix)**: Represents the relationship between documents and topics.
   - **S (Singular Values Matrix)**: Represents the strength or importance of each topic.
   - **V^T (Term-Topic Matrix)**: Represents the relationship between words and topics.

3. **Identify Topics**:
   The rows of the **V^T** matrix represent the terms associated with each topic, and the values in these rows represent the significance of each word for a given topic. The **U** matrix represents how much each document relates to each topic.

4. **Interpret Topics**:
   - **Topics** are the columns of **V^T**, and each column represents a set of words that are strongly related to a particular topic.
   - The **U** matrix gives the document-topic relationship, meaning how much each document is associated with each of the topics.

In [2]:
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
class TopicModelingSVD:
    def __init__(self, documents : list, num_topics : int) -> None:
        self.documents = documents
        self.num_topics = num_topics
      self.dtm = self.vectorizer.fit_transform(documents)  # Document-Term Matrix
        self.svd = TruncatedSVD(n_components=num_topics)
        self.svd_matrix = self.svd.fit_transform(self.dtm)  # Matrix of documents in topic space
        self.term_topic_matrix = self.svd.components_  # Matrix of words in topic space
        