📝 **Author:** Amirhossein Heydari - 📧 **Email:** amirhosseinheydari78@gmail.com - 📍 **Linktree:** [linktr.ee/mr_pylin](https://linktr.ee/mr_pylin)

---

# Dependencies

In [None]:
import math

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from torch import nn, optim

In [2]:
# set a seed for deterministic results
random_state = 42
torch.manual_seed(random_state)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [3]:
# set print options to increase line width
torch.set_printoptions(linewidth=200)

# One-Hot Encoding
   - To plug words into a Neural Network, we need a way to turn the **words** into **numbers**.
   - One-hot encoding is a simple representation of **categorical** data in **binary vectors**.
   - Each unique **word** (or **token**) in a vocabulary is assigned a **unique index**.
   - The vector representation consists of **all zeros** except for a **`1`** at the position of the **word's index**.

<figure style="text-align: center;">
    <img src="../../assets/images/original/we/one-hot-embedding.svg" alt="one-hot-embedding.svg" style="width: 80%;">
    <figcaption style="text-align: center;">One-Hot Embedding</figcaption>
</figure>

📜 **Properties**:
   - **Dimensionality**
      - The dimension of the vector equals the size of the vocabulary.
      -  For example, if there are 10,000 words, each vector is 10,000-dimensional.
   - **Sparsity**
      - Most values in the vector are 0, leading to high memory consumption and computational inefficiency.
   - **No Semantic Relationships**
      - Vectors for words like **"cat"** and **"dog"** are as dissimilar as **"cat"** and **"table"**, even though **"cat"** and **"dog"** are **semantically related**.

📉 **Limitations**:
   - **Inefficient** for large vocabularies due to high dimensionality.
   - Encodes words **independently** without considering their **meaning** or **relationships**.
   - **Sparse** vectors can lead to **poor** performance in machine learning models, especially for **large datasets**.

### Generating One-Hot Vectors for a Vocabulary

In [4]:
# define the vocabulary
vocabulary = ["apple", "banana", "school", "date"]
vocab_size = len(vocabulary)

# create and store one-hot vectors in a dictionary
one_hot_vectors = {vocabulary[idx]: one_hot_vector for idx, one_hot_vector in enumerate(torch.eye(vocab_size))}

# display the one-hot vectors
for word, vector in one_hot_vectors.items():
    print(f"{word}: {vector.tolist()}")


apple: [1.0, 0.0, 0.0, 0.0]
banana: [0.0, 1.0, 0.0, 0.0]
school: [0.0, 0.0, 1.0, 0.0]
date: [0.0, 0.0, 0.0, 1.0]


### Using One-Hot Vectors in a Simple Context

In [None]:
# function to compute similarity (cosine similarity)
def calculate_similarity(vec1, vec2):
    return torch.cosine_similarity(vec1.unsqueeze(dim=0), vec2.unsqueeze(dim=0))

# create a 2D tensor for similarity values directly
similarity_values = torch.zeros((vocab_size, vocab_size))

for i in range(vocab_size):
    for j in range(vocab_size):
        similarity_values[i, j] = calculate_similarity(one_hot_vectors[vocabulary[i]], one_hot_vectors[vocabulary[j]])

# plot the heatmap
plt.figure()
sns.heatmap(similarity_values, annot=True, fmt=".1f", xticklabels=vocabulary, yticklabels=vocabulary, cmap="Blues")
plt.title("Word Similarity Heatmap")
plt.xlabel("Words")
plt.ylabel("Words")
plt.show()


# Frequency-Based Word Representations

## 1. Count Vectorization
   - Represents each word as a **vector** with a **dimensionality** equal to the **vocabulary size**.
   - Each entry in the vector corresponds to the **number of times** the word appears in a **document** or **corpus**.
   - For the sentence **"I love AI, AI loves me"** with a vocabulary of `{I, love, AI, loves, me}`, the count vectors are:
      - `I : [1, 0, 0, 0, 0]`
      - `AI: [0, 0, 2, 0, 0]`

<figure style="text-align: center;">
    <img src="../../assets/images/original/we/frequency-count-vectorization.svg" alt="frequency-count-vectorization.svg" style="width: 100%;">
    <figcaption style="text-align: center;">Frequency-Based: Count Vectorization</figcaption>
</figure>

📈 **Advantages**:
   - Simple and easy to implement

📉 **Disadvantages**:
   - Most entries are **zero** for large vocabularies.
   - Fails to capture **relationships** between words.

In [6]:
corpus = [
    "The cat sat on the mat",             # Document 1
    "The dog sat on the mat",             # Document 2
    "The cat and the dog played outside"  # Document 3
]

In [7]:
# tokenize and convert to lowercase
def preprocess(corpus: list) -> list:
    return [doc.lower().split() for doc in corpus]

# count vectorization
def count_vectorization(corpus: list) ->dict[str, list[int]]:
    docs = preprocess(corpus)
    unique_vocabs = set(word for doc in docs for word in doc)
    word_counts = {word: [doc.count(word) for doc in docs] for word in unique_vocabs}
    return word_counts

In [8]:
# pre-process the raw corpus
pp_corpus = preprocess(corpus)
len_max_word = len(max(pp_corpus, key=len))

# calculate Count Vectorization for the entire corpus
word_counts = count_vectorization(corpus)

# log
print("Count Vectorization values:")
for term, counts in word_counts.items():
    print(f"  {term:{len_max_word + 1}}: {counts}")

Count Vectorization values:
  mat     : [1, 1, 0]
  the     : [2, 2, 2]
  cat     : [1, 0, 1]
  on      : [1, 1, 0]
  played  : [0, 0, 1]
  dog     : [0, 1, 1]
  outside : [0, 0, 1]
  sat     : [1, 1, 0]
  and     : [0, 0, 1]


## 2. TF-IDF (Term Frequency-Inverse Document Frequency)
   - Improves on **raw counts** by assigning **weights** to words based on their **importance** in the document relative to the entire corpus.
   - **TF (Term Frequency)**: Frequency of word $t$ in document $d$.
   - **IDF (Inverse Document Frequency)**: Importance of word $t$ across all documents.

<figure style="text-align: center;">
    <img src="../../assets/images/original/we/frequency-tf-idf.svg" alt="frequency-tf-idf.svg" style="width: 100%;">
    <figcaption style="text-align: center;">Frequency-Based: TF-IDF</figcaption>
</figure>

🔬 **Formula**:
   1. **Term Frequency (TF)**
      $$\text{TF}(t, d) = \frac{\text{Count of term t in document d}}{\text{Total terms in document d}}$$
   1. **Inverse Document Frequency (IDF)**
      $$\text{IDF}(t) = \log \left( \frac{N + 1}{\text{DF}(t) + 1} \right)$$
   1. **Term Frequency-Inverse Document Frequency (TF-IDF)**
      $$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$
where $N$ is the total number of documents, and $\text{DF}(t)$ is the number of documents containing $t$.

In [9]:
corpus = [
    "The cat sat on the mat",             # Document 1
    "The dog sat on the mat",             # Document 2
    "The cat and the dog played outside"  # Document 3
]

In [10]:
# tokenize and convert to lowercase
def preprocess(corpus: list) -> list:
    return [doc.lower().split() for doc in corpus]

# calculate Term Frequency (TF)
def term_frequency(doc: str, term: str) -> float:
    term_count = doc.count(term)
    return term_count / len(doc)

# calculate Inverse Document Frequency (IDF)
def inverse_document_frequency(corpus: list, term: str) -> float:
    num_docs_with_term = sum(1 for doc in corpus if term in doc)
    return math.log((len(corpus) + 1) / (num_docs_with_term + 1))

# calculate TF-IDF for each term in the corpus
def calculate_tf_idf(corpus: list)  -> tuple[dict[str, list[float]], dict[str, float], dict[str, torch.Tensor]]:
    terms = set(word for doc in corpus for word in doc)  # get unique terms from all documents
    tf_values = {}
    idf_values = {}
    tf_idf_values = {}
    
    # calculate TF and IDF for all terms
    for term in terms:
        tf_values[term] = [term_frequency(doc, term) for doc in corpus]
        idf_values[term] = inverse_document_frequency(corpus, term)
        tf_idf_values[term] = torch.tensor(tf_values[term]) * idf_values[term]
    
    return tf_values, idf_values, tf_idf_values

In [11]:
# pre-process the raw corpus
pp_corpus = preprocess(corpus)
len_max_word = len(max(pp_corpus, key=len))

# calculate TF, IDF, and TF-IDF for the entire corpus
tf_values, idf_values, tf_idf_values = calculate_tf_idf(pp_corpus)

# log
print("Term Frequency (TF) values:")
for term, tf in tf_values.items():
    print(f"  {term:{len_max_word + 1}}: {[f'{t:.4f}' for t in tf]}")

print("\nInverse Document Frequency (IDF) values:")
for term, idf in idf_values.items():
    print(f"  {term:{len_max_word + 1}}: {idf:.4f}")

print("\nTF-IDF values:")
for term, tf_idf in tf_idf_values.items():
    print(f"  {term:{len_max_word + 1}}: {[f'{t:.4f}' for t in tf_idf.tolist()]}")


Term Frequency (TF) values:
  mat     : ['0.1667', '0.1667', '0.0000']
  the     : ['0.3333', '0.3333', '0.2857']
  cat     : ['0.1667', '0.0000', '0.1429']
  on      : ['0.1667', '0.1667', '0.0000']
  played  : ['0.0000', '0.0000', '0.1429']
  dog     : ['0.0000', '0.1667', '0.1429']
  outside : ['0.0000', '0.0000', '0.1429']
  sat     : ['0.1667', '0.1667', '0.0000']
  and     : ['0.0000', '0.0000', '0.1429']

Inverse Document Frequency (IDF) values:
  mat     : 0.2877
  the     : 0.0000
  cat     : 0.2877
  on      : 0.2877
  played  : 0.6931
  dog     : 0.2877
  outside : 0.6931
  sat     : 0.2877
  and     : 0.6931

TF-IDF values:
  mat     : ['0.0479', '0.0479', '0.0000']
  the     : ['0.0000', '0.0000', '0.0000']
  cat     : ['0.0479', '0.0000', '0.0411']
  on      : ['0.0479', '0.0479', '0.0000']
  played  : ['0.0000', '0.0000', '0.0990']
  dog     : ['0.0000', '0.0479', '0.0411']
  outside : ['0.0000', '0.0000', '0.0990']
  sat     : ['0.0479', '0.0479', '0.0000']
  and     : 

# Word Embedding


## 1. Random Initialization
   - It maps **words** into **dense vectors of fixed size** (initialized with random values), capturing **semantic** meanings and **relationships** in **various contexts**.
   - Unlike one-hot encoding, embeddings are **continuous-valued** and **compact representations**.
   - The vectors are updated using some downstream task, like a classification task, so that words with similar meanings would be closer together in the vector space.

<figure style="text-align: center;">
    <img src="../../assets/images/original/we/word-embedding.svg" alt="word-embedding.svg" style="width: 100%;">
    <figcaption style="text-align: center;">Traditional Word Embedding using Neural Networks</figcaption>
</figure>

📜 **Properties**:
   - **Low Dimensionality**
      - Vectors typically have dimensions like **50**, **100**, or **300**, **regardless of vocabulary size**.
   - **Semantic Relationships**
      - Words with similar **meanings** or **contexts** have similar **vector representations**.
      - Example: **"king"** and **"queen"** might have vectors that are close in the **embedding space**.
   - **Learned Representations**
      - Embeddings are **learned from data**, capturing nuanced meanings based on word co-occurrences.

📈 **Advantages**:
   - **Lower** **memory** and **computational** requirements compared to one-hot encoding.
   - Captures **semantic** relationships and contextual nuances.
   - Pre-trained embeddings (like [**GloVe**](https://nlp.stanford.edu/projects/glove/) or [**FastText**](https://fasttext.cc/)) can be used across multiple tasks.

📉 **Limitations**:
   - Embeddings may not generalize well if the training data is biased or limited
   - Pre-trained embeddings struggle with unseen words, though methods like FastText address this by considering subword information.
      - **GloVe & Word2Vec**
         - These embeddings are **static** and **tied** directly to the words in the training corpus.
         - If a word **does not appear** in the training corpus, the model **cannot** generate an **embedding** for it.
         - Because they rely on **co-occurrence statistics** (or local context), and if a word is absent from the training data, the model has no data to create a meaningful representation.
         - This results in a failure to handle **out-of-vocabulary (OOV)** words.
      - **FastText**
         - It represents each word as a bag of character n-grams (subword units).
         - For example, the word "playing" could be broken down into subword units like:
            - "pla", "lay", "ayi", "yin", "ing" (and also prefixes and suffixes like "pl", "ing", etc.)
            - This is subword-level information that captures word morphology.

In [12]:
corpus = [
    "cats like to chase mice",           # Document 1
    "dogs like to chase cats",           # Document 2
    "mice like to chase dogs",           # Document 3
    "cats and dogs are pets",            # Document 4
    "mice are small and quick",          # Document 5
    "dogs are loyal and friendly",       # Document 6
    "cats are independent and curious",  # Document 7
    "mice are often found in fields",    # Document 8
    "pets bring joy to their owners",    # Document 9
    "dogs and cats can be friends",      # Document 10
]

In [13]:
# tokenize the corpus and build a vocabulary
vocabulary = {word for sentence in corpus for word in sentence.split()}
vocab_size = len(vocabulary)
word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# hyperparameters
embedding_dim = 10  # each word will be represented by 10 numbers
context_size = 2    # number of context words to predict the next word
learning_rate = 0.001
epochs = 120

# prepare training data (pairs of context words and target word)
context_words = []
target_words = []
for sentence in corpus:
    words = sentence.split()
    for i in range(context_size, len(words)):

        # context = previous 'context_size' words
        context = words[i - context_size:i]
        target = words[i]
        
        # convert words to indices
        context_indices = [word_to_idx[word] for word in context]
        target_index = word_to_idx[target]
        
        context_words.append(context_indices)
        target_words.append(target_index)

# convert to tensors
context_tensor = torch.tensor(context_words, dtype=torch.long)
target_tensor = torch.tensor(target_words, dtype=torch.long)

In [14]:
class NextWordPredictionModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(NextWordPredictionModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim * context_size, vocab_size)
    
    def forward(self, context):
        embedded = self.embeddings(context)  # embedding layer [lookup table]
        embedded = embedded.view(1, -1)      # flatten the embeddings
        output = self.fc(embedded)           # fully connected layer
        return output

# instantiate the model, loss function, and optimizer
model = NextWordPredictionModel(vocab_size, embedding_dim)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

In [15]:
# training loop
for epoch in range(epochs):
    total_loss = 0
    for i in range(len(context_tensor)):
        context = context_tensor[i]  # x
        target = target_tensor[i]    # y_true
        
        # zero the gradients
        optimizer.zero_grad()
        
        # forward
        output = model(context)
        
        # backward
        loss = loss_fn(output, target.unsqueeze(dim=0))
        loss.backward()
        
        # update weights
        optimizer.step()
        
        # store loss
        total_loss += loss.item()
    
    # log
    if epoch % 10 == 0 or (epoch + 1) == epochs:
        print(f"Epoch {epoch + 1:03}/{epochs} -> Loss: {total_loss}")

Epoch 001/120 -> Loss: 106.17169380187988
Epoch 011/120 -> Loss: 97.98147189617157
Epoch 021/120 -> Loss: 90.42929065227509
Epoch 031/120 -> Loss: 83.52974367141724
Epoch 041/120 -> Loss: 77.28132420778275
Epoch 051/120 -> Loss: 71.66597467660904
Epoch 061/120 -> Loss: 66.64803922176361
Epoch 071/120 -> Loss: 62.175538301467896
Epoch 081/120 -> Loss: 58.18662166595459
Epoch 091/120 -> Loss: 54.6183876991272
Epoch 101/120 -> Loss: 51.413461446762085
Epoch 111/120 -> Loss: 48.52277076244354
Epoch 120/120 -> Loss: 46.156003057956696


In [16]:
len_max_word = len(max(vocabulary, key=len))

# extract the learned word embeddings
word_embeddings = model.embeddings.weight.data

# display word embeddings for each word in the vocabulary
for idx, word in idx_to_word.items():
    print(f"Word: {word:{len_max_word + 1}}, Embedding: {word_embeddings[idx]}")

Word: dogs        , Embedding: tensor([ 1.9137,  1.4560,  0.8759, -2.1305,  0.7160, -1.2763, -0.0435, -1.6225, -0.7886,  1.6386])
Word: joy         , Embedding: tensor([-0.3999, -1.4019, -0.7463, -0.5884, -0.7627,  0.7427,  1.6566, -0.1580, -0.4732,  0.4605])
Word: found       , Embedding: tensor([-0.7651,  1.0983,  0.8069,  1.7188,  1.2894,  1.3102,  0.5997,  1.3287, -0.2605,  0.0381])
Word: curious     , Embedding: tensor([-0.2516,  0.8599, -1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746])
Word: are         , Embedding: tensor([-1.6074,  1.0477, -0.8846, -0.6043, -1.3506,  2.2575, -1.2085, -0.4120, -1.0048, -0.7248])
Word: quick       , Embedding: tensor([ 0.0780,  0.5258, -0.4880,  1.1914, -0.8140, -0.7360, -1.4032,  0.0360, -0.0635,  0.6756])
Word: be          , Embedding: tensor([-0.1146,  1.8441, -1.1611,  1.4067,  1.4701,  0.8668,  2.2431,  0.5206,  0.3355, -0.1853])
Word: their       , Embedding: tensor([-1.0755,  1.3009, -0.1542,  0.5065,  0.0569,  0.401

In [None]:
# function to compute similarity (cosine similarity)
def calculate_similarity(vec1, vec2):
    return torch.cosine_similarity(vec1.unsqueeze(dim=0), vec2.unsqueeze(dim=0))

# create a 2D tensor for similarity values directly
similarity_values = torch.zeros((vocab_size, vocab_size))

for i in range(vocab_size):
    for j in range(vocab_size):
        similarity_values[i, j] = calculate_similarity(word_embeddings[i], word_embeddings[j])

# plot the heatmap
plt.figure(figsize=(24, 18))
sns.heatmap(similarity_values, annot=True, fmt=".1f", xticklabels=vocabulary, yticklabels=vocabulary, cmap="Blues")
plt.title("Word Similarity Heatmap")
plt.xlabel("Words")
plt.ylabel("Words")
plt.show()

In [None]:
# reduce dimensionality of word embeddings
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
reduced_embeddings = tsne.fit_transform(word_embeddings)

# plot the embeddings
plt.figure(figsize=(8, 8))
for i, word in enumerate(idx_to_word.values()):
    plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])
    plt.annotate(word, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))

plt.title("Word Embeddings Visualization")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()

## 2. Word2Vec
   - A popular method for learning word embeddings.
   - It was introduced in the paper [**Efficient Estimation of Word Representations in Vector Space**](https://www.khoury.northeastern.edu/home/vip/teach/DMcourse/4_TF_supervised/notes_slides/1301.3781.pdf) by [*Tomas Mikolov*](https://scholar.google.com/citations?user=oBu8kMMAAAAJ&hl=en&oi=sra) in 2013.

🏛️ **Two Main Architectures of Word2Vec**:
   - **Skip-gram**
      - Predicts the **context words** (surrounding words) given a **target word**.
      - Objective: Maximize the probability of context words appearing given the target word.
      <figure style="text-align: center;">
         <img src="../../assets/images/original/we/word2vec-skipgram.svg" alt="word2vec-skipgram.svg" style="width: 100%;">
         <figcaption style="text-align: center;">Word2Vec using Skip-Gram method</figcaption>
      </figure>

   - **CBOW (Continuous Bag of Words)**
      - Predicts a **target word** based on its surrounding **context words**.
      - Objective: Maximize the probability of a target word given the context.
      <figure style="text-align: center;">
         <img src="../../assets/images/original/we/word2vec-cbow.svg" alt="word2vec-cbow.svg" style="width: 100%;">
         <figcaption style="text-align: center;">Word2Vec using CBOW method</figcaption>
      </figure>

📈 **Advantages**:
   - It provides better word embedding representations compared to traditional vanilla word embeddings
   - Optimized using techniques like **negative sampling** or **hierarchical softmax** to improve efficiency at training stage.
      - **Negative Sampling**
         - Focuses on distinguishing observed (**positive**) word-context pairs from randomly sampled (**negative**) pairs (**around 2-20 pairs**).

📉 **Limitations**:
   - **Static Embeddings**
      - Produces the **same** vector for a word, regardless of its **context**.
      - For example, **"bark"** has the same embedding whether referring to the **outer covering of a tree** or the **sound a dog makes**.
   - **OOV Words**
      - Struggles with **unseen** words in the test set unless extended methods like **FastText** are used.
   - These limitations are overcome by **contextual embeddings** (e.g., [**BERT**](https://github.com/google-research/bert), [**GPT-3**](https://github.com/openai/gpt-3)).

## 3. GloVe (Global Vectors for Word Representation)

## 4. FastText (Subword Information)

## 5. ELMo (Embeddings from Language Models)

## 6. BERT (Bidirectional Encoder Representations from Transformers)

# Document Embedding

## 1. Matrix Factorization (Latent Semantic Analysis - LSA)
   - **Matrix factorization** is a technique used in **Latent Semantic Analysis (LSA)** to uncover hidden structures in large text corpora.
   - It aims to capture the **semantic relationships** between words and documents by decomposing a large **term-document matrix** into a smaller **latent semantic space**.
   - It uses **singular value decomposition (SVD)** to reduce the dimensionality of the term-document matrix, revealing patterns and associations between terms and documents.

📜 **Properties**:
   - **Dimensionality Reduction**
      - The technique reduces the **high-dimensional** term-document matrix into a **lower-dimensional representation** while retaining key semantic information.
   - **Latent Semantic Structure**
      - LSA uncovers **hidden relationships** and **topics** within the data that are not immediately visible in the raw term-document matrix.
   - **Singular Value Decomposition (SVD)**
      - The matrix is factorized into three matrices: **U**, **Σ**, and **V**, where:
         - **U**: Term matrix (words),
         - **Σ**: Singular values (importance),
         - **V**: Document matrix.

📈 **Advantages**:
   - **Captures Synonymy**
      - LSA can recognize words with similar meanings even if they don't appear together frequently in the documents.
      - Example: **"car"** and **"automobile"** might be clustered together in the **latent space** despite not often appearing in the same context.
   - **Dimensionality Reduction**
      - LSA simplifies the data while preserving key semantic information, reducing both **memory usage** and **computation time**.
   - **Discover Topics**
      - It can uncover **latent topics** within the corpus, grouping similar documents and terms based on their underlying meaning.
   - **No Need for Labeling**
      - Unlike supervised learning, LSA does not require labeled data to identify these patterns.

📉 **Limitations**:
   - **Does Not Handle Polysemy Well**
      - Words with multiple meanings may be grouped together despite having different contexts. For example, **"bank"** (riverbank vs. financial institution) could be treated the same.
   - **Sparse Matrices**
      - LSA relies on large, sparse term-document matrices, which can be **computationally expensive** to create and process.
   - **Requires Sufficient Data**
      - The quality of the latent semantic space depends heavily on the amount and diversity of the training data.
   - **Interpretability Issues**
      - The topics or relationships discovered by LSA may be difficult to **interpret** due to the complexity of the latent space and the SVD transformation.


In [19]:
corpus = [
    "I love programming in Python",               # Document 1
    "Python is a great programming language",     # Document 2
    "I enjoy data science and machine learning",  # Document 3
    "Data science is amazing with Python",        # Document 4
]

In [20]:
# vectorize the corpus using TF-IDF (Term Frequency - Inverse Document Frequency)
vectorizer = TfidfVectorizer(stop_words='english')  # removing stop words in English : "the", "and", "is", "in", "at", "on", "of", "to", "a", "for", etc.
X = vectorizer.fit_transform(corpus)

# apply Latent Semantic Analysis (LSA) using Truncated SVD
# n_components is the number of latent semantic dimensions you want to reduce to
lsa = TruncatedSVD(n_components=2)
X_lsa = lsa.fit_transform(X)

# log
print(f"Original Matrix Shape (term-document matrix): {X.shape}")
print(f"Transformed Matrix Shape (after LSA): {X_lsa.shape}")
print(f"\nTransformed Matrix (LSA representation):\n{X_lsa}")

# examine the topics (terms most related to each component)
terms = vectorizer.get_feature_names_out()
print("\nTop terms for each LSA component:")
for i, topic in enumerate(lsa.components_):
    print(f"Component {i+1}:")
    terms_indices = topic.argsort()[:-4:-1]  # get the top 3 terms for this component
    for index in terms_indices:
        print(f"  {terms[index]}")
    print()

Original Matrix Shape (term-document matrix): (4, 11)
Transformed Matrix Shape (after LSA): (4, 2)

Transformed Matrix (LSA representation):
[[ 0.69609115 -0.46351513]
 [ 0.67453765 -0.48738522]
 [ 0.43283672  0.74122971]
 [ 0.66586887  0.49645966]]

Top terms for each LSA component:
Component 1:
  python
  programming
  love

Component 2:
  data
  science
  machine



In [None]:
# plot the heatmap
similarity_matrix = np.dot(X_lsa, X_lsa.T)
plt.figure(figsize=(8, 6))
sns.heatmap(similarity_matrix, annot=True, cmap='YlGnBu', fmt='.2f', linewidths=0.5, xticklabels=['Doc 1', 'Doc 2', 'Doc 3', 'Doc 4'], yticklabels=['Doc 1', 'Doc 2', 'Doc 3', 'Doc 4'])
plt.title("Document Similarity Matrix in LSA Space")
plt.show()