### 1. Imports and Setup

In [None]:
import numpy as np
import pandas as pd
import gensim.downloader as api
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

### 2. Define a simple test corpus

In [None]:
documents = [
    "I love deep learning and natural language processing.",
    "Natural language models are fascinating.",
    "Topic modeling helps to discover themes in documents.",
    "Machine learning enables automatic topic discovery.",
    "Neural networks learn embeddings from data."
]

### 3. Bag-of-Words (BoW) Representation

The **Bag-of-Words** model is one of the simplest ways to represent text numerically. It ignores grammar and word order and focuses only on word occurrence.

##### **What is it?**
- Each document is treated as a "bag" of individual words.
- A vocabulary is built from all the unique words in the corpus.
- Each document is then represented as a vector counting how many times each word from the vocabulary appears.

This results in a **document-term matrix**:
- Each row corresponds to a document.
- Each column corresponds to a word from the vocabulary.
- Each cell contains the count of the word in that document.

Although simple, BoW has limitations:
- It does not consider word order or context.
- It can result in very high-dimensional and sparse data.

##### **Simple Example**
Let's say we have two short documents:

- Document 1: "I love NLP"
- Document 2: "I love machine learning"

The combined vocabulary is: `[I, love, NLP, machine, learning]`

We can represent each document as a vector of word counts:

| Document | I | love | NLP | machine | learning |
|----------|---|------|-----|---------|----------|
| Doc 1    | 1 | 1    | 1   | 0       | 0        |
| Doc 2    | 1 | 1    | 0   | 1       | 1        |

This matrix shows how many times each word appears in each document. No word order is preserved.

Still, it’s a foundational method and helps build intuition for more sophisticated approaches like TF-IDF and word embeddings.

##### 🛠️ **Code Example**

The code block below uses `CountVectorizer` from `sklearn` to create the BoW matrix and displays it as a Pandas DataFrame for readability.

This block creates a Bag-of-Words (BoW) representation of our corpus:
- CountVectorizer transforms the documents into a matrix (documents x words).
- Each element in the matrix represents how many times a word appears in a document.
- The 'fit_transform' function builds the vocabulary and generates the counts.
- Finally, we convert the matrix into a Pandas DataFrame for better visualization.

In [None]:
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(documents)
pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out())

### 4. TF-IDF (Term Frequency–Inverse Document Frequency) Representation

TF-IDF is an improvement over the Bag-of-Words model. While BoW only counts word frequency, **TF-IDF balances frequency with uniqueness**, reducing the weight of common words that appear in many documents.

##### **What is it?**
- **Term Frequency (TF)** measures how often a word appears in a specific document.
- **Inverse Document Frequency (IDF)** downweights words that appear in many documents.
- The product **TF × IDF** gives more importance to words that are frequent in a document but rare across the corpus.

The result is a **weighted document-term matrix** that emphasizes more informative words.

TF-IDF(w, d, D) = TF(w, d) × IDF(w, D)

Where:
- **TF(w, d)** is the term frequency of word *w* in document *d*:
  > TF(w, d) = (Number of times *w* appears in *d*) / (Total words in *d*)

- **IDF(w, D)** is the inverse document frequency of *w* in the full corpus *D*:
  > IDF(w, D) = log[(1 + N) / (1 + DF(w))] + 1

  where:
  - *N* is the total number of documents
  - *DF(w)* is the number of documents containing the word *w*

📌 This helps to penalize very common words (like "the", "and", "is") and give more weight to words that are specific to a document.

##### 📋 TF-IDF Table Example

Let’s use the same two documents:

- Document 1: "I love NLP"  
- Document 2: "I love machine learning"

Assuming simplified TF-IDF values:

| Document | I    | love | NLP   | machine | learning |
|----------|------|------|-------|---------|----------|
| Doc 1    | 0.00 | 0.00 | 0.707 | 0.000   | 0.000    |
| Doc 2    | 0.00 | 0.00 | 0.000 | 0.577   | 0.577    |

🔍 **Interpretation**:
- Common words like `"I"` and `"love"` get a TF-IDF score of 0.
- More unique terms like `"NLP"`, `"machine"`, and `"learning"` receive higher weights.

##### 🛠️ **Code Example**

The code block below uses `TfidfVectorizer` from `sklearn` to generate a TF-IDF matrix and display it using Pandas.

This block:
- Computes TF-IDF values for all terms in the corpus.
- Automatically normalizes and applies the IDF component.
- Outputs a readable DataFrame to inspect how word importance varies by document.


In [None]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)
pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out()).round(2)

#### 🧠 5. GloVe (Global Vectors for Word Representation)

GloVe is a neural word embedding model that captures **semantic meaning** by learning from global word co-occurrence statistics. Unlike TF-IDF, which produces sparse matrices, GloVe produces dense, low-dimensional vectors where similar words are close in the vector space.

##### **What is it?**
- GloVe starts by building a **co-occurrence matrix**, where each cell counts how often word *j* appears in the context of word *i*.
- It then **factorizes** this matrix so that the **dot product** of word vectors approximates the **log of their co-occurrence**.
- This allows the model to capture meaningful relationships between words, including analogies like:
  > `"king" - "man" + "woman" ≈ "queen"`

##### **Formula (Simplified)**

The GloVe model learns word vectors such that:

> **w<sub>i</sub> · w<sub>j</sub> + b<sub>i</sub> + b<sub>j</sub> ≈ log(X<sub>ij</sub>)**

Where:
- *w<sub>i</sub>* and *w<sub>j</sub>* are the word vectors for word *i* and context word *j*
- *X<sub>ij</sub>* is the number of times word *j* appears in the context of word *i*
- *b<sub>i</sub>*, *b<sub>j</sub>* are bias terms
- The model minimizes the weighted squared error between both sides

##### Imagine a simplified co-occurrence matrix:

|         | ice | steam | solid | gas |
|---------|-----|--------|--------|-----|
| **ice**   |  0  |   3    |   15   |  7  |
| **steam** |  3  |   0    |   2    | 13  |


##### 📋 GloVe Table Example

| Word Pair     | Co-occurrence | log(X<sub>ij</sub>) | GloVe dot product |
|---------------|----------------|---------------------|-------------------|
| ice, solid    | 15             | ~2.71               | close to 2.71     |
| ice, gas      | 7              | ~1.95               | close to 1.95     |
| steam, solid  | 2              | ~0.69               | close to 0.69     |
| steam, gas    | 13             | ~2.56               | close to 2.56     |

- GloVe uses these co-occurrence counts (or smoothed versions) to learn word embeddings.
- It trains word vectors so that their dot product approximates the **logarithm** of the number of times the words co-occur.

For example:
- `dot(ice, solid) ≈ log(15)`
- `dot(ice, gas) ≈ log(7)`
- `dot(steam, solid) ≈ log(2)`
- `dot(steam, gas) ≈ log(13)`

This training process helps the model learn **meaningful differences** between words:
- “ice” is more strongly associated with “solid” than “gas”
- “steam” is more strongly associated with “gas” than “solid”

✅ The result is that similar words end up with similar vectors, and **vector differences** can capture relationships and analogies.


##### 🛠️ **Code Example**

The code block below loads the `"glove-wiki-gigaword-50"` model and explores:
- The shape and dimension of the vectors
- Examples of real embeddings (e.g., `"ice"` and `"steam"`)
- Arithmetic on vectors to reveal patterns (e.g., plural forms, analogies)

These embeddings can be used as input features for tasks like clustering, topic modeling, or classification.


In [None]:
# Load pre-trained word embeddings (GloVe)
print("Loading GloVe word embeddings...")
glove_vectors = api.load("glove-wiki-gigaword-50")

In [None]:
print("Number of word vectors in the model:", len(glove_vectors))
print("Dimension of each word vector:", glove_vectors.vector_size)

In [None]:
# Display example embeddings for 'deep' and 'learning'
print("Embedding for 'deep': ", glove_vectors['deep'][:10], "...")
print(" Embedding for 'learnig': ", glove_vectors['learning'][:10], "...")

In [None]:
# 6. Define Helper to Retrieve Embeddings
def get_embedding(word):
    if word in glove_vectors:
        return glove_vectors[word]
    else:
        return np.zeros(50)  # Return a zero vector for unknown words


In [17]:
# 7. Extract Word Embeddings from Corpus
unique_words = list(set(word_tokenize(" ".join(documents).lower())))
word_embeddings = np.array([get_embedding(word) for word in unique_words])

In [16]:
rows, cols = word_embeddings.shape
print("Number of words:", rows)
print("Embedding dimensions:", cols)

Number of words: 30
Embedding dimensions: 50
