# NLP Vectorization Pipeline
This Jupyter notebook demonstrates four fundamental techniques for converting text into numerical features suitable for neural network models:
1. **One‑Hot Encoding**
2. **Bag‑of‑Words (BoW)**
3. **Word Embeddings** (`nn.Embedding`)
4. **Embedding Bags** (`nn.EmbeddingBag`)

Each technique is applied to the same miniature corpus so you can observe and compare their representations.

In [None]:
# -----------------------------
# 1. Imports and Sample Corpus
# -----------------------------
from collections import Counter  # For counting word frequencies if needed
import torch
import torch.nn as nn

# Reproducibility
# Set random seed for reproducibility (affects torch random ops, e.g., embedding initialization)
torch.manual_seed(42)

# Sample corpus (three documents)
corpus = [
    "I like cats",
    "I hate dogs",
    "I'm impartial to hippos"
]

print("Corpus:", corpus)

Corpus: ['I like cats', 'I hate dogs', "I'm impartial to hippos"]


In [None]:
# -------------------------------------
# 2. Tokenisation and Vocabulary Build
# -------------------------------------
def tokenize(text):
    # The replace method is used to replace all occurrences of a specified substring with another substring.
    # Syntax: str.replace(old, new)
    # In this case, it removes apostrophes by replacing "'" with "" (empty string).
    return text.lower().replace("'", "").split() 

# Tokenise corpus
tokenised_docs = [tokenize(doc) for doc in corpus]
print("Tokenised Documents:", tokenised_docs)

# Build vocabulary (word -> index), starting at 0
all_tokens = [tok for doc in tokenised_docs for tok in doc]
vocab = {tok: idx for idx, tok in enumerate(sorted(set(all_tokens)))}
print("Vocabulary:", vocab)

Tokenised Documents: [['i', 'like', 'cats'], ['i', 'hate', 'dogs'], ['im', 'impartial', 'to', 'hippos']]
Vocabulary: {'cats': 0, 'dogs': 1, 'hate': 2, 'hippos': 3, 'i': 4, 'im': 5, 'impartial': 6, 'like': 7, 'to': 8}


In [8]:
# ----------------------------------
# 3. One‑Hot Encoding (per word)
# ----------------------------------
import numpy as np

vocab_size = len(vocab)

def one_hot(word):
    vec = np.zeros(vocab_size, dtype=int)  # Create a zero vector of vocab size for one-hot encoding
    vec[vocab[word]] = 1  # Set the index corresponding to the word in the vocab to 1 (one-hot encoding)
    return vec

# Example: one‑hot vectors for first document
doc0_vectors = [one_hot(tok) for tok in tokenised_docs[0]]
print("One‑Hot vectors for doc0:\n", np.array(doc0_vectors))

One‑Hot vectors for doc0:
 [[0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0 0]]


In [12]:
# ------------------------------------------------------
# 4. Bag‑of‑Words (sum of one‑hot vectors per document)
# ------------------------------------------------------
def bow_vector(doc_tokens):
    # For a given list of tokens in a document, compute the Bag-of-Words vector:
    # 1. For each token, generate its one-hot vector using the one_hot() function.
    # 2. Stack all one-hot vectors for the document and sum them along axis=0.
    #    This results in a single vector of vocab_size length, where each position
    #    contains the count of the corresponding word in the document.
    vec = np.sum([one_hot(tok) for tok in doc_tokens], axis=0) 
    return vec

bow_representations = [bow_vector(doc) for doc in tokenised_docs]  # Compute BoW vector for each document in the corpus
print("Bag‑of‑Words representations:")
for i, vec in enumerate(bow_representations):
    print(f"Doc {i}: {vec}")

Bag‑of‑Words representations:
Doc 0: [1 0 0 0 1 0 0 1 0]
Doc 1: [0 1 1 0 1 0 0 0 0]
Doc 2: [0 0 0 1 0 1 1 0 1]


In [11]:
# --------------------------------------------
# 5. Word Embeddings using nn.Embedding
# --------------------------------------------
embedding_dim = 8  # small dimension for demonstration
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Convert token lists to index tensors
index_docs = [torch.tensor([vocab[tok] for tok in doc], dtype=torch.long) for doc in tokenised_docs]

# Retrieve embeddings for first document
embeddings_doc0 = embedding_layer(index_docs[0])
print("Embeddings for first document (shape = [tokens, embedding_dim]):\n", embeddings_doc0)

Embeddings for first document (shape = [tokens, embedding_dim]):
 tensor([[ 0.5230,  0.9717, -0.2779, -0.6116, -0.5572, -0.9683,  0.8713, -0.0956],
        [ 0.8854,  0.1824,  0.7864, -0.0579,  0.5667, -0.7098, -0.4875,  0.0501],
        [ 0.5635,  1.8582,  1.0441, -0.8638,  0.8351, -0.3157,  0.2691,  0.0854]],
       grad_fn=<EmbeddingBackward0>)


In [7]:
# --------------------------------------------------
# 6. Embedding Bag (average of embeddings per doc)
# --------------------------------------------------
embedding_bag = nn.EmbeddingBag(num_embeddings=vocab_size,
                                embedding_dim=embedding_dim,
                                mode='mean')

# Flatten all indices into one 1‑D tensor
flat_indices = torch.cat(index_docs)
# Offsets: starting indices of each document in flat_indices
offsets = torch.tensor([0] + [len(d) for d in index_docs[:-1]]).cumsum(dim=0)

print("Flat indices:", flat_indices)
print("Offsets:", offsets)

# Compute embedding bag representations
bag_outputs = embedding_bag(flat_indices, offsets)
print("EmbeddingBag outputs (shape = [docs, embedding_dim]):\n", bag_outputs)

Flat indices: tensor([4, 7, 0, 4, 2, 1, 5, 6, 8, 3])
Offsets: tensor([0, 3, 6])
EmbeddingBag outputs (shape = [docs, embedding_dim]):
 tensor([[-0.0471, -0.3956,  0.1453,  0.0946,  0.4031, -0.5658, -0.3008, -0.2628],
        [ 0.0516,  0.3793,  0.8084, -0.1968,  0.6478, -0.1254, -0.4170,  0.2644],
        [-0.8807,  0.4806,  0.3092,  0.2491,  0.0087,  0.1930,  1.0166,  0.9075]],
       grad_fn=<EmbeddingBagBackward0>)


## Conclusion
This notebook illustrated how the same corpus is represented under four vectorisation schemes. You can now extend this notebook by training a classifier on top of any of these representations or experimenting with different embedding dimensions and pooling modes.

---
**Next Steps (Suggested Exercises)**
1. Add a simple fully‑connected classifier on top of the Bag‑of‑Words vectors.
2. Train the embedding + linear layers end‑to‑end on a larger labelled dataset.
3. Compare model performance and runtime across the four representations.