# NLP Vectorization Pipeline
This Jupyter notebook demonstrates four fundamental techniques for converting text into numerical features suitable for neural network models:
1. **One‑Hot Encoding**
2. **Bag‑of‑Words (BoW)**
3. **Word Embeddings** (`nn.Embedding`)
4. **Embedding Bags** (`nn.EmbeddingBag`)

Each technique is applied to the same miniature corpus so you can observe and compare their representations.

In [None]:
# -----------------------------
# 1. Imports and Sample Corpus
# -----------------------------
from collections import Counter
import torch
import torch.nn as nn

# Reproducibility
torch.manual_seed(42)

# Sample corpus (three documents)
corpus = [
    "I like cats",
    "I hate dogs",
    "I'm impartial to hippos"
]

print("Corpus:", corpus)

In [None]:
# -------------------------------------
# 2. Tokenisation and Vocabulary Build
# -------------------------------------
def tokenize(text):
    return text.lower().replace("'", "").split()

# Tokenise corpus
tokenised_docs = [tokenize(doc) for doc in corpus]
print("Tokenised Documents:", tokenised_docs)

# Build vocabulary (word -> index), starting at 0
all_tokens = [tok for doc in tokenised_docs for tok in doc]
vocab = {tok: idx for idx, tok in enumerate(sorted(set(all_tokens)))}
print("Vocabulary:", vocab)

In [None]:
# ----------------------------------
# 3. One‑Hot Encoding (per word)
# ----------------------------------
import numpy as np

vocab_size = len(vocab)

def one_hot(word):
    vec = np.zeros(vocab_size, dtype=int)
    vec[vocab[word]] = 1
    return vec

# Example: one‑hot vectors for first document
doc0_vectors = [one_hot(tok) for tok in tokenised_docs[0]]
print("One‑Hot vectors for doc0:\n", np.array(doc0_vectors))

In [None]:
# ------------------------------------------------------
# 4. Bag‑of‑Words (sum of one‑hot vectors per document)
# ------------------------------------------------------
def bow_vector(doc_tokens):
    vec = np.sum([one_hot(tok) for tok in doc_tokens], axis=0)
    return vec

bow_representations = [bow_vector(doc) for doc in tokenised_docs]
print("Bag‑of‑Words representations:")
for i, vec in enumerate(bow_representations):
    print(f"Doc {i}: {vec}")

In [None]:
# --------------------------------------------
# 5. Word Embeddings using nn.Embedding
# --------------------------------------------
embedding_dim = 8  # small dimension for demonstration
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Convert token lists to index tensors
index_docs = [torch.tensor([vocab[tok] for tok in doc], dtype=torch.long) for doc in tokenised_docs]

# Retrieve embeddings for first document
embeddings_doc0 = embedding_layer(index_docs[0])
print("Embeddings for first document (shape = [tokens, embedding_dim]):\n", embeddings_doc0)

In [None]:
# --------------------------------------------------
# 6. Embedding Bag (average of embeddings per doc)
# --------------------------------------------------
embedding_bag = nn.EmbeddingBag(num_embeddings=vocab_size,
                                embedding_dim=embedding_dim,
                                mode='mean')

# Flatten all indices into one 1‑D tensor
flat_indices = torch.cat(index_docs)
# Offsets: starting indices of each document in flat_indices
offsets = torch.tensor([0] + [len(d) for d in index_docs[:-1]]).cumsum(dim=0)

print("Flat indices:", flat_indices)
print("Offsets:", offsets)

# Compute embedding bag representations
bag_outputs = embedding_bag(flat_indices, offsets)
print("EmbeddingBag outputs (shape = [docs, embedding_dim]):\n", bag_outputs)

## Conclusion
This notebook illustrated how the same corpus is represented under four vectorisation schemes. You can now extend this notebook by training a classifier on top of any of these representations or experimenting with different embedding dimensions and pooling modes.

---
**Next Steps (Suggested Exercises)**
1. Add a simple fully‑connected classifier on top of the Bag‑of‑Words vectors.
2. Train the embedding + linear layers end‑to‑end on a larger labelled dataset.
3. Compare model performance and runtime across the four representations.