# NLPA Laboratory Requirements

### Setting Up Your Python Environment

# Core Libraries & Installation

pip install \
  numpy \
  scipy \
  pandas \
  matplotlib \
  jupyter \
  scikit-learn \
  nltk \
  spacy \
  gensim \
  torch torchvision torchaudio \
  tensorflow \
  transformers \
  datasets \
  sentencepiece

| Library           | Purpose                                                         |
| ----------------- | --------------------------------------------------------------- |
| **numpy, scipy**  | Numerical computing                                             |
| **pandas**        | Data handling & analysis                                        |
| **matplotlib**    | Basic plotting & visualization                                  |
| **jupyter**       | Interactive notebooks                                           |
| **scikit-learn**  | Traditional ML algorithms & preprocessing                       |
| **nltk**          | Classic NLP tasks (tokenization, corpora, simple models)        |
| **spaCy**         | Industrial-strength NLP (tokenization, POS, dependency parsing) |
| **gensim**        | Topic modeling & word embeddings (Word2Vec, Doc2Vec, LDA)       |
| **torch**         | Deep-learning framework (PyTorch)                               |
| **tensorflow**    | Deep-learning framework                                         |
| **transformers**  | State-of-the-art pre-trained models (BERT, GPT, etc.)           |
| **datasets**      | Easy access to common NLP datasets                              |
| **sentencepiece** | Subword tokenization (for Transformer models)                   |


 # Highly Recommended

In [1]:
!pip install huggingface-hub



# plotly or seaborn
## For richer data visualizations as you analyze text and model outputs.

# Verifying Your Setup

In [2]:
import nltk, torch
print("NLTK:", nltk.__version__)
# print("spaCy:", spacy.__version__)
print("PyTorch:", torch.__version__)
# print("TensorFlow:", tensorflow.__version__)


NLTK: 3.8.1
PyTorch: 2.4.0+cu118


In [3]:
# Test Python version
import sys
print("Python:", sys.version.split()[0])

# Core scientific stack
import numpy;       print("NumPy:        ", numpy.__version__)
import scipy;       print("SciPy:        ", scipy.__version__)
import pandas;      print("Pandas:       ", pandas.__version__)
import matplotlib;  print("Matplotlib:   ", matplotlib.__version__)

# Machine-learning toolkit
import sklearn;     print("scikit-learn: ", sklearn.__version__)

# Classic NLP libraries
import nltk;        print("NLTK:         ", nltk.__version__)
import spacy;       print("spaCy:        ", spacy.__version__)
import gensim;      print("Gensim:       ", gensim.__version__)

# Deep-learning frameworks
import torch;       print("PyTorch:      ", torch.__version__)
import torchvision; print("TorchVision:  ", torchvision.__version__)
import torchaudio;  print("TorchAudio:   ", torchaudio.__version__)

#import tensorflow as tf
#print("TensorFlow:   ", tf.__version__)

# Transformer models & datasets
import transformers;  print("Transformers: ", transformers.__version__)
import datasets;      print("Datasets:     ", datasets.__version__)
import sentencepiece; print("SentencePiece:", sentencepiece.__version__)


Python: 3.11.7
NumPy:         1.24.4
SciPy:         1.10.1
Pandas:        1.5.3
Matplotlib:    3.8.0
scikit-learn:  1.2.2
NLTK:          3.8.1
spaCy:         3.8.7
Gensim:        4.3.0
PyTorch:       2.4.0+cu118
TorchVision:   0.19.0+cu118
TorchAudio:    2.4.0+cu118
Transformers:  4.51.3
Datasets:      4.0.0
SentencePiece: 0.2.0


In [4]:
#!pip install datasets

In [5]:
#! pip install spacy

In [6]:
#! pip install gensim

# How Machines Interpret Text - Like Operating Systems

### Unicode Transformation Format – 8-bit. --utf-8

# Text Encoding & Bit-Level Demo

This notebook shows how a string of text is represented internally:
- **Unicode code points**  
- **UTF-8 byte sequences**  
- **Bit-level patterns**  
- **Binary file I/O**  


In [1]:
# 1) Define some sample text (ASCII + non-ASCII)
text = "Hello, 世界!"

In [2]:
# 2) Per-character breakdown: code points and UTF-8 bytes
print("Per-character breakdown:")
for ch in text:
    code_point = ord(ch)
    utf8_bytes = ch.encode('utf-8')
    print(f"  '{ch}'  → code point: U+{code_point:04X}  → bytes: {list(utf8_bytes)}")

Per-character breakdown:
  'H'  → code point: U+0048  → bytes: [72]
  'e'  → code point: U+0065  → bytes: [101]
  'l'  → code point: U+006C  → bytes: [108]
  'l'  → code point: U+006C  → bytes: [108]
  'o'  → code point: U+006F  → bytes: [111]
  ','  → code point: U+002C  → bytes: [44]
  ' '  → code point: U+0020  → bytes: [32]
  '世'  → code point: U+4E16  → bytes: [228, 184, 150]
  '界'  → code point: U+754C  → bytes: [231, 149, 140]
  '!'  → code point: U+0021  → bytes: [33]


In [3]:
# 3) Full text as a UTF-8 byte sequence
full_bytes = text.encode('utf-8')
print("\nFull text as bytes:", full_bytes)


Full text as bytes: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'


In [4]:
# 4) Bit-level representation of those bytes
bit_strs = [format(b, '08b') for b in full_bytes]
print("Full text as bits:   ", ' '.join(bit_strs))

Full text as bits:    01001000 01100101 01101100 01101100 01101111 00101100 00100000 11100100 10111000 10010110 11100111 10010101 10001100 00100001


# Q. Is this the encoding format for NLP applications?

# Encoding in NLP: Why UTF-8 Is Recommended

In this notebook we will:
- See why UTF-8 is the go-to encoding for NLP.
- Show how ASCII encoding fails on non-ASCII characters.
- Demonstrate how mis-decoding (e.g., Latin-1 → UTF-8) corrupts text.
- Confirm that UTF-8 round-trips without loss.


In [5]:
# Sample text containing ASCII, CJK, accented latin, and emoji
text = "Hello, world! 你好, café 😊"
print("Original text:", text)

Original text: Hello, world! 你好, café 😊


In [6]:
# 1) Attempt to encode with ASCII (should error)
try:
    ascii_bytes = text.encode('ascii')
    print("ASCII bytes:", ascii_bytes)
except UnicodeEncodeError as e:
    print("ASCII encoding error:", e)

ASCII encoding error: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)


> **Why this fails:**  
> ASCII only covers code points 0–127. Characters like “你” (U+4F60), “é” (U+00E9), or “😊” (U+1F60A) lie outside that range, so `text.encode('ascii')` raises a `UnicodeEncodeError`.

## Conclusion

- **UTF-8** is the industry standard for NLP because it can losslessly encode **all** Unicode code points.  
- Using narrower encodings (ASCII, Latin-1, etc.) either throws errors or corrupts your data.  
- **Always** ensure your entire NLP pipeline—file I/O, model inputs, serialization, network transfers—is UTF-8 end-to-end.

# Lets do a simple text encoding for a machine learning application

In [11]:
# 1. Your raw sentence
text = "I love natural language processing courese"

In [12]:
# 2. Lowercase & whitespace-tokenize
tokens = text.lower().split()
#    → ["i", "love", "natural", "language", "processing", "courese"]
tokens

['i', 'love', 'natural', 'language', 'processing', 'courese']

In [14]:
# 3. Build a tiny vocab mapping (in real life you'd pre-build on your whole corpus)
vocab = {tok: idx+1 for idx, tok in enumerate(tokens)}
#    → {'i':1, 'love':2, 'natural':3, 'language':4, 'processing':5, 'courese':6}
vocab

{'i': 1, 'love': 2, 'natural': 3, 'language': 4, 'processing': 5, 'courese': 6}

In [15]:
# 4. Convert sentence to a list of token IDs
encoded = [vocab[t] for t in tokens]
print("Integer-encoded:", encoded)
# Integer-encoded: [1, 2, 3, 4, 5, 6]

Integer-encoded: [1, 2, 3, 4, 5, 6]


# One-hot encoding

In [18]:
import numpy as np

vocab_size = len(vocab) + 1   # +1 if you reserve zero for padding/OOV
# Create identity matrix of size vocab_size
eye = np.eye(vocab_size)

# Build one-hot rows for each token ID
one_hot = eye[encoded]
print("One-hot shape:", one_hot.shape)
# One-hot shape: (6, 7)

print("First 2 one-hot rows:\n", one_hot[:])

One-hot shape: (6, 7)
First 2 one-hot rows:
 [[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1.]]


In [19]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [20]:
# 2) One-hot encoding: print each token with its vector
import numpy as np

# (Assuming you’ve already defined `tokens`, `vocab`, and `encoded` as before:)
# tokens   = ["i","love","natural","language","processing","courese"]
# vocab    = {'i':1, 'love':2, ...}
# encoded  = [1, 2, 3, 4, 5, 6]

vocab_size = len(vocab) + 1   # +1 if you reserve 0 for padding/OOV
eye = np.eye(vocab_size, dtype=int)

one_hot = eye[encoded]        # shape: (sentence_length, vocab_size)

print("Token".ljust(15), "One-hot vector")
print("-"*15, "-"* (vocab_size*2))
for token, vec in zip(tokens, one_hot):
    print(token.ljust(15), vec)

Token           One-hot vector
--------------- --------------
i               [0 1 0 0 0 0 0]
love            [0 0 1 0 0 0 0]
natural         [0 0 0 1 0 0 0]
language        [0 0 0 0 1 0 0]
processing      [0 0 0 0 0 1 0]
courese         [0 0 0 0 0 0 1]


In [22]:
import numpy as np

# Re-define tokens, vocab, and encoded
tokens = ["i", "love", "natural", "language", "processing", "courese"]
vocab = {tok: idx+1 for idx, tok in enumerate(tokens)}
encoded = [vocab[t] for t in tokens]

# Simulate a PyTorch nn.Embedding with a NumPy matrix
np.random.seed(42)
vocab_size = len(vocab) + 1
embedding_dim = 4
embedding_matrix = np.random.rand(vocab_size, embedding_dim)

# Lookup embeddings
embedded_sequence = embedding_matrix[encoded]

print("Token".ljust(15), "Embedding vector")
print("-" * 15)
for token, vec in zip(tokens, embedded_sequence):
    print(token.ljust(15), vec)

Token           Embedding vector
---------------
i               [0.15601864 0.15599452 0.05808361 0.86617615]
love            [0.60111501 0.70807258 0.02058449 0.96990985]
natural         [0.83244264 0.21233911 0.18182497 0.18340451]
language        [0.30424224 0.52475643 0.43194502 0.29122914]
processing      [0.61185289 0.13949386 0.29214465 0.36636184]
courese         [0.45606998 0.78517596 0.19967378 0.51423444]


# Q. What is this embedding layer?

## An embedding layer is essentially a learnable lookup table that maps discrete input tokens (words, subwords, characters, item IDs, etc.) to continuous, dense vectors. 

Inputs:

A sequence of integer IDs, each representing a token in your vocabulary.

E.g. [12, 5, 89, 32] might correspond to ["I", "love", "NLP", "."].

Parameters:

A weight matrix W of shape (V, D), where

V = size of your vocabulary (or number of unique IDs),

D = dimensionality of the embedding vectors you want (e.g. 50, 100, 300).

Operation:

For each input ID i, you return the i-th row of W, which is a D-dimensional vector.

If your input is a sequence of length L, the output is a matrix of shape (L, D).

Learning:

During training, W is updated via backpropagation so that tokens used in similar contexts acquire similar vectors.

You can also initialize W from pre-trained embeddings (e.g. GloVe, word2vec) and either freeze or fine-tune them.



In [20]:
import torch
from torch import nn

vocab_size = 10000   # e.g. 10k words
embedding_dim = 128  # each word → a 128-dim vector

# Create the layer
embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Sample batch of token IDs (batch_size=2, seq_len=5)
input_ids = torch.tensor([[12, 45, 900, 32, 1],
                          [ 4, 23,  17,  0, 7]], dtype=torch.long)

# Forward pass → shape (2, 5, 128)
output = embed(input_ids)
print(output.shape)  # torch.Size([2, 5, 128])

torch.Size([2, 5, 128])


# Proving Semantic Structure with a PyTorch Embedding Layer

We will:

1. Load a small pretrained embedding matrix into `nn.Embedding`.  
2. Extract the vectors for a handful of words.  
3. Compute pairwise cosine similarities.  
4. Observe that “cat” ↔ “dog” and “apple” ↔ “banana” are much closer than unrelated pairs.


In [21]:
import torch
import torch.nn as nn
import numpy as np

# ---- 2.1) A toy “pretrained” embedding matrix ----
# In practice, replace this with loaded GloVe/BERT weights.
# Here we simulate a 10-word vocab, each with a 5-dim vector.
np.random.seed(0)
pretrained_weights = np.random.randn(10, 5).astype(np.float32)

# Let’s pretend our vocab is:
vocab = ["<pad>","cat","dog","car","apple","banana","king","queen","man","woman"]
word2idx = {w:i for i,w in enumerate(vocab)}

# ---- 2.2) Build the Embedding layer and load weights ----
emb = nn.Embedding(num_embeddings=len(vocab), embedding_dim=5)
emb.weight.data.copy_(torch.from_numpy(pretrained_weights))

# ---- 2.3) Select words we care about ----
words = ["cat","dog","car","apple","banana","king","queen"]
idxs  = torch.tensor([word2idx[w] for w in words], dtype=torch.long)

# ---- 2.4) Lookup their embeddings ----
vectors = emb(idxs)                          # shape: (7, 5)
vectors = vectors.detach().cpu().numpy()

# ---- 2.5) Cosine similarity function ----
def cosine_sim(a, b):
    return (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))

# ---- 2.6) Compute and print pairwise similarities ----
print("Pairwise cosine similarities:\n")
for i, w1 in enumerate(words):
    for j, w2 in enumerate(words[i+1:], start=i+1):
        sim = cosine_sim(vectors[i], vectors[j])
        print(f"  {w1:>6} ↔ {w2:<6}: {sim:.3f}")

Pairwise cosine similarities:

     cat ↔ dog   : 0.528
     cat ↔ car   : 0.288
     cat ↔ apple : 0.760
     cat ↔ banana: 0.523
     cat ↔ king  : 0.126
     cat ↔ queen : 0.291
     dog ↔ car   : 0.562
     dog ↔ apple : 0.345
     dog ↔ banana: 0.125
     dog ↔ king  : -0.130
     dog ↔ queen : 0.829
     car ↔ apple : -0.339
     car ↔ banana: -0.249
     car ↔ king  : 0.119
     car ↔ queen : 0.553
   apple ↔ banana: 0.612
   apple ↔ king  : -0.029
   apple ↔ queen : 0.159
  banana ↔ king  : -0.622
  banana ↔ queen : -0.309
    king ↔ queen : 0.073


> You should see that pairs like **cat ↔ dog** and **apple ↔ banana** have notably higher cosine-similarity scores than, say, **cat ↔ car** or **king ↔ apple**.  
>
> This demonstrates that once you load semantically-trained embedding weights into a PyTorch `nn.Embedding` layer, the geometric structure of that vector space indeed places similar-meaning words close together—exactly what lets downstream models generalize by proximity in embedding space.


| Score range   | Label              |
| ------------- | ------------------ |
| \[0.90, 1.00] | Nearly identical   |
| \[0.75, 0.90) | Highly similar     |
| \[0.50, 0.75) | Moderately similar |
| \[0.25, 0.50) | Slightly related   |
| \[0.00, 0.25) | Unrelated          |


In [22]:
import numpy as np

# Sample cosine similarities
pairs = {
    ("cat","dog"):   0.92,
    ("cat","car"):   0.12,
    ("apple","banana"): 0.88
}

def interpret_similarity(s):
    if s >= 0.90: return "Nearly identical"
    if s >= 0.75: return "Highly similar"
    if s >= 0.50: return "Moderately similar"
    if s >= 0.25: return "Slightly related"
    return "Unrelated"

for (w1,w2), score in pairs.items():
    angle = np.degrees(np.arccos(score))
    print(f"{w1:>6} ↔ {w2:<6}: {score:.2f} ({score*100:.0f}%), {angle:.0f}°, {interpret_similarity(score)}")


   cat ↔ dog   : 0.92 (92%), 23°, Nearly identical
   cat ↔ car   : 0.12 (12%), 83°, Unrelated
 apple ↔ banana: 0.88 (88%), 28°, Highly similar


In [24]:
from IPython.display import display, Math

display(Math(r"""
s = \frac{a \cdot b}{\|a\|\;\|b\|}, 
\quad
\theta = \cos^{-1}(s)\times\frac{180}{\pi}
"""))


<IPython.core.display.Math object>

In [25]:
import numpy as np

# 1) Define two example embedding vectors
a = np.array([0.90, 0.10, 0.00])
b = np.array([0.88, 0.12, 0.00])

# 2) Compute cosine similarity s
s = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 3) Print s and also the corresponding angle θ
theta = np.degrees(np.arccos(s))

print(f"cosine similarity s = {s:.4f}")
print(f"angle θ = {theta:.1f}°")


cosine similarity s = 0.9997
angle θ = 1.4°
