# NLP Lab 2: Count Vectorization (OOP approach)

**Mục tiêu:** Triển khai kỹ thuật Count Vectorization (Bag-of-Words) theo hướng đối tượng.

**Nội dung:**
1. Test SimpleTokenizer và RegexTokenizer
2. Test CountVectorizer trên toy corpus
3. Test trên dataset UD English EWT
4. So sánh với scikit-learn

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../src')))

from preprocessing.tokenizers import SimpleTokenizer, RegexTokenizer
from representations.count_vectorizer import CountVectorizer
import json

## 1. Test SimpleTokenizer và RegexTokenizer (Ví dụ cơ bản)

In [2]:
# Test cases
test_sentences = [
    "Hello World!",
    "I love NLP.",
    "Natural Language Processing is fun!",
    "   Multiple   spaces   here   ",
    "Numbers: 123, 456.78"
]

simple_tokenizer = SimpleTokenizer()
regex_tokenizer = RegexTokenizer()

print("=" * 60)
print("TOKENIZER COMPARISON")
print("=" * 60)

for sentence in test_sentences:
    print(f"\nInput: '{sentence}'")
    print(f"  SimpleTokenizer: {simple_tokenizer.tokenize(sentence)}")
    print(f"  RegexTokenizer:  {regex_tokenizer.tokenize(sentence)}")

TOKENIZER COMPARISON

Input: 'Hello World!'
  SimpleTokenizer: ['Hello', 'World!']
  RegexTokenizer:  ['hello', 'world']

Input: 'I love NLP.'
  SimpleTokenizer: ['I', 'love', 'NLP.']
  RegexTokenizer:  ['i', 'love', 'nlp']

Input: 'Natural Language Processing is fun!'
  SimpleTokenizer: ['Natural', 'Language', 'Processing', 'is', 'fun!']
  RegexTokenizer:  ['natural', 'language', 'processing', 'is', 'fun']

Input: '   Multiple   spaces   here   '
  SimpleTokenizer: ['Multiple', 'spaces', 'here']
  RegexTokenizer:  ['multiple', 'spaces', 'here']

Input: 'Numbers: 123, 456.78'
  SimpleTokenizer: ['Numbers:', '123,', '456.78']
  RegexTokenizer:  ['numbers', '123', '456', '78']


### Nhận xét về Tokenizers:
- **SimpleTokenizer**: Chỉ split theo whitespace, giữ nguyên dấu câu và case
- **RegexTokenizer**: Dùng regex `\b\w+\b`, lowercase, loại bỏ dấu câu

## 2. Test CountVectorizer (Toy Corpus)

In [3]:
# Corpus từ đề bài
corpus = [
    "I love NLP.",
    "I love programming.",
    "NLP is a subfield of AI."
]

tokenizer = RegexTokenizer()
vectorizer = CountVectorizer(tokenizer)

# fit_transform
vectors = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.vocabulary_)
print(f"\nVocabulary size: {len(vectorizer.vocabulary_)}")
print("\nDocument-Term Matrix:")
for i, (doc, vec) in enumerate(zip(corpus, vectors)):
    print(f"  Doc {i}: {vec}  <- '{doc}'")

Vocabulary: {'a': 0, 'ai': 1, 'i': 2, 'is': 3, 'love': 4, 'nlp': 5, 'of': 6, 'programming': 7, 'subfield': 8}

Vocabulary size: 9

Document-Term Matrix:
  Doc 0: [0, 0, 1, 0, 1, 1, 0, 0, 0]  <- 'I love NLP.'
  Doc 1: [0, 0, 1, 0, 1, 0, 0, 1, 0]  <- 'I love programming.'
  Doc 2: [1, 1, 0, 1, 0, 1, 1, 0, 1]  <- 'NLP is a subfield of AI.'


In [4]:
# Toy corpus khác để test thêm
toy_corpus = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are enemies"
]

toy_vectorizer = CountVectorizer(RegexTokenizer())
toy_vectors = toy_vectorizer.fit_transform(toy_corpus)

print("Vocabulary:", toy_vectorizer.vocabulary_)
print("\nDocument-Term Matrix:")
for i, vec in enumerate(toy_vectors):
    print(f"  Doc {i}: {vec}")

# Verify: "the" appears twice in doc 0 and doc 1
the_idx = toy_vectorizer.vocabulary_['the']
print(f"\nVerify 'the' (index={the_idx}): Doc0={toy_vectors[0][the_idx]}, Doc1={toy_vectors[1][the_idx]}")

Vocabulary: {'and': 0, 'are': 1, 'cat': 2, 'cats': 3, 'dog': 4, 'dogs': 5, 'enemies': 6, 'log': 7, 'mat': 8, 'on': 9, 'sat': 10, 'the': 11}

Document-Term Matrix:
  Doc 0: [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 2]
  Doc 1: [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 2]
  Doc 2: [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]

Verify 'the' (index=11): Doc0=2, Doc1=2


## 3. Test trên UD English EWT Dataset

### 3.1 Test Tokenizers trên EWT

In [5]:
# Load dataset
dataset_path = '../data/lab5/UD_English-EWT/en_ewt-ud-train.jsonl'
documents = []

with open(dataset_path, 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        text = " ".join(data['words'])
        documents.append(text)

print(f"Loaded {len(documents)} documents from EWT dataset.")
print(f"\nSample documents:")
for i in range(min(3, len(documents))):
    print(f"  [{i}]: {documents[i][:80]}...")

Loaded 12544 documents from EWT dataset.

Sample documents:
  [0]: Al - Zaman : American forces killed Shaikh Abdullah al - Ani , the preacher at t...
  [1]: [ This killing of a respected cleric will be causing us trouble for years to com...
  [2]: DPA : Iraqi authorities announced that they had busted up 3 terrorist cells oper...


In [6]:
# Test tokenizers on EWT samples
simple_tok = SimpleTokenizer()
regex_tok = RegexTokenizer()

print("Tokenizer comparison on EWT samples:")
for i in range(min(3, len(documents))):
    doc = documents[i]
    simple_tokens = simple_tok.tokenize(doc)
    regex_tokens = regex_tok.tokenize(doc)
    print(f"\nDoc {i}: '{doc[:50]}...'")
    print(f"  SimpleTokenizer: {len(simple_tokens)} tokens")
    print(f"  RegexTokenizer:  {len(regex_tokens)} tokens")

Tokenizer comparison on EWT samples:

Doc 0: 'Al - Zaman : American forces killed Shaikh Abdulla...'
  SimpleTokenizer: 29 tokens
  RegexTokenizer:  23 tokens

Doc 1: '[ This killing of a respected cleric will be causi...'
  SimpleTokenizer: 18 tokens
  RegexTokenizer:  15 tokens

Doc 2: 'DPA : Iraqi authorities announced that they had bu...'
  SimpleTokenizer: 17 tokens
  RegexTokenizer:  15 tokens


### 3.2 Test CountVectorizer trên EWT

In [7]:
# Initialize vectorizer
ewt_tokenizer = RegexTokenizer()
ewt_vectorizer = CountVectorizer(ewt_tokenizer)

# Fit on entire EWT dataset
ewt_vectorizer.fit(documents)
print(f"Vocabulary size: {len(ewt_vectorizer.vocabulary_)}")

# Show some vocabulary samples
vocab_items = list(ewt_vectorizer.vocabulary_.items())[:10]
print(f"\nSample vocabulary (first 10):")
for word, idx in vocab_items:
    print(f"  '{word}': {idx}")

Vocabulary size: 15972

Sample vocabulary (first 10):
  '0': 0
  '00': 1
  '000': 2
  '0000108806': 3
  '0027': 4
  '0046': 5
  '008': 6
  '01': 7
  '0134': 8
  '02': 9


In [8]:
# Transform first 5 documents
ewt_vectors = ewt_vectorizer.transform(documents[:5])

print("First 5 document vectors:")
for i, vec in enumerate(ewt_vectors):
    non_zero = sum(1 for v in vec if v > 0)
    total_count = sum(vec)
    print(f"  Doc {i}: length={len(vec)}, non-zero={non_zero}, total_count={total_count}")

First 5 document vectors:
  Doc 0: length=15972, non-zero=19, total_count=23
  Doc 1: length=15972, non-zero=15, total_count=15
  Doc 2: length=15972, non-zero=15, total_count=15
  Doc 3: length=15972, non-zero=12, total_count=15
  Doc 4: length=15972, non-zero=30, total_count=34


In [9]:
# Analyze sparsity
sample_vectors = ewt_vectorizer.transform(documents[:100])
total_elements = len(sample_vectors) * len(sample_vectors[0])
non_zero_elements = sum(sum(1 for v in vec if v > 0) for vec in sample_vectors)
sparsity = 1 - (non_zero_elements / total_elements)

print(f"Sparsity analysis (first 100 docs):")
print(f"  Total elements: {total_elements}")
print(f"  Non-zero elements: {non_zero_elements}")
print(f"  Sparsity: {sparsity:.4f} ({sparsity*100:.2f}%)")

Sparsity analysis (first 100 docs):
  Total elements: 1597200
  Non-zero elements: 1880
  Sparsity: 0.9988 (99.88%)


## 4. So sánh với Scikit-Learn

In [10]:
from sklearn.feature_extraction.text import CountVectorizer as SklearnCountVectorizer

# Sklearn vectorizer
sklearn_vectorizer = SklearnCountVectorizer()
sklearn_vectors = sklearn_vectorizer.fit_transform(documents)

print("Comparison:")
print(f"  My Vocabulary Size:      {len(ewt_vectorizer.vocabulary_)}")
print(f"  Sklearn Vocabulary Size: {len(sklearn_vectorizer.vocabulary_)}")
print(f"\n  My Vector Shape:      ({len(documents)}, {len(ewt_vectorizer.vocabulary_)})")
print(f"  Sklearn Vector Shape: {sklearn_vectors.shape}")

Comparison:
  My Vocabulary Size:      15972
  Sklearn Vocabulary Size: 15936

  My Vector Shape:      (12544, 15972)
  Sklearn Vector Shape: (12544, 15936)


In [11]:
# Compare on toy corpus to verify correctness
test_corpus = [
    "I love NLP.",
    "I love programming.",
    "NLP is a subfield of AI."
]

my_vec = CountVectorizer(RegexTokenizer())
my_result = my_vec.fit_transform(test_corpus)

sk_vec = SklearnCountVectorizer(lowercase=True)
sk_result = sk_vec.fit_transform(test_corpus).toarray()

print("My vocabulary:", my_vec.vocabulary_)
print("Sklearn vocabulary:", dict(sorted(sk_vec.vocabulary_.items(), key=lambda x: x[1])))
print("\nMy vectors:")
for v in my_result:
    print(f"  {v}")
print("\nSklearn vectors:")
for v in sk_result:
    print(f"  {list(v)}")

My vocabulary: {'a': 0, 'ai': 1, 'i': 2, 'is': 3, 'love': 4, 'nlp': 5, 'of': 6, 'programming': 7, 'subfield': 8}
Sklearn vocabulary: {'ai': 0, 'is': 1, 'love': 2, 'nlp': 3, 'of': 4, 'programming': 5, 'subfield': 6}

My vectors:
  [0, 0, 1, 0, 1, 1, 0, 0, 0]
  [0, 0, 1, 0, 1, 0, 0, 1, 0]
  [1, 1, 0, 1, 0, 1, 1, 0, 1]

Sklearn vectors:
  [np.int64(0), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(0), np.int64(0)]
  [np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(0), np.int64(1), np.int64(0)]
  [np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(0), np.int64(1)]


### Giải thích sự khác biệt:
- **Tokenization khác nhau**: Sklearn mặc định dùng regex `(?u)\b\w\w+\b` (từ >= 2 ký tự), còn ta dùng `\b\w+\b` (từ >= 1 ký tự)
- **Vocabulary size**: Sklearn có thể lớn hơn/nhỏ hơn tùy cách xử lý punctuation và single-character tokens