In [1]:
!pip install numpy pandas scikit-learn



# AIG230 NLP (Week 3 Lab) — Notebook 1: Text Representation

This notebook focuses on **turning raw text into numeric features** you can use in real-world ML systems.

You will build:
- a clean **train/test split**
- **Bag-of-Words** (binary and count)
- **Document-Term Matrix** (DTM)
- **TF-IDF** (with n-grams)
- **Hashing trick** (production-friendly)
- basic **retrieval** (cosine similarity) and a **baseline classifier**
- model **persistence** (save/load)

## 0) Setup


In [2]:

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## 1) A small, realistic dataset (you can replace with your own CSV)


In industry, text often comes with:
- an **ID**
- free-text **description**
- a **label** (category, priority, intent, topic) or a target (churn, fraud, etc.)

Here we create a toy dataset that looks like support tickets / ops incidents.  
Swap this section with a `pd.read_csv(...)` in your own workflows.


In [3]:

data = [
    ("T-001", "VPN keeps disconnecting every 10 minutes on Windows 11 after latest update", "network"),
    ("T-002", "Password reset link is expired and user cannot login to the portal", "auth"),
    ("T-003", "Email delivery delayed, outbound messages queued for hours", "messaging"),
    ("T-004", "Cannot install printer driver, installer fails with error code 1603", "device"),
    ("T-005", "MFA prompt never arrives on mobile app, user stuck at login", "auth"),
    ("T-006", "WiFi signal drops in meeting rooms, access point reboot helps temporarily", "network"),
    ("T-007", "Outlook search not returning results, index seems corrupted", "messaging"),
    ("T-008", "Laptop battery drains fast after BIOS update, power settings unchanged", "device"),
    ("T-009", "Portal shows 500 error when submitting form, happened after deployment", "app"),
    ("T-010", "API requests timing out, latency spike observed in last hour", "app"),
    ("T-011", "User cannot access shared drive, permission denied though in correct group", "auth"),
    ("T-012", "Teams calls have choppy audio, jitter high on corporate network", "network"),
    ("T-013", "Push notifications not working on Android for the app", "app"),
    ("T-014", "Mailbox is full and cannot receive emails, auto-archive not running", "messaging"),
    ("T-015", "Bluetooth mouse not pairing after restart, device shows as unknown", "device"),
    ("T-016", "Zoom client crashes when joining scheduled meeting after update", "app"),
    ("T-017", "Network drive mapping lost after system reboot, user needs remount", "network"),
    ("T-018", "SSO login loop occurs after password change on corporate portal", "auth"),
    ("T-019", "Excel freezes when opening large CSV files from shared drive", "app"),
    ("T-020", "Outlook calendar events missing after profile rebuild", "messaging"),
    ("T-021", "VPN fails to reconnect automatically when switching WiFi networks", "network"),
    ("T-022", "OneDrive sync stuck on pending, files not uploading", "app"),
    ("T-023", "Monitor not detected on docking station after sleep mode", "device"),
    ("T-024", "Slack notifications delayed until app reopened on desktop", "messaging"),
    ("T-025", "Git push fails due to SSL certificate validation error", "app"),
]

df = pd.read_csv('dataset.csv')
df


Unnamed: 0,text,label
0,Package still not delivered even after estimat...,delivery
1,Payment declined but money deducted from my bank,payment
2,Received damaged item inside the parcel,product
3,Cannot login to my account forgot email used,account
4,Order confirmation email never received after ...,order
5,Product description on website differs from ac...,product
6,Refund not processed after returning the shoes,payment
7,Tracking link not updating since last three days,delivery
8,Account locked due to multiple failed login at...,account
9,Placed order twice by mistake need one canceled,order


### Train/test split


In [4]:

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 13
Test size: 7


## 2) Tokenization basics and normalization (lightweight, practical)


In production pipelines you typically do **minimal, safe normalization**:
- lowercase
- normalize whitespace
- optionally strip obvious punctuation
- keep numbers when they carry meaning (error codes, versions, dates)

Heavy normalization (stemming, aggressive regexes) can hurt when your text includes:
error codes, product names, IDs, or domain terminology.


In [5]:

def simple_normalize(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["text_norm"] = df["text"].map(simple_normalize)
df[["text_norm","label"]].head()


Unnamed: 0,text_norm,label
0,package still not delivered even after estimat...,delivery
1,payment declined but money deducted from my bank,payment
2,received damaged item inside the parcel,product
3,cannot login to my account forgot email used,account
4,order confirmation email never received after ...,order


## 3) Vocabulary + Document-Term Matrix (DTM) with CountVectorizer


**CountVectorizer** builds:
- a vocabulary (token → column index)
- a sparse matrix where rows are documents and columns are tokens

This is the classic **Document-Term Matrix** representation.


In [6]:

count_vec = CountVectorizer(
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b",  # keeps tokens like "500", "1603", "mfa"
    min_df=1
)

X_train_counts = count_vec.fit_transform(X_train)
X_test_counts  = count_vec.transform(X_test)

print("DTM shape (train):", X_train_counts.shape)
print("Vocabulary size:", len(count_vec.vocabulary_))


DTM shape (train): (13, 80)
Vocabulary size: 80


### Inspect the vocabulary and a single row


In [7]:

# Show a small slice of the vocabulary (token -> index)
vocab_items = sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])[:25]
vocab_items


[('accepted', 0),
 ('account', 1),
 ('actual', 2),
 ('after', 3),
 ('attempts', 4),
 ('being', 5),
 ('but', 6),
 ('by', 7),
 ('canceled', 8),
 ('card', 9),
 ('checkout', 10),
 ('color', 11),
 ('confirmation', 12),
 ('courier', 13),
 ('credit', 14),
 ('damaged', 15),
 ('date', 16),
 ('days', 17),
 ('delivered', 18),
 ('description', 19),
 ('different', 20),
 ('differs', 21),
 ('due', 22),
 ('during', 23),
 ('email', 24)]

In [8]:

# Look at a specific document row: non-zero entries (token counts)
row_id = 0
row = X_train_counts[row_id]
inv_vocab = {idx: tok for tok, idx in count_vec.vocabulary_.items()}

nz_cols = row.nonzero()[1]
tokens_counts = sorted([(inv_vocab[c], int(row[0, c])) for c in nz_cols], key=lambda x: -x[1])
tokens_counts[:20]


[('refund', 1),
 ('not', 1),
 ('processed', 1),
 ('after', 1),
 ('returning', 1),
 ('the', 1),
 ('shoes', 1)]

## 4) Binary vs Count-based Bag-of-Words


Binary BoW: token present or not (good for short texts and some classification tasks)  
Count BoW: raw frequency (baseline for many pipelines)

Both discard word order.


In [9]:
binary_vec = CountVectorizer(binary=True, token_pattern=r"(?u)\b\w+\b")
X_train_bin = binary_vec.fit_transform(X_train)


In [10]:
X_train_bin.shape

(13, 80)

In [11]:
X_train_bin

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 98 stored elements and shape (13, 80)>

## 5) TF-IDF (a refinement, not a replacement)


TF-IDF downweights very common tokens and upweights tokens that are more distinctive.

In industry, TF-IDF with **n-grams** is a strong baseline for:
- ticket routing
- intent detection
- spam detection
- incident clustering


In [12]:
tfidf_vec = TfidfVectorizer(
    ngram_range=(1,3),         # unigrams + bigrams
    token_pattern=r"(?u)\b\w+\b",
    min_df=1,
    sublinear_tf=True          # common practical tweak
)

X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf  = tfidf_vec.transform(X_test)



In [13]:
print("TF-IDF shape (train):", X_train_tfidf.shape)

TF-IDF shape (train): (13, 237)


## 6) Quick retrieval: 'find similar tickets' with cosine similarity


A very common industry use case is **nearest neighbor retrieval** for:
- deduplication
- suggesting knowledge base articles
- finding similar past incidents


In [14]:
# Build a search index from ALL tickets using TF-IDF
X_all = tfidf_vec.fit_transform(df["text"])

def search_similar(query: str, top_k: int = 5):
    qv = tfidf_vec.transform([query])
    sims = cosine_similarity(qv, X_all).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return df.loc[top_idx, ["text","label"]].assign(similarity=sims[top_idx])

print(search_similar("My package says delivered but I never received it", top_k=5))
print(search_similar("Payment failing in PayPal", top_k=5))
print(search_similar("I can’t sign into my account after changing my phone number", top_k=5))


                                                 text     label  similarity
16  Courier marked package delivered but not received  delivery    0.357687
4   Order confirmation email never received after ...     order    0.246886
0   Package still not delivered even after estimat...  delivery    0.126695
1    Payment declined but money deducted from my bank   payment    0.125187
3        Cannot login to my account forgot email used   account    0.069462
                                                 text    label  similarity
10      Credit card not being accepted during payment  payment    0.150412
1    Payment declined but money deducted from my bank  payment    0.139672
14      Order stuck in processing status for two days    order    0.139169
13  Unable to update phone number in profile settings  account    0.138211
4   Order confirmation email never received after ...    order    0.000000
                                                 text    label  similarity
3        Cannot log

## 7) Classification baseline (Logistic Regression)


For text classification, a strong baseline is:

**TF-IDF → Linear model (LogReg / Linear SVM)**

This is fast, reliable, easy to explain, and often hard to beat without deep learning.


In [15]:

clf = LogisticRegression(max_iter=2000)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,3),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", clf)
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print(classification_report(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


              precision    recall  f1-score   support

     account       0.00      0.00      0.00         2
    delivery       0.33      1.00      0.50         1
       order       0.50      1.00      0.67         1
     payment       0.00      0.00      0.00         2
     product       0.50      1.00      0.67         1

    accuracy                           0.43         7
   macro avg       0.27      0.60      0.37         7
weighted avg       0.19      0.43      0.26         7

Confusion matrix:
 [[0 1 1 0 0]
 [0 1 0 0 0]
 [0 0 1 0 0]
 [0 1 0 0 1]
 [0 0 0 0 1]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 8) Production pattern: HashingVectorizer (no stored vocab)


In production, you may need:
- constant memory usage
- privacy (no vocabulary inspection)
- streaming support
- easier deployment across services

**HashingVectorizer** avoids building a vocabulary. Tradeoff: collisions.


In [16]:

hash_pipe = Pipeline([
    ("hash", HashingVectorizer(
        n_features=2**18,        # tune for your scale
        alternate_sign=False,    # makes features more interpretable for linear models
        ngram_range=(1,3),
        token_pattern=r"(?u)\b\w+\b"
    )),
    ("model", LogisticRegression(max_iter=2000))
])

hash_pipe.fit(X_train, y_train)
pred_hash = hash_pipe.predict(X_test)
print(classification_report(y_test, pred_hash))



              precision    recall  f1-score   support

     account       0.00      0.00      0.00         2
    delivery       0.33      1.00      0.50         1
       order       0.50      1.00      0.67         1
     payment       0.00      0.00      0.00         2
     product       0.50      1.00      0.67         1

    accuracy                           0.43         7
   macro avg       0.27      0.60      0.37         7
weighted avg       0.19      0.43      0.26         7



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9) Save and load the model (typical deployment step)


In [17]:

model_path = "week3_text_representation_model.joblib"
joblib.dump(pipeline, model_path)

loaded = joblib.load(model_path)
loaded.predict(["portal returns 500 error after deploy"])


array(['order'], dtype=object)

## Exercises (do these during lab)
1) Add 10 more tickets to `data` with realistic wording and labels. Re-train and compare results.  
2) Try `ngram_range=(1,3)` and observe what changes.  
3) For retrieval, test at least 3 queries and explain why the top result makes sense.  
4) Replace the dataset with a CSV you create (columns: `text`, `label`) and rerun the notebook.


1. Accuracy increases with more data  
2. The size of X_train_tfidf increases from (13, 165) to (13, 237)  
3. The top result makes sense since they includes the keyword found in the queries.