# Mini AI Pipeline: AG News Headline Classification

This notebook follows the small-pipeline plan: baseline keyword rules vs. MiniLM embeddings + linear classifier. Run end-to-end (<10 minutes on CPU).

## Setup
Install lightweight dependencies if needed. Comment out in managed environments that already include them.

In [1]:
# !pip install -q datasets scikit-learn sentence-transformers torch pandas numpy

In [2]:
import random
from collections import Counter
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

## Load and subsample data
Using AG News, keep a tiny subset for speed: 2k train, 500 test.

In [3]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

dataset = load_dataset('ag_news')
label_names = dataset['train'].features['label'].names

train_df = dataset['train'].to_pandas().sample(n=2000, random_state=SEED).reset_index(drop=True)
test_df = dataset['test'].to_pandas().sample(n=500, random_state=SEED).reset_index(drop=True)

def preprocess(text: str) -> str:
    return text.lower().strip()

train_df['text'] = train_df['text'].apply(preprocess)
test_df['text'] = test_df['text'].apply(preprocess)
train_df.head()

Unnamed: 0,text,label
0,"bbc set for major shake-up, claims newspaper l...",2
1,marsh averts cash crunch embattled insurance b...,2
2,"jeter, yankees look to take control (ap) ap - ...",1
3,flying the sun to safety when the genesis caps...,3
4,stocks seen flat as nortel and oil weigh new ...,2


## Baseline: keyword rules
Simple keyword counts per class; ties broken by majority prior.

In [4]:
keywords = {
    'World': {'war', 'government', 'minister', 'president', 'iraq', 'election', 'peace'},
    'Sports': {'win', 'wins', 'victory', 'coach', 'season', 'game', 'team', 'vs', 'cup'},
    'Business': {'market', 'profit', 'shares', 'stocks', 'deal', 'company', 'dollar', 'trade'},
    'Sci/Tech': {'software', 'research', 'technology', 'chip', 'internet', 'science', 'phone', 'data'}
}
label_to_name = dict(enumerate(label_names))
name_to_label = {v: k for k, v in label_to_name.items()}
majority_label = train_df['label'].mode().iloc[0]

def baseline_predict(text: str) -> int:
    tokens = text.split()
    scores = {name: 0 for name in keywords}
    for token in tokens:
        for cls, vocab in keywords.items():
            if token in vocab:
                scores[cls] += 1
    best_name = max(scores.items(), key=lambda x: (x[1], x[0]))[0]
    if scores[best_name] == 0:
        return majority_label
    return name_to_label[best_name]

baseline_preds = [baseline_predict(t) for t in test_df['text']]
baseline_acc = accuracy_score(test_df['label'], baseline_preds)
baseline_f1 = f1_score(test_df['label'], baseline_preds, average='macro')
print(f'Baseline accuracy: {baseline_acc:.3f}, macro-F1: {baseline_f1:.3f}')

Baseline accuracy: 0.460, macro-F1: 0.447


## AI pipeline: MiniLM embeddings + logistic regression
Embed headlines, then train a linear classifier.

In [5]:
encoder = SentenceTransformer('all-MiniLM-L6-v2')

train_embeddings = encoder.encode(train_df['text'].tolist(), batch_size=64, show_progress_bar=True, convert_to_numpy=True, device='cuda' if encoder.device.type == 'cuda' else None)
test_embeddings = encoder.encode(test_df['text'].tolist(), batch_size=64, show_progress_bar=True, convert_to_numpy=True, device='cuda' if encoder.device.type == 'cuda' else None)

clf = LogisticRegression(max_iter=1000, C=4.0, multi_class='multinomial', n_jobs=-1)
clf.fit(train_embeddings, train_df['label'])

preds = clf.predict(test_embeddings)
acc = accuracy_score(test_df['label'], preds)
macro_f1 = f1_score(test_df['label'], preds, average='macro')
print(f'Pipeline accuracy: {acc:.3f}, macro-F1: {macro_f1:.3f}')

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/8 [00:00<?, ?it/s]



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/to

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Pipeline accuracy: 0.850, macro-F1: 0.852


## Qualitative differences
Collect examples where baseline and pipeline disagree.

In [6]:
diffs = []
for text, true_label, base_pred, model_pred in zip(test_df['text'], test_df['label'], baseline_preds, preds):
    if base_pred != model_pred:
        diffs.append({
            'text': text,
            'true': label_to_name[true_label],
            'baseline': label_to_name[base_pred],
            'pipeline': label_to_name[model_pred],
        })

pd.DataFrame(diffs).head(10)

Unnamed: 0,text,true,baseline,pipeline
0,paris tourists search for key to 'da vinci cod...,World,Sports,World
1,net firms: don't tax voip the spanish-american...,Sci/Tech,Sci/Tech,Business
2,dependent species risk extinction the global e...,Sci/Tech,Sports,Sci/Tech
3,eds is charter member of siebel bpo alliance (...,Sci/Tech,Sports,Business
4,"forgoing stiff upper lip, charles jousts with ...",World,World,Sports
5,profit plunges at international game tech inte...,Business,Sports,Business
6,salvaging genesis despite a seemingly calamito...,Sci/Tech,Sports,Sci/Tech
7,general mills goes whole grains new york (cnn/...,Business,Sports,Business
8,big merger could box qwest in qwest communicat...,Business,Sports,Business
9,mortgage approvals drop sharply figures showin...,Business,Sports,Business


## Classification report (optional detail)

In [7]:
print(classification_report(test_df['label'], preds, target_names=label_names))

              precision    recall  f1-score   support

       World       0.84      0.87      0.85       121
      Sports       0.93      0.97      0.95       116
    Business       0.80      0.78      0.79       124
    Sci/Tech       0.83      0.79      0.81       139

    accuracy                           0.85       500
   macro avg       0.85      0.85      0.85       500
weighted avg       0.85      0.85      0.85       500

