# HW 4: Text Classification
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer questions and load + display from your experiments. Feel free to add as many code and markdown chunks as you would like in each of the sub-sections. 

**If you use any external resources (e.g., code snippets, reference articles), please cite them in comments or text!**

## Part 1: Train and evaluate `distilgpt` based Bayesian Classifier

In this part, load in the probability estimates from your finetuned **language models**, use it to classify text, and display classification results. 

We want to calculate the probability of text given its class(= likelihood), so we want to finetune two separate models, one trained on only the positive sentiment text, and the other trained on only the negative sentiment text. 

For the train configuration, we will use minimal pair comparison mode with `loadPretrained` set to `true`(or not specified since it is true by default) since we want to finetune the distilgpt model. 

Also, since we want to train 2 separate models, we will have 2 diffrent pairs of train-val datasets one with positive reviews and another with negative reviews. 

The preprocessed training/validation data will be in the format of 

| text |
| :-------: |
| `review` |
| ... |

In order to calculate the accuracy of the model, we need to compare the likelihoods of neg & pos then use the one with higehr log probability as `predicted` to calculate the accuracy. 

| sentid | pairid | comparison | sentence |
| :-------: | :------: | :-------: | :-------: | 
| 0 | 0 | expected | `review` |
| 1 | 1 | expected | `review` |
| ... | ... | ... | ... |

### Train & Val Data Generation

In [88]:
import os 

nf = "./aclImdb/train/neg"
ntfs = [f"{nf}/{path}" for path in os.listdir(nf)[:2500]]  # negative train files
nvfs = [f"{nf}/{path}" for path in os.listdir(nf)[2500:3000]]  # negative val files

pf = "./aclImdb/train/pos"
ptfs = [f"{pf}/{path}" for path in os.listdir(pf)[:2500]]  # positive train files
pvfs = [f"{pf}/{path}" for path in os.listdir(pf)[2500:3000]]  # positive val files

In [89]:
import csv
def txts2tsv(paths: list[str], savefpath: str) -> None:
    """Generates a tsv file consisted of the frist line of each file

    Args:
        paths (list[str]): paths of source files
        savefpath (str): save path
    """
    rows = [['text']]
    for path in paths:
        with open(path, 'r') as f:
            rows.append([f.readline()])

    with open(savefpath, "w", newline="") as f:
        fw = csv.writer(f)
        fw.writerows(rows)

In [90]:
t = {
    'data/neg_train.tsv': ntfs,
    'data/neg_val.tsv': nvfs,
    'data/pos_train.tsv': ptfs,
    'data/pos_val.tsv': pvfs
    }

for savefpath, paths in t.items():
    txts2tsv(paths, savefpath)

### Eval Dataset Genration

In [91]:
nef = "./aclImdb/test/neg"
nefs = [f"{nef}/{path}" for path in os.listdir(nef)[:1000]]  # negative eval files

pef = "./aclImdb/test/pos"
pefs = [f"{pef}/{path}" for path in os.listdir(pef)[:1000]]  # positive eval files

In [92]:
e = {
    'neg': nefs,
    'pos': pefs
    }
sentid = 0
pairid = 0
rows = [['sentid', 'pairid', 'comparison', 'sentence', 'target']]
for target, paths in e.items():
    for path in paths:    
        with open(path, 'r') as f:
            rows.append([sentid, pairid, "expected", f.readline(), target])
            sentid += 1
            pairid += 1

with open('data/bay_eval.tsv', 'w') as f:
    fw = csv.writer(f, delimiter='\t')
    fw.writerows(rows)

In [93]:
import pandas as pd 

ndf = pd.read_csv('predictions/neg_predictions.tsv', sep='\t')
pdf = pd.read_csv('predictions/pos_predictions.tsv', sep='\t')
df = pd.concat([ndf, pdf], ignore_index=True)

In [94]:
print("=" * 40)
likelihood = df['prob'].mean()
print(f"Overall Likeliehood: {likelihood:.3f}")

prior = 1/len(['pos', 'neg'])
print(f"Assumed Prior: {prior:.1f}")
print("=" * 40)

Overall Likeliehood: 0.208
Assumed Prior: 0.5


In [100]:
import numpy as np

ndf['logprob'] = np.log(ndf['prob'])
pdf['logprob'] = np.log(pdf['prob'])

gndf = ndf.groupby(["sentid"])["logprob"].mean()
gpdf = pdf.groupby(["sentid"])["logprob"].mean()

In [101]:
bdf = pd.read_csv('data/bay_eval.tsv', sep='\t')
cdf = pd.DataFrame({"neg_prob": gndf, "pos_prob": gpdf, "target": bdf["target"], "sentence": bdf["sentence"]})
cdf["predicted"] = np.where(
    cdf["neg_prob"] > cdf["pos_prob"], "neg", "pos"
)
cdf['correct'] = cdf["predicted"] == cdf["target"]

In [102]:
acc = cdf["correct"].mean()

print('=' * 40)
print(f'Accuracy: {acc * 100} %')
print('=' * 40)

Accuracy: 86.95 %


In general, P(class∣text)∝P(text∣class) * P(class). However, since we assume a uniform prior, this term is constant and does not affect the final ranking, making the likelihood the sole determining factor. A token likelihood of 0.21 is strong, considering that a model guessing randomly would only have a probability of 1/vocab-size of being correct, a value that is typically very small.
To calculate the accuracy, I first evaluated both the `neg_bay` and `pos_bay` models on the `bay_eval.tsv` dataset. Since the model's predictions are generated on a per-token basis, I grouped the tokens by sentence and calculated the average log probability. I chose to use log probabilities instead of raw probabilities because this method prevents numerical underflow and penalizes low-probability tokens more harshly (as log(p)=>-inf as p=>0). I also took the average, rather than the sum, of the log probabilities because averaging normalizes the scores and mitigates the bias caused by different sentence lengths.

Next, I compared the likelihoods of the negative and positive sentiments for each sentence and set whichever was higher as the predicted value. The final step was to evaluate whether the predicted value matched the target (or gold) label of the sentence. I then took the mean of these comparisons to calculate the overall accuracy.
I was very pleased with the resulting accuracy of 86.95%, which is a significant improvement over the 50% baseline of a random guess.

## Part 2: Train and evaluate a `distilgpt` based TextClassification model
In this part, load in and display the results from your finetuned **TextClassification models**. 

In [53]:
t = {
    "data/textclassification_train.tsv": {'0': ntfs, '1': ptfs},
    "data/textclassification_val.tsv": {'0': nvfs, '1': pvfs},
}

for savefpath, sentiments in t.items():
    rows = [["text", "label"]]
    for sentiment, paths in sentiments.items():
        for path in paths:
            with open(path, "r") as f:
                text = f.readline()
                rows.append([text, sentiment])
    with open(savefpath, "w", newline="") as f:
        fw = csv.writer(f, delimiter='\t')
        fw.writerows(rows)

In [54]:
e = {
    "data/textclassification_eval.tsv":{
    'neg': nefs,
    'pos': pefs
    }
}
textid = 0
for savefpath, sentiments in e.items():
    rows = [["textid", "text", "target"]]
    for sentiment, paths in sentiments.items():
        for path in paths:
            with open(path, "r") as f:
                text = f.readline()
                rows.append([textid, text, sentiment])
                textid += 1
    with open(savefpath, "w", newline="") as f:
        fw = csv.writer(f, delimiter="\t")
        fw.writerows(rows)

In [55]:
# Trained w/ CPU
tdf = pd.read_csv('results/textclassification.tsv', sep='\t', index_col=0)

# Trained w/ CUDA w/ max_length=512
ctdf = pd.read_csv("results/cuda_textclassification.tsv", sep="\t", index_col=0)

# After discovering `device: cuda` config, I tried training the model on Turing with CUDA
# However I was faced with some kind of index mismatch error.

# I got curious and was able to solve this error by adding `max_length=512` param to `NLPScholar/src/trainers/HFTextClassificationTrainer.py:52`

# Strangley `device: cpu/mps` seems to handle this error, only warning users("Sequence length: 1275 is larger than maximum sequence length allowed by the model: 1024.  Using a stride of 512 in calculating token predictabilities.")

# I wanted to see if there would be any difference in the end.

# Just as a fun fact, it took me 34m 55s 329ms to train the model with CPU,
# but only 1 minuite or so with CUDA. Very glad to have discovered this option.

# & Now I sort of understand the industry's obsession with GPUs...

In [56]:
display(tdf)
display(ctdf)

Unnamed: 0,model,micro-precision,micro-recall,micro-f1,macro-precision,macro-recall,macro-f1,accuracy
0,/Users/jeong/Projects/cosc426-projects/cosc426...,0.9145,0.9145,0.9145,0.91451,0.9145,0.914499,0.9145


Unnamed: 0,model,micro-precision,micro-recall,micro-f1,macro-precision,macro-recall,macro-f1,accuracy
0,/Users/jeong/Projects/cosc426-projects/cosc426...,0.9145,0.9145,0.9145,0.91457,0.9145,0.914496,0.9145


The final results were straightforward. My text classification model achieved 91.45% accuracy, while the Bayesian model scored 86.95%. This performance gap of about 4.5% really just comes down to how each model reads a sentence. The Bayesian classifier is fast and effective due to its simplicity. It essentially just counts words linked to each sentiment. The problem is that this method completely ignores word order, which is a major weakness when it sees sentences with negation or tricky phrasing. The second model's higher accuracy shows it learned to pay attention to context and structure. That 4.5% improvement is really just the value of using a model that can understand how words actually work together.

## Part 3: Reflect on the two approaches to classification

In this part, answer the question in `HW4.md` in markdown chunks. If you used external sources to find and make sense of this, please cite them!

The biggest difference is really about knowledge and how it's used. The Bayesian classifier was built from scratch on our dataset. It learned to associate sentiments by simply counting word frequencies, completely ignoring grammar and context.

For the `distilgpt` approach, we didn't start from zero. We took a massive, pretrained model that already had a deep understanding of how language works, and just finetuned its existing knowledge for our specific task. So, it's basically the difference between statistical counting versus adapting an already "intelligent" system that has captured some meaning from text.

## Part 4 (Optional): Error analysis comparing the two approaches to classification

In this part, include the results from your experiments (if you choose to attempt this)

### Error analysis on the bayesian model

In [133]:
# For Bayesian
acc

neg_pre = (cdf.loc[cdf['predicted'] == 'neg', 'correct']).mean()
pos_pre = (cdf.loc[cdf['predicted'] == 'pos', 'correct']).mean()

neg_rec = (cdf.loc[cdf["target"] == "neg", "correct"]).mean()
pos_rec = (cdf.loc[cdf["target"] == "pos", "correct"]).mean()

neg_f1 = 2 * (neg_pre * neg_rec) / (neg_pre + neg_rec)
pos_f1 = 2 * (pos_pre * pos_rec) / (pos_pre + pos_rec)

### Error analysis on textclassfication model

In [134]:
# For TextClassification
tcdf = pd.read_csv("predictions/textclassification.tsv", sep="\t", index_col=0)
tcdf["correct"] = tcdf["predicted"] == tcdf["target"]

tacc = tdf['accuracy'].iloc[0]

tneg_pre = (tcdf.loc[tcdf["predicted"] == "neg", "correct"]).mean()
tpos_pre = (tcdf.loc[tcdf["predicted"] == "pos", "correct"]).mean()

tneg_rec = (tcdf.loc[tcdf["target"] == "neg", "correct"]).mean()
tpos_rec = (tcdf.loc[tcdf["target"] == "pos", "correct"]).mean()

tneg_f1 = 2 * (neg_pre * neg_rec) / (neg_pre + neg_rec)
tpos_f1 = 2 * (pos_pre * pos_rec) / (pos_pre + pos_rec)

### Comparison

In [143]:
GREEN = "\033[92m"
BLUE = "\033[94m"
END = "\033[0m"

print("=" * 70)
print(f"{GREEN}Bayesian classifier{END}")
print(f"Accuracy| {acc*100:.2f}%")
print(f"Negative| Precision: {neg_pre:.4f}, Recall: {neg_rec:.4f}, F1: {neg_f1:.4f}")
print(f"Positive| Precision: {pos_pre:.4f}, Recall: {pos_rec:.4f}, F1: {pos_f1:.4f}")

print("=" * 70)

print(f"{BLUE}TextClassification model{END}")
print(f"Accuracy| {tacc*100:.2f}%")
print(f"Negative| Precision: {tneg_pre:.4f}, Recall: {tneg_rec:.4f}, F1: {tneg_f1:.4f}")
print(f"Positive| Precision: {tpos_pre:.4f}, Recall: {tpos_rec:.4f}, F1: {tpos_f1:.4f}")
print("=" * 70)

[92mBayesian classifier[0m
Accuracy| 86.95%
Negative| Precision: 0.8549, Recall: 0.8900, F1: 0.8721
Positive| Precision: 0.8853, Recall: 0.8490, F1: 0.8668
[94mTextClassification model[0m
Accuracy| 91.45%
Negative| Precision: 0.9166, Recall: 0.9120, F1: 0.8721
Positive| Precision: 0.9124, Recall: 0.9170, F1: 0.8668


### Discussion

The models likely agree on most of the simple sentences, which is why the Bayesian model's accuracy is already high at 87%. The 4.5% accuracy gap comes from how they each handle the tougher cases.

The metrics show they make different kinds of mistakes. The Bayesian model is biased. Its higher recall for negative sentences, combined with lower precision, means it tends to mislabel positive sentences as negative. The TextClassification model is not just more accurate. It's more balanced. Its precision and recall scores are nearly the same for both classes, showing it doesn't have a default guess. Its errors are probably on sentences that are truly ambiguous, while the Bayesian model's errors come from its simpler token-wise guessing method. 