## Headline Sentiment Analysis Model Training

This notebook fine-tunes a Microsoft MiniLM-L12-H384-uncased transformer model for regression-based sentiment analysis of headlines. The model is trained to predict continuous sentiment scores rather than discrete classes. It leverages the Hugging Face toolkit to handle tokenisation.

### Overview
- **Base Model**: Microsoft MiniLM-L12-H384-uncased (384-dimensional, 12-layer transformer)
- **Task**: Sequence classification for regression (sentiment scoring)
- **Dataset**: CSV file containing headline text data and ratings
- **Hardware**: This notebook is written for Apple Silicon.

### Training Configuration
- **Data Split**: 80% train, 10% validation, 10% test
- **Epochs**: 10
- **Evaluation Strategy**: Per epoch with MSE-based early stopping

### Output
The trained model is saved to `/models/` and evaluated on the test set with final MSE, MAE, and loss metrics.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
checkpoint = "microsoft/MiniLM-L12-H384-uncased"

tokeniser = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=1,
    problem_type="regression"
)
model.to("mps")

In [None]:
def tokenize_function(examples):
    return tokeniser(
        examples['text'],
        truncation=True,
        padding=True,
        max_length=512,
    )

In [None]:
dataset = load_dataset("csv", data_files="/data/headline_data.csv")
dataset = dataset.map(tokenize_function)
dataset = dataset["train"].train_test_split(test_size=0.2)

train_dataset = dataset["train"]
dataset = dataset["test"].train_test_split(test_size=0.5)
validation_dataset = dataset["train"]
test_dataset = dataset["test"]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.flatten()
    labels = labels.flatten()

    return {
        "mse": mean_squared_error(labels, predictions),
        "mae": mean_absolute_error(labels, predictions),
    }

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokeniser)
training_args = TrainingArguments(
    output_dir="/models",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="/logs",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="mse",
    greater_is_better=False,
    save_total_limit=2,
    dataloader_num_workers=0,
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

In [None]:
test_results = trainer.evaluate(eval_dataset=test_dataset)

print(f"Test MSE: {test_results['eval_mse']:.4f}")
print(f"Test MAE: {test_results['eval_mae']:.4f}")
print(f"Test Loss: {test_results['eval_loss']:.4f}")

In [None]:
trainer.save_model()