## Results Summary

The table below summarizes test-set performance for all six models on the same stratified split. I have also put some key plots in the end of notebook.

| Model | Test Accuracy | Test Macro F1 |
|---|---:|---:|
| MLP | 0.7400 | 0.6812 |
| RNN | 0.6988 | 0.6812 |
| LSTM | 0.7689 | 0.7458 |
| GRU | 0.7634 | 0.7330 |
| BERT | 0.8171 | 0.8033 |
| GPT | 0.8033 | 0.7815 |

Ranking by macro F1 (best to worst): **BERT > GPT > LSTM > GRU > MLP = RNN**.

## 1. Training Dynamics

### Did your models show signs of overfitting or underfitting? What changes could address this?

Both MLP and LSTM showed **overfitting**, but the LSTM showed it more strongly.

- **MLP:** Training metrics improved steadily, while validation F1 plateaued around 0.66-0.69 after early epochs. This is moderate overfitting.
- **LSTM:** Training macro F1 became very high (close to perfect) while validation F1 peaked around 0.76 and then flattened or fluctuated. This is stronger overfitting.

Changes that could help:
- Increase regularization.
- Use earlier stopping.
- Reduce model capacity.
- Try data augmentation or noise injection.

### How did class weights affect training stability and final performance?

Using `CrossEntropyLoss` helped the models pay more attention to minority classes (especially negative and positive), instead of being dominated by the neutral class. This generally improved macro F1 and class balance, but also made optimization noisier early in training, which is expected when minority-class errors receive higher gradient weight.

## 2. Model Performance and Error Analysis

### Which of your two models generalized better to the test set?

**LSTM generalized better than MLP.**

- MLP test macro F1: **0.6812**
- LSTM test macro F1: **0.7458**

The LSTM also had higher test accuracy (0.7689 vs. 0.7400).

### Which sentiment class was most frequently misclassified, and why?

For both MLP and LSTM, the **positive class** had the lowest per-class F1:

- MLP: Positive F1 = **0.5969**
- LSTM: Positive F1 = **0.6425**

Likely reasons:
- Positive financial statements are often subtle and can be linguistically close to neutral statements.
- The neutral class is the largest class, so borderline examples are often pulled toward neutral predictions.
- Mean pooling (MLP) especially loses local contextual cues and sentiment shifters.

## 3. Cross-Model Comparison

### How did mean-pooled FastText embeddings limit the MLP vs. sequence models?

Mean pooling removes token order and long-distance dependencies. The MLP receives one fixed sentence vector and cannot model sequence structure, negation scope, or phrase-level interactions as effectively as sequence models.

### What advantage did LSTM sequential processing provide over MLP?

The LSTM processed tokens in order and used hidden states to retain contextual information across the sentence. This improved test macro F1 from **0.6812 (MLP)** to **0.7458 (LSTM)**.

### Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Why?

Yes. Both BERT and GPT outperformed all classical baselines:

- BERT macro F1 = **0.8033**
- GPT macro F1 = **0.7815**

Pretraining gives these models rich contextual representations and language knowledge that static embedding pipelines do not capture. Fine-tuning then adapts those representations to the financial sentiment task efficiently.

### Rank all six models by test performance and explain the ranking

By test macro F1:
1. **BERT (0.8033)**
2. **GPT (0.7815)**
3. **LSTM (0.7458)**
4. **GRU (0.7330)**
5. **MLP (0.6812)**
6. **RNN (0.6812)**

Why:
- Transformer LLMs are strongest due to contextual pretraining and expressive architecture.
- LSTM/GRU outperform MLP/RNN because gated recurrent units handle sequence dependencies better and mitigate vanishing-gradient issues better than vanilla RNNs.
- MLP and vanilla RNN are weakest due to representational limitations.

## Key Plots

### MLP
![](outputs/mlp_f1_learning_curves.png)
### LSTM
![](outputs/lstm_f1_learning_curves.png)
### RNN / GRU / BERT / GPT
![](outputs/rnn_f1_learning_curves.png)
![](outputs/gru_f1_learning_curves.png)
![](outputs/bert_f1_learning_curves.png)
![](outputs/gpt_f1_learning_curves.png)

## AI Use Disclosure (Required)

## No

- **Tool(s) used:** OpenAI Codex (GPT-5 based coding assistant in this environment).
- **How it was used:** Implemented `train_sentiment_mlp_classifier.py` and `train_sentiment_lstm_classifier.py`, helped execute all required training scripts, and drafted this reflection text using observed results.
- **What I verified myself:** I ran all six training scripts, confirmed output figures/model checkpoints were generated, and checked final test metrics before writing the analysis and ranking.