## Neural Models for Sentiment Classification

**Author:** Junho Hong  

This notebook contains analysis and reflection on six neural models trained for sentiment classification on the Financial PhraseBank dataset:
- MLP (mean-pooled FastText)
- RNN (SentenceTransformer sequences)
- LSTM (padded FastText sequences)
- GRU (SentenceTransformer sequences)
- BERT (pretrained transformer, fine-tuned)
- GPT-2 (pretrained transformer, fine-tuned)

## Summary of Results

| Rank | Model | Test Accuracy | Test Macro F1 | Architecture Type |
|------|-------|--------------|---------------|-------------------|
| 1 | **BERT** | **0.8253** | **0.8090** | Pretrained Transformer (Encoder) |
| 2 | **GPT-2** | 0.8061 | 0.7931 | Pretrained Transformer (Decoder) |
| 3 | **GRU** | 0.7785 | 0.7520 | Sequential (SentenceTransformer) |
| 4 | **LSTM** | 0.7648 | 0.7375 | Sequential (FastText) |
| 5 | **RNN** | 0.7180 | 0.7023 | Sequential (SentenceTransformer) |
| 6 | **MLP** | 0.7001 | 0.6699 | Feed-forward (mean pooling) |

## 1. Training Dynamics

### 1.1 MLP Training Curves

![MLP Training Curves](outputs/mlp_training_curves.png)

### 1.2 LSTM Training Curves

![LSTM Training Curves](outputs/lstm_training_curves.png)

### 1.3 Analysis: Overfitting vs Underfitting

#### MLP Model:
The MLP shows **slight overfitting** characteristics:
- Training accuracy reaches ~75% while validation plateaus at ~72-73%
- Training F1 reaches ~0.71 while validation plateaus at ~0.70
- Training loss continues decreasing to ~0.58 while validation loss stabilizes at ~0.60
- The gap between train/val metrics is relatively small (~2-3%), indicating mild overfitting

**Evidence:** The training curves show steady improvement on the training set while validation metrics plateau after epoch 20, creating a small but consistent gap.

#### LSTM Model:
The LSTM shows **significant overfitting**:
- Training accuracy reaches ~93% while validation plateaus at ~77%
- Training F1 reaches ~0.92 while validation plateaus at ~0.75
- Training loss drops to ~0.17 while validation loss increases to ~0.88
- Large gap (~16% accuracy, ~0.17 F1) between train and validation performance

#### Why LSTM overfits more than MLP:
1. **Model Capacity**: LSTM has more parameters (~280K) than MLP (~39K)
2. **Sequential Information**: LSTM can memorize specific word sequences from training data
3. **Complexity**: Two-layer LSTM with recurrent connections has higher capacity to fit training data patterns

### 1.4 Architectural/Training Changes to Address Overfitting

#### For LSTM:
1. **Stronger Regularization**:
   - Increase dropout from 0.3 to 0.5
   - Add recurrent dropout in LSTM layers
   - Apply weight decay (L2 regularization)
2. **Reduce Model Capacity**:
   - Use 1 LSTM layer instead of 2
   - Reduce hidden dimension from 128 to 64
3. **Gradient Clipping**: Prevent exploding gradients that can cause overfitting
4. **Learning Rate Scheduling**: More aggressive learning rate reduction when validation performance plateaus

### 1.5 Effect of Class Weights on Training

Class weights were computed based on the imbalanced distribution:
- Negative: 604 samples (weight ≈ 2.67)
- Neutral: 2879 samples (weight ≈ 0.56)
- Positive: 1363 samples (weight ≈ 1.18)

#### Impact on Training Stability:
1. **Loss Weighting**: Negative class errors contribute ~5x more to loss than Neutral class errors
2. **Learning Focus**: Model learns to pay more attention to minority classes (Negative, Positive)
3. **Macro F1 Improvement**: Without class weights, model would overpredict Neutral class; with weights, performance is more balanced across classes

#### Impact on Final Performance:
- Class weights helped achieve better **Macro F1** scores (which treat all classes equally)
- Without weights, the model would likely achieve higher overall accuracy but lower F1 on minority classes
- The weighted loss made training more stable by preventing the model from simply learning to predict the majority class

## 2. Model Performance and Error Analysis

### 2.1 MLP Confusion Matrix

![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)

**Test Performance:**
- Accuracy: 70.01%
- Macro F1: 0.6699

### 2.2 LSTM Confusion Matrix

![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png)

**Test Performance:**
- Accuracy: 76.48%
- Macro F1: 0.7375

### 2.3 Generalization Comparison

**LSTM generalizes better to the test set** based on multiple metrics:

| Metric | MLP | LSTM | Difference |
|--------|-----|------|------------|
| Test Accuracy | 70.01% | 76.48% | **+6.47%** |
| Test Macro F1 | 0.6699 | 0.7375 | **+0.0676** |

#### Evidence from Confusion Matrices:

**MLP Errors:**
- Negative class: 72/91 correct (79.1% recall)
- Neutral class: 300/432 correct (69.4% recall)
- Positive class: 137/204 correct (67.2% recall)
- High confusion between Neutral→Positive (98 errors)

**LSTM Errors:**
- Negative class: 75/91 correct (82.4% recall)
- Neutral class: 350/432 correct (81.0% recall)
- Positive class: 131/204 correct (64.2% recall)
- Better at distinguishing Neutral class (81% vs 69%)

#### Why LSTM Generalizes Better:
1. **Sequential Context**: LSTM captures word order and dependencies that MLP's mean pooling loses
2. **Contextual Understanding**: LSTM processes sequences token-by-token, preserving syntactic structure

### 2.4 Most Frequently Misclassified Class

#### Analysis from Confusion Matrices:

**MLP:**
- Negative: 19 errors out of 91 (20.9% error rate)
- Neutral: 132 errors out of 432 (30.6% error rate)
- Positive: **67 errors out of 204 (32.8% error rate)** ← **Highest**

**LSTM:**
- Negative: 16 errors out of 91 (17.6% error rate)
- Neutral: 82 errors out of 432 (19.0% error rate)
- Positive: **73 errors out of 204 (35.8% error rate)** ← **Highest**

**Answer: The Positive class is most frequently misclassified in both models.**

#### Reasons for Positive Class Misclassification:

1. **Semantic Ambiguity in Financial Text**:
   - Financial positive sentiment is often understated (e.g., "modest gains," "slight improvement")
   - These phrases lack strong positive indicators, making them hard to distinguish from Neutral
   - Example: "The company reported stable earnings" could be Neutral or mildly Positive

2. **Model Limitations**:
   - Both models struggle to capture nuanced sentiment in professional financial writing
   - Mean pooling (MLP) loses word order that might indicate positive framing
   - LSTM's sequential processing doesn't fully capture subtle financial sentiment cues

## 3. Cross-Model Comparison

### 3.1 Additional Model Confusion Matrices

#### RNN
![RNN Confusion Matrix](outputs/rnn_confusion_matrix.png)
**Performance:** Accuracy = 71.80%, Macro F1 = 0.7023

#### GRU
![GRU Confusion Matrix](outputs/gru_confusion_matrix.png)
**Performance:** Accuracy = 77.85%, Macro F1 = 0.7520

#### BERT
![BERT Confusion Matrix](outputs/bert_confusion_matrix.png)
**Performance:** Accuracy = 82.53%, Macro F1 = 0.8090

#### GPT-2
![GPT Confusion Matrix](outputs/gpt_confusion_matrix.png)
**Performance:** Accuracy = 80.61%, Macro F1 = 0.7931

### 3.2 Training Curves Comparison

#### RNN Training Curves
![RNN Learning Curves](outputs/rnn_f1_learning_curves.png)

#### GRU Training Curves
![GRU Learning Curves](outputs/gru_f1_learning_curves.png)

#### BERT Training Curves
![BERT Learning Curves](outputs/bert_f1_learning_curves.png)

#### GPT-2 Training Curves
![GPT Learning Curves](outputs/gpt_f1_learning_curves.png)

### 3.3 How Mean-Pooled FastText Embeddings Limited MLP Performance

Mean pooling creates a **bag-of-words representation** that loses information:

#### What is Lost:
1. **Word Order**: "profits increased significantly" vs "significantly increased profits" → same vector
2. **Negation**: "not good" vs "good" → vectors are averaged, weakening negation signal
3. **Syntactic Structure**: Cannot capture subject-verb-object relationships
4. **Context-Dependent Meaning**: "the bank reported losses" (financial) vs "the river bank" → word "bank" has same embedding

#### Impact on Sentiment Classification:
- **Example 1**: "The company did not meet expectations" vs "The company met expectations"
  - Mean pooling produces very similar vectors despite opposite sentiments
  - MLP cannot distinguish these because negation word "not" is averaged in

- **Example 2**: "Sales increased while profits decreased" 
  - Sequential models see the contrast structure
  - MLP only sees average of "increased" and "decreased" embeddings

#### Quantitative Evidence:
- MLP F1: 0.6699 (lowest among all models)
- Gap to LSTM: +0.0676 F1 (10% relative improvement)
- Gap to BERT: +0.1391 F1 (21% relative improvement)

### 3.4 LSTM's Sequential Processing Advantage Over MLP

**1. Preserves Word Order:**
- LSTM processes tokens sequentially: word₁ → word₂ → word₃
- Can learn that "not profitable" is different from "profitable not"

**2. Captures Long-Range Dependencies:**
- LSTM's memory cells maintain information across the sequence
- Can connect subject at start of sentence to verb at end

**3. Handles Negation Better:**
- Sequential processing allows LSTM to learn negation patterns
- "not good" is processed as: "not" updates hidden state → "good" is modulated by that state

**4. Contextual Word Representation:**
- Same word in different positions/contexts produces different hidden states
- "bank" in "financial bank" vs "river bank" gets different representations based on preceding words

### 3.5 Did Fine-Tuned LLMs Outperform Classical Baselines? Ranking and Key Factors

**Answer: Yes, significantly.** BERT (F1 0.8090) and GPT-2 (0.7931) outperform the best classical baseline, GRU (0.7520), by about 5–8% relative.

#### Ranking by Test Macro F1

| Rank | Model | F1 | Architecture | Key differentiator |
|------|-------|-----|--------------|--------------------|
| 1 | BERT | 0.8090 | Pretrained Transformer | Bidirectional, contextual embeddings, self-attention |
| 2 | GPT-2 | 0.7931 | Pretrained Transformer | Unidirectional decoder, contextual embeddings |
| 3 | GRU | 0.7520 | Gated RNN | Sequential + gating; less overfitting than LSTM on small data |
| 4 | LSTM | 0.7375 | Gated RNN | Sequential + memory; more overfitting than GRU here |
| 5 | RNN | 0.7023 | Vanilla RNN | No gating, weaker long-range dependencies |
| 6 | MLP | 0.6699 | Feed-forward | Mean pooling loses word order and negation |

#### Why LLMs Lead

1. **Pretraining & transfer** — BERT/GPT are pretrained on huge corpora; fine-tuning adapts that knowledge to financial sentiment. Classical models learn from scratch on ~4.8k samples.
2. **Contextual embeddings** — BERT/GPT give context-dependent representations; classical models use static embeddings (e.g. FastText/SentenceTransformer).
3. **Self-attention** — Transformers let every token attend to every other token; RNNs are sequential and can dilute long-range information.
4. **Bidirectionality (BERT)** — BERT sees full context (left and right); RNNs and GPT are unidirectional.

#### Why This Ranking Within Tiers

- **BERT > GPT-2:** Bidirectional context and encoder design suit classification better than a decoder-only model.
- **GRU > LSTM:** On this small dataset, GRU’s simpler gating generalizes better (LSTM showed stronger overfitting in training curves).
- **RNN below gated RNNs:** No gating and vanishing gradients limit long-range dependency modeling.
- **MLP last:** Mean pooling loses word order and negation, so bag-of-words is too weak for sentiment.

## AI Use Disclosure

In completing this assignment, I used Claude for:
- Debugging PyTorch compatibility issues when writing scripts
- Structuring code with proper documentation and comments
- Organizing the structure of this reflection notebook
