1. Training Dynamics

a. Did your models show signs of overfitting or underfitting? What architectural or training changes could address this?

MLP Loss vs. Epochs
![MLP Loss vs. Epochs](loss_vs_epochs.png)

MLP Accuracy vs. Epochs
![MLP Accuracy vs. Epochs](accuracy_vs_epochs.png)

MLP Macro F1 vs. Epochs
![MLP Macro F1 vs. Epochs](macro_f1_vs_epochs.png)

LSTM Loss vs. Epochs
![LSTM Loss vs. Epochs](lstm_loss_vs_epochs.png)

LSTM Accuracy vs. Epochs
![LSTM Accuracy vs. Epochs](lstm_accuracy_vs_epochs.png)

LSTM Macro F1 vs. Epochs
![LSTM Macro F1 vs. Epochs](lstm_macro_f1_vs_epochs.png)

The MLP model shows more signs of mild overfitting than the LSTM model. In the loss curves, training loss continues decreasing steadily for MLP, while validation loss plateaus after approximately 10 epochs, with slight oscillation thereafter. For LSTM, although the validation loss also begins to stabilize around 10 epochs, it continues to trend downward more gradually over later epochs. Similarly, the macro-F1 curves show that the MLP’s validation performance plateaus earlier and exhibits a widening gap relative to training performance, whereas the LSTM maintains validation improvements further into training. This suggests that the LSTM’s additional capacity is being used to learn meaningful sequential structure rather than simply memorizing training data.


b. How did using class weights affect training stability and final performance?

Adjusting for class imbalances increased sensitivity to minority classes and improved macro-f1 performance by penalizing misclassification of smaller weight classes more heavily. Although this introduced larger fluctuations in validation curves because of uneven gradient scaling, it led to more balanced classification behavior across the three sentiment categories. Overall accuracy was similar, but macro-f1 improved, evidencing better generalization across all classes instead of just the majority neutral class.

2. Model Performance and Error Analysis

a. Which of your two models generalized better to the test set? Provide evidence from your metrics.

My LSTM model generalized better to the test set. The test F1 score for my LSTM model was several points higher than my MLP model. 0.72 for LSTM vs. 0.69 for MLP. This makes sense because LSTM processes text sequentially whereas MLP uses fixed input vectors. Word order matters in the LSTM model, but not for the MLP model. So overall performance on unseen data was better for LSTM than MLP.

Another way to look at generalization performance is to see how the train F1 scores compared to the test F1 scores for both models. For MLP, the test performance was 8 percentage points lower than its corresponding training performance. For LSTM, the difference was 10 points. So LSTM performance dropped more significantly on unseen data than did the MLP model, but only by a small amount. This might indicate more overfitting forL LSTM. 

b. Which sentiment class was most frequently misclassified? Propose reasons for this pattern.

![MLP Confusion Matrix](mlp_confusion_matrix.png)

![LSTM Confusion Matrix](lstm_confusion_matrix.png)

Based on the confusion matrices, the positive sentiment was most frequently misclassified by the LSTM model - around 33% of the time.

For MLP, this class was misclassified about 45% of the time. Most of the misclassifications mistook postive sentiment for neutral sentiment. 

This could be because the positive sentiment content was relatively mild and was harder to differentiate from more neutral tones. Financial content tends to use more neutral phrasing, so positive sentiment is harder to capture. On the flip side, negative sentiment is likely easier to capture because this sort of criticism is often more direct.

3. Cross-Model Comparison

Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT

a. How did mean-pooled FastText embeddings limit the MLP compared to sequence-based models?

Mean-pooled fast text emebddings limited the MLP because averaging word vectors eliminates word order and compositional structure. The resulting representation prevents the model from capturing negation, modifier scope, and sequential dependencies. Sequence-based models process tokens in order and maintain contextual memory, allowing for interactions between words and allowing the model to capture more nuance.

b. What advantage did the LSTM’s sequential processing provide over the MLP?

The LSTM's sequential processing provided a structural advantage over the MLP by incorporating word order and contextual dependencies. As above, the LSTM processes tokens one at a time and maintains a memory state that captures interactions across word positions. Allows the model to capture negation, structure, and subtle effects that can make a big difference in sentiment.

c. Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.

Yes, BERT and GPT clearly outperformed the classical baselines. The primary reason is large scale pretraiing, allowing transformers to learn complicated representations before the model is fine tuned. Unlike scratch trained models, BERT and GPT can encode grammer, syntax, and semantic relationships. Transformers generate contextual embeddings (word representations depend on the surrounding words), whereas classical baseline models rely on static embeddings and have to learn language structure from limited data. These differences explain the performance gap between fine-tuned models and the classical baselines. 

d. Rank all six models by test performance. What architectural or representational factors explain the ranking?

Based on the Macro F1 test performance, the following are the model rankings:

1. BERT 0.84 2. GPT 0.81 3. GRU 0.77 4. LSTM 0.70 5. MLP 0.69 6. RNN 0.66 

These may not reflect the expected results as the models I built might be out of place. 

The results largely reflect differences in model architecture, capacity, and pretraining. Transformer based models outperformed more outdated baselines with pretraining and contextual embeddings capable of capturing bidirectionality. GRU surprisingly outperformed my LSTM model, perhaps due to its lower parameter count and optimization stability given the smaller dataset we were working with. 

RNN performed the worse, which is consistent with its difficulty capturing long-range dependencies. The MLP model fell behidn sequence based models because it used mean-pooled embeddings, discarding word order and syntax structure.

In general, performance tended to increase with model complexity and representational capacity.

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>


I used ChatGPT 5.2 to help me debug and make code more concise. Especially with complex pytorch syntax and data loading problems. It also was helpful for formatting the required graphs and metrics. Also I used AI to help me understand the provided scripts more in depth. 

I edited the code, analyzed outputs and charts, read documentation resources, and reformatted code myself. Did not use AI for the write up and open questions.