## Open-Ended Reflection Questions

### 1. Training Dynamics

- Both the MLP and LSTM models show significant signs of **overfitting**, as training metrics reach near-perfection while validation loss begins to diverge and increase after the initial epochs. To address this, implementing **early stopping** after epoch 30 or increasing **regularization such as higher dropout rates or weight decay** would help generalize the model.
- Because the Financial PhraseBank dataset is highly imbalanced, using `class_weight` prevents the models from simply guessing the majority neutral class to achieve high accuracy, and stabilizes training for minority classes.

![MLP Training Curves](outputs/mlp_training_curves.png){#fig-mlp fig-align="center"}

![LSTM Training Curves](outputs/lstm_training_curves.png){#fig-lstm fig-align="center"}

### 2. Model Performance and Error Analysis

- The LSTM model generalized slightly better to the test set compared to the MLP. While both models show clear overfitting, the LSTM achieved a higher final Macro F1 Score 0.73+ and demonstrated a more robust diagonal in its confusion matrix.
- The "Negative" classes were most frequently misclassified, often being mistaken for the "Neutral" class. This pattern occurs because the Financial PhraseBank dataset is highly imbalanced, with "Neutral" samples making up the vast majority of the data. Consequently, the models will develop a bias toward the "Neutral" class, making it difficult to distinguish nuanced sentiment from neutral reporting.

::: {#fig-confusion-matrices layout-ncol=2}

![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png){#fig-mlp fig-align="center"}

![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png){#fig-lstm fig-align="center"}

Comparison of Confusion Matrices for MLP and LSTM Models
:::

### 3. Cross-Model Comparison

- Mean-pooling word embeddings for the MLP forces the model to treat the sentence as "a bag of words," which completely **ignores the word order and grammatical structure**. This limits the model's ability to distinguish between sentences containing the same words but expressing different sentiments through arrangement.
- The LSTM's sequential architecture processes tokens one by one, allowing it to maintain a hidden state that **captures long-range dependencies** within the sentence. This enables the model to grasp the nuance of financial phrases, leading to better generalization and higher performance on the test set.
- Yes, the fine-tuned LLMs significantly outperformed classical baselines. This is due to their large-scale pretraining on vast amounts of text, which provides them with a deep understanding of language logic. Unlike the static FastText embeddings used in MLP and LSTM, BERT and GPT generate contextual representations, meaning **the same word has a different vector based on its surrounding context**, allowing them to capture subtle financial nuances.
- The general ranking of model performance in this case (from best to worst) is: BERT > GPT > GRU > LSTM > MLP > RNN. This is because classical models like MLP are hindered by static embeddings, while sequence-based models (RNN, LSTM, GRU) improve by adding temporal awareness, but still lack the deep semantic depth provided by the pretraining of LLMs.

::: {#fig-all-confusions layout-ncol=3}

![MLP](outputs/mlp_confusion_matrix.png){#fig-mlp fig-align="center"}

![RNN](outputs/rnn_confusion_matrix.png){#fig-rnn fig-align="center"}

![LSTM](outputs/lstm_confusion_matrix.png){#fig-lstm fig-align="center"}

![GRU](outputs/gru_confusion_matrix.png){#fig-gru fig-align="center"}

![BERT](outputs/bert_confusion_matrix.png){#fig-bert fig-align="center"}

![GPT](outputs/gpt_confusion_matrix.png){#fig-gpt fig-align="center"}

Confusion Matrices Across All Six Models
:::

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

---

### In this assignment, I used AI to assist with:

- Debugging library import errors and resolving version conflicts.
- Explaining code and troubleshooting errors related to the torch library.
- Assisting with Markdown formatting and layout for inserting images.
- Refining and polishing language.
