## 1. Training Dynamics

*Focus on your MLP and LSTM implementations*

- Did your models show signs of **overfitting** or **underfitting**? What architectural or training changes could address this?
- How did using **class weights** affect training stability and final performance?


#### MLP Curves
![MLP Learning Curves](outputs/mlp_f1_learning_curves.png)
#### LSTM Curves
![LSTM Learning Curves](outputs/lstm_f1_learning_curves.png)

The MLP model doesn't show any clear signs of overfitting, as there is not a significant gap between training and validation performance. The training accuracy and F1 scores as well as the validation scores both reach ranges around 0.70-0.75. Similarly, when looking at the loss curve, the training curve continues to decrease as expected, but so does the validation loss. While it shows some signs of leveling out, it still hasn't curved upwards yet, which would then be a sign of overfitting. This suggests that the model is learning patterns from the training data that generalize well to unseen examples. This is also seen in the test performance where we achieved a Macro F1 ~69%. One reason the model could overfit is if the hidden layer dimensionality is large compared to the number of training examples. This makes it so that the MLP may start memorizing noise patterns rather than actual signal in some of its nodes, especially with longer training. On the other hand, it could underfit if the architecture is too shallow or constrained, not allowing it to capture more complex, nonlinear relationships. One example is that the mean-pooled embeddings compress information which could remove useful information. If the MLP doesn't have the capacity to recover these, performance could plateau. So balancing capacity, regularization, and training duration are all essential to maintain performance.

Similarly, the LSTM model demonstrates relatively good training dynamics; however, there are some signs of minimal overfitting. The training and validation curves track each other relatively closely throughout training, but in the end we see a ~10% divergence in the training and validation metrics. In the loss curve, we also see that while the training loss continues to decrease, the validation loss has plateaued, and even started to slightly curve upwards a little. This suggests that the model is starting to fit noise rather than real trends. Because of this, reducing the number of training epochs is likely to reduce the overfitting. Implementing early stopping based on validation loss would allow the model to keep its strongest state before overfitting. One more factor to consider is the LSTM's capacity. Given that I increased the hidden state dimensions from the 128 used in the RNN code to 256, I might have increased the hidden size too much compared to the size of the dataset. As above, increasing dropout or applying stronger regularization could also reduce overfitting.

## MLP Confusion Matrix
![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)
## LSTM Confusion Matrix
![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png)

Using class weights had a positive impact on both training stability and final performance. The dataset has relatively high class imbalance, where there are 684 negative samples, 2,879 neutral samples, and 1,363 positive samples. This means the neutral class is ~58% of the data. Without class weighting, the models have a stronger incentive to predict the majority class, as this would minimize loss. However, the loss curves as seen above for both models show relatively smooth convergence suggesting that the weighted loss provides a better signal for all 3 classes. The confusion matrices also clearly show that class weights prevented the models from being a majority class predictor. Both the MLP and LSTM have strong performance across all three classes. This is seen in the >70% macro F1 scores for both models, which treats all classes equally regardless of their frequency. The score would be much lower if the model only learned to predict the majority class. So the class weights forced the models to pay appropriate attention to the minority classes, which made sure that errors on these classes contribute proportionally more to the loss.

## 2. Model Performance and Error Analysis

*Focus on your MLP and LSTM implementations*

- Which of your two models **generalized better** to the test set? Provide evidence from your metrics.
- Which **sentiment class** was most frequently misclassified? Propose reasons for this pattern.


#### MLP Confusion Matrix
![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)
#### LSTM Confusion Matrix
![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png)

The LSTM model performed slightly better on the test set compared to the MLP, despite the minor overfitting. The final test metrics for the MLP were: Accuracy: 0.7263, F1 Macro: 0.6982, F1: Weighted: 0.7368. However, the final test metrics for the LSTM were: Accuracy: 0.7387, F1 Macro: 0.7048, F1: Weighted: 0.7443. The consistent improvement across accuracy, macro F1, and weighted F1 show that the LSTM improved overall performance and balance across classes. The confusion matrices further support this conclusion. The LSTM's confusion matrix shows a slightly more concentrated diagonal, which means more accurate predictions across all three classes, while the MLP exhibits slightly more scattering on the off-diagonal, which are wrong predictions. The reason for this difference could be due to the architecture of each approach. The LSTM's sequential methods allow it to capture word order, negation patterns, and contextual relationships that may be important for sentiment analysis. However, the MLP collapses all the sequential information into a single vector, which could lose information about how the words interact with each other.

Looking at the confusion matrices and classification reports, the neutral class (Class 1) had the highest absolute number of misclassifications. For the MLP model, the recall for the neutral class was 0.7269 on 432 neutral test samples, so 118 neutral instances were misclassified. The LSTM performed slightly better with a recall of 0.7639, misclassifying only 102 neutral samples. However, the positive class showed higher misclassification rates given the fewer absolute errors. The MLP misclassified about 60 positive samples for a recall of 0.7059 and the LSTM misclassified 66 positive samples for a recall of 0.6765. The negative class had the fewest absolute misclassifications, as well as having the highest recall. Overall, the positive sentiment had the highest misclassification rate, while neutral had the most total errors.

The high number of misclassifications for the neutral samples can be attributed to both its raw number of observations as well as the ambiguity of neutral sentiment. Unlike clear positive or negative signals, neutral statements are more factual and lack emotional indicators. Phrases could also be mixed, like a positive statement in a declining market. The positive class's high proportional error rate likely reflects similar challenges. Financial writing is often characterized by hedging language which is used to blur the boundaries between the different sentiments. Attempting to distinguish real positive statements from cautiously optimistic or neutral language is difficult, especially as financial analysts often use more conservative language even in favorable environments.

## 3. Cross-Model Comparison

*Compare across all models you trained (MLP, RNN, LSTM, GRU, BERT, GPT)*

- **Rank the models** from best to worst based on a metric of your choice. Justify your ranking.
- What **architectural differences** distinguish the top performer from the lowest?
- Do you observe any patterns in which models **overfit more**? What might explain this?
- If you were to deploy one of these models for **real-world sentiment analysis**, which would you choose and why?

#### RNN Curves
![RNN Learning Curves](outputs/rnn_f1_learning_curves.png)
#### GRU Curves
![GRU Learning Curves](outputs/gru_f1_learning_curves.png)
#### BERT Curves
![BERT Learning Curves](outputs/bert_f1_learning_curves.png)
#### GPT Curves
![GPT Learning Curves](outputs/gpt_f1_learning_curves.png)



I'm going to rank the models from best to worst based upon Macro F1. Macro F1 is a good choice for this problem because this dataset has heavily imbalanced classes, but Macro F1 treats all three sentiments equally. Therefore the rankings from best to worst are BERT, GPT, GRU, LSTM, RNN, MLP. These rankings are likely due to the fundamental differences in the model architecture. BERT is a bidirectional transformer with self attention that achieves the highest test Macro F1. GPT is also a transformer, giving it the second highest performance. However, GPT is only unidirectional making its test performance slightly worse. The next three models, the GRU, the LSTM, and the RNN are all recurrent models; however, the GRU and LSTM's gating mechanisms help fix the vanishing gradients better than the base RNN. While the GRU is a simpler architecture than the LSTM, using only one gate rather than two, the simple architecture could allow it to be a little more generalizable in this problem. The MLP performed the worst. It gets mean-pooled sentence embeddings, discarding word order and contextual relationships which makes it lose information giving it the worst performance. So, the best performing BERT and the worst performing MLP have some fundamentally different architectures. BERT uses several deep layers of bidirectional transformer with multi-head self-attention, and positional encodings that connect every single word to all the other words in the context. The MLP on the other hand is a shallow feedforward network that averages all word embeddings, destroying relationships and structure within the words. Additionally BERT leverages transfer learning from pre-training, so it has general language understanding already, where the MLP learns from scratch using only the static FastText embeddings. 

Looking at the learning curves there is a clear trend, where the more complex models overfit more. The MLP shows no overfitting, with training and validation curves that track very closely together as above, likely due to its limited complexity preventing it from fitting more complex patterns. RNN and GRU show a lot of overfitting, with large gaps between training and validation performance at the end. The RNN struggles with vanishing gradients that cause unstable learning and poor generalization, and the GRU, despite its gating mechanisms, is likely learning overly complex sequential patterns. The LSTM demonstrates minimal overfitting with training and validation curves remaining relatively close as above, suggesting its use of two gates has better generalizability. The transformer models, GPT and BERT, both show minimal signs of overfitting because the train and validation curves align fairly well. However, they both show signs of overfitting, and given that these models were only trained for 5 epochs each, rather than the 30 used for the less complex models, it is highly likely that they would also show more significant overfitting if given more time. So the curves show that overfitting has some correlation with model complexity.

I would choose BERT for real-world deployment. Despite its higher computational costs, it has the best accuracy across all the sentiment classes. Especially in the financial context, the consequences of misclassification far outweigh computational costs, and there isn't a high need for extremely low latency. BERT's superior performance is really important as it distinguishes truly neutral statements from subtle positive/negative signals that get hidden in the more cautious language. Additionally, BERT's pre-training gives it more capacity to understand unusual sentences, domain jargon, and varied writing styles that models trained from scratch would struggle with.

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

I used Claude Sonnet 4.5 to help me write the training code for the MLP and LSTM classifiers. However, a lot of the code I directly copied from the RNN code, and these models were relatively simple, so my main usecase was to write the code to pull the fasttext gensim data from assignment 2 into the scripts. I also had the built-in VSCode auto complete to help me during the written questions. I verified the code and the writing by reading everything and making changes as necessary based upon performance for the code, and my opinion for the writing.