### Section 1. Training Dynamics

Focus on your MLP and LSTM implementations
- Question: Did your models show signs of overfitting or underfitting? What architectural or training changes could address this?

<font color="red">Answer: The MLP model saw mild overfitting approximately after reaching epoch 20, as seen in Figure 1.2. This is supported by the Train F1 curve, which rises steadily plateauing approximately at 0.68, whereas Validation F1 peaks at 0.68, and then falls slightly afterwards. This gap between the Training and Validation F1 scores support the conclusion that there is mild overfitting. </font>

<font color="red">For LSTM, the Training F1 curve ends approximately at 0.79 whereas Validation F1 curve at 0.72 as seen in Figure 1.4. This larger gap between train and validation than MLP suggests that the model is more expressive, but also overfits more. Nonetheless, both MLP and LSTM have their validation curves trending upwards, so overfitting isn’t too severe. </font>

<font color="red">Underfitting in both models does not appear to be an issue. Both models have high training F1 and validation F1 and a major decrease in loss as more epochs are run (Figure 1.5 and 1.6), contrary to signs of underfitting.</font>
<font color="red">Since overfitting is the main issue for both LSTM and MLP models, there are multiple architectural and training changes to address this.</font>

<font color="red">For MLP, mitigations to address this issue can involve the increase in weight decay which would penalize larger parameters, lower model complexity alongside discouraging memorizing training specific patterns. Additionally, reducing hidden size can prevent LSTM from overfitting small outliers in the Phrasebank. Adding dropouts can randomly activate neurons in training, which enables the model to learn robust representations. For LSTM, mitigations can involve early stopping once F1 scores see a sustained decrease, adding dropouts between LSTM layers and reducing hidden size from 128. 
</font>

#### Figure 1.1: MLP Accuracy Curve 
<img src="outputs/mlp_acc_curve.png" width="700">

#### Figure 1.2: MLP Macro F1 Curve 
<img src="outputs/mlp_f1_curve.png" width="700">

#### Figure 1.3: LSTM Accuracy Curve 
<img src="outputs/lstm_accuracy_curve.png" width="700">

#### Figure 1.4: LSTM Macro F1 Curve 
<img src="outputs/lstm_macro_f1_curve.png" width="700">

#### Figure 1.5: MLP Loss Curve 
<img src="outputs/mlp_loss_curve.png" width="700">

#### Figure 1.6: LSTM Loss Curve 
<img src="outputs/lstm_loss_curve.jpeg" width="700">

- Question: How did using class weights affect training stability and final performance?

<font color="red"> Answer:
The dataset is imbalanced in size of each class, with negative sentiment as the minority class and neutral sentiment as the majority. Introducing class weights in cross-entropy loss leads to a heavier penalization of misclassified minority class examples during training. This henceforth encourages the model to allocate greater capacity in correctly predicting negative sentiment. </font>

<font color="red">The effects of these weights are evidenced in the confusion matrices. The minority class has non-trivial correct classifications as opposed to frequent, or systematic misclassification as neutral. Furthermore, the macro F1 score which also weights class performance is also relatively strong. This indicates improved balance across classes. </font>

<font color="red">While the usage of class weights likely did improve minority recall, it did cause slightly more noisier validation curves caused by stronger corrective updates in training. 
</font>




## Section 2. Model Performance and Error Analysis
Focus on your MLP and LSTM implementations
- Question: Which of your two models generalized better to the test set? Provide evidence from your metrics.

<font color="red">Answer: LSTM generalizes better to the test set. This is supported by LSTM’s higher test accuracy score of 0.75 compared to MLP’s 0.721 (Figure 1.1 and 1.3). The improvement in Macro F1 score of 0.72 compared to MLP’s 0.677 is especially meaningful (Figure 1.2 and 1.4), as it indicates a well balanced performance across sentiment classes (positive, neutral and negative) as opposed to dominant majority neutral class. 
</font>

- Question: Which sentiment class was most frequently misclassified? Propose reasons for this pattern.

<font color="red">The most frequently misclassified class in MLP was neutral misclassified as positive. Specifically, there is confusion between neutral and positive classes, especially neutral texts predicted as positive (74 cases) seen in Figure 2.1. The minority class was also confused. The reason for this pattern likely derives from the averaging effect of mean pooling, where it dilutes subtle contextual hints which differentiate neutral to positive phrasing in financial texts. </font>

<font color="red">There is also confusion between neutral and positive texts in the LSTM model. Nonetheless it has higher precision in detecting negative sentiment compared to MLP (Figure 2.2). This difference between LSTM and MLP suggests sequential modeling and positional information of texts is important in capturing stronger polarity cues concerning negation/negative financial phrases. 
</font>

#### Figure 2.1: MLP Confusion Matrix
<img src="outputs/mlp_confusion_matrix.png" width="700">

#### Figure 2.2: LSTM Confusion Matrix
<img src="outputs/lstm_confusion_matrix.jpeg" width="500">


### Section 3. Cross-Model Comparison

Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT

- Question: How did mean-pooled FastText embeddings limit the MLP compared to sequence-based models?

<font color="red">Answer: The MLP model receives one single 300-dimensional averaged vector via averaging FastText word embeddings. In this process, mean pooling treats sentences as an unordered collection of word vectors. Hence, there it cannot encode word order and positional information, structure of syntax, or negation scope - words affected by a negative term (e.g. no, without). As a result, the model cannot distinguish differences between phrases such as “not profitable” vs “profitable”, where sentiment is differentiated via negation terms. Moreover, usage of mean pooling averages away the effect of polarity modifiers. With these limitations, the MLP model struggles to classify neutral and mildly positive financial statements, hence likely explaining its lower Macro F1 score compared to other sequence based models which preserve token order. </font>

- Question: What advantage did the LSTM’s sequential processing provide over the MLP?

<font color="red"> Answer: The LSTM’s sequential hidden state enables it to model contextual dependencies across tokens. It processes ordered tokens and uses gates such as input, forget and output. Doing so brings the multiple benefits of captured negation patterns, dependencies on the phrase-level, preserving sequential dependencies and model interactions across tokens. These benefits are evidenced by the higher Macro F1 score LSTM achieved relative to MLP, a better detection of negative phrases. </font>

- Question: Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.

<font color="red">Answer: Fine-tuned LLMs did indeed outperform classical baselines. Classical models relied on static pre-trained FastText embeddings and were trained only on Financial Phrasebank dataset smaller in size. </font>
<font color="red">Classical baseline models, e.g. LSTM and GRU preserve token order. However, these models also suffer from a lack of deep contextual representation from large-scale language pretraining and are unable to perform contextual re-embedding, but instead learn contextual patterns from scratch. </font>

<font color="red">GPT and BERT contrarily, are pre-trained on huge text corpora via transformer architecture. Both also produce contextual embeddings, inferring word meaning based on surrounding words. This enables a much more precise classification of word sentiment through understanding of nuanced sentiment cues, text composition structure and elusive/subtle financial language patterns given their large sample size and context-based sentiment analysis. With self attention and fine tuning that adjusts high level representations, this enabled LLMs to adapt the rich semantic representations onto the sentiment classification tasks. These factors explain the higher macro F1 scores the two LLMs are able to achieve compared to the classical baseline models. </font>

<font color="red"></font>
<font color="red"></font>

- Question: Rank all six models by test performance. What architectural or representational factors explain the ranking?

<font color="red">Answer:
Models ranked by Test Macro F1 values are as follows:
</font>

<font color="red">

1. BERT (0.827): Like GPT, BERT also is composed of a transformer but with deeper pre-training. It achieves the best performance because of its bidirectional attention allowing the scanning for context of words left and right of the word, and large scale pretraining. BERT also has a self attention mechanism which enables each token to simultaneously attend to all tokens in the same sentence. This allows for rich contextual representation. Furthermore, fine tuning also enables pretrained representations to adapt financial sentiment classification. These factors all explain BERT’s leading macro F1.
2. GPT (0.774): GPT also benefits from large scale pretraining and transformer architecture, though its unidirectional attention restricts tokens to attend only preceding tokens. This places a limited contextual access compared to the BERT’s bidirectional modeling, hence a lower performance compared to BERT.
3. LSTM (0.732) & GRU (0.732): LSTM and GRU have gated recurrence, hence it mitigates vanishing gradients and improve long range modeling capabilities, rendering a similar performance between the two models.
4. RNN (0.691): Vanilla RNN has sequential tokens, but performs worse as it does not have gating mechanisms. This makes it harder for it to capture longer-range dependencies.
5. MLP (0.677): Has the poorest F1 score performance as mean-pooled FastText embeddings remove word order. This means sentences are treated as unordered collections of word vectors. This restricts the model’s ability to model compositional meaning and negation/negative phrases.

</font>

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">I used AI tools ChatGPT and Google’s Notebook LM to simplify lecture slide content and proofread structure and flow of my writing. I refered to course materials / AI-generated summaries to edit and correct my responses. I independently re-ran code cells, inspected outpits and confirm the validity of output to course content. I also used ChatGPT to debug code and iterate the code myself using its responses.</font>
