# Project 2 Learning Goals

1. **Fine-Tuning BERT**: Gain hands-on experience with fine-tuning a BERT model for Sentiment Analysis on financial data.
2. **Tokenizer Usage**: Learn how to use a tokenizer for text-to-token mapping, padding, and truncation.
3. **Training Setup**: Understand and utilize `TrainingArguments` and `Trainer` for model training.
4. **Model Deployment**: Learn how to push models to the Hugging Face Hub.

## Setup & Imports

In [None]:
!pip install transformers[torch] datasets evaluate --quiet
!pip install einops --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
from datasets import load_dataset
import evaluate
import matplotlib.pyplot as plt
from typing import List, Dict, Any, Union, Generator, Callable, Tuple
from transformers import AutoModelForSequenceClassification, AutoConfig, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer
from tqdm import tqdm
import torch
import numpy as np
import unittest
from unittest.mock import Mock, patch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

## Dataset preparation

We will be using the same dataset as Project 1, so let's just repeat some of that code here.  The Financial Phrasebook dataset is a relatively small dataset (<5000 examples) so we'll have a fairly aggressive train/test split (70/30).  Since Bert is pretrained, we don't need an enormous training set anyways.

In [None]:
# Load the dataset with the 'sentences_50agree' configuration
phrasebank = load_dataset("financial_phrasebank", "sentences_50agree")

# Split the 'train' data into training and test sets
phrasebank_split = phrasebank["train"].train_test_split(test_size=0.3, shuffle=True)

# Retrieve the string version of the three classes of sentiments
sentiment_names = phrasebank["train"].features["label"].names

Downloading builder script:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4846 [00:00<?, ? examples/s]

## Retrieve Pretrained Model
We will retrieve the pretrained [Bert model](https://huggingface.co/bert-base-uncased) from HuggingFace. This is an encoder model that can easily be fine tuned to a variety of tasks.  In this project, we'll be fine-tuning it for classification on the Financial Phrasebank dataset.

In [None]:
# Retrieve the model and tokenizer for 'bert-base-uncased'
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(sentiment_names))
# Note that we are specifying the number of labels we want.
# This preconfigures the model with a softmax output layer over the appropriate number of classes.

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize Dataset
In Project 1, we didn't tokenize the dataset because we only needed it ad hoc.  This time, we know that we'll be iterating over it a few times in training, so we'll tokenize the whole thing at first to save time later.

In [None]:
def tokenize_function(example: Dict[str, Union[str, int]]) -> Dict[str, torch.Tensor]:
    """Tokenizes a single example using a pre-trained tokenizer.

    Args:
        example: The example containing a sentence to tokenize.

    Returns:
        A dictionary containing tokenized input_ids and attention_mask, both as PyTorch tensors.
    """
    tokenized_example = tokenizer(
        example["sentence"],
        padding=True,
        truncation=True,
        return_tensors="pt"
    )
    return tokenized_example

# Map the train and test sets to tokenized versions of that data using the tokenize_function()
train_tokenized_datasets = phrasebank_split["train"].map(tokenize_function, batched=True)
test_tokenized_datasets = phrasebank_split["test"].map(tokenize_function, batched=True)

Map:   0%|          | 0/3392 [00:00<?, ? examples/s]

Map:   0%|          | 0/1454 [00:00<?, ? examples/s]

## Collator
Next we'll create a data collator, which will ensure that all our data is padded appropriately as it is loaded in batches to the model.  Passing the tokenizer to the data collator serves a specific purpose: it allows the collator to know how to handle padding and other sequence manipulations in a way that is consistent with how the original tokenization was done.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Setting Up Training Arguments

Before we start the training process, we need to configure various training parameters. This is done using the `TrainingArguments` class from the Hugging Face Transformers library. Here's a breakdown of the parameters we are setting:

- **output_dir**: This is the directory where the training outputs (like model checkpoints) will be saved. We set it to `"phrasebank-sentiment-analysis"`.

- **evaluation_strategy**: This parameter defines how often the model should be evaluated during training. We set it to `"steps"`, meaning the model will be evaluated at regular step intervals.

- **eval_steps**: This specifies the number of training steps between each evaluation. We set it to `100`.

- **per_device_train_batch_size**: This is the batch size for each training step. A batch is a portion of the dataset used for training the model in a single step. We set it to `32`.

- **logging_steps**: This defines how often training metrics should be logged. We set it to `100`, so metrics will be logged every 100 steps.

- **num_train_epochs**: This is the number of times the training loop will iterate over the entire training dataset. We set it to `4`.

By setting these parameters, we control various aspects of training, evaluation, and logging, making the training process more structured and easier to manage.


In [None]:
training_args = TrainingArguments(
    output_dir = "phrasebank-sentiment-analysis",
    evaluation_strategy = "steps",
    eval_steps = 100,
    per_device_train_batch_size = 32,
    logging_steps = 100,
    num_train_epochs = 4)

In [None]:
# @title Test Your Code!
class TestTrainingArguments(unittest.TestCase):

    def test_training_args(self):
        # Check each parameter
        self.assertEqual(training_args.output_dir, "phrasebank-sentiment-analysis")
        self.assertEqual(training_args.evaluation_strategy, "steps")
        self.assertEqual(training_args.eval_steps, 100)
        self.assertEqual(training_args.per_device_train_batch_size, 32)
        self.assertEqual(training_args.logging_steps, 100)
        self.assertEqual(training_args.num_train_epochs, 4)

# Run the tests
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestTrainingArguments))

.
----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

## Defining Custom Evaluation Metrics

To evaluate the performance of our fine-tuned model, we define a function called `compute_metrics`. This function will compute the F1 score and accuracy for the model's predictions.

Here's a breakdown of what the function does:

- **f1_metric and accuracy_metric**: We load F1 and accuracy evaluation metrics using a hypothetical `evaluate.load` method. These metrics are widely used for classification tasks.

- **logits, labels**: The function takes `eval_preds` as input, which is a tuple containing the logits (model outputs) and the true labels.

- **predictions**: We use NumPy's `argmax` function to find the index (class label) with the maximum value for each logit vector. This converts the logits to class labels.

- **f1_score**: We compute the F1 score using the loaded `f1_metric`. We set the average parameter to `"macro"` to calculate the metric independently for each class and then find the average.

- **accuracy**: We compute the accuracy using the loaded `accuracy_metric`.

The function then returns a dictionary containing these computed metrics.

In [None]:
def compute_metrics(eval_preds: Tuple[np.ndarray, np.ndarray]) -> Dict[str, float]:
    """Computes F1 score and accuracy for model evaluation.

    This function takes a tuple containing the predicted logits and true labels,
    and computes the F1 score and accuracy. It uses pre-loaded evaluation metrics
    for F1 and accuracy, assumed to be loaded via a hypothetical `evaluate.load` method.

    Args:
        eval_preds: A tuple containing two NumPy arrays.
                    The first array contains the predicted logits.
                    The second array contains the true labels.

    Returns:
        A dictionary containing the F1 score and accuracy as scalar values.
    """

    # Load evaluation metrics
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    # Extract logits and labels from eval_preds
    logits, labels = eval_preds

    # Convert logits to class labels
    predictions = np.argmax(logits, axis=-1)

    # Compute F1 score and extract the scalar value
    f1_result = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    f1_score = f1_result['f1'] if isinstance(f1_result, dict) else f1_result

    # Compute accuracy and extract the scalar value
    accuracy_result = accuracy_metric.compute(predictions=predictions, references=labels)
    accuracy_score = accuracy_result['accuracy'] if isinstance(accuracy_result, dict) else accuracy_result


    return {"F1": f1_score, "Accuracy": accuracy_score}


In [None]:
# @title Test Your Code!
from sklearn.metrics import f1_score, accuracy_score

class TestComputeMetrics(unittest.TestCase):

    def test_compute_metrics(self):
        # Create example data: 3 correct predictions, 3 incorrect predictions
        true_labels = np.array([0, 1, 0, 1, 1, 0])
        pred_logits = np.array([[0.7, 0.3], [0.4, 0.6], [0.6, 0.4], [0.35, 0.65], [0.8, 0.2], [0.4, 0.6]])

        # Compute expected F1 and accuracy using sklearn
        pred_labels = np.argmax(pred_logits, axis=-1)
        expected_f1 = f1_score(true_labels, pred_labels, average='macro')
        expected_accuracy = accuracy_score(true_labels, pred_labels)

        # Compute metrics using the function to be tested
        result = compute_metrics((pred_logits, true_labels))

        # Validate the results
        self.assertAlmostEqual(result["F1"], expected_f1, places=5)
        self.assertAlmostEqual(result["Accuracy"], expected_accuracy, places=5)

# Run the tests in the notebook
unittest.TextTestRunner().run(unittest.TestLoader().loadTestsFromTestCase(TestComputeMetrics))


.
----------------------------------------------------------------------
Ran 1 test in 1.854s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

### Initializing the Trainer

In this section, we initialize the `Trainer` class provided by the Hugging Face Transformers library. The `Trainer` is responsible for managing the training and evaluation loops. Below is an explanation of each argument passed to the `Trainer`:

- `model.to(torch_device)`: The pre-trained model fine-tuned for our specific task. It is moved to the device specified by `torch_device` (either CPU or GPU).
  
- `training_args`: This contains various training arguments like the output directory, evaluation strategy, batch size, etc., which are defined in a `TrainingArguments` object.
  
- `train_dataset=train_tokenized_datasets`: This is the tokenized version of our training dataset, which the `Trainer` will use during the training process.
  
- `eval_dataset=test_tokenized_datasets`: Similar to `train_dataset`, this is the tokenized version of our test dataset used during the evaluation steps.
  
- `data_collator=data_collator`: A data collator is responsible for batching together samples for training and evaluation. Here, we use a predefined data collator suitable for our task.
  
- `tokenizer=tokenizer`: The tokenizer is responsible for converting text into tokens that the model can understand. Although not strictly necessary for training, it is often useful for post-training tasks like inference.
  
- `compute_metrics=compute_metrics`: This function is used to compute evaluation metrics, like F1 score and accuracy, at the end of each evaluation loop.
  
By initializing the `Trainer` with these arguments, we set up a robust training and evaluation loop that takes care of most of the heavy lifting for us.


In [None]:
trainer = Trainer(
    model = model.to(torch_device),
    args = training_args,
    train_dataset = train_tokenized_datasets,
    eval_dataset = test_tokenized_datasets,
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics)

# Note: There is not unit test for this function.
# If the training loop in the next code cell works, then you've succeeded!

### Starting the Training Process

The `trainer.train()` method is called to start the actual training of the model. This function initiates the training loop that iterates over the training dataset, updates the model parameters, and performs evaluations based on the configurations we set in `TrainingArguments` and `Trainer`.

When this method is called, the following steps are executed:

1. **Initialization**: The model and optimizer are initialized based on the configurations.
  
2. **Training Loop**: The model iterates over the training data in batches, performing forward and backward passes, and updating the model weights.
  
3. **Evaluation**: If specified in `TrainingArguments`, the model is evaluated on the test dataset at regular intervals. Metrics like F1 score and accuracy are computed using the `compute_metrics` function.
  
4. **Logging**: Training and evaluation statistics are logged, which can be viewed in real-time if a logging utility like TensorBoard is used.

By calling this single method, the entire training, evaluation, and logging pipeline is executed, simplifying the process into a one-step operation.


In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1,Accuracy
100,0.5676,0.443338,0.796069,0.825309
200,0.2504,0.413734,0.832065,0.854195
300,0.1266,0.525624,0.845222,0.859697
400,0.0501,0.61501,0.837114,0.852132


TrainOutput(global_step=424, training_loss=0.23752048920910313, metrics={'train_runtime': 386.4382, 'train_samples_per_second': 35.11, 'train_steps_per_second': 1.097, 'total_flos': 1045875835468800.0, 'train_loss': 0.23752048920910313, 'epoch': 4.0})

### Authentication and Model Upload to Hugging Face Hub

#### Authentication
The `notebook_login()` function from the `huggingface_hub` library is used to authenticate your notebook with your Hugging Face account. This step is essential for pushing models to the Hugging Face Model Hub. A pop-up will appear that will ask for your Hugging Face credentials.

#### Pushing Model to the Hub
After successful authentication, we call `trainer.push_to_hub()` to upload the trained model to the Hugging Face Model Hub.

Here's what happens when you execute this code:

1. **Authentication**: The `notebook_login()` function prompts you to log in to your Hugging Face account, allowing you secure access to push models to the hub.

2. **Model Upload**: The `trainer.push_to_hub()` method uploads all model files (model weights, configuration, etc.) to your Hugging Face account. The model will be publicly available, and others can download it using its identifier.

By running these commands, you not only preserve your model but also make it accessible to the wider community for various NLP tasks.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Push to the hub
model_url = trainer.push_to_hub()
print(f'Find your new model here:  {model_url}')

Find your new model here:  https://huggingface.co/guilima5/phrasebank-sentiment-analysis/tree/main/


# Project 2 Wrap-Up

## Summary

In this project, we successfully achieved the following learning goals:

### Fine-Tuning BERT
We downloaded a pre-trained BERT model and fine-tuned it for the task of Sentiment Analysis, specifically focusing on financial data. This gave us hands-on experience with adapting a general-purpose language model to a specialized task.

### Tokenizer Usage
We learned how to use a tokenizer for essential text processing steps such as text-to-token mapping, padding, and truncation. This is crucial for preparing textual data for model training.

### Training Setup
We utilized the `TrainingArguments` and `Trainer` classes from the Hugging Face Transformers library. This encapsulates best practices for training transformer models and provided a streamlined way to set up and execute the training process.

### Model Deployment
Finally, we pushed our fine-tuned model to the Hugging Face Hub. This allows for easy sharing of the model and offers a platform for community evaluation and usage.

## Optional Steps for Future Exploration
- **Evaluation Metrics**: Dive deeper into the evaluation metrics, possibly comparing them with benchmarks or other models.
- **Model Interpretability**: Investigate why the model makes specific predictions to understand it better.
- **Hyperparameter Tuning**: Experiment with different hyperparameters to potentially improve model performance.
- **Version Control**: Learn to manage different versions of the model on the Hub.
- **Real-World Testing**: Demonstrate how to use the deployed model for sentiment analysis on new financial data.
- **Documentation**: Add detailed documentation to enhance the project's understandability and reusability.

# [Optional] Bonus: Extensions

This project focused on building a text classification model for identifying sentiment of sentences from financial news. There are many ways we could extend this project to handle more complex natural language processing tasks:

**Fine-Tuning for Named Entity Recognition**

We could fine-tune our pretrained model like BERT or RoBERTa to do named entity recognition (NER). This involves identifying "named entities" like people, organizations, locations, etc. in text. The [TNER dataset](https://huggingface.co/datasets/tner/fin) on HuggingFace provide labeled data for this task. We would add a token classification head to our model and train it to predict named entities for each token. To evaluate the model, we need to handle the structured output predictions to compare to ground truth labels. The [seqeval](https://github.com/chakki-works/seqeval) library provides useful functions to evaluate sequence predictions for NER models.

[Token classification reference notebook](https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb)

**77-Class Intent Classifier**

To build a more sophisticated virtual assistant, we could train a model to classify customer intents into 77 different classes using the [Banking77 dataset](https://huggingface.co/datasets/tner/fin). This would allow our assistant to distinguish between a much finer-grained set of customer needs like changing passwords, reporting lost credit cards, etc. We would replace the classification layer in our existing model architecture with a new layer predicting 77 classes instead of just 3 classes.

Adding capabilities like NER and more detailed intent classification would make our assistant far more useful for real-world applications! The pretrained Transformer models we used provide a great starting point for implementing these more complex NLP tasks as well.

**Question Answering**

Question Answering
In addition to intent classification, we could train our model to answer user questions. The [HotpotQA dataset](https://huggingface.co/datasets/hotpot_qa) on HuggingFace contains 113k Wikipedia-based question-answer pairs.

To implement question answering, we would fine-tune a model like BERT on this data using a span prediction head. The model would take as input a context paragraph from Wikipedia and a question, and predict the start and end token span in the context containing the answer.

Question answering is useful for conversational assistants as it allows directly answering user questions, instead of simply classifying the intent. Adding a module like this could make our assistant more capable of natural conversation and providing relevant information to users.

[QA reference notebook](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb)

In [None]:
## Start your extension project here