## Authorship
Otávio Sbampato
> Hugging Face - https://huggingface.co/otaviosbampato  
> GitHub - https://github.com/otaviosbampato

## Preparation

- we need to firstly change our notebook configuration to use a GPU.  
- then, we will install our pre-requisites.

In [None]:
!pip install -U transformers datasets evaluate
# engine, datasets



In [None]:
import transformers; print(transformers.__version__)

4.57.1


## Preparing our dataset

url: https://huggingface.co/datasets/vamossyd/finance_emotions

In [None]:
from datasets import load_dataset

In [None]:
dataset_name = "vamossyd/finance_emotions"
train_set = load_dataset(dataset_name, split="train[:90%]")
test_set = load_dataset(dataset_name, split="train[90%:]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
train_set

Dataset({
    features: ['body', 'cleaned_text', 'label', 'chatGPT_label'],
    num_rows: 9000
})

In [None]:
train_set[65]

{'body': '$MNKD This stock is the worst investment I have ever made.  Underwater, I will be happy to dump once (if ever) I get back to flat.',
 'cleaned_text': '<ticker> this stock is the worst investment i have ever made . underwater , i will be happy to dump once if ever i get back to flat .',
 'label': 'sad',
 'chatGPT_label': 'sad'}

## Tokenizer

we'll now transform words and texts of interest into numbers, tokenizing them.

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# this tokenizer takes care of casing too.

In [None]:
# train_set[100:105]
label_to_value = {
    "neutral": 0,
    "happy": 1,
    "fear": 2,
    "surprise": 3,
    "sad": 4,
    "disgust": 5,
    "anger": 6
}

In [None]:
def tokenizer_func(batch):
  tokenized_batch = tokenizer(
      batch['cleaned_text'],
      padding="max_length",
      truncation=True)

  # for each label in our batch of data we pass as a param to the function,
  # we run it trough our set to get its label number,
  # then we replace all values of 'label' in tokenized_batch for its number.
  tokenized_batch['label'] = [label_to_value[label] for label in batch['label']]

  # then we return it.
  return tokenized_batch

> - above, we tokenize the cleaned_text part of our batch of data, pad the text to max_length (for bert, 512 words), and set truncation to true.
> - in case a cleaned_text had less than 512 words, the rest of the words would be filled as 0's, which the tokenizer would ignore.


> > *   for strings such small as ours, a most likely better approach would be to use padding=True, which would pad the maximum length at the biggest word count string on each batch.
> > *   in a study-case scenario such as this, however, its interesting to keep it in a max_length, which keeps stability (although adds much much noise).


> - in case it had MORE than 512 words, the tokenizer would just truncate it.

**wait, what even is a tokenizer?**

a computer is not so great at recognizing words as we are. (or, even if it were, doing so would be much much expensive in computation). <br/>
therefore we have a tokenizer! the purpose of a tokenizer is to map those words into numbers. that makes all our lives much easier. <br/>
check this example below to understand more thoroughly.

suppose we have 2 phrases:

1.   I enjoy driving my car  
2.   I really enjoy reading good books
  
mapping each unique word out to a number becomes:  

I -> 0  
enjoy -> 1  
driving -> 2  
my -> 3  
car -> 4  
really -> 5  
reading -> 6  
good -> 7  
books -> 8  
  
*tokenizer comes in*  
phrase 1 becomes: 0 1 2 3 4 5  
phrase 2 becomes: 0 5 1 6 7 8  
and so on.  
  
this makes it much easier for the model to interpret text and do its magic.  

> it is worth noting that some tokenizers differ in behavior. some may do subwording (*driving -> 2*  would then become *driv -> 2* and *ing -> 3* ), which may help performance, and some might even map out every letter and construct text based on that.

In [None]:
tokenized_train_data = train_set.map(tokenizer_func, batched=True)
tokenized_test_data = test_set.map(tokenizer_func, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Importing the model

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=7)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


here we use BERT, an encoder-only transformer model; this is particularly good given the intricate encoder-only transformers qualities on context comprehension and sentiment analysis.  
> encoder-only transformers do NOT generate new text (differs from GPT models). for our use case, they are the correct choice.

In [None]:
from transformers import TrainingArguments

In [None]:
training_args = TrainingArguments(output_dir="finance-model")

In [None]:
!pip install evaluate



In [None]:
import numpy as np
import evaluate

these two libs will help us evaluate accuracy and other metrics easier.

In [None]:
metric = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
def compute_metrics(evaluate_prediction):
  logits, labels = evaluate_prediction
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

we'll implement training now.

In [None]:
training_args = TrainingArguments(output_dir="finance-model",
                                  eval_strategy="epoch",
                                  report_to="none")

In [None]:
from transformers import Trainer

this trainer puts it everything together. better practice would be to separate variable naming, but for this learning environment, we keep as much variable similarity as possible on params.  
everything we've done so far leads to this very moment.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_test_data,
    compute_metrics=compute_metrics,
)

training takes ~45min. let it do its magic :)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0598,1.6779,0.408
2,0.6904,2.027521,0.432
3,0.3669,2.666963,0.449


TrainOutput(global_step=3375, training_loss=0.7243557219328703, metrics={'train_runtime': 2662.1503, 'train_samples_per_second': 10.142, 'train_steps_per_second': 1.268, 'total_flos': 7104317414400000.0, 'train_loss': 0.7243557219328703, 'epoch': 3.0})

we now upload our project to huggingface!

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `hf auth whoami` to get more information or `hf auth logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The tok

In [None]:
model.save_pretrained("finance-emotions-model",
                      push_to_hub=True,
                      private=False)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...s-model/model.safetensors:   0%|          |  558kB /  433MB            