# FinBERT (benchmark)

### Environment configuration


In [2]:
!pip install -qU transformers accelerate datasets==2.16.0 watermark textattack
!pip install pyarrow
!pip install "numpy<2"
!pip install -q pandas tqdm

%reload_ext watermark
%watermark -vmp transformers,datasets,torch,numpy,pandas,tqdm

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")



ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## Dataset

https://huggingface.co/datasets/takala/financial_phrasebank

In [None]:
from datasets import load_dataset, DatasetDict, ClassLabel, Dataset

dataset = load_dataset("takala/financial_phrasebank", "sentences_50agree")

full_dataset = dataset['train']

split_dataset = full_dataset.train_test_split(test_size=0.2, seed=42)
test_valid_split = split_dataset['test'].train_test_split(test_size=0.5, seed=42)

dataset = DatasetDict({
    'train': split_dataset['train'],
    'validation': test_valid_split['train'],
    'test': test_valid_split['test']
})

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
label_names = dataset["train"].features["label"].names
label2id = {name: dataset["train"].features["label"].str2int(name) for name in label_names}
id2label = {id: label for label, id in label2id.items()}

print("Label names: ", label_names)
print("Label ids: ", label2id["negative"], label2id['neutral'], label2id["positive"])

Label names:  ['negative', 'neutral', 'positive']
Label ids:  0 1 2


## FinBERT

[finBERT](https://huggingface.co/ProsusAI/finbert)

In [None]:
from transformers import pipeline

sentiment_clf = pipeline(
    model="ProsusAI/finbert",
    device=device, batch_size=32
)

Device set to use cuda


In [None]:
from transformers.pipelines.pt_utils import KeyDataset

test_outputs = []
for output in sentiment_clf(KeyDataset(dataset["test"], "sentence"), top_k=None):
    test_outputs.append(output[0])

print(f"Inference complete. Total predictions: {len(test_outputs)}")

Inference complete. Total predictions: 485


## Metrics

In [None]:
from sklearn.metrics import classification_report

true_labels = dataset["test"]["label"]

predicted_labels = [label2id[output['label']] for output in test_outputs]

print("\n--- Final Test Results (FinBERT Zero-Shot Baseline via Pipeline) ---")

report = classification_report(
    y_true=true_labels,
    y_pred=predicted_labels,
    target_names=label_names,
    digits=4
)

print(report)


--- Final Test Results (FinBERT Zero-Shot Baseline via Pipeline) ---
              precision    recall  f1-score   support

    negative     0.8028    0.9500    0.8702        60
     neutral     0.9603    0.8582    0.9064       282
    positive     0.8086    0.9161    0.8590       143

    accuracy                         0.8866       485
   macro avg     0.8573    0.9081    0.8785       485
weighted avg     0.8961    0.8866    0.8879       485



### Interpretacion de las Metricas

- **Precision**: De todas las veces que el modelo dijo que algo era Negative, el 80.28% de las veces fue correcto.

- **Recall**: De todos los titulares que eran realmente Negative (60 en total), el modelo logró identificar correctamente al 95.00% de ellos.En resumen, el 88.66% de acierto global es tu métrica principal, pero el resto de la tabla te permite argumentar que, incluso sin fine-tuning, el modelo especializado ya tiene un rendimiento muy robusto.
