# FinBERT

In this notebook, you will learn how to use **FinBERT**, a transformer-based model fine-tuned on financial text, to perform sentiment analysis. We will cover how to: 
- Set up FinBERT using the Hugging Face Transformers library
- Run predictions on financial text
- Interpret and visualize sentiment results

# Background

Natural language processing (NLP) has entered a new era in how machines represent and understand language. But let's first recap the evolution of language representation for classification tasks, like **sentiment analysis**. 

## Early Sentiment Classification: N-Grams, Bag-of-Words, and Linear Models
The earliest approaches to sentiment analysis relied on simple yet effective techniques rooted in **n-gram** models and the **Bag-of-Words (BoW)** representation.  N-grams represent documents by counting sequences of one (unigram), two (bigram), or more consecutive words. For example, “not good” as a bigram helps capture negative sentiment more accurately than the words “not” and “good” alone.
In the BoW model, each document is converted into a sparse vector representing the frequency of words or n-grams, without capturing grammar or word meaning. These patterns are counted and used as features to train classification models. These vectors are then used in simple classifiers like:
- **Naive Bayes** - assumes word independence and applies Bayes’ theorem to predict sentiment by modeling the likelihood of observing each word in documents of a particular sentiment.
- **Logistic Regression** - learns weights for each word or n-gram as input features to predict sentiment probability.

These models are fast and  effective for simple cases, but while n-gram models can capture how often words appear together in sequence, they treat each word in isolation, lack the ability to generalize across contexts or capture deeper word meanings. Words like “bad” and “terrible” were treated as unrelated, and sentence structure was lost.

## Machine Learning 

In the mid-2000s, NLP saw a shift with the emergence of machine learning techniques like **Support Vector Machines (SVMs)** and **Conditional Random Fields (CRFs)**. These models offered stronger generalization and allowed incorporation of more complex linguistic features into classification tasks. However, they still relied on sparse, hand-crafted features—typically derived from BoW or n-gram representations.

## Neural Networks - Non-Linear Classification
To overcome the limitations of linear models, **neural networks** were introduced. A basic neural network consists of an input layer, one or more hidden layers that apply nonlinear transformations, and an output layer that that produces predictions.

Neural networks allow models to learn more complex decision boundaries and interactions between features. 

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*KfJUyVjsS9ZhxcBk.png" alt="Classical vs Deep Learning" width="600">
Image Source: Adarsha Regmi, NLP guide towards neural networks

  
Neural language models represent words in this prior context by their embeddings, rather than just by their word identity as used in n-gram language models. Compared to n-gram models, neural language models can handle much longer histories, can generalize better over contexts of similar words, and are more accurate at word-prediction. On the other hand, neural net language models are much more complex, are slower and need more energy to train, and are less interpretable than n-gram models.

## Word Embeddings and Deep Learning
### Static word Embeddings
In vector semantics, a word is modeled as a vector in multi-dimensional continuous space, called an **embedding**. Traditional word embedding models, such as **Word2Vec** and **GloVe**, moved NLP forward by mapping words to dense vector representations that capture semantic and syntactic relationships. The embeddings are learned from the word distributions in large text corpora, and allow models to place similar words close in vector space (e.g., “excellent” and “great”), which significantly improves sentiment classification. 

<img src="https://towardsdatascience.com/wp-content/uploads/2021/03/15F4TXdFYwqi-BWTToQPIfg.jpeg" alt="Word2Vec Vectors" width="600">
Image source: Word2Vec Research Paper Explained. Toward Data Science

Google’s **Word2Vec** introduced two training architectures:
- **CBOW** (Continuous Bag-of-Words)- predicts a word given its context.
- **Skip-Gram** - predicts context words given a target word.

**GloVe**, developed at Stanford, took a matrix factorization approach, learning embeddings from global word co-occurrence counts. Like Word2Vec, it produced static embeddings that improved many NLP tasks, including sentiment analysis.

Both rely on shallow neural networks to learn embeddings—dense vector representations that place semantically similar words close together. For sentiment classification, this meant models could now recognize that “happy” and “joyful” convey similar tone.



Since these models were pre-trained, the embeddings can be reused across tasks, reducing the need for task-specific training. However, a key limitation of these models is that each word has a single fixed vector regardless of context, i.e. **static embeddings**. A word like “cold” has the same vector whether referring to weather or tone.


### Contextual Word Embeddings

These limitations prompted the development of more advanced **contextualized word embeddings**. Models like Embeddings from Language Models (**ELMo**) addressed this by using bi-directional (**biLM**) Long Short-Term Memory (**LSTMs**) to generate word vectors that change based on surrounding context. The model analyzes full sentence context and generates dynamic word representations. The same word—e.g., “charged”—could now have different representations in “charged a fee” vs. “charged with a crime.”

<img src="https://jalammar.github.io/images/elmo-forward-backward-language-model-embedding.png" alt="ELMo Embeddings" width="500">
Image source: Jay Alammar, The Illustrated BERT, ELMO, and co. 

### Transfer Learnings

**ULMFiT** followed with innovations in transfer learning, showing how a pretrained language model could be fine-tuned for specific downstream NLP tasks such as sentiment classification or spam detection. These models trained on vasts amounts of unlabeled text to predict the next word in a sequence. Thus making it possible to learn deep language patterns without the need for manual annotation.

## Transformers and Large Language Models

The Transformer architecture replaced recurrence with self-attention, enabling efficient parallel processing and capturing long-range dependencies in text.. This design became the basis for **large language models (LLMs)** like GPT and BERT.

These models are pretrained on massive corpora using unsupervised tasks such as masked language modeling (BERT) or next-word prediction (GPT), and fine-tuned for downstream tasks like sentiment classification. Unlike earlier methods, LLMs produce contextual embeddings that adapt to the full sentence, enabling precise classification of sentiment, tone, and subtle meaning.

<img src="https://jalammar.github.io/images/t/Transformer_decoder.png" alt="Transformer" width="500">
Image source: Jay Alammar, The Illustrated Transformer

### BERT: Deep Bidirectional Context for Sentiment Tasks

**BERT** (Bidirectional Encoder Representations from Transformers) marked another breakthrough. By analyzing text in both directions, BERT captures richer context than unidirectional models. Pretrained on Wikipedia and BookCorpus, it can be fine-tuned on sentiment classification.

BERT significantly outperforms earlier models on sentiment benchmarks. Its embeddings are dynamic and encode nuanced information, allowing it to distinguish “not bad” from “bad” or understand sarcasm and negation. It uses the Transformer encoder to process each word in relation to both preceding and following words.

BERT introduced two key innovations:
1. Masked Language Modeling (MLM): randomly masks words and predicts them, enabling deep context learning.
2. Transformer architecture: captures long-term dependencies better than RNNs.

<img src="https://jalammar.github.io/images/BERT-classification-spam.png" alt="Transformer" width="500">
Image source: Jay Alammar, The Illustrated BERT, ELMO, and co. 


#### FinBERT: Domain-Specific Sentiment Classification in Finance

While BERT performs well on general text, it struggles with financial language where words like “negative,” “flat,” or “beat” have technical meanings. For example, “negative cash flow” is a neutral financial term, not an expression of negative sentiment, and “beat consensus” implies a positive outcome, not aggression.

FinBERT adapts BERT by pretraining it on 4.9 billion tokens from SEC filings, earnings calls, and analyst reports. This domain-specific training helps it better understand financial jargon, tone, and context—allowing for more accurate sentiment classification in tasks like investor communication analysis and earnings report interpretation. FinBERT consistently outperforms general models on financial NLP tasks.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*-qwGn_js-CjwgOz5e8tWug.png" alt="FinBERT1" width="400">
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*WBvoOrHXYXHPy_tx8iwLOQ.png" alt="FinBERT2" width="250">

Image source: Zulkuf Genc, FinBERT: Financial Sentiment Analysis with BERT 

All the fine-tuned FinBERT models are publicly hosted at Huggingface 🤗.


References: 
- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) (Dan Jurafsky and James H. Martin. 2025)
- Word2Vec [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) (Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013)
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) (Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014)
- ELMo [Deep contextualized word representations](https://arxiv.org/pdf/1802.05365) (Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018)
- BERT [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) (Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019)
- FinBERT [FinBERT: Financial Sentiment Analysis with BERT](https://medium.com/prosus-ai-tech-blog/finbert-financial-sentiment-analysis-with-bert-b277a3607101) HugginFace, ProsusAI/finbert

# Environment Setup

## Required libraries

To follow along with this tutorial, you will need the following Python libraries:
- `transformers`: from Hugging Face, to load FinBERT
- `torch`: for running the model (PyTorch based)
- `pandas`
- `numpy`
- `matplotlib`
- `scipy`

## Environment Setup

Please follow the instructions in the `README.md` file to set up the environment using `conda` and `environment.yml`.


## Setup

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import torch
import transformers

# Suppress all warnings
warnings.filterwarnings("ignore")

# Set display options for pandas
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

This code checks your system's PyTorch library and the Transformers Hugging Face installation and whether CUDA (NVIDIA GPU support) is available for acceleration:

In [8]:
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("Transformers version:", transformers.__version__)

PyTorch version: 2.0.1
CUDA available: False
Transformers version: 4.40.1


## Load FinBERT and Tokenizer

As, mentioned above, FinBERT is a version of BERT that has been fine-tuned specifically on financial sentiment data such as earnings reports, analyst statements, and press releases.

In [9]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

In [10]:
# Load pretrained FinBERT model and tokenizer
model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

The `tokenizer` is what will split the text into tokens and converts them into numeric input IDs. The `model` will process those IDs and outputs raw scores for three sentiment classes: positive, negative, and neutral.

## Sentiment Analysis
Analyzing sentiment in financial text is valuable for understanding the perspectives of managers, analysts, and investors. 

- **Input**: A financial text.

- **Output**: Positive, Neutral or Negative.

Let’s analyze some sample financial statements. In practice, these could be headlines, 10-K filings, press releases, etc.

### Sentence examples

In [11]:
texts = [
    # Positive examples
    "Earnings exceeded analyst expectations by 12%.",
    "The new product line drove record profits this quarter.",

    # Negative examples
    "The company faces lawsuits related to data breaches.",
    "Supply chain disruptions severely impacted Q2 margins.",

    # Neutral examples
    "The board of directors met on May 3rd.",
    "The company is headquartered in San Jose, California.",

    # Ambiguous example
    "Despite strong revenue, net income declined due to increased R&D investment.",
]


### Tokenize the input

In [12]:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")


Parameters:
- `padding=True`: ensures all sentences are the same length.
- `truncation=True`: cuts off long sentences that exceed the model’s max input length.
- `return_tensors="pt"`: returns PyTorch tensors for use with the model.

## Run Sentiment Predictions 

### Get Sentiment Predictions

Now, we’ll pass the tokenized text to the FinBERT model and interpret the output.

In [13]:
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)



What is happening:
- We use `torch.no_grad()` because we’re only making predictions, not training.
- The model returns **logits**, which are unnormalized prediction scores.
- We convert logits to **probabilities** using the softmax function.

### Map Predictions to Sentiment Labels

In [14]:
labels = ["positive", "negative", "neutral"]

for text, prob in zip(texts, probs):
    sentiment = labels[torch.argmax(prob)]
    confidence = prob.max().item()
    print(f"\"{text}\" → {sentiment} ({confidence:.2f})")

"Earnings exceeded analyst expectations by 12%." → positive (0.95)
"The new product line drove record profits this quarter." → positive (0.93)
"The company faces lawsuits related to data breaches." → negative (0.91)
"Supply chain disruptions severely impacted Q2 margins." → negative (0.97)
"The board of directors met on May 3rd." → neutral (0.94)
"The company is headquartered in San Jose, California." → neutral (0.95)
"Despite strong revenue, net income declined due to increased R&D investment." → negative (0.97)


How do we get the scores?
- `torch.argmax(probs)` gives the index of the highest probability class.
- We map that index to one of the labels: "positive", "negative", or "neutral".
- Confidence is how strongly the model believes in its prediction.

## Try Domain-Specific Text Snippets

### Analyst notes

In [15]:
texts = [
    "We maintain our overweight rating due to improving margins.",
    "Downgraded to neutral as macro risks continue to weigh on sentiment.",
    "The upgrade reflects stronger-than-expected earnings and a favorable outlook.",
    "We remain cautious due to continued pricing pressure and regulatory overhang.",
]


In [16]:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

labels = ["positive", "negative", "neutral"]

for text, prob in zip(texts, probs):
    sentiment = labels[torch.argmax(prob)]
    confidence = prob.max().item()
    print(f"\"{text}\" → {sentiment} ({confidence:.2f})")

"We maintain our overweight rating due to improving margins." → positive (0.96)
"Downgraded to neutral as macro risks continue to weigh on sentiment." → negative (0.89)
"The upgrade reflects stronger-than-expected earnings and a favorable outlook." → positive (0.96)
"We remain cautious due to continued pricing pressure and regulatory overhang." → negative (0.90)


### Earnings call quotes

In [17]:
texts = [
    "We experienced strong year-over-year revenue growth across all segments.",
    "Operating margins declined this quarter due to higher logistics costs.",
    "We are reaffirming our full-year guidance despite market volatility.",
    "Customer demand remained stable, but input costs continue to rise.",
    "We anticipate headwinds from FX and interest rate uncertainty in Q3.",
]


In [18]:

inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

labels = ["positive", "negative", "neutral"]

for text, prob in zip(texts, probs):
    sentiment = labels[torch.argmax(prob)]
    confidence = prob.max().item()
    print(f"\"{text}\" → {sentiment} ({confidence:.2f})")

"We experienced strong year-over-year revenue growth across all segments." → positive (0.96)
"Operating margins declined this quarter due to higher logistics costs." → negative (0.98)
"We are reaffirming our full-year guidance despite market volatility." → positive (0.94)
"Customer demand remained stable, but input costs continue to rise." → positive (0.89)
"We anticipate headwinds from FX and interest rate uncertainty in Q3." → negative (0.95)


### Risk factor excerpts from 10-Ks

In [19]:
texts = [
    "Our business could be adversely affected by rising interest rates.",
    "We face ongoing risks related to global supply chain disruptions.",
    "Failure to comply with new ESG regulations may impact operations.",
    "A prolonged downturn in consumer spending could hurt sales performance.",
    "We are exposed to cybersecurity threats that may disrupt critical systems.",
]


In [20]:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

labels = ["positive", "negative", "neutral"]

for text, prob in zip(texts, probs):
    sentiment = labels[torch.argmax(prob)]
    confidence = prob.max().item()
    print(f"\"{text}\" → {sentiment} ({confidence:.2f})")

"Our business could be adversely affected by rising interest rates." → negative (0.91)
"We face ongoing risks related to global supply chain disruptions." → negative (0.95)
"Failure to comply with new ESG regulations may impact operations." → negative (0.95)
"A prolonged downturn in consumer spending could hurt sales performance." → negative (0.97)
"We are exposed to cybersecurity threats that may disrupt critical systems." → negative (0.84)
