# FinBERT Example Notebook

This notebooks shows how to train and use the FinBERT pre-trained language model for financial sentiment analysis.

In [83]:
!pip install -q ipykernel

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [84]:
!pip install -q torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [85]:
!pip install -q transformers 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [86]:
!pip install -q nltk 
!pip install transformers[torch]
!pip install accelerate -U

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [87]:
from pathlib import Path
import shutil
import os
import logging
import torch
import sys

sys.path.append('..')

from pprint import pprint
from sklearn.metrics import classification_report

from transformers import AutoModelForSequenceClassification , AutoTokenizer

from utility import *
import utils as tools


%load_ext autoreload
%autoreload 2

project_dir = Path.cwd().parent
#pd.set_option('max_colwidth', -1)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [88]:
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.ERROR)

## Train & Test Data 

In [89]:
!pip install datasets
from datasets import load_dataset

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [90]:
cl_data_path = project_dir/'data'/'sentiment_data'
train = pd.read_csv(os.path.join(cl_data_path, 'train.csv'), sep='\t', index_col=False)
eval = pd.read_csv(os.path.join(cl_data_path, 'test.csv'), sep='\t', index_col=False)


In [91]:
df_train.head(6)

Unnamed: 0.1,Unnamed: 0,text,label
0,1950,"After the reporting period , BioTie North Amer...",positive
1,4283,They will cover all Forest Industry 's units a...,negative
2,3014,"( ADP News ) - Nov 28 , 2008 - Finnish power-s...",positive
3,4097,"Following the transaction , Lundbeck has world...",positive
4,2733,A few employees would remain at the Oulu plant...,neutral
5,1464,ASSA ABLOY Kaupthing Bank gave a ` neutral ' r...,neutral


In [92]:
df_train['Unnamed: 0']

0       1950
1       4283
2       3014
3       4097
4       2733
        ... 
3483    3056
3484    4644
3485    3502
3486    4235
3487     554
Name: Unnamed: 0, Length: 3488, dtype: int64

In [93]:
# Saving train and eval data
#file_path = "/content/drive/MyDrive/data/"
train.to_csv(os.path.join(cl_data_path, "train_subset.csv"), index=False)
eval.to_csv(os.path.join(cl_data_path, "eval.csv"), index=False)

In [94]:
dataset = load_dataset('csv', data_files={'train': os.path.join(cl_data_path, 'train_subset.csv'), 'eval': os.path.join(cl_data_path, 'eval.csv')})

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

In [95]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text', 'label'],
        num_rows: 3488
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'text', 'label'],
        num_rows: 970
    })
})

## Prepare the model

In [96]:

MODEL = 'bert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [97]:

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}
# Defining a function to tokenize text
def tokenize_data(example):
    return tokenizer(example['text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['Unnamed: 0','label','text']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/3488 [00:00<?, ? examples/s]

Map:   0%|          | 0/970 [00:00<?, ? examples/s]

Map:   0%|          | 0/3488 [00:00<?, ? examples/s]

Map:   0%|          | 0/970 [00:00<?, ? examples/s]

In [98]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3488
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 970
    })
})

In [99]:

#Load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [100]:
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)


In [101]:

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [102]:
from transformers import TrainingArguments

training_args = TrainingArguments(
                output_dir='./results', #output directory
                num_train_epochs=3,  #Total number of training epochs to perform
                per_device_train_batch_size=8, #Batch size for device during training
                per_device_eval_batch_size= 8, #Batch size for evaluation
                evaluation_strategy = 'epoch', #Evaluation is done at the end of each epoch
                eval_steps=100,
                save_strategy='epoch', #save at the end of each epoch
                save_steps=100,
                warmup_steps= 500, #Number of steps used for a linear warmup from 0 to learning_rate
                learning_rate = 5e-6, #learining rate
                seed=42,
                weight_decay = 0.01, # the weight decay value
                logging_strategy='epoch',
                logging_dir = './logs',
                logging_steps =100,
                load_best_model_at_end=True, #Whether or not to load the best model found during training at the end of training
                )

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
from transformers import Trainer


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

## Test the model

`bert.evaluate` outputs the DataFrame, where true labels and logit values for each example is given

In [None]:
test_data = finbert.get_data('test')

In [None]:
results = finbert.evaluate(examples=test_data, model=bertmodel)

### Prepare the classification report

In [None]:
def report(df, cols=['label','prediction','logits']):
    #print('Validation loss:{0:.2f}'.format(metrics['best_validation_loss']))
    #cs = CrossEntropyLoss(weight=finbert.class_weights)
    #loss = cs(torch.tensor(list(df[cols[2]])),torch.tensor(list(df[cols[0]])))
    #print("Loss:{0:.2f}".format(loss))
    print("Accuracy:{0:.2f}".format((df[cols[0]] == df[cols[1]]).sum() / df.shape[0]) )
    print("\nClassification Report:")
    print(classification_report(df[cols[0]], df[cols[1]]))

In [None]:
results['prediction'] = results.predictions.apply(lambda x: np.argmax(x,axis=0))

In [None]:
report(results,cols=['labels','prediction','predictions'])

### Get predictions

With the `predict` function, given a piece of text, we split it into a list of sentences and then predict sentiment for each sentence. The output is written into a dataframe. Predictions are represented in three different columns: 

1) `logit`: probabilities for each class

2) `prediction`: predicted label

3) `sentiment_score`: sentiment score calculated as: probability of positive - probability of negative

Below we analyze a paragraph taken out of [this](https://www.economist.com/finance-and-economics/2019/01/03/a-profit-warning-from-apple-jolts-markets) article from The Economist. For comparison purposes, we also put the sentiments predicted with TextBlob.
> Later that day Apple said it was revising down its earnings expectations in the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours trading and the decline was extended to more than 10% when the market opened. The dollar fell by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. Yields on government bonds fell as investors fled to the traditional haven in a market storm.

In [None]:
text = "Later that day Apple said it was revising down its earnings expectations in \
the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. \
The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours \
trading and the decline was extended to more than 10% when the market opened. The dollar fell \
by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering \
some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. \
Yields on government bonds fell as investors fled to the traditional haven in a market storm."

In [None]:
cl_path = project_dir/'models'/'classifier_model'/'finbert-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(cl_path, cache_dir=None, num_labels=3)

In [None]:
import nltk
nltk.download('punkt')

In [None]:
result = predict(text,model)

In [None]:
blob = TextBlob(text)
result['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]
result

In [None]:
print(f'Average sentiment is %.2f.' % (result.sentiment_score.mean()))

Here is another example

In [None]:
text2 = "Shares in the spin-off of South African e-commerce group Naspers surged more than 25% \
in the first minutes of their market debut in Amsterdam on Wednesday. Bob van Dijk, CEO of \
Naspers and Prosus Group poses at Amsterdam's stock exchange, as Prosus begins trading on the \
Euronext stock exchange in Amsterdam, Netherlands, September 11, 2019. REUTERS/Piroschka van de Wouw \
Prosus comprises Naspers’ global empire of consumer internet assets, with the jewel in the crown a \
31% stake in Chinese tech titan Tencent. There is 'way more demand than is even available, so that’s \
good,' said the CEO of Euronext Amsterdam, Maurice van Tilburg. 'It’s going to be an interesting \
hour of trade after opening this morning.' Euronext had given an indicative price of 58.70 euros \
per share for Prosus, implying a market value of 95.3 billion euros ($105 billion). The shares \
jumped to 76 euros on opening and were trading at 75 euros at 0719 GMT."

In [None]:
result2 = predict(text2,model)
blob = TextBlob(text2)
result2['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]

In [None]:
result2

In [None]:
print(f'Average sentiment is %.2f.' % (result2.sentiment_score.mean()))