# here we will use basic data (small set) of finance sentiment analysis and try to improve it by using noisy data.

Here is the plan:

1.   Use kaggle financial sentiment set as a starting point --> expected accuracy of test is is ~85%
2.   get the accuracy, classification report and confusion matrix for 3 models:

> Logistic Regression

> Passive Aggresive Model

> Fine tuned BERT model (the cased version)- pytorch

> fine tuned BERT model - pytorch-lightning

3. Start augmenting the data in various methods and see how it affects the results. Methods to test:

- Find top ngrams for positive and negative and get additional data on 500 companies
- Use negative and positive phrases to look for articles that will be marked negative or positive
- Obtain articles from PR hose and mark all such content as positive

4. We will test the effect of each method on how the above models are improved (or not)
5. We will try to provide guidelines for noisy data enhancements

Let's dive in...





## install and import libraries

In [None]:
!pip install -qq transformers
!pip install torch
!pip install cleantext
!pip install -q -U watermark
!pip install plotly==5.2.1

[K     |████████████████████████████████| 2.9 MB 5.3 MB/s 
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[K     |████████████████████████████████| 895 kB 37.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 34.2 MB/s 
[K     |████████████████████████████████| 636 kB 40.9 MB/s 
Collecting cleantext
  Downloading cleantext-1.1.3-py3-none-any.whl (3.7 kB)
Installing collected packages: cleantext
Successfully installed cleantext-1.1.3
Collecting plotly==5.2.1
  Downloading plotly-5.2.1-py2.py3-none-any.whl (21.8 MB)
[K     |████████████████████████████████| 21.8 MB 1.4 MB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.2.1 tenacity-8.0.1


In [None]:
# get the versions
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.7.12
IPython version      : 5.5.0

numpy       : 1.19.5
pandas      : 1.1.5
torch       : 1.9.0+cu102
transformers: 4.11.0



In [None]:
# Import libraries
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd

from cleantext import clean

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap

import plotly.express as px

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

from tqdm.auto import tqdm

# import pytorch_lightning as pl
# from pytorch_lightning.metrics.functional import accuracy, f1, auroc
# from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
# from pytorch_lightning.loggers import TensorBoardLogger


HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

  defaults = yaml.load(f)


<torch._C.Generator at 0x7fa24f6a8bf0>

## allocate a GPU/CPU

In [None]:
!nvidia-smi
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

Wed Sep 29 15:49:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

device(type='cuda', index=0)

## get the Kaggle financial sentiment data and perpare it. 
We'll prepare 2 datasets
1. As is without touching the text
2. pre-processed to lowercase, remove puctuation and stemming the words


The data is here - https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news


In [None]:
# get the data
!gdown --id 1VFcdeyOf5NY0q3m2xqN04PpwoS63kK0E

Downloading...
From: https://drive.google.com/uc?id=1VFcdeyOf5NY0q3m2xqN04PpwoS63kK0E
To: /content/all-data.csv
  0% 0.00/672k [00:00<?, ?B/s]100% 672k/672k [00:00<00:00, 48.0MB/s]


In [None]:
# read the original data into a DataFrame

df = pd.read_csv("/content/all-data.csv", engine='python', names=['sentiment','text'])
df.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


## alternatively, use the extended data

In [None]:
# alternatively read the augmented data into a DataFrame
df_alt = pd.read_csv('/content/drive/MyDrive/data/Kaggle financial sentiment/extended_data.csv')

In [None]:
# look at the alternative df and fix it a bit to fit the current classifier
df_alt = df_alt[['sentiment', 'text']]


In [None]:
# replace -1, 0, 1 witg 'negative, 'neutral, 'posiive'

def int_to_sentiment (sent):
  if sent == -1:
    return 'negative'
  elif sent == 0:
    return 'neutral'
  elif sent == 1:
    return 'positive'
  else:
    return 'unknown'

df_alt['sentiment'] = df_alt['sentiment'].apply(int_to_sentiment)

In [None]:
df_alt.head()

Unnamed: 0,sentiment,text
0,positive,The Future of Hair Loss Treatments Will Involv...
1,negative,Australis Reports Q4 and Financial Year (FY) 2...
2,negative,EPS dropped to EUR0 .2 from EUR0 .3 .
3,negative,US Central bank anticipates low rates for an “...
4,negative,GameStop (GME) Stock News and Forecast: Q2 Ear...


In [None]:
df = df_alt.copy(deep=True)

## end of alternative extended data - here all calculation are similar

In [None]:
# how much data we hve, what type of data
print('shape -->',df.shape)
print(df.info())

In [None]:
# let's also look the the sentiment distribution
df['sentiment'].value_counts() # we later leave the 'neutral' untouched and try to improve mostly the negatove and positive

positive    3895
negative    3447
neutral     2879
Name: sentiment, dtype: int64

In [None]:
# define class_names
class_names = ['negative', 'neutral','positive']

In [None]:
# also verify we don't have null values
df.isna().sum()

sentiment    0
text         0
dtype: int64

In [None]:
# finally let's look at data dsitribution for this set
# let's plot token lengths as histogram
fig = px.histogram(df, x='sentiment', color='sentiment', width=1000, color_discrete_sequence=px.colors.qualitative.Pastel)

fig.update_layout(
    title_text='Classes distribution', # title of plot
    xaxis_title_text='Classes', # xaxis label
    yaxis_title_text='Counts', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    showlegend=False,
    coloraxis_showscale=False,
    font_size=18
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True)


fig.show() 


## clean the data (and leave an uncleaned version). we will have

1.   df --> the original data
2.   df_clean --> stemmed, lowercased, stopwords, punctuations, numbers removed



In [None]:
# we'll use cleantext library (and the clean object) to clean the text and NLTK stopwords

import nltk
nltk.download('stopwords')

# define a function to do the cleaning
def clean_it(sentence):
  return clean(sentence,  
       stemming=True,
      stopwords=True,
      lowercase=False,
      numbers=True,
      punct=True,
      stp_lang='english')
  # or just use --> clean(sentence, all=True)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# so fron now on we have 2 dataframes, the regular df and df_clean which we cleaned
df_clean = df.copy(deep=True)
df_clean['text'] = df['text'].apply(clean_it)

# and show few lines of df_clean
df_clean.head()

Unnamed: 0,sentiment,text
0,neutral,accord gran compani plan move product russia ...
1,neutral,technopoli plan develop stage area less squar...
2,negative,the intern electron industri compani elcoteq l...
3,positive,with new product plant compani would increas c...
4,positive,accord compani updat strategi year baswar ta...


In [None]:
# run with df_clean if desired

df = df_clean

df.head()

Unnamed: 0,sentiment,text
0,neutral,accord gran compani plan move product russia ...
1,neutral,technopoli plan develop stage area less squar...
2,negative,the intern electron industri compani elcoteq l...
3,positive,with new product plant compani would increas c...
4,positive,accord compani updat strategi year baswar ta...


## convert the labels or targets to integers

In [None]:
# Important note: we use 0, 1, 2 (instead of say -1, 0, 1) because that's how the classifier assigns a category... selecting e.g. -1, 0, 1 will results in an error

# convert sentiments to integers
# negative --> 0
# neutral --> 1
# positive-->  2

def sentiment_to_int(sentiment):
  if (sentiment.strip()  == 'negative'):
    return int(0)
  elif (sentiment.strip()  == 'neutral'):
    return int(1)
  elif (sentiment.strip()  == 'positive'):
    return int(2)
  else:
      return int(100)



In [None]:
# convert to integers
df['sentiment'] = df['sentiment'].apply(sentiment_to_int)

In [None]:
# shuffle the df
df = df.sample(frac=1)

In [None]:
# let's look at the resulting df
df.head(12)

Unnamed: 0,sentiment,text
4409,0,net sale drop yearonyear eur million
4687,0,adp new jan finnish mobil phone maker no...
831,2,depart store sale improv eur mn
2619,1,the crane would instal onboard two freighter o...
4278,1,the subscript period amer sport warrant sche...
3602,1,from emphasi kyro strategi glaston growth
3075,1,other detail provid
4752,0,oper profit total eur mn eur mn correspond ...
2596,1,the compani also seek possibl reloc luumaki pe...
2246,2,these measur expect produc annual cost save eu...


In [None]:
# do we have any sentiment value equal to 100 ?
df['sentiment'].count() == 100 # no we don't have :)

False

## tokenization --> we'll be using BERT tokenizer in all cases

In [None]:
# define the model to use and tokenizer
BERT_MODEL_NAME = 'bert-base-cased'

tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)


Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.



In [None]:
# let's look at the tokenizer and use its encoding_plus method

text = ' Lionel Messi believes PSG is the best place for him to win Champions League again and again'

encoding = tokenizer.encode_plus(
  text, 
  max_length=64,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False, # we only need MLM, not NSP task
  padding='max_length',
  truncation=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors  
)

In [None]:
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
encoded = tokenizer.encode_plus(text)

print(f' Sentence: {text}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')
print(f'Encoded with special chars: {encoded}')

 Sentence:  Lionel Messi believes PSG is the best place for him to win Champions League again and again
   Tokens: ['Lionel', 'Me', '##ssi', 'believes', 'PS', '##G', 'is', 'the', 'best', 'place', 'for', 'him', 'to', 'win', 'Champions', 'League', 'again', 'and', 'again']
Token IDs: [14957, 2508, 19828, 6616, 12727, 2349, 1110, 1103, 1436, 1282, 1111, 1140, 1106, 1782, 4748, 1453, 1254, 1105, 1254]
Encoded with special chars: {'input_ids': [101, 14957, 2508, 19828, 6616, 12727, 2349, 1110, 1103, 1436, 1282, 1111, 1140, 1106, 1782, 4748, 1453, 1254, 1105, 1254, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# and we can convert the input_ids back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])
print(tokens)

['[CLS]', 'Lionel', 'Me', '##ssi', 'believes', 'PS', '##G', 'is', 'the', 'best', 'place', 'for', 'him', 'to', 'win', 'Champions', 'League', 'again', 'and', 'again', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


In [None]:
# Now, let's decide on the optimal sequence length... to do that we look and sentence lengths distribution

def number_of_tokens(sentence):
  return len(sentence.split())

# example
number_of_tokens('today is a good day for travel and having good food with friends, yes indeed')

15

In [None]:
# define a DataFrame that holds lengths distribution
ds = df['text'].apply(number_of_tokens)
ds = ds.to_frame('lengths')
# let's look at lengthes
ds.value_counts().head()


lengths
28         468
26         467
25         443
27         442
29         442
dtype: int64

In [None]:
# now plot it

# let's plot token lengths as histogram
fig = px.histogram(ds, x='lengths', color='lengths', width=1000, color_discrete_sequence=px.colors.qualitative.Pastel)
   

fig.update_layout(
    title_text='Token lengths distribution', # title of plot
    xaxis_title_text=' Tokens length', # xaxis label
    yaxis_title_text='Counts', # yaxis label
    bargap=0.5, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    showlegend=False,
    font_size=20
)
fig.show() 

In [None]:
# clearly 64 tokens is a good choice
MAX_LEN = 64

## preparing the dataset class and DataLoader capability

In [None]:
class FinanceSentimentDataset(Dataset):

  def __init__(self, texts, targets, tokenizer, max_len):
    self.texts = texts
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.texts)
  
  def __getitem__(self, item):
    text = str(self.texts[item])
    target = self.targets[item]

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      padding='max_length',
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return {
      'text': text,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

In [None]:
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = FinanceSentimentDataset(
    texts=df.text.to_numpy(),
    targets=df.sentiment.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
  )

  return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=2
  )


## prepare train, validation and test data sets

In [None]:
# use the train_test_split utility of sklearn to create df_train, df_val and df_test

df_train, df_val_test = train_test_split(df, train_size=0.35, random_state=RANDOM_SEED )
df_val, df_test = train_test_split(df_val_test, test_size=0.5, random_state=RANDOM_SEED)


# let's look t the resulting sizes
print(f'train set size--> {df_train.shape}')
print(f'validation set size--> {df_val.shape}')
print(f'test set size--> {df_test.shape}')

train set size--> (1696, 2)
validation set size--> (1575, 2)
test set size--> (1575, 2)


In [None]:
# let's look at sentiment distribution in the train data set

df_train['sentiment'].value_counts() 

# it is very skewed towards neutral and positive we can try stratify it or better enhance the data with negative and positive examples...more on it later

2    2724
0    2400
1    2030
Name: sentiment, dtype: int64

In [None]:
# invoke dataloaders for each type of the datasets we have: train, validation, test
BATCH_SIZE = 8


train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)

## let's look at a sample data


In [None]:
# we look at 2 samples data of batch_size from our training data

it = iter(train_data_loader)
data1 = next(it)
data2 = next (it)

# and now let's print the labes (targets for both) to see if distribution makes sense
print(f"First sample targets: {data1['targets']}")
print(f"Second sample targets: {data2['targets']}") 


First sample targets: tensor([0, 2, 0, 0, 2, 2, 1, 0])
Second sample targets: tensor([2, 1, 0, 0, 1, 2, 2, 0])


In [None]:
# let's look at the shapes of the data structure

print(f"Each batch has {len(data1['text'])} texts")
print(f"Each batch has {data1['input_ids'].shape} ids")
print(f"Each batch has {data1['attention_mask'].shape} attention masks")
print(f"Each batch has {data1['targets'].shape} targets")

Each batch has 8 texts
Each batch has torch.Size([8, 64]) ids
Each batch has torch.Size([8, 64]) attention masks
Each batch has torch.Size([8]) targets


In [None]:
# apply bert model to a data example

bert_model = BertModel.from_pretrained(BERT_MODEL_NAME)


Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to b

In [None]:
# let's try to use it on our train_data example

output = bert_model(
  input_ids=data1['input_ids'], 
  attention_mask=data1['attention_mask']
)

In [None]:
# output of the bert model contains 2 elements: last_hidden_state and pooler_output

output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [None]:
# let's look at the output last hidden size and output pooler layer which averages the last hidden state
# so let's check their shapes
output.last_hidden_state.shape, output.pooler_output.shape # we are interested in the pooler layer which we will feed to a softmanx fro predictions


(torch.Size([8, 64, 768]), torch.Size([8, 768]))

In [None]:
# the 768 comes from the BERT_BASE model and we can see it by lookin at
bert_model.config.hidden_size

768

## time to build our model --> bert --> droput --> linear of size (hidden_state x n_classes)

In [None]:
# OK, time to build the model class based on the BERT_BASE

class SentimentClassifier(nn.Module):

  def __init__(self, n_classes):
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(BERT_MODEL_NAME)
    self.drop = nn.Dropout(p=0.5)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  
  def forward(self, input_ids, attention_mask):
    output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(output.pooler_output)
    return self.out(output)

In [None]:
# let's instatiate our model

sentiment_values = [0, 1, 2]

model = SentimentClassifier(len(sentiment_values))
model = model.to(device)


Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to b

In [None]:
# we'll move our sample batch to device as well

input_ids = data1['input_ids'].to(device)
attention_mask = data1['attention_mask'].to(device)

print(input_ids.shape) # batch size x seq length
print(attention_mask.shape) # batch size x seq length

torch.Size([8, 64])
torch.Size([8, 64])


In [None]:
# then to get the predicted class we'll apply softmax to the output layer (logits) so for the batch we get the per class prediction is
outputs = model(input_ids, attention_mask)
F.softmax(outputs, dim=1)


tensor([[0.4596, 0.2714, 0.2690],
        [0.3779, 0.0819, 0.5402],
        [0.3040, 0.4422, 0.2538],
        [0.2507, 0.2323, 0.5170],
        [0.4432, 0.2872, 0.2695],
        [0.5236, 0.2863, 0.1901],
        [0.3529, 0.1426, 0.5045],
        [0.4049, 0.1652, 0.4299]], device='cuda:0', grad_fn=<SoftmaxBackward>)

In [None]:
# so in the above we have the normalized distribution per sample in the batch, we could select only the MAX and show this
_,output_predictions =torch.max(outputs, dim=1)
print(f'before softmax predictions --> {output_predictions}')

# or we can run the softmax and actually get the same predictions, alas from the normalized distribution
_,softmax_predictions =torch.max(F.softmax(outputs, dim=1), dim=1)
print(f'after softmax predictions --> {softmax_predictions}')

# true labels are
print(f"true labels are --> {data1['targets']} - but errors are obvious as we did not train the model yet")


# so we see the prediction remains the same whether or not we're applying the softmax --> that's why in any case we can run the torch.max directly on the class non_normalized probabilities


before softmax predictions --> tensor([0, 2, 1, 2, 0, 0, 2, 2], device='cuda:0')
after softmax predictions --> tensor([0, 2, 1, 2, 0, 0, 2, 2], device='cuda:0')
true labels are --> tensor([0, 2, 0, 0, 2, 2, 1, 0]) - but errors are obvious as we did not train the model yet


## now, let's train our model and evaluate it

In [None]:
# define optimizer, loss function , epochs and scheduler

EPOCHS = 8

optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss().to(device) 

In [None]:
# we now write a function to train a single epoch, later we will loop through all epochs
def train_epoch(
  model, 
  data_loader, 
  loss_fn, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  model = model.train()

  losses = []
  correct_predictions = 0
  
  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)

    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, targets)

    correct_predictions += torch.sum(preds == targets)
    losses.append(loss.item())

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

  return correct_predictions.double() / n_examples, np.mean(losses)


In [None]:
# so now let's have a function for evaluationg the model (good for validation and test DataLoaders)
# we will not need the optimizer and the sceduler 

def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()

  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      loss = loss_fn(outputs, targets)

      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  return correct_predictions.double() / n_examples, np.mean(losses)


In [None]:
# ok, now we are good to go over all epochs and run the training and validation loop...:)

%%time

history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
     model,
     train_data_loader,
     loss_fn,
     optimizer,
     device,
     scheduler,
     len(df_train) 
      
  )

  print(f'Train loss {train_loss} - Train accuracy {train_acc}')

  val_acc, val_loss = eval_model(
      model,
      val_data_loader,
      loss_fn,
      device,
      len(df_val)
  )

  print(f'Validation loss {val_loss} - Validation accuracy {val_acc}')
  print ()

  # fill up the right history tracking lists
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)

  if val_acc > best_accuracy:
    torch.save(model.state_dict(), '/content/drive/MyDrive/data/finance_sentiment/best_model_noisy_state_with_stemming.bin')
    best_accuracy = val_acc 


Epoch 1/8
----------
Train loss 0.9260779970851024 - Train accuracy 0.5053117137265866
Validation loss 0.7743840354184309 - Validation accuracy 0.5420743639921721

Epoch 2/8
----------
Train loss 0.7513576940617748 - Train accuracy 0.6169974839250769
Validation loss 0.6594672341210147 - Validation accuracy 0.7181996086105675

Epoch 3/8
----------
Train loss 0.5816338481641682 - Train accuracy 0.7746715124405926
Validation loss 0.5841074214549735 - Validation accuracy 0.7769080234833659

Epoch 4/8
----------
Train loss 0.4563641566127622 - Train accuracy 0.849874196253844
Validation loss 0.7543715890642488 - Validation accuracy 0.7782126549249837

Epoch 5/8
----------
Train loss 0.3822022539131635 - Train accuracy 0.8884540117416829
Validation loss 0.8457339640117425 - Validation accuracy 0.7951728636660143

Epoch 6/8
----------
Train loss 0.31226805313620404 - Train accuracy 0.9179480011182555
Validation loss 0.9861446862062925 - Validation accuracy 0.7958251793868232

Epoch 7/8
------

In [None]:
# convert pt tensors in history to floats

def tensor_to_float(hist_dict):
  floats = {}
  for ind in range(len(hist_dict)):
    floats[ind] = hist_dict[ind].item()
  return floats

train_accuracy = tensor_to_float(history['train_acc'])
validation_accuracy = tensor_to_float(history['val_acc'])

In [None]:
# it is also a good idea to save all history to a file
df_history = pd.DataFrame({ 'train_accuracy': train_accuracy, 'validation_accuracy':validation_accuracy})

df_history.to_csv('/content/drive/MyDrive/data/finance_sentiment/cased_history_noisy_best_model_noisy_state_with_stemming.csv')

df_history

Unnamed: 0,train_accuracy,validation_accuracy
0,0.505312,0.542074
1,0.616997,0.7182
2,0.774672,0.776908
3,0.849874,0.778213
4,0.888454,0.795173
5,0.917948,0.795825
6,0.935141,0.808219
7,0.947023,0.811481


## inspecting the accuracy 

In [None]:
# let's place all history dict into a DataFrame for ease of inspection
df_accuracy = pd.read_csv('/content/drive/MyDrive/data/finance_sentiment/cased_history_noisy_best_model_noisy_state_with_stemming.csv')
#df_accuracy = pd.DataFrame(history)

df_accuracy 

Unnamed: 0.1,Unnamed: 0,train_accuracy,validation_accuracy
0,0,0.505312,0.542074
1,1,0.616997,0.7182
2,2,0.774672,0.776908
3,3,0.849874,0.778213
4,4,0.888454,0.795173
5,5,0.917948,0.795825
6,6,0.935141,0.808219
7,7,0.947023,0.811481


In [None]:
# plot it
import plotly.express as px
# import plotly.graph_objects as go

fig = px.line(df_accuracy, x=df_accuracy.index, y=['train_accuracy','validation_accuracy'],  range_y=[0.6, 1.02], width=1000)

fig.update_yaxes( title='Accuracy')
fig.update_xaxes( title='Epoch', nticks=8)
fig.update_layout(
    title='Training and Validation accuracy',
    font_size=20
)

fig.show()



We see above an interestin behvious where training accuracy increase and training loss decreses but validation accuracy improves only a bit while validation loss increases - one reaso could be training overfitting or simply too small validation and testing sets... we will explore this further...

## experimental visualization (not needed)

In [None]:
# # another experimental visualization too - autoviz

# !pip install autoviz

# from autoviz.AutoViz_Class import AutoViz_Class
# AV = AutoViz_Class()

# filename = ""
# sep = ","
# dft = AV.AutoViz(
#     filename,
#     sep=",",
#     depVar="",
#     dfte=df_acc,
#     header=0,
#     verbose=0,
#     lowess=False,
#     chart_format="svg",
#     max_rows_analyzed=150000,
#     max_cols_analyzed=30,
# )

## model evaluation

In [None]:
# let's load the best model and check accuracy on the test data

model = SentimentClassifier(len(sentiment_values))
model.load_state_dict(torch.load('/content/drive/MyDrive/data/finance_sentiment/best_model_noisy_state_with_stemming.bin'))
model = model.to(device)


Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to b

In [None]:
# we can now use te eval_model on the test data to see how well we're doing

test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len(df_test)
)

print (f'Accuracy of the noisy model on the original test set using bert model: {test_acc.item():.4f}') 


Accuracy of the noisy model on the original test set using bert model: 0.9010


## getting predictions from the model

In [None]:
# let's have a function for predictions wich is very similar to the test evaluation fuction but also returns probabilities

def get_predictions (model, data_loader):

  model = model.eval()


  texts = []
  predictions = []
  prediction_probs = []
  true_labels = []

  
  with torch.no_grad():
    for d in data_loader:

      batch_texts = d['text']
      input_ids = d['input_ids'].to(device)
      attention_mask = d['attention_mask'].to(device)
      targets = d['targets'].to(device)

      outputs = model(
          input_ids=input_ids,
          attention_mask=attention_mask
      )

      _, preds = torch.max(outputs, dim=1)

      probs = F.softmax(outputs, dim=1)

      texts.extend(batch_texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      true_labels.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  true_labels = torch.stack(true_labels).cpu()
  
  return texts, predictions, prediction_probs, true_labels



In [None]:
# let's look at the test_data and look at other accuracy measures

y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(
  model,
  test_data_loader
)

In [None]:
# let's look at few predictions and true labels examples

pred_df = pd.DataFrame( {'text':y_review_texts, 'predicted_label':y_pred , 'predicted_probability':y_pred_probs  , 'true_label': y_test })

In [None]:
# how big is the dataframe ?

pred_df.shape

(727, 4)

In [None]:
# let's look at few predictions and true labels examples
pd.set_option('display.max_colwidth', None)
# pred_df.head(8)

# to allow presentation let's keep it as csv file
pred_df.to_csv('/content/drive/MyDrive/data/finance_sentiment/8_predictions.csv')

In [None]:
# let's have a classification report

class_names = ['negative', 'neutral','positive']
print(classification_report(y_test, y_pred, target_names=class_names))

              precision    recall  f1-score   support

    negative       0.96      0.95      0.95        95
     neutral       0.96      0.96      0.96       413
    positive       0.93      0.94      0.93       219

    accuracy                           0.95       727
   macro avg       0.95      0.95      0.95       727
weighted avg       0.95      0.95      0.95       727



In [None]:
# let's look at the confusion matrix

cf_matrix=confusion_matrix(y_pred, y_test)
print(cf_matrix)


[[ 90   3   1]
 [  3 397  13]
 [  2  13 205]]


In [None]:
# plot the confusion matrix

import plotly.figure_factory as ff

z = np.flipud(confusion_matrix(y_pred, y_test)) # flip the matrix verticaly so it fits the confusion matrix definition

x = ['Negative', 'Neutral', 'Positive']
y = ['Positive', 'Neutral', 'Negative']

fig = ff.create_annotated_heatmap(z, x=x, y=y, colorscale='amp' )
fig.update_layout(width=800, font_size=20)
fig.update_yaxes(title = "True values", title_font=dict(size=20, family='Arial', color='crimson'))
fig.update_xaxes(title = "Predicted values", title_font=dict(size=20, family='Arial', color='crimson'))
fig.show()



## predicting from raw text

In [None]:
# let's try to predict any text

text = '''
He is a very experinced manager and can make the company highly energized and innovative
'''

# text = ' Microsot revenues rose and shareholders are hapy and optimistic about the future of the company'

encoded_review = tokenizer.encode_plus(
  text,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  padding='max_length',
  return_attention_mask=True,
  return_tensors='pt',
)

In [None]:
#Let's get the predictions from our model:

input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)

output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)

print(f'Review text: {text}')
print(f'Sentiment  : {class_names[prediction]}')

prediction.item()


Review text: 
He is a very experinced manager and can make the company highly energized and innovative

Sentiment  : positive


2