<a href="https://colab.research.google.com/github/oferweintraub/finance_sent/blob/main/Using_noisy_data_to_improve_finance_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# here we will use basic data (small set) of finance sentiment analysis and try to improve it by using noisy data.

Here is the plan:

1.   Use kaggle financial sentiment set as a starting point --> expected accuracy of test is is ~85%
2.   get the accuracy, classification report and confusion matrix for 3 models:

> Logostic Regression

> Passive Aggresive Model

> Fine tuned BERT model (the cased version)- pytorch

> fine tuned BERT model - pytorch-lightning

3. Start augmenting the data in various methods and see how it affects the results. Methods to test:

- Find top ngrams for positive and negative and get additional data on 500 companies
- Use negative and positive phrases to look for articles that will be marked negative or positive
- Obtain articles from PR hose and mark all such content as positive

4. We will test the effect of each method on how the above models are improved (or not)
5. We will try to provide guidelines for noisy data enhancements

Let's dive in...





## install and import libraries

In [55]:
!pip install -qq transformers
# !pip install pytorch-lightning
!pip install cleantext
!pip install -q -U watermark
!pip install plotly==5.2.1



In [56]:
# get the versions
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.7.11
IPython version      : 5.5.0

numpy       : 1.19.5
pandas      : 1.1.5
torch       : 1.9.0+cu102
transformers: 4.9.2



In [99]:
# Import libraries
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd

from cleantext import clean

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap

import plotly.express as px

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

from tqdm.auto import tqdm

# import pytorch_lightning as pl
# from pytorch_lightning.metrics.functional import accuracy, f1, auroc
# from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
# from pytorch_lightning.loggers import TensorBoardLogger


HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x7f3737c6b9b0>

## allocate a GPU/CPU

In [98]:
!nvidia-smi
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



device(type='cpu')

## get the Kaggle financial sentiment data and perpare it. 
We'll prepare 2 datasets
1. As is without touching the text
2. pre-processed to lowercase, remove puctuation and stemming the words


The data is here - https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news


In [59]:
# get the data
!gdown --id 1VFcdeyOf5NY0q3m2xqN04PpwoS63kK0E

Downloading...
From: https://drive.google.com/uc?id=1VFcdeyOf5NY0q3m2xqN04PpwoS63kK0E
To: /content/all-data.csv
  0% 0.00/672k [00:00<?, ?B/s]100% 672k/672k [00:00<00:00, 10.7MB/s]


In [60]:
# and read it into a dataframe

df = pd.read_csv("/content/all-data.csv", engine='python', names=['sentiment','text'])
df.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [61]:
# how much data we hve, what type of data
print('shape -->',df.shape)
print(df.info())

shape --> (4846, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  4846 non-null   object
 1   text       4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB
None


In [62]:
# let's also look the the sentiment distribution
df['sentiment'].value_counts() # we later leave the 'neutral' untouched and try to improve mostly the negatove and positive

neutral     2879
positive    1363
negative     604
Name: sentiment, dtype: int64

In [63]:
# finally let's look at data dsitribution for this set
# let's plot token lengths as histogram
fig = px.histogram(df, x='sentiment', color='sentiment', width=800, color_discrete_sequence=px.colors.qualitative.Pastel)
   

fig.update_layout(
    title_text='Sentiment in dataset', # title of plot
    #xaxis_title_text=' labels', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    showlegend=False
)
fig.show() 


## clean the data (and leave an uncleaned version). we will have

1.   df --> the original data
2.   df_clean --> stemmed, lowercased, stopwords, punctuations, numbers removed



In [64]:
# we'll use cleantext library (and the clean object) to clean the text and NLTK stopwords

import nltk
nltk.download('stopwords')

# define a function to do the cleaning
def clean_it(sentence):
  return clean(sentence,  
       stemming=True,
      stopwords=True,
      lowercase=True,
      numbers=True,
      punct=True,
      stp_lang='english')
  # or just use --> clean(sentence, all=True)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [65]:
# so fron now on we have 2 dataframes, the regular df and df_clean which we cleaned
df_clean = df.copy(deep=True)
df_clean['text'] = df['text'].apply(clean_it)

# and show few lines of df_clean
df_clean.head()

Unnamed: 0,sentiment,text
0,neutral,accord gran compani plan move product russia ...
1,neutral,technopoli plan develop stage area less squar...
2,negative,intern electron industri compani elcoteq laid ...
3,positive,new product plant compani would increas capac ...
4,positive,accord compani updat strategi year baswar ta...


In [66]:
# run with df_clean if desired

# df = df_clean

df.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


## convert the labels or targets to integers

In [67]:
# convert sentiments to integers
# negative - 0
# neutral - 1
# positive - 2

def sentiment_to_int(sentiment):
  if (sentiment.strip()  == 'negative'):
    return int(0)
  elif (sentiment.strip()  == 'neutral'):
    return int(1)
  elif (sentiment.strip()  == 'positive'):
    return int(2)
  else:
      return int(10)

df['sentiment'] = df['sentiment'].apply(sentiment_to_int)

In [68]:
# look at df
df.head()

Unnamed: 0,sentiment,text
0,1,"According to Gran , the company has no plans t..."
1,1,Technopolis plans to develop in stages an area...
2,0,The international electronic industry company ...
3,2,With the new production plant the company woul...
4,2,According to the company 's updated strategy f...


## tokenization --> we'll be using BERT tokenizer in all cases

In [69]:
# define the model to use and tokenizer
BERT_MODEL_NAME = 'bert-base-cased'

tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

In [70]:
# let's look at the tokenizer and use its encoding_plus method

text = ' Lionel Messi believes PSG is the best place for him to win Champions League again and again'

encoding = tokenizer.encode_plus(
  text, 
  max_length=64,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False, # we only need MLM, not NSP task
  padding='max_length',
  truncation=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors  
)

In [71]:
# let's look at some encoding data
type (encoding) # --> transformers.tokenization_utils_base.BatchEncoding
encoding.keys() # --> dict_keys(['input_ids', 'attention_mask'])
print(f"input_ids: \n {encoding['input_ids']}")
print()
print(f"attention_mask: \n {encoding['attention_mask']}")

input_ids: 
 tensor([[  101, 14957,  2508, 19828,  6616, 12727,  2349,  1110,  1103,  1436,
          1282,  1111,  1140,  1106,  1782,  4748,  1453,  1254,  1105,  1254,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])

attention_mask: 
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [72]:
# and we can convert the input_ids back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])
tokens

['[CLS]',
 'Lionel',
 'Me',
 '##ssi',
 'believes',
 'PS',
 '##G',
 'is',
 'the',
 'best',
 'place',
 'for',
 'him',
 'to',
 'win',
 'Champions',
 'League',
 'again',
 'and',
 'again',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

In [73]:
# Now, let's decide on the optimal sequence length... to do that we look and sentence lengths distribution

def number_of_tokens(sentence):
  return len(sentence.split())

# example
number_of_tokens('today is a good day for travel and having good food with friends, yes indeed')

15

In [74]:
# define a DataFrame that holds lengths distribution
ds = df['text'].apply(number_of_tokens)
ds = ds.to_frame('lengths')
# let's look at lengthes
ds.value_counts().head()


lengths
17         216
16         211
21         210
19         210
20         200
dtype: int64

In [75]:
# now plot it

# let's plot token lengths as histogram
fig = px.histogram(ds, x='lengths', color='lengths', width=800, color_discrete_sequence=px.colors.qualitative.Pastel)
   

fig.update_layout(
    title_text='token lengths distribution', # title of plot
    xaxis_title_text=' Tokens length', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.5, # gap between bars of adjacent location coordinates
    bargroupgap=0.1, # gap between bars of the same location coordinates
    showlegend=False
)
fig.show() 

In [76]:
# clearly 64 tokens is a good choice
MAX_LEN = 64

## preparing the dataset class and DataLoader capability

In [77]:
class FinanceSentimentDataset(Dataset):

  def __init__(self, texts, targets, tokenizer, max_len):
    self.texts = texts
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.texts)
  
  def __getitem__(self, item):
    text = str(self.texts[item])
    target = self.targets[item]

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      padding='max_length',
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return {
      'text': text,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

In [78]:
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = FinanceSentimentDataset(
    texts=df.text.to_numpy(),
    targets=df.sentiment.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
  )

  return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=2
  )


## prepare train, validation and test data sets

In [79]:
# we'll use the train_test_split utility of sklearn to create df_train, df_val and df_test

df_train, df_val_test = train_test_split(df, train_size=0.8, random_state=RANDOM_SEED )
df_val, df_test = train_test_split(df_val_test, test_size=0.5, random_state=RANDOM_SEED)


# let's look t the resulting sizes
print(f'train set size--> {df_train.shape}')
print(f'validation set size--> {df_val.shape}')
print(f'test set size--> {df_test.shape}')

train set size--> (3876, 2)
validation set size--> (485, 2)
test set size--> (485, 2)


In [80]:
# we can also look at df_test just to make sure all looks good
print(df_test.head())
df_test.sentiment.value_counts() 

# it is very skewed towards neutral and positive we can try stratify it or better enhance the data with negative and positive examples...more on it later

      sentiment                                               text
2804          1  Another firm Air Liquide was exempted because ...
1534          1  In 2008 , Kemira recorded revenue of approxima...
1181          1  As the world leaders in developing UV technolo...
3412          1  There will be return flights from Stuttgart ev...
1084          1  The company has a wide selection of metal prod...


1    285
2    144
0     56
Name: sentiment, dtype: int64

In [81]:
# invoke dataloaders for each type of the datasets we have: train, validation, test
BATCH_SIZE = 8

train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)

## let's look at a sample data


In [82]:
# we look at sample data of batch_size from our training data

data = next(iter(train_data_loader))
data.keys()

dict_keys(['text', 'input_ids', 'attention_mask', 'targets'])

In [83]:
# let's look at the shape of these tensors
print(data['input_ids'].shape)
print(data['attention_mask'].shape)
print(data['targets'].shape)

torch.Size([8, 64])
torch.Size([8, 64])
torch.Size([8])


In [84]:
data['text'][0]

"In Russia , Raisio 's Food Division 's home market stretches all the way to Vladivostok ."

In [85]:
bert_model = BertModel.from_pretrained(BERT_MODEL_NAME)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [86]:
# let's try to use it on our train_data example

output = bert_model(
  input_ids=data['input_ids'], 
  attention_mask=data['attention_mask']
)

In [87]:
# let's look at the output last hidden size and output pooler layer which averages the last hidden state

# output contains 2 features
print (output.keys())

# so let's check their shapes
output.last_hidden_state.shape, output.pooler_output.shape # we are interested in the pooler layer which we will feed to a softmanx fro predictions


odict_keys(['last_hidden_state', 'pooler_output'])


(torch.Size([8, 64, 768]), torch.Size([8, 768]))

In [88]:
# the 768 comes from the BERT_BASE model and we can see it by lookin at
bert_model.config.hidden_size

768

## time to build our model --> bert --> droput --> linear of size (hidden_state x n_classes)

In [89]:
# OK, time to build the model class based on the BERT_BASE

class SentimentClassifier(nn.Module):

  def __init__(self, n_classes):
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(BERT_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  
  def forward(self, input_ids, attention_mask):
    output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(output.pooler_output)
    return self.out(output)

In [90]:
# let's instatiate our model

sentiment_values = [0, 1, 2]

model = SentimentClassifier(len(sentiment_values))
model = model.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [91]:
# we'll move our sample batch to device as well

input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)

print(input_ids.shape) # batch size x seq length
print(attention_mask.shape) # batch size x seq length

torch.Size([8, 64])
torch.Size([8, 64])


In [92]:
# then to get the predicted class we'll apply softmax to the output layers (logits) so for the batch we get the per class perdoction
outputs = model(input_ids, attention_mask)
F.softmax(model(input_ids, attention_mask), dim=1)


tensor([[0.2864, 0.3261, 0.3875],
        [0.1659, 0.3250, 0.5090],
        [0.2778, 0.3072, 0.4149],
        [0.3403, 0.3071, 0.3527],
        [0.1904, 0.3309, 0.4787],
        [0.2200, 0.3293, 0.4507],
        [0.3431, 0.2016, 0.4553],
        [0.3108, 0.3131, 0.3761]], grad_fn=<SoftmaxBackward>)

In [93]:
# so in the above we have the normalized distribution per sample in the batch, we could select only the MAX and show this
_, output_predictions =torch.max(outputs, dim=1)
print(f'before softmax predictions --> {output_predictions}')

# or we can run the softmax and actually get the same predictions, alas from the normalized distribution
_, softmax_predictions =torch.max(F.softmax(outputs, dim=1), dim=1)
print(f'after softmax predictions --> {softmax_predictions}')

# so we see the prediction remains the same --> that's why in mny cased we can run the torch.max directly on the class non_normalized probabilities


before softmax predictions --> tensor([2, 2, 2, 2, 1, 2, 2, 2])
after softmax predictions --> tensor([2, 2, 2, 2, 1, 2, 2, 2])


## now, let's train our model and evaluate it

In [94]:
# define optimizer, loss function , epochs and scheduler

EPOCHS = 1

optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss().to(device)

In [95]:
# we now write a function to train a single epoch, later we will loop through all epochs
def train_epoch(
  model, 
  data_loader, 
  loss_fn, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  model = model.train()

  losses = []
  correct_predictions = 0
  
  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)

    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, targets)

    correct_predictions += torch.sum(preds == targets)
    losses.append(loss.item())

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

  return correct_predictions.double() / n_examples, np.mean(losses)


In [96]:
# so now let's have a function for evaluationg the model (good for validation and test DataLoaders)
# we will not need the optimizer and the sceduler 

def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()

  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      loss = loss_fn(outputs, targets)

      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  return correct_predictions.double() / n_examples, np.mean(losses)


In [97]:
# ok, now we are good to go over all epochs and run the training and validation loop...:)

%%time

history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
     model,
     train_data_loader,
     loss_fn,
     optimizer,
     device,
     scheduler,
     len(df_train) 
      
  )

  print(f'Train loss {train_loss} - Train accuracy {train_acc}')

  val_acc, val_loss = eval_model(
      model,
      val_data_loader,
      loss_fn,
      device,
      len(df_val)
  )

  print(f'Validation loss {val_loss} - Validation accuracy {val_acc}')
  print ()

  # fill up the right history tracking lists
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)

  if val_acc > best_accuracy:
    torch.save(model.state_dict(), '/content/drive/MyDrive/data/finance_sentiment/best_model_state.bin')
    best_accuracy = val_acc


Epoch 1/1
----------


KeyboardInterrupt: ignored

In [None]:
# it is also a good idea to save all history to a file
df_history = pd.DataFrame(history)

df_history.to_csv('/content/drive/MyDrive/data/finance_sentiment/cased_history.csv')

df_history.head()

## inspecting the accuracy 

In [None]:
# let's place all history dict into a DataFrame for ease of inspection
#df_accuracy = pd.read_csv('/content/drive/MyDrive/data/finance_sentiment/cased_history.csv')
df_accuracy = pd.DataFrame(history)

df_accuracy.head()

In [None]:
# df_accuracy contains the data as pt tensors , let's extract this data and place it into df_acc

def get_item(pt_tensor):
  if isinstance(pt_tensor, float):     
    return pt_tensor
  else: 
    return pt_tensor.item()

ds_t =  df_accuracy['train_acc'].apply(get_item)
ds_v =  df_accuracy['val_acc'].apply(get_item)

df_acc= pd.concat([ds_t, ds_v], axis=1)


In [47]:
df_acc

Unnamed: 0,train_acc,val_acc
0,0.702786,0.797938


In [None]:
# plot it
import plotly.express as px
# import plotly.graph_objects as go

fig = px.line(df_acc, x=df_acc.index, y=['train_acc', 'val_acc'],  range_y=[0.6, 1.02], width=1000)
fig.update_yaxes( title='Accuracy')
fig.update_xaxes( title='Epoch')

fig.show()



## experimental visualization (not needed)

In [49]:
# # another experimental visualization too - autoviz

# !pip install autoviz

# from autoviz.AutoViz_Class import AutoViz_Class
# AV = AutoViz_Class()

# filename = ""
# sep = ","
# dft = AV.AutoViz(
#     filename,
#     sep=",",
#     depVar="",
#     dfte=df_acc,
#     header=0,
#     verbose=0,
#     lowess=False,
#     chart_format="svg",
#     max_rows_analyzed=150000,
#     max_cols_analyzed=30,
# )

## model evaluation

In [100]:
# let's load the best model 

model = SentimentClassifier(len(sentiment_values))
model.load_state_dict(torch.load('/content/drive/MyDrive/data/finance_sentiment/best_model_state.bin'))
model = model.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [101]:
# we can now use te eval_model on the test data to see how well we're doing

test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len(df_test)
)

test_acc.item()


0.8288659793814434

## getting predictions from the model

In [102]:
# let's have a function for predictions wich is very similar to the test evaluation fuction but also returns probabilities

def get_predictions (model, data_loader):

  model = model.eval()


  texts = []
  predictions = []
  prediction_probs = []
  true_labels = []

  with torch.no_grad():
    for d in data_loader:

      texts = d['text']
      input_ids = d['input_ids'].to(device)
      attention_mask = d['attention_mask'].to(device)
      targets = d['targets'].to(device)

      outputs = model(
          input_ids=input_ids,
          attention_mask=attention_mask
      )

      _, preds = torch.max(outputs, dim=1)

      probs = F.softmax(outputs, dim=1)

      texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      true_labels.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  true_labels = torch.stack(true_labels).cpu()

  return texts, predictions, prediction_probs, true_labels




In [103]:
# let's look at the test_data and look at other accuracy measures

y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(
  model,
  test_data_loader
)

In [104]:
# let's have a classification report

class_names = ['negative', 'neutral','positive']
print(classification_report(y_test, y_pred, target_names=class_names))

              precision    recall  f1-score   support

    negative       0.75      0.80      0.78        56
     neutral       0.87      0.87      0.87       285
    positive       0.77      0.75      0.76       144

    accuracy                           0.83       485
   macro avg       0.80      0.81      0.80       485
weighted avg       0.83      0.83      0.83       485



In [174]:
# let's look at the confusion matrix

cf_matrix=confusion_matrix(y_pred, y_test)
print(cf_matrix)


[[ 45   6   9]
 [  9 249  27]
 [  2  30 108]]


In [175]:
# plot the confusion matrix

import plotly.figure_factory as ff

# for plotting let's transfrom the cf_matrix so heat map reflects the confusion matrix correcly

cf_matrix = np.flip(cf_matrix, 0)

x = ['true negative', 'true neutral', 'true positive']
y = ['predicted positive', 'predicted neutral', 'predicted negative']


fig = ff.create_annotated_heatmap(cf_matrix, x=x, y=y, colorscale='amp')
fig.show()



In [193]:
# let's look at one or more examples (we can switch the review_text to any example we want ) 

idx = 6

review_text = y_review_texts[idx]
true_sentiment = y_test[idx]
pred_df = pd.DataFrame({
  'class_names': class_names,
  'values': y_pred_probs[idx],
  'true_label': true_sentiment.item()
})

In [190]:
pred_df

Unnamed: 0,class_names,values,true_label
0,negative,0.02548,2
1,neutral,0.165756,2
2,positive,0.808764,2


In [191]:
# review the example we inserted above

print("\n".join(wrap(review_text)))
print()
print(f'True sentiment: {class_names[true_sentiment]}')

17 March 2011 - Goldman Sachs estimates that there are negative
prospects for the Norwegian mobile operations of Norway 's Telenor ASA
OSL : TEL and Sweden 's TeliaSonera AB STO : TLSN in the short term .

True sentiment: positive


## predicting raw text

In [197]:
# let's try to predict any text

# text = '''
# Suspect in Birmingham FedEx driver road rage shooting captured in Wisconsin
# A man wanted on an attempted murder charge in the Aug. 2 shooting 
# of a FedEx driver on Interstate 20/59 in downtown Birmingham was apprehended 
# today in Madison, Wisconsin. Ronnie Thompson, 31, was apprehended by members of the United States Marshals Service.
# '''

text = ' microsoft losses deepended and the quarter results where really bad'

encoded_review = tokenizer.encode_plus(
  text,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  padding='max_length',
  return_attention_mask=True,
  return_tensors='pt',
)

In [198]:
#Let's get the predictions from our model:

input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)

output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)

print(f'Review text: {text}')
print(f'Sentiment  : {class_names[prediction]}')

prediction.item()


Review text:  microsoft losses deepended and the quarter results where really bad
Sentiment  : negative


tensor([0])