# Financial sentiment analysis

Fine-tuning di un modello [BERT](https://arxiv.org/pdf/1810.04805.pdf) preaddestrato caricato da [TensorFlow Hub](https://www.tensorflow.org/hub/) per eseguire sentiment analysis su un [dataset di news finanziarie](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news).

## Dataset e Preprocessing

In [26]:
!pip install -q -U tensorflow-text

In [27]:
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import numpy as np

Il dataset in questione è composto da entry formate dal titolo di un articolo di news finanziarie e la relativa etichetta (positive, neutral, negative). Utilizzare il solo titolo e non l'intero articolo è una pratica comune, viene utilizzata anche nel paper [Deep Learning for Event-Driven Stock Prediction](https://www.ijcai.org/Proceedings/15/Papers/329.pdf), in quanto il titolo è considerato una sintesi esaustiva del contenuto dell'articolo.


In [28]:
import os.path
from urllib.request import urlretrieve

if not os.path.exists("financial_data_all.csv"):
    urlretrieve("https://raw.githubusercontent.com/gned0/financial_sentiment_analysis/main/financial_data_all.csv", "financial_data_all.csv")

data = pd.read_csv('financial_data_all.csv', delimiter=',', encoding='latin-1')

In [29]:
data2 = data.set_axis(['Target', "Text"], axis=1, inplace=False)
data2.head()

Unnamed: 0,Target,Text
0,neutral,Technopolis plans to develop in stages an area...
1,negative,The international electronic industry company ...
2,positive,With the new production plant the company woul...
3,positive,According to the company 's updated strategy f...
4,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...


In [30]:
data2['Target'].value_counts()

neutral     2878
positive    1363
negative     604
Name: Target, dtype: int64

Labeling delle etichette

In [31]:
from sklearn import preprocessing
data2["Target"] = data2["Target"].astype("category")
le = preprocessing.LabelEncoder()
data2['Target'] = le.fit_transform(data2.Target.values)
data2.head()

Unnamed: 0,Target,Text
0,1,Technopolis plans to develop in stages an area...
1,0,The international electronic industry company ...
2,2,With the new production plant the company woul...
3,2,According to the company 's updated strategy f...
4,2,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...


Conversione del dataset da Dataframe Pandas a Dataset Tensorflow

In [32]:
DATASET_SIZE = data2.size
BATCH_SIZE = 32

text = data2['Text'].to_numpy()
targets = data2['Target'].to_numpy()

dataset = tf.data.Dataset.from_tensor_slices((text, targets)).shuffle(1000)
batches = dataset.shuffle(1000).batch(BATCH_SIZE)

In [33]:

train_size = int(0.85 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
train_dataset = dataset.take(train_size).batch(BATCH_SIZE)
val_dataset = dataset.skip(train_size)
val_dataset = dataset.take(val_size).batch(BATCH_SIZE)

train_dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int64)>

Ogni batch ha la seguente forma

In [34]:
for row in train_dataset.take(1):
  print(row)

(<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'Finnish food company Raisio Oyj HEL : RAIVV said on Friday it has wrapped up the divestment of its margarine operations to US sector player Bunge Ltd NYSE : BG for EUR80m USD119 .2 m .',
       b"Cargotec 's share capital after the increase totals EUR 64,299,180 .",
       b"Kalnapilio-Tauro Grupe ( Kalnapilis-Tauras Group ) , which is owned by Denmark 's Royal Unibrew , raised its market share to 25.18 percent from 23.74 percent , as beer sales for the seven months jumped by 14.5 percent to 40.5 million liters .",
       b"Seppala 's revenue increased by 0.2 % to EUR10 .1 m. In Finland , revenue went down by 2.4 % to EUR6 .8 m , while sales abroad rose by 6.2 % to EUR3 .3 m. Sales increased in all the Baltic countries as well as in Russia and Ukraine .",
       b'The borrower was happy to do the roadshow and this paid off as the hit ratio from it was high .',
       b'Compared with the FTSE 100 index , which rose 28.3 points or 0

## Costruzione del modello

Da TensorFlow Hub viene importato un modello BERT (architettura con 12 layer, ognuno con 12 attention heads). Il modello importato ha già i pesi del pre-training (eseguito su Wikipedia e BooksCorpus). Viene anche importato il relativo layer di preprocessing, che sarà inserito subito dopo il layer di input nella rete neurale.

In [None]:
bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'

tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'

In [None]:
preprocessing = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
encoder_layer = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')

In [None]:
text_test = ['Apple sues Samsung']
preprocessing(text_test)

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = preprocessing
  encoder_inputs = preprocessing_layer(text_input)
  encoder = encoder_layer
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(3, activation='softmax', name='classifier')(net)
  return tf.keras.Model(text_input, net)

In [None]:
model = build_classifier_model()

In [None]:
model.compile(loss=tf.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="sgd", metrics=['accuracy'])

In [None]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               [(None,)]            0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_word_ids': ( 0           text[0][0]                       
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'pooled_output': (N 109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
____________________________________________________________________________________________

## Addestramento e valutazione

Viene utilizzato un callback per salvare dei checkpoint dei pesi durante l'addestramento: alla fine di ogni epoca, se essa è la migliore finora, viene salvato un checkpoint dei pesi. La metrica di confronto delle epoche è l'accuracy (parametro *monitor*).

In [None]:
EPOCHS = 5
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='accuracy',
    mode='max',
    save_best_only=True)

model.fit(train_dataset, epochs=EPOCHS, verbose = 1, callbacks=[model_checkpoint_callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8633d6ee50>

In [None]:
model.evaluate(val_dataset)



[0.356236070394516, 0.860977292060852]

# Binarizzazione del problema

Viene ora binarizzato il dataset (vengono rimosse le entry con etichetta "neutral"), dopodiché viene rieseguito il fine-tuning sul nuovo task binario.

In [35]:
data_binary = data.set_axis(['Target', "Text"], axis=1, inplace=False)
data_binary = data_binary[data_binary.Target != 'neutral']
data_binary['Target'].value_counts()

positive    1363
negative     604
Name: Target, dtype: int64

In [36]:
data_binary["Target"] = data_binary["Target"].astype("category")
data_binary['Target'] = le.fit_transform(data_binary.Target.values)

In [37]:
BINARY_DATASET_SIZE = data_binary.size/2

binary_text = data_binary['Text'].to_numpy()
binary_targets = data_binary['Target'].to_numpy()
dataset = tf.data.Dataset.from_tensor_slices((binary_text, binary_targets)).shuffle(1000)

In [38]:
train_size = int(0.85 * BINARY_DATASET_SIZE)
val_size = int(0.15 * BINARY_DATASET_SIZE)
train_dataset = dataset.take(train_size).batch(BATCH_SIZE)
val_dataset = dataset.skip(train_size)
val_dataset = dataset.take(val_size).batch(BATCH_SIZE)

In [None]:
def build_binary_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = preprocessing
  encoder_inputs = preprocessing_layer(text_input)
  encoder = encoder_layer
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation="sigmoid", name='classifier')(net)
  return tf.keras.Model(text_input, net)

In [None]:
binary_model = build_binary_model()
binary_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), optimizer="sgd", metrics=tf.metrics.BinaryAccuracy())
binary_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               [(None,)]            0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_mask': (None 0           text[0][0]                       
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'sequence_output':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
______________________________________________________________________________________________

In [None]:
EPOCHS = 5
checkpoint_filepath = '/tmp/checkpoint_binary'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='loss',
    mode='max',
    save_best_only=True)

binary_model.fit(train_dataset, epochs=EPOCHS, verbose = 1, callbacks=[model_checkpoint_callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8634f53210>

In [None]:
binary_model.evaluate(val_dataset)



[0.02635045163333416, 0.996610164642334]

# Confronto con semplice modello lineare

## Binario

In [19]:
import sklearn

In [60]:
from sklearn.model_selection import train_test_split
training, validation = train_test_split(data_binary, test_size=0.15, random_state=21)

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [65]:
from sklearn.linear_model import LogisticRegression
lrm = LogisticRegression(solver="saga", C=10)
lrm.fit(vect.fit_transform(training['Text']), training['Target']);



In [66]:
lrm.score(vect.transform(validation['Text']), validation['Target'])

0.8648648648648649

## Multiclasse

In [67]:
training, validation = train_test_split(data2, test_size=0.15, random_state=21)

In [68]:
lrm = LogisticRegression(solver="saga", C=10)
lrm.fit(vect.fit_transform(training['Text']), training['Target']);



In [69]:
lrm.score(vect.transform(validation['Text']), validation['Target'])

0.7730398899587345

# Conclusioni

Un semplice modello lineare raggiunge accuracy del **77%** nel task multiclasse e dell' **86%** nel task binario. Il modello preaddestrato BERT con fine-tuning raggiunge invece rispettivamente l'**86%** e **99%** di accuracy.