## Sentiment Analysis from CryptoLin

### Importing the data of CryptoLin for finBERT model Training 

In [1]:
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

In [2]:
df = pd.read_csv("CryptoLin_IE.csv")

In [3]:
df.columns

Index(['id', 'date', 'news', 'final_manual_labelling', 'text_span',
       'type_abnormal_return_fama_frech', 'vader', 'textblob', 'flair',
       'finbert_positive', 'finbert_negative', 'finbert_neutral',
       'vader_class', 'textblob_class', 'flair_class',
       'finbert_positive_class', 'finbert_negative_class',
       'finbert_neutral_class'],
      dtype='object')

In [4]:
df.drop(['id'],inplace=True,axis=1)

In [5]:
df.head()

Unnamed: 0,date,news,final_manual_labelling,text_span,type_abnormal_return_fama_frech,vader,textblob,flair,finbert_positive,finbert_negative,finbert_neutral,vader_class,textblob_class,flair_class,finbert_positive_class,finbert_negative_class,finbert_neutral_class
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...,0,0.0,0.0,0.875877,0.098288,-0.020569,0.881142,0,-1,-1,1,0,1
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...,0,0.128,0.2,0.998796,0.047823,-0.162971,0.789206,1,1,1,-1,-1,1
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...,0,0.0,0.136364,0.984027,0.156997,-0.008097,0.834906,0,1,1,1,1,1
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...,0,0.0,0.0,0.996666,0.055608,-0.015489,0.928903,0,-1,1,0,0,1
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...,0,-0.3182,0.0,0.999921,0.039964,-0.472788,0.487248,-1,-1,1,-1,-1,-1


In [6]:
#We ae going to use the reduce manual_labeled df for training our finBERT model:
input_df = df[['date','news','final_manual_labelling','text_span']]

## FinBERT Sentiment Analysis
ref: https://wandb.ai/ivangoncharov/FinBERT_Sentiment_Analysis_Project/reports/Financial-Sentiment-Analysis-on-Stock-Market-Headlines-With-FinBERT-Hugging-Face--VmlldzoxMDQ4NjM0

HuggingFace makes it really easy for us to try out different NLP models. We can find the FinBERT model on the HuggingFace model hub (https://huggingface.co/ProsusAI/finbert) & even run a test inference using a little text box right on their website (https://huggingface.co/ProsusAI/finbert)! 

#### Using Finbert as it is, checking the output 

In [7]:
#!pip install transformers

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [9]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [10]:
def apply_finbert(x):
    inputs = tokenizer([x], padding = True, truncation = True, return_tensors='pt')
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=1) 
    return predictions[:, 0].tolist()[0], predictions[:, 1].tolist()[0], predictions[:, 2].tolist()[0]

In [11]:
input_df[['Positive','Negative','Neutral']] = (input_df['news'].apply(apply_finbert)).apply(pd.Series)

In [12]:
input_df

Unnamed: 0,date,news,final_manual_labelling,text_span,Positive,Negative,Neutral
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...,0.098288,0.020569,0.881142
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...,0.047823,0.162970,0.789207
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...,0.156997,0.008097,0.834906
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...,0.055608,0.015489,0.928903
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...,0.039964,0.472789,0.487248
...,...,...,...,...,...,...,...
2678,2020-05-01,Gambling for a good cause  CryptoSlots donate...,1,{annotator1_id:80;annotator1_label:1; annotato...,0.178831,0.008580,0.812589
2679,2020-04-18,"Litecoin, The Chinese Alternative to Bitcoin",0,{annotator1_id:10;annotator1_label:0; annotato...,0.105272,0.009314,0.885414
2680,2020-04-10,Do You Know What is Happening to Money?,0,{annotator1_id:32;annotator1_label:0; annotato...,0.027453,0.304318,0.668229
2681,2018-07-30,Download CoinMarketCal app on App Store,0,{annotator1_id:33;annotator1_label:0; annotato...,0.046135,0.015244,0.938621


In [13]:
input_df[input_df['Positive']>0.5]['final_manual_labelling'].value_counts()

 1    419
 0     95
-1     11
Name: final_manual_labelling, dtype: int64

In [14]:
input_df[input_df['Negative']>0.5]['final_manual_labelling'].value_counts()

-1    202
 0     73
 1     32
Name: final_manual_labelling, dtype: int64

In [15]:
input_df[input_df['Neutral']>0.5]['final_manual_labelling'].value_counts()

 1    901
 0    743
-1    172
Name: final_manual_labelling, dtype: int64

In [16]:
def calculate_result(df):
    largo = df.shape[0]
    result = []
    for i in range(0,largo):
        if df.iloc[i]['Positive']>0.5:
            result.append(1)
        elif df.iloc[i]['Negative']>0.5:
            result.append(-1)
        elif df.iloc[i]['Neutral']>0.5:
            result.append(0)
        elif df.iloc[i]['Neutral']>df.iloc[i]['Positive'] and df.iloc[i]['Neutral']>df.iloc[i]['Negative']:
            result.append(0)
        elif df.iloc[i]['Positive']>df.iloc[i]['Neutral'] and df.iloc[i]['Positive']>df.iloc[i]['Negative']:
            result.append(1)
        else:
            result.append(-1)

    df['result']=result
    return df

In [17]:
non_trained_result = calculate_result(input_df)

In [18]:
non_trained_result.head(10)

Unnamed: 0,date,news,final_manual_labelling,text_span,Positive,Negative,Neutral,result
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...,0.098288,0.020569,0.881142,0
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...,0.047823,0.16297,0.789207,0
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...,0.156997,0.008097,0.834906,0
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...,0.055608,0.015489,0.928903,0
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...,0.039964,0.472789,0.487248,0
5,2022-01-25,Multichain vulnerability put a billion dollars...,-1,{annotator1_id:77;annotator1_label:-1; annotat...,0.101357,0.194968,0.703675,0
6,2022-01-25,YouTube wants to help content creators capital...,0,{annotator1_id:52;annotator1_label:0; annotato...,0.475625,0.007213,0.517162,0
7,2022-01-25,OpenSea is reimbursing users who sold NFTs bel...,0,{annotator1_id:10;annotator1_label:0; annotato...,0.011183,0.952579,0.036238,-1
8,2022-01-25,GoodDollar Launches Key Protocol Upgrade to Ex...,1,{annotator1_id:22;annotator1_label:1; annotato...,0.853261,0.009177,0.137562,1
9,2022-01-25,BCB Group raises a $60 million Series A round ...,1,{annotator1_id:43;annotator1_label:1; annotato...,0.218266,0.010609,0.771125,0


In [19]:
#The accuracy with the original FinBERT:
accuracy_score(y_true=non_trained_result['final_manual_labelling'], y_pred=non_trained_result['result'])

0.5143496086470369

## Retraining Finbert

In [20]:
df = input_df[['news','final_manual_labelling']]

In [21]:
df.head()

Unnamed: 0,news,final_manual_labelling
0,"Ripple announces stock buyback, nabs $15 billi...",1
1,IMF directors urge El Salvador to remove Bitco...,-1
2,Dragonfly Capital is raising $500 million for ...,1
3,Rick and Morty co-creator collaborates with Pa...,0
4,How fintech SPACs lost their shine,0


In [22]:
df.columns = ['news','labels']
df.head()

Unnamed: 0,news,labels
0,"Ripple announces stock buyback, nabs $15 billi...",1
1,IMF directors urge El Salvador to remove Bitco...,-1
2,Dragonfly Capital is raising $500 million for ...,1
3,Rick and Morty co-creator collaborates with Pa...,0
4,How fintech SPACs lost their shine,0


In [23]:
#Giving a +1 offset to the target variable:
df['labels'].replace({1:2},inplace=True)
df['labels'].replace({0:1},inplace=True)
df['labels'].replace({-1:0},inplace=True)

In [24]:
df['labels'].value_counts()

2    1366
1     921
0     396
Name: labels, dtype: int64

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback
import numpy as np

In [26]:
# Preprocess data
X = list(df["news"])
y = list(df["labels"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_val_2, X_test, y_val_2, y_test = train_test_split(X_val, y_val, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val_2, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

In [27]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [28]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

In [29]:
# Define Trainer parameters
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    #recall = recall_score(y_true=labels, y_pred=pred,'weighted')
    #precision = precision_score(y_true=labels, y_pred=pred)
    #f1 = f1_score(y_true=labels, y_pred=pred)
    
    return {"accuracy": accuracy}#, "precision": precision, "recall": recall, "f1": f1}

In [30]:
# Define Trainer
args = TrainingArguments(
    output_dir="first_test",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    seed=0,
    load_best_model_at_end=True,
)

In [31]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [32]:
# Train pre-trained model
trainer.train()

***** Running training *****
  Num examples = 2146
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1076


Step,Training Loss,Validation Loss,Accuracy
500,0.6653,1.974498,0.37296
1000,0.2607,3.483169,0.358974


***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to first_test/checkpoint-500
Configuration saved in first_test/checkpoint-500/config.json
Model weights saved in first_test/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to first_test/checkpoint-1000
Configuration saved in first_test/checkpoint-1000/config.json
Model weights saved in first_test/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from first_test/checkpoint-500 (score: 1.9744975566864014).


TrainOutput(global_step=1076, training_loss=0.44057233094282755, metrics={'train_runtime': 3038.0404, 'train_samples_per_second': 2.826, 'train_steps_per_second': 0.354, 'total_flos': 185272957552032.0, 'train_loss': 0.44057233094282755, 'epoch': 4.0})

In [33]:
# ----- 3. Predicting -----#
# Create torch dataset
test_dataset = Dataset(X_test_tokenized, y_test)

In [34]:
# Load trained model
model_path = "first_test/checkpoint-500"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=3)

loading configuration file first_test/checkpoint-500/config.json
Model config BertConfig {
  "_name_or_path": "ProsusAI/finbert",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 1,
    "neutral": 2,
    "positive": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.20.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file first

In [35]:
# Define test trainer
test_trainer = Trainer(model)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [36]:
 #Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 108
  Batch size = 8


In [37]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [38]:
#The accuracy with the re-trained FinBERT:
accuracy_score(y_true=y_test, y_pred=y_pred)

0.6944444444444444

In [39]:
#The accuracy of the original finbert:
X_test_df = pd.DataFrame()
X_test_df['news']=X_test
X_test_df['labels']=y_test

In [40]:
X_test_df.head()

Unnamed: 0,news,labels
0,PayPal ups weekly limit on crypto purchases to...,2
1,Bitcoin climbs more than 10% following an ext...,2
2,Russian crypto ban proposal draws denunciation...,0
3,At least $611 million stolen in massive cross-...,0
4,Shenzhen-listed ICT firm to spend up to $155 m...,2


In [41]:
X_test_df.shape

(108, 2)

In [42]:
X_test_df['labels'].replace({0:-1},inplace=True)
X_test_df['labels'].replace({1:0},inplace=True)
X_test_df['labels'].replace({2:1},inplace=True)

In [43]:
X_test_df.head()

Unnamed: 0,news,labels
0,PayPal ups weekly limit on crypto purchases to...,1
1,Bitcoin climbs more than 10% following an ext...,1
2,Russian crypto ban proposal draws denunciation...,-1
3,At least $611 million stolen in massive cross-...,-1
4,Shenzhen-listed ICT firm to spend up to $155 m...,1


In [44]:
X_test_df[['Positive','Negative','Neutral']] = (X_test_df['news'].apply(apply_finbert)).apply(pd.Series)

In [45]:
result_df = calculate_result(X_test_df)

In [46]:
result_df.head(10)

Unnamed: 0,news,labels,Positive,Negative,Neutral,result
0,PayPal ups weekly limit on crypto purchases to...,1,0.129498,0.562219,0.308283,-1
1,Bitcoin climbs more than 10% following an ext...,1,0.010615,0.011444,0.977942,0
2,Russian crypto ban proposal draws denunciation...,-1,0.890342,0.087145,0.022513,1
3,At least $611 million stolen in massive cross-...,-1,0.916018,0.060444,0.023538,1
4,Shenzhen-listed ICT firm to spend up to $155 m...,1,0.006029,0.022424,0.971547,0
5,Coinbase says it holds bitcoin as an investmen...,1,0.00531,0.019722,0.974968,0
6,Solana-based DEX Soldex AI : An interview with...,0,0.018919,0.959193,0.021887,-1
7,"Keep, NuCypher communities back protocol merge...",0,0.021079,0.742907,0.236014,-1
8,Cross-chain decentralized exchange zkLink rais...,1,0.004484,0.038862,0.956654,0
9,Grammy Award winner The Weeknd announces plan ...,1,0.01107,0.940532,0.048397,-1


In [47]:
result_df.shape

(108, 6)

In [48]:
#The accuracy with the original FinBERT:
accuracy_score(y_true=result_df['labels'], y_pred=result_df['result'])

0.12962962962962962

### Retraining with a second set of arguments 

In [49]:
# Define Trainer
args = TrainingArguments(
        output_dir = 'second_test/',
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=4,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [50]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [51]:
# Train pre-trained model
trainer.train()

***** Running training *****
  Num examples = 2146
  Num Epochs = 4
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 272


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.328407,0.403263
2,No log,2.56128,0.389277
3,No log,2.762023,0.389277
4,No log,2.879592,0.389277


***** Running Evaluation *****
  Num examples = 429
  Batch size = 32
Saving model checkpoint to second_test/checkpoint-68
Configuration saved in second_test/checkpoint-68/config.json
Model weights saved in second_test/checkpoint-68/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 32
Saving model checkpoint to second_test/checkpoint-136
Configuration saved in second_test/checkpoint-136/config.json
Model weights saved in second_test/checkpoint-136/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 32
Saving model checkpoint to second_test/checkpoint-204
Configuration saved in second_test/checkpoint-204/config.json
Model weights saved in second_test/checkpoint-204/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 32
Saving model checkpoint to second_test/checkpoint-272
Configuration saved in second_test/checkpoint-272/config.json
Model weights saved in second_test/checkpoint-272/pytor

TrainOutput(global_step=272, training_loss=0.16326583133024328, metrics={'train_runtime': 2430.9704, 'train_samples_per_second': 3.531, 'train_steps_per_second': 0.112, 'total_flos': 185272957552032.0, 'train_loss': 0.16326583133024328, 'epoch': 4.0})

In [52]:
# ----- 3. Predicting -----#
# Create torch dataset
test_dataset = Dataset(X_test_tokenized, y_test)

In [53]:
# Load trained model
model_path = "second_test/checkpoint-204"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=3)

loading configuration file second_test/checkpoint-204/config.json
Model config BertConfig {
  "_name_or_path": "first_test/checkpoint-500",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 1,
    "neutral": 2,
    "positive": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.20.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights 

In [54]:
# Define test trainer
test_trainer = Trainer(model)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [55]:
 #Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 108
  Batch size = 8


In [56]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [57]:
#The accuracy with the re-trained FinBERT:
accuracy_score(y_true=y_test, y_pred=y_pred)

0.7314814814814815

## Testing different approaches to improve the model accuracy 

### Removing the Stopwords 

In [58]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
df['news'] = df['news'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

[nltk_data] Downloading package stopwords to /Users/paula/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [59]:
df["news"] = df["news"].str.lower()

In [60]:
import string
df["news"] = df['news'].str.replace('[^\w\s]','')
print(df['news'])

0       ripple announces stock buyback nabs 15 billion...
1       imf directors urge el salvador remove bitcoin ...
2          dragonfly capital raising 500 million new fund
3       rick morty cocreator collaborates paradigm nft...
4                            how fintech spacs lost shine
                              ...                        
2678    gambling good cause  cryptoslots donates proce...
2679             litecoin the chinese alternative bitcoin
2680                     do you know what happening money
2681                 download coinmarketcal app app store
2682               download coinmarketcal app google play
Name: news, Length: 2683, dtype: object


In [61]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

#Function to apply for each word the proper lemmatization.
def lemmetize_titles(words):
    a = []
    tokens = word_tokenize(words)
    for token in tokens:
        lemmetized_word = lemmatizer.lemmatize(token)
        a.append(lemmetized_word)
    lemmatized_title = ' '.join(a)
    return lemmatized_title

[nltk_data] Downloading package punkt to /Users/paula/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/paula/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [62]:
df['lemmetized_titles'] = df['news'].apply(lemmetize_titles)

In [63]:
df.head(20)

Unnamed: 0,news,labels,lemmetized_titles
0,ripple announces stock buyback nabs 15 billion...,2,ripple announces stock buyback nabs 15 billion...
1,imf directors urge el salvador remove bitcoin ...,0,imf director urge el salvador remove bitcoin l...
2,dragonfly capital raising 500 million new fund,2,dragonfly capital raising 500 million new fund
3,rick morty cocreator collaborates paradigm nft...,1,rick morty cocreator collaborates paradigm nft...
4,how fintech spacs lost shine,1,how fintech spacs lost shine
5,multichain vulnerability put billion dollars r...,0,multichain vulnerability put billion dollar ri...
6,youtube wants help content creators capitalize...,1,youtube want help content creator capitalize nfts
7,opensea reimbursing users sold nfts market val...,1,opensea reimbursing user sold nfts market valu...
8,gooddollar launches key protocol upgrade expan...,2,gooddollar launch key protocol upgrade expand ...
9,bcb group raises 60 million series a round col...,2,bcb group raise 60 million series a round cole...


In [64]:
# Preprocess data
X = list(df["lemmetized_titles"])
y = list(df["labels"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_val_2, X_test, y_val_2, y_test = train_test_split(X_val, y_val, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val_2, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

In [65]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)
test_dataset = Dataset(X_test_tokenized, y_test)

In [66]:
# Define Trainer
args = TrainingArguments(
        output_dir = 'third_test/',
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=4,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [67]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [68]:
# Train pre-trained model
trainer.train()

***** Running training *****
  Num examples = 2146
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1076


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.818241,0.435897
2,0.418200,2.943276,0.391608
3,0.418200,3.3849,0.407925
4,0.177700,3.623519,0.39627


***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to third_test/checkpoint-269
Configuration saved in third_test/checkpoint-269/config.json
Model weights saved in third_test/checkpoint-269/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to third_test/checkpoint-538
Configuration saved in third_test/checkpoint-538/config.json
Model weights saved in third_test/checkpoint-538/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to third_test/checkpoint-807
Configuration saved in third_test/checkpoint-807/config.json
Model weights saved in third_test/checkpoint-807/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 429
  Batch size = 8
Saving model checkpoint to third_test/checkpoint-1076
Configuration saved in third_test/checkpoint-1076/config.json
Model weights saved in third_test/checkpoint-1076/pytorch_model.b

TrainOutput(global_step=1076, training_loss=0.2893604301608628, metrics={'train_runtime': 2417.2246, 'train_samples_per_second': 3.551, 'train_steps_per_second': 0.445, 'total_flos': 136749087716976.0, 'train_loss': 0.2893604301608628, 'epoch': 4.0})

In [69]:
# Load trained model
model_path = "third_test/checkpoint-269"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=3)

loading configuration file third_test/checkpoint-269/config.json
Model config BertConfig {
  "_name_or_path": "second_test/checkpoint-204",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 1,
    "neutral": 2,
    "positive": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.20.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights 

In [70]:
# Define test trainer
test_trainer = Trainer(model)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [71]:
 #Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 108
  Batch size = 8


In [72]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [73]:
#The accuracy with the re-trained FinBERT:
accuracy_score(y_true=y_test, y_pred=y_pred)

0.8611111111111112

In [87]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, confusion_matrix

In [88]:
confusion_matrix(y_true=y_test, y_pred=y_pred)

array([[19,  0,  1],
       [ 1, 22, 12],
       [ 0,  1, 52]])

In [89]:
raw_pred, _, _ = test_trainer.predict(train_dataset)

***** Running Prediction *****
  Num examples = 2146
  Batch size = 8


In [90]:
y_pred = np.argmax(raw_pred, axis=1)

In [91]:
accuracy_score(y_true=y_train, y_pred=y_pred)

0.9263746505125815

In [92]:
confusion_matrix(y_true=y_train, y_pred=y_pred)

array([[ 300,   13,    8],
       [  25,  624,   85],
       [  10,   17, 1064]])