Authors: Allard Marc-Antoine

Course: FIN-407

---

## Sentiment Analysis Benchmark

1. Exploration of the training set


2. Models training:
    1. CBOW + Classification Algo
    2. FastText
    3. LSTM/GRU
    4. RoBERTa
    5. FinBERT


---
### Import

In [None]:
# For Colab
# !pip instal requirements.txt

In [40]:
%load_ext autoreload
%autoreload 2

from torch.utils.data import Dataset
import pandas as pd
pd.options.plotting.backend = "plotly"
pd.options.plotting.backend = 'matplotlib'
import plotly.express as px

import numpy as np
import torch
import torch.nn as nn
from torch.optim import RAdam
import random
from tqdm import tqdm
import statsmodels as sm 
import seaborn as sns 
%matplotlib inline 

import spacy, nltk, gensim, sklearn
import pyLDAvis.gensim_models

from datasets import Dataset as HFDataset
from datasets import concatenate_datasets
from torch.utils.data import DataLoader

from transformers import (AutoConfig,
                          AutoTokenizer,
                          AutoModelForSequenceClassification,
                          DataCollatorWithPadding,
                          TrainingArguments,
                          AdamW,
                          get_scheduler,
                          Trainer)

import evaluate
accuracy = evaluate.load("accuracy")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
### Helper func

In [45]:
def plot_timeseries(df): 
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract the year and month from the Date column
    df['YearMonth'] = df['Date'].dt.to_period('M').astype(str)

    # Group by YearMonth and count the number of articles for each sentiment label
    monthly_counts = df.groupby(['YearMonth', 'label']).size().reset_index(name='Count')

    # Create a bar plot using Plotly Express with hue for the sentiment label
    fig = px.bar(monthly_counts, 
                 x='YearMonth', 
                 y='Count', 
                 title='Monthly Count of Sentiment-Labeled News Articles', 
                 color='label', 
                 color_discrete_map={'-1': 'red', '0': 'blue', '1': 'green'},
                 labels={'YearMonth': 'Month', 'Count': 'Number of Articles'},
                 category_orders={'label': ['-1', '0', '1']})

    # Update the x-axis tick format and angle for better readability
    fig.update_xaxes(tickangle=-45)

    # Show the plot
    fig.show()

---
### Exploration of the training set

In [29]:
training_path = "../../data/training/"
# Load the data
training_news = pd.read_json(training_path+"train_answers.json")

# Add unique identifier
training_news['id'] = range(1, len(training_news) + 1)
# label to float
training_news['label'] = training_news['label'].str.replace(" ", "")
training_news['label'] = training_news['label'].astype(float)

training_news.set_index('id', inplace=True)

training_news.head()

Unnamed: 0_level_0,Summary,Date,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"Today, the United States wakes up at 2, smokes...",2010-01-12 00:00:00+00:00,-1.0
2,"Today, the United States wakes up at 2, smokes...",2010-01-12 00:00:00+00:00,0.0
3,Short treasuries via TBT. That 4.7% return on ...,2010-02-12 00:00:00+00:00,1.0
4,Hard Assets Investor submits: By Brad Zigler R...,2010-02-26 00:00:00+00:00,0.0
5,"After seeing the debacle 1999-2002 in CDOs, mo...",2010-02-26 00:00:00+00:00,1.0


In [30]:
training_news.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3931 entries, 1 to 3931
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype              
---  ------   --------------  -----              
 0   Summary  3931 non-null   object             
 1   Date     3931 non-null   datetime64[ns, UTC]
 2   label    3931 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(1), object(1)
memory usage: 122.8+ KB


In [47]:
import matplotlib.pyplot as plt

fig = training_news.label.value_counts().plot(kind='bar', backend='plotly')

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Number of Articles',
)

In [44]:
plot_timeseries(training_news)


Converting to PeriodArray/Index representation will drop timezone information.



---
### Models training

#### 1. CBOW

#### 2. FastText

#### 3. LSTM/GRU

#### 4. RoBERTa

#### 5. FinBERT