In [1]:
import matplotlib.pyplot as plt
from nltk.sentiment import vader
import numpy as np
import prep_financial_phrasebank as prep # library for preprocessing dataset
from afinn import Afinn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



## Financial Phrasebank

This dataset contains labelled sentences extracted from financial news. It's split into 4 categories, each one based on the % of experts that agreed on the sentence's label:
1) sentences_50agree
2) sentences_66agree
3) sentences_75agree
4) sentences_100agree

The entire dataset contains 11,821 news articles.



## Data preparation

Since our real data is in the form of news article abstracts, we perform certain data cleaning steps:

1) turn the entire sentence to lowercase
2) handle negations: The way we are handling negation is by using the wordnet library which contains lists of antonyms for several words. Whenever the word "not" or "n't" is encountered, we replace the next word with its antonym. For example, "not good" becomes "bad". This process will allow us to then remove stopwords without affecting the sentiment of the sentence.
3) remove punctuation
4) remove stopwords

After the cleaning is done, we proceed with sentence tokenization. For tokenization, we'll use two approaches:
1) single-embedding tokenization using the sentiment score for each word and **averaging** the sentiment scores for all words in a sentence
2) using pretrained word2vec embeddings. This approach uses pretrained embeddings generated from the google news 300 dataset which was trained on 100 billion words and generates vectors of 300 dimensions for each word. In this approach we try to extract as much context as possible from each word in the sentence.

### Single-embedding tokenization

We'll start with a simple model which gives each word a sentiment score. The sentence sentiment will be determined by the **average** of the sentiment scores of the words in the sentence.

For this model, we used the score method from the afinn library to get the sentiment scores for each word. 

In [17]:
afin = Afinn()

2263

In [4]:
def sum_vader_scores(sentence):
    sent_score = np.sum(np.array([afin.score(word) for word in sentence]))
    return sent_score

def generate_X_vader(df):
    X = np.array([
        sum_vader_scores(sentence)
        for sentence in df['tokenized_sentences']
    ])
    return X.reshape(-1,1)


In [5]:
# import and clean data
fp = prep.import_all_splits()
fp = prep.tokenize_financial_phrasebank(fp)
# get X and y with afinn embeddings
X = prep.generate_X_vader(fp)
y = prep.generate_y_vader()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print("afinn (train):", clf.score(X_train, y_train))
print("afinn (test):", clf.score(X_test, y_test))

afinn (train): 0.612707182320442
afinn (test): 0.6203090507726269


  y = column_or_1d(y, warn=True)


In [11]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics

# prepare training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

gbc = GradientBoostingClassifier(
    random_state=0, max_depth=4, n_estimators=30, learning_rate=0.3
).fit(X_train, y_train)
print("afinn (train):", gbc.score(X_train, y_train))
print("afinn (test):", gbc.score(X_test, y_test))

afinn (train): 0.612707182320442
afinn (test): 0.6203090507726269


  y = column_or_1d(y, warn=True)


In [12]:
y_train[0]

array([2], dtype=int64)

In [13]:
import torch
from torch import nn
from torch.optim import Adam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

class SimpleRNN(nn.Module):
    def __init__(self, hidden_size):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(input_size=1, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :])
        return out

# Convert numpy arrays to PyTorch tensors
X_train_torch = torch.from_numpy(X_train).float()
y_train_torch = torch.from_numpy(y_train).float()

# Reshape X_train to be (batch_size, sequence_length, input_size)
X_train_torch = X_train_torch.view(-1, 1, 1)

# Initialize the model, loss function, and optimizer
model = SimpleRNN(hidden_size=10)
criterion = nn.MSELoss()
optimizer = Adam(model.parameters(), lr=0.01)

# Train the model
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train_torch)
    loss = criterion(outputs, y_train_torch)
    loss.backward()
    optimizer.step()
    # print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 1, Loss: 1.1279287338256836
Epoch: 2, Loss: 1.03074049949646
Epoch: 3, Loss: 0.9399200081825256
Epoch: 4, Loss: 0.8547313213348389
Epoch: 5, Loss: 0.7754307985305786
Epoch: 6, Loss: 0.7023527026176453
Epoch: 7, Loss: 0.6358165740966797
Epoch: 8, Loss: 0.5761474370956421
Epoch: 9, Loss: 0.5236926078796387
Epoch: 10, Loss: 0.4788150489330292
Epoch: 11, Loss: 0.4418506324291229
Epoch: 12, Loss: 0.4130420684814453
Epoch: 13, Loss: 0.39245468378067017
Epoch: 14, Loss: 0.3798768222332001
Epoch: 15, Loss: 0.3747141659259796
Epoch: 16, Loss: 0.37590497732162476
Epoch: 17, Loss: 0.38189730048179626
Epoch: 18, Loss: 0.39074909687042236
Epoch: 19, Loss: 0.40037909150123596
Epoch: 20, Loss: 0.4089144170284271
Epoch: 21, Loss: 0.41499829292297363
Epoch: 22, Loss: 0.41793790459632874
Epoch: 23, Loss: 0.4176730811595917
Epoch: 24, Loss: 0.414628803730011
Epoch: 25, Loss: 0.40952637791633606
Epoch: 26, Loss: 0.4032071530818939
Epoch: 27, Loss: 0.39649197459220886
Epoch: 28, Loss: 0.390083432197

In [14]:
# Convert numpy arrays to PyTorch tensors
X_test_torch = torch.from_numpy(X_test).float()
y_test_torch = torch.from_numpy(y_test).float()

# Reshape X_test to be (batch_size, sequence_length, input_size)
X_test_torch = X_test_torch.view(-1, 1, 1)

# Switch model to evaluation mode
model.eval()

# Make predictions
with torch.no_grad():
    predictions = model(X_test_torch)

# Calculate the loss
loss = criterion(predictions, y_test_torch)
print(f'Test Loss: {loss.item()}')

# Convert the model's output to binary labels
predicted_labels = torch.round(predictions)

# Calculate the number of correct predictions
correct_predictions = (predicted_labels == y_test_torch).float().sum()

# Calculate the accuracy
accuracy = correct_predictions / y_test_torch.shape[0]

print(f'Accuracy: {accuracy.item()}')

Test Loss: 0.3615744113922119
Accuracy: 0.620309054851532


In [15]:
prep.RNN_experiment_torch()

[1,   100] loss: 1.044
[1,   200] loss: 1.990
[1,   300] loss: 2.920
[2,   100] loss: 0.919
[2,   200] loss: 1.830
[2,   300] loss: 2.742
[3,   100] loss: 0.919
[3,   200] loss: 1.810
[3,   300] loss: 2.704
[4,   100] loss: 0.886
[4,   200] loss: 1.784
[4,   300] loss: 2.660
[5,   100] loss: 0.885
[5,   200] loss: 1.772
[5,   300] loss: 2.672
[6,   100] loss: 0.895
[6,   200] loss: 1.788
[6,   300] loss: 2.666
[7,   100] loss: 0.887
[7,   200] loss: 1.754
[7,   300] loss: 2.632
[8,   100] loss: 0.880
[8,   200] loss: 1.753
[8,   300] loss: 2.628
[9,   100] loss: 0.872
[9,   200] loss: 1.737
[9,   300] loss: 2.626
[10,   100] loss: 0.868
[10,   200] loss: 1.727
[10,   300] loss: 2.588
Finished Training
Accuracy:  0.6419627749576988


### Synthetic data generation

We generated synthetic data using the following process:
1) obtain a vocabulary of words using nltk's sentiment analyzer's lexicon
2) clean that vocabulary to only include words and filters out punctuation, numbers, etc.
3) generate random sized sentences using random words from the vocabulary
4) create a label for each sentence by taking the **average** of the sentiment scores of the words in the sentence. 

The following histogram shows the distribution of the sentiment score in each of the words in our vocabulary.

In [None]:
# Initialize VADER
sia = vader.SentimentIntensityAnalyzer()

# make a vocabulary from the lexicon which excludes non alpha tokens
vocab = sorted([token for token in sia.lexicon if token.isalpha()])

values = np.array([sia.lexicon[word] for word in vocab])

# show a histogram of vocab sentiment scores
plt.hist(values, bins='auto')  # 'auto' automatically determines the number of bins
plt.title('Histogram of VADER Lexicon Values')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Given the above distribution, we've decided to create a 'neutral' sentiment cutoff between [[ INPUT THE CUTOFFS WE CHOSE ]]. Aggregate scores below this range will be classified as 'negative' while scores above this range will be classified as 'positive'.