<a href="https://colab.research.google.com/github/pschrader98/ACC540_Tutorial5/blob/main/ACC540Tutorial5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook provides you with a simple application of sentiment analysis with Bidirectional Encoder Representations from Transformers (BERT).
Transformers took over many of the natural language processing tasks in practice and academia. Libraries, such as Huggingface and Tensorflow, provide a lot of functionality for the analysisis of unstructured data. Many models are pre-trained on large datasets and readily available. You can always check out https://huggingface.co/models if you are looking for models to solve certain tasks. It is often sufficient to use these "out of the box" models or fine tune these models.

We start by importing the packages.

In [1]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import torch

For this analysis we start by using the Finbert model. It is trained on earnings calls, analyst reports, and corporate reports. Please check out the documentation here: https://huggingface.co/yiyanghkust/finbert-tone and their paper https://onlinelibrary.wiley.com/doi/full/10.1111/1911-3846.12832.

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

Lets load the model and the tokenizer.

In [40]:
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone', model_max_length=512)

Lets load the dataset. The dataset contains Reddit posts of retail investors and their discussion of stocks. We are interested in the sentiment of the discussion of certain stocks.

In [6]:
data = pd.read_csv('/content/ACC540_Tutorial5/Comments_ticker_sample.csv', index_col=False)

In [7]:
data

Unnamed: 0.1,Unnamed: 0,Date_UTC,Author,Body,Tickers
0,48861,1649873485,SwedishFish123,"I fold man, I’m buying GOLD shares and calling...",['GOLD']
1,759092,1655145020,MONEYMIKE_Bx,SPY GONNA SEE 360 EOW people??,['SEE']
2,2469125,1670441743,Not99Percent,"Biden used up that SPR excuse already, can't ...",['SPR']
3,3414216,1681222843,Rare-ish_Bird,"G'day, mate. sleep well?",['G']
4,2854971,1675167857,FunCranberry112122,SNAP usually does bounce back a lot after earn...,['SNAP']
...,...,...,...,...,...
995,3734314,1686065752,downinthebasement,Interest rates at 0: Market melts up. \nBulls...,['V']
996,3506174,1682518366,NobleMotary,It’s like 90% institutional owned.\n\nDon’t sh...,['CMG']
997,1561815,1662146902,juliettewhiskey,WOW WAY TO KIL THE MARKET,['WOW']
998,1298123,1660053147,estherxxx07,Where are the AMTD bagholders?,['AMTD']


To do inference or train large language models it is often necessary to use a GPU for speed-up. Google Colab offers an environment with a GPU, but often there are capacity contraint and you cannot connect with a basic account. For the small dataset we use a CPU works as well, but it takes longer. The code below simply uses a GPU, if available.

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
finbert.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30873, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.

In [41]:
nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer, device=-1, truncation=True)

Lets predict the sentiment of the Reddit posts.

In [43]:
predictions = nlp(data['Body'].tolist())

Here we simply add the list as a new column to the dataframe and only keep the label.

In [45]:
data['Predictions_FinBert'] = predictions
data['Predictions_FinBert'] = data['Predictions_FinBert'].apply(lambda x: x['label'])

Now we can look at some examples for negative posts.

In [46]:
data[data['Predictions_FinBert']=='Negative']

Unnamed: 0.1,Unnamed: 0,Date_UTC,Author,Body,Tickers,Predictions,Predictions_TwitterBert,Predictions_FinBert
19,2569867,1671551386,talaabo,People act like Meta is the only major tech st...,['AMZN'],Negative,negative,Negative
24,522111,1653056383,warm_breakfast_beer,AMZN trying to hold it together. This shit is ...,['AMZN'],Negative,negative,Negative
31,248109,1651261903,Not99Percent,AAPL lost more than 100bn just in this session...,['AAPL'],Negative,negative,Negative
35,2668860,1672941579,Oxianas,Coal stocks are not following natgas this time...,"['BTU', 'CEIX']",Negative,negative,Negative
50,3446784,1681737368,UsernameTaken_123,Wonder how much TWTR would be worth if it star...,['TWTR'],Negative,neutral,Negative
...,...,...,...,...,...,...,...,...
929,3851543,1687550215,MonkeyDickLuffy,My ADBE and AAPL calls single handedly kept th...,"['ADBE', 'AAPL']",Negative,negative,Negative
953,1977808,1665141146,VisualMod,>Imagine being Mr. Bull's BDSM sex slave or an...,['AMD'],Negative,negative,Negative
958,2815729,1674757610,VisualMod,>USER REPORTS INDICATE SPOTIFY IS HAVING PROBL...,"['USER', 'IS']",Negative,negative,Negative
963,65142,1649961019,OMASJack,BBBY is holding to a total of -3% in last 2 da...,"['BBBY', 'AMD']",Negative,negative,Negative


... and positive posts.

In [47]:
data[data['Predictions_FinBert']=='Positive']

Unnamed: 0.1,Unnamed: 0,Date_UTC,Author,Body,Tickers,Predictions,Predictions_TwitterBert,Predictions_FinBert
20,3556007,1683128397,RedsInABox,No one can afford new vehicles so people are o...,"['O', 'AZO']",Positive,negative,Positive
22,1994892,1665675311,Key_Abroad7633,So hotter than expected inflation is good now?...,['BP'],Positive,negative,Positive
39,665673,1654529036,iop9,$DUOL \nGood short opportunity in Duoling...,['DUOL'],Positive,negative,Positive
59,3367232,1680545100,VisualMod,I completely agree! DOGE is a much better inve...,['TSLA'],Positive,positive,Positive
81,1419032,1660822106,BBFA369,"Look, you seem convinced. Just buy some puts a...",['GME'],Positive,neutral,Positive
...,...,...,...,...,...,...,...,...
961,2452237,1670269390,WallStreetBoners,"I legit use CRM, SNOW everyday at my job. Work...","['CRM', 'SNOW']",Positive,positive,Positive
971,1427280,1661167767,fuscosco,"the stock split is coming, earnings are still ...",['TWTR'],Positive,negative,Positive
974,3397299,1680880772,Harvooost,FFS WHATS SO GOOD ABOUT IT,"['SO', 'GOOD', 'IT']",Positive,positive,Positive
976,2106146,1666637956,VisualMod,">NASDAQ 100 EXTENDS AFTERNOON GAINS, UP 1%\n\n...",['UP'],Positive,neutral,Positive


# 2. TwitterBert

Lets check out a different model for sentiment analysis. Especially for sentiment analysis the context and domain matters. We can try a model that was trained on social media (Twitter) data. https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest.

We start by loading the model.

In [34]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

In [35]:
tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment-latest', model_max_length = 512)
model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment-latest')

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We want to use a GPU, if possible.

In [36]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

Change the device to 0 if a GPU is available.

In [37]:
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, device=-1, truncation = True)

Apply a slightly different approach for inference to better keep track of the progress. Sometimes you can achieve a significant speed up if you use batches and multiple processors.

In [38]:
# Split data into manageable batches
batch_size = 10  # Choose a suitable batch size based on your system capability
bodies = data['Body'].tolist()
batches = [bodies[i:i + batch_size] for i in range(0, len(bodies), batch_size)]

predictions = []

# Process each batch and update the progress bar
for batch in tqdm(batches, desc="Analyzing Sentiments"):
    batch_predictions = nlp(batch)
    predictions.extend(batch_predictions)

Analyzing Sentiments: 100%|██████████| 100/100 [04:30<00:00,  2.71s/it]


Create a new column in the dataframe

In [39]:
data['Predictions_TwitterBert'] = predictions
data['Predictions_TwitterBert'] = data['Predictions_TwitterBert'].apply(lambda x: x['label'])

What would you say, which model classifies the sentiment more accurately?

In [55]:
data[data['Predictions_TwitterBert']!=data['Predictions_FinBert'].str.lower()]

Unnamed: 0.1,Unnamed: 0,Date_UTC,Author,Body,Tickers,Predictions,Predictions_TwitterBert,Predictions_FinBert
0,48861,1649873485,SwedishFish123,"I fold man, I’m buying GOLD shares and calling...",['GOLD'],Neutral,negative,Neutral
2,2469125,1670441743,Not99Percent,"Biden used up that SPR excuse already, can't ...",['SPR'],Neutral,negative,Neutral
3,3414216,1681222843,Rare-ish_Bird,"G'day, mate. sleep well?",['G'],Neutral,positive,Neutral
4,2854971,1675167857,FunCranberry112122,SNAP usually does bounce back a lot after earn...,['SNAP'],Neutral,positive,Neutral
5,89093,1650040432,WeBuiltaTowerofStone,There are tons of republicans on Twitter. Awww...,['DWAC'],Neutral,negative,Neutral
...,...,...,...,...,...,...,...,...
992,35315,1649788022,Rippper600,How come NVDA go up but then go down but then ...,['NVDA'],Neutral,negative,Neutral
995,3734314,1686065752,downinthebasement,Interest rates at 0: Market melts up. \nBulls...,['V'],Neutral,negative,Neutral
996,3506174,1682518366,NobleMotary,It’s like 90% institutional owned.\n\nDon’t sh...,['CMG'],Neutral,negative,Neutral
997,1561815,1662146902,juliettewhiskey,WOW WAY TO KIL THE MARKET,['WOW'],Neutral,positive,Neutral


You can see, that almost half of our preditions differ. It is important to test models and validate their predictions.