# Sentiment Analysis on crypto related comments using transformers

**Luca Santarella (A.Y. 2021/2022)**


The datasets and the fine-tuned models can be downloaded from here: https://drive.google.com/drive/folders/1QDCGFKcqSQpaK9UGHUsJpi4RWYYv8LCK?usp=sharing


In [1]:
!pip install datasets
!pip install transformers



## Importing libraries

In [94]:
import pandas as pd
import torch
import torch.nn as nn
import numpy as np
import tensorflow as tf
import requests
from tqdm.notebook import tqdm
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import datetime
import json
import pandas_datareader

In [3]:
PATH = 'C:\\Users\\lucas\\Desktop\\Unipi\\HLT\\sentiment-analysis-crypto' 

## Preprocessing labeled dataset used for fine tuning
The dataset contains comments taken from various crypto subreddits crypto such as "r/cryptocurrency" (https://www.reddit.com/r/CryptoCurrency/) for the period of August 2021. The dataset was taken from SocialGrep which was responsible for the aggreagation and labeling of the comments (https://socialgrep.com/datasets/reddit-cryptocurrency-data-for-august-2021).

In [4]:
# df = pd.read_csv(os.path.join(PATH,'crypto-aug-2021-comments.csv'))

### Removing irrelevant columns

We only keep the columns useful for the task which are: `body` and `sentiment`. 

In [5]:
# df.drop(labels=['type','id','subreddit.id','subreddit.name', 'subreddit.nsfw','created_utc', 'permalink', 'score'], axis=1, inplace=True)

In [6]:
# df.info()

### Removing irrelevant comments
Comments which have a body such as: "[deleted\]" or "[removed\]" mean that the content of the body is not available anymore, so every instance of such comments are removed from the dataset.
Automatic comments made by bots on the subreddits are also removed from the dataset, in this way we keep only comments made by humans.

In [7]:
# df['body']= df['body'].replace(r'\n',' ', regex=True) 

In [8]:
# df.body.value_counts()[:10]

In [52]:
def remove_comments(dataframe):
  print(f"Number of deleted rows: {(dataframe.body == '[removed]').sum()+(dataframe.body == '[deleted]').sum()}")

  dataframe.drop(dataframe[(dataframe.body == "[removed]") | (dataframe.body == "[deleted]")].index, inplace=True)
  sum_auto_comments = 0
  auto_comments = []

  with open(os.path.join(PATH, "auto_comments.txt"), encoding="utf8") as fp:
    lines = fp.read().splitlines()
    for line in lines:
        auto_comments.append(line)
  for comment in auto_comments:
    sum_auto_comments += (dataframe.body == comment).sum()
    dataframe.drop(dataframe[dataframe.body == comment].index, inplace=True)

  print(f"Number of deleted rows with automatic comments: {sum_auto_comments}")

remove_comments(df)


AttributeError: 'DataFrame' object has no attribute 'body'

In [10]:
# df.body.value_counts()[:10]

We also keep only unique comments which are 512 characters or shorter.

In [11]:
# df = df.dropna()
# df = df[df['body'].apply(lambda x: len(x) <= 512) ]

### Renaming the columns
We rename the columns to keep them consistent with the `datasets.Dataset`
implementation

In [12]:
# df.rename(columns={"sentiment":"labels","body":"text"}, inplace=True)
# df.reset_index(drop=True, inplace=True)

In [13]:
# n_rows_before = df.shape[0]
# df.drop_duplicates(subset="text", inplace=True)
# print(f"Unique {df.shape[0]} rows out of {n_rows_before} rows")

### Save dataframe

In [14]:
# df.to_csv(os.path.join(PATH, "preprocessed_df.csv"))

In [15]:
df = pd.read_csv(os.path.join(PATH, "preprocessed_df.csv"), index_col=0)

### Sampling the dataframe

Finally we take a sample of the original dataset which we will then divided into training set, test set and evaluation set.

In [16]:
n_sample = 300000

df_sample = df.sample(n=n_sample, random_state=42)
print(f"Sample of {df_sample.shape[0]} rows out of {df.shape[0]} rows")

Sample of 300000 rows out of 2533454 rows


In [17]:
df = df_sample.copy()

The labels are converted from float to integers where 0 is a **negative** comment, 1 is a **positive** comment and 2 is a **neutral** one.

In [18]:
df.loc[df_sample['labels'] > 0, 'labels'] = 1 #POSITIVE
df.loc[df_sample['labels'] == 0, 'labels'] = 2 #NEUTRAL
df.loc[df_sample['labels'] < 0, 'labels'] = 0 #NEGATIVE

We keep the dataset balanced, in this way we will have the same amount of negative, positive and neutral comments.

In [19]:
df = (df.groupby('labels', as_index=False)
        .apply(lambda x: x.sample(n=30000, random_state=69))
        .reset_index(drop=True))

In [20]:
df.reset_index(drop=True, inplace=True)

In [21]:
df['labels'].value_counts()

0.0    30000
1.0    30000
2.0    30000
Name: labels, dtype: int64

In [22]:
df_dev, df_test = train_test_split(df, test_size=0.3, random_state=69)
df_train, df_val = train_test_split(df_dev, test_size=0.2, random_state=69)

In [23]:
df_train.to_csv(os.path.join(PATH, "df_train.csv"), index=False)
df_val.to_csv(os.path.join(PATH, "df_val.csv"), index=False)
df_test.to_csv(os.path.join(PATH, "df_test.csv"), index=False)

### Prepare dataset

The dataframe is converted to the Dataset format accepted by the Huggingface model.

In [24]:
from datasets import Dataset, load_dataset, Value, ClassLabel, Features

schema = Features({'text': Value(dtype='string', id=None),
 'labels': ClassLabel(num_classes=3, id=None),
}) #defining the schema we make sure that the Dataset is well built 

dataset_all = Dataset.from_pandas(df, schema)

The dataset is partitioned into training set (70%) and evaluation set (30%).

In [25]:
dataset = dataset_all.train_test_split(test_size=0.3,seed=69)

In [26]:
dataset_train = dataset_all.train_test_split(test_size=0.2,seed=69)

In [27]:
dataset_train["train"]

Dataset({
    features: ['text', 'labels'],
    num_rows: 72000
})

### Tokenize the dataset

We tokenize the dataset using the appropriate tokenizer provided by the *transformers* module.

In [28]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tok_ds_train = dataset_train["train"].map(tokenize_function, batched=True)
tok_ds_val = dataset_train["test"].map(tokenize_function, batched=True)
tok_ds_test = dataset["test"].map(tokenize_function, batched=True)

tok_ds_train = tok_ds_train.remove_columns(["text"])
tok_ds_test = tok_ds_test.remove_columns(["text"])
tok_ds_val = tok_ds_val.remove_columns(["text"])

100%|██████████████████████████████████████████████████████████████████████████████████| 72/72 [00:21<00:00,  3.39ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.35ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:08<00:00,  3.32ba/s]


## Fine-tune DistilBERT model

In [29]:
from torch.utils.data import DataLoader

tok_ds_train.set_format("pt")
tok_ds_test.set_format("pt")
tok_ds_val.set_format("pt")

train_dataloader = DataLoader(tok_ds_train, shuffle=True, batch_size=10)
test_dataloader = DataLoader(tok_ds_test, batch_size=10)
val_dataloader = DataLoader(tok_ds_val, batch_size=10)


In [30]:
# class Classifier(nn.Module):
#     def __init__(self, num_units, activation_fun, hidden_layers):
#         super(Classifier, self).__init__()
        
#         if(activation_fun == "sigmoid"):
#             activation_fun = nn.Sigmoid()
#         elif activation_fun == "relu":
#             activation_fun = nn.ReLU()
#         elif activation_fun == "tanh":
#             activation_fun = nn.Tanh()
            
#         if hidden_layers == 2:
#             self.linear_stack = nn.Sequential(
#                 nn.Linear(768, num_units),
#                 activation_fun,
#                 nn.Linear(num_units, num_units),
#             )
#         elif hidden_layers == 1:
#             self.linear_stack = nn.Sequential(
#                 nn.Linear(768, num_units),
#             )     

#     def forward(self, x):
#         X = self.linear_stack(x)
#         return X
    
#     def predict(self, x):
#         X = self.linear_stack(x)
#         return X

In [31]:
# myclassifier = Classifier(500, 'relu', 2)

In [32]:
from transformers import DistilBertForSequenceClassification

distilbert_model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3)
# distilbert_model.pre_classifier = myclassifier
# distilbert_model.classifier = nn.Linear(500, 3)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [33]:
for idx, m in enumerate(distilbert_model.modules()):
  print(idx, '->', m)

0 -> DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
  

In [34]:
from torch.optim import AdamW

optimizer = AdamW(distilbert_model.parameters(), lr=3e-5)

In [35]:
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# classifier = Classifier(num_units=768, activation_fun = "sigmoid", hidden_layers = 1)
# print(classifier)

In [36]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [37]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilbert_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [38]:
print(device)

cuda


In [39]:
tok_ds_train

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 72000
})

In [40]:
progress_bar = tqdm(range(num_training_steps))

distilbert_model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = distilbert_model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [1:15:56<00:00,  4.79it/s]

In [42]:
torch.save(distilbert_model.state_dict(), PATH+"/distilbert_model.pth")

In [44]:
distilbert_model.load_state_dict(torch.load(PATH+"/distilbert_model.pth"))
distilbert_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [45]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilbert_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

### Testing DistilBERT model

In [48]:
from datasets import load_metric
distilbert_model.eval()
preds = []

metric1 = load_metric('accuracy')
metric2 = load_metric("precision")
metric3 = load_metric("recall")

progress_bar = tqdm(val_dataloader)
distilbert_model.eval()
for batch in val_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = distilbert_model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    preds.append(predictions.cpu().detach().tolist())
    metric1.add_batch(predictions=predictions, references=batch["labels"])
    metric2.add_batch(predictions=predictions, references=batch["labels"])
    metric3.add_batch(predictions=predictions, references=batch["labels"])
    progress_bar.update(1)
    
accuracy = metric1.compute()
precision = metric2.compute(average='macro')
recall = metric3.compute(average='macro')


100%|██████████████████████████████████████████████████████████████████████████| 21600/21600 [1:17:59<00:00,  4.62it/s][A

  0%|                                                                                 | 1/1800 [00:00<10:20,  2.90it/s][A
  0%|▏                                                                                | 3/1800 [00:00<04:14,  7.07it/s][A
  0%|▏                                                                                | 5/1800 [00:00<03:09,  9.49it/s][A
  0%|▎                                                                                | 7/1800 [00:00<02:42, 11.04it/s][A
  0%|▍                                                                                | 9/1800 [00:00<02:28, 12.06it/s][A
  1%|▍                                                                               | 11/1800 [00:01<02:20, 12.74it/s][A
  1%|▌                                                                               | 13/1800 [00:01<02:15, 13.23it/s][A
  1%|▋        

  7%|█████▋                                                                         | 131/1800 [00:09<01:58, 14.05it/s][A
  7%|█████▊                                                                         | 133/1800 [00:09<01:58, 14.08it/s][A
  8%|█████▉                                                                         | 135/1800 [00:09<01:58, 14.09it/s][A
  8%|██████                                                                         | 137/1800 [00:09<01:57, 14.10it/s][A
  8%|██████                                                                         | 139/1800 [00:10<01:58, 13.96it/s][A
  8%|██████▏                                                                        | 141/1800 [00:10<01:59, 13.83it/s][A
  8%|██████▎                                                                        | 143/1800 [00:10<01:59, 13.89it/s][A
  8%|██████▎                                                                        | 145/1800 [00:10<01:58, 13.99it/s][A
  8%|██████▍    

 15%|███████████▌                                                                   | 263/1800 [00:18<01:47, 14.31it/s][A
 15%|███████████▋                                                                   | 265/1800 [00:19<01:47, 14.27it/s][A
 15%|███████████▋                                                                   | 267/1800 [00:19<01:47, 14.32it/s][A
 15%|███████████▊                                                                   | 269/1800 [00:19<01:46, 14.32it/s][A
 15%|███████████▉                                                                   | 271/1800 [00:19<01:46, 14.30it/s][A
 15%|███████████▉                                                                   | 273/1800 [00:19<01:46, 14.33it/s][A
 15%|████████████                                                                   | 275/1800 [00:19<01:46, 14.33it/s][A
 15%|████████████▏                                                                  | 277/1800 [00:19<01:46, 14.33it/s][A
 16%|███████████

 22%|█████████████████▎                                                             | 395/1800 [00:28<01:39, 14.11it/s][A
 22%|█████████████████▍                                                             | 397/1800 [00:28<01:40, 13.91it/s][A
 22%|█████████████████▌                                                             | 399/1800 [00:28<01:40, 13.91it/s][A
 22%|█████████████████▌                                                             | 401/1800 [00:28<01:41, 13.77it/s][A
 22%|█████████████████▋                                                             | 403/1800 [00:28<01:40, 13.91it/s][A
 22%|█████████████████▊                                                             | 405/1800 [00:28<01:39, 14.00it/s][A
 23%|█████████████████▊                                                             | 407/1800 [00:29<01:39, 14.07it/s][A
 23%|█████████████████▉                                                             | 409/1800 [00:29<01:38, 14.14it/s][A
 23%|███████████

 29%|███████████████████████▏                                                       | 527/1800 [00:37<01:29, 14.27it/s][A
 29%|███████████████████████▏                                                       | 529/1800 [00:37<01:29, 14.26it/s][A
 30%|███████████████████████▎                                                       | 531/1800 [00:37<01:29, 14.23it/s][A
 30%|███████████████████████▍                                                       | 533/1800 [00:37<01:29, 14.22it/s][A
 30%|███████████████████████▍                                                       | 535/1800 [00:38<01:29, 14.19it/s][A
 30%|███████████████████████▌                                                       | 537/1800 [00:38<01:28, 14.23it/s][A
 30%|███████████████████████▋                                                       | 539/1800 [00:38<01:28, 14.23it/s][A
 30%|███████████████████████▋                                                       | 541/1800 [00:38<01:28, 14.23it/s][A
 30%|███████████

 37%|████████████████████████████▉                                                  | 659/1800 [00:46<01:21, 13.92it/s][A
 37%|█████████████████████████████                                                  | 661/1800 [00:46<01:21, 13.98it/s][A
 37%|█████████████████████████████                                                  | 663/1800 [00:47<01:20, 14.05it/s][A
 37%|█████████████████████████████▏                                                 | 665/1800 [00:47<01:20, 14.10it/s][A
 37%|█████████████████████████████▎                                                 | 667/1800 [00:47<01:19, 14.17it/s][A
 37%|█████████████████████████████▎                                                 | 669/1800 [00:47<01:19, 14.15it/s][A
 37%|█████████████████████████████▍                                                 | 671/1800 [00:47<01:19, 14.14it/s][A
 37%|█████████████████████████████▌                                                 | 673/1800 [00:47<01:21, 13.87it/s][A
 38%|███████████

 44%|██████████████████████████████████▋                                            | 791/1800 [00:56<01:13, 13.75it/s][A
 44%|██████████████████████████████████▊                                            | 793/1800 [00:56<01:13, 13.69it/s][A
 44%|██████████████████████████████████▉                                            | 795/1800 [00:56<01:12, 13.85it/s][A
 44%|██████████████████████████████████▉                                            | 797/1800 [00:56<01:11, 13.96it/s][A
 44%|███████████████████████████████████                                            | 799/1800 [00:56<01:12, 13.80it/s][A
 44%|███████████████████████████████████▏                                           | 801/1800 [00:57<01:13, 13.59it/s][A
 45%|███████████████████████████████████▏                                           | 803/1800 [00:57<01:13, 13.63it/s][A
 45%|███████████████████████████████████▎                                           | 805/1800 [00:57<01:11, 13.82it/s][A
 45%|███████████

 51%|████████████████████████████████████████▌                                      | 923/1800 [01:05<01:04, 13.53it/s][A
 51%|████████████████████████████████████████▌                                      | 925/1800 [01:06<01:05, 13.43it/s][A
 52%|████████████████████████████████████████▋                                      | 927/1800 [01:06<01:04, 13.52it/s][A
 52%|████████████████████████████████████████▊                                      | 929/1800 [01:06<01:04, 13.50it/s][A
 52%|████████████████████████████████████████▊                                      | 931/1800 [01:06<01:03, 13.71it/s][A
 52%|████████████████████████████████████████▉                                      | 933/1800 [01:06<01:02, 13.83it/s][A
 52%|█████████████████████████████████████████                                      | 935/1800 [01:06<01:03, 13.63it/s][A
 52%|█████████████████████████████████████████                                      | 937/1800 [01:07<01:02, 13.78it/s][A
 52%|███████████

 59%|█████████████████████████████████████████████▋                                | 1055/1800 [01:15<00:54, 13.78it/s][A
 59%|█████████████████████████████████████████████▊                                | 1057/1800 [01:15<00:54, 13.60it/s][A
 59%|█████████████████████████████████████████████▉                                | 1059/1800 [01:15<00:54, 13.50it/s][A
 59%|█████████████████████████████████████████████▉                                | 1061/1800 [01:15<00:54, 13.66it/s][A
 59%|██████████████████████████████████████████████                                | 1063/1800 [01:16<00:53, 13.79it/s][A
 59%|██████████████████████████████████████████████▏                               | 1065/1800 [01:16<00:53, 13.80it/s][A
 59%|██████████████████████████████████████████████▏                               | 1067/1800 [01:16<00:53, 13.73it/s][A
 59%|██████████████████████████████████████████████▎                               | 1069/1800 [01:16<00:52, 13.84it/s][A
 60%|███████████

 66%|███████████████████████████████████████████████████▍                          | 1187/1800 [01:24<00:43, 14.11it/s][A
 66%|███████████████████████████████████████████████████▌                          | 1189/1800 [01:25<00:43, 14.14it/s][A
 66%|███████████████████████████████████████████████████▌                          | 1191/1800 [01:25<00:42, 14.16it/s][A
 66%|███████████████████████████████████████████████████▋                          | 1193/1800 [01:25<00:42, 14.15it/s][A
 66%|███████████████████████████████████████████████████▊                          | 1195/1800 [01:25<00:42, 14.17it/s][A
 66%|███████████████████████████████████████████████████▊                          | 1197/1800 [01:25<00:42, 14.19it/s][A
 67%|███████████████████████████████████████████████████▉                          | 1199/1800 [01:25<00:42, 14.17it/s][A
 67%|████████████████████████████████████████████████████                          | 1201/1800 [01:25<00:42, 14.09it/s][A
 67%|███████████

 73%|█████████████████████████████████████████████████████████▏                    | 1319/1800 [01:34<00:35, 13.66it/s][A
 73%|█████████████████████████████████████████████████████████▏                    | 1321/1800 [01:34<00:35, 13.63it/s][A
 74%|█████████████████████████████████████████████████████████▎                    | 1323/1800 [01:34<00:34, 13.69it/s][A
 74%|█████████████████████████████████████████████████████████▍                    | 1325/1800 [01:34<00:34, 13.73it/s][A
 74%|█████████████████████████████████████████████████████████▌                    | 1327/1800 [01:35<00:34, 13.76it/s][A
 74%|█████████████████████████████████████████████████████████▌                    | 1329/1800 [01:35<00:34, 13.81it/s][A
 74%|█████████████████████████████████████████████████████████▋                    | 1331/1800 [01:35<00:33, 13.82it/s][A
 74%|█████████████████████████████████████████████████████████▊                    | 1333/1800 [01:35<00:33, 13.94it/s][A
 74%|███████████

 81%|██████████████████████████████████████████████████████████████▉               | 1451/1800 [01:44<00:25, 13.83it/s][A
 81%|██████████████████████████████████████████████████████████████▉               | 1453/1800 [01:44<00:24, 13.92it/s][A
 81%|███████████████████████████████████████████████████████████████               | 1455/1800 [01:44<00:24, 13.95it/s][A
 81%|███████████████████████████████████████████████████████████████▏              | 1457/1800 [01:44<00:24, 13.95it/s][A
 81%|███████████████████████████████████████████████████████████████▏              | 1459/1800 [01:44<00:24, 13.91it/s][A
 81%|███████████████████████████████████████████████████████████████▎              | 1461/1800 [01:44<00:24, 13.91it/s][A
 81%|███████████████████████████████████████████████████████████████▍              | 1463/1800 [01:45<00:24, 13.95it/s][A
 81%|███████████████████████████████████████████████████████████████▍              | 1465/1800 [01:45<00:23, 14.00it/s][A
 82%|███████████

 88%|████████████████████████████████████████████████████████████████████▌         | 1583/1800 [01:53<00:16, 13.56it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 1585/1800 [01:53<00:15, 13.72it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 1587/1800 [01:53<00:15, 13.67it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 1589/1800 [01:54<00:15, 13.63it/s][A
 88%|████████████████████████████████████████████████████████████████████▉         | 1591/1800 [01:54<00:15, 13.55it/s][A
 88%|█████████████████████████████████████████████████████████████████████         | 1593/1800 [01:54<00:15, 13.63it/s][A
 89%|█████████████████████████████████████████████████████████████████████         | 1595/1800 [01:54<00:15, 13.44it/s][A
 89%|█████████████████████████████████████████████████████████████████████▏        | 1597/1800 [01:54<00:14, 13.59it/s][A
 89%|███████████

 95%|██████████████████████████████████████████████████████████████████████████▎   | 1715/1800 [02:03<00:06, 14.15it/s][A
 95%|██████████████████████████████████████████████████████████████████████████▍   | 1717/1800 [02:03<00:05, 14.14it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▍   | 1719/1800 [02:03<00:05, 14.13it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▌   | 1721/1800 [02:03<00:05, 14.10it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▋   | 1723/1800 [02:03<00:05, 14.12it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▊   | 1725/1800 [02:03<00:05, 14.12it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▊   | 1727/1800 [02:03<00:05, 14.15it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▉   | 1729/1800 [02:04<00:05, 14.11it/s][A
 96%|███████████

In [49]:
print(accuracy)
print(precision)
print(recall)

{'accuracy': 0.9492777777777778}
{'precision': 0.9493852767688301}
{'recall': 0.9493533597808287}



100%|██████████████████████████████████████████████████████████████████████████████| 1800/1800 [02:22<00:00, 14.04it/s][A

In [None]:
# distilbert_model.to('cpu')

In [None]:
# tokenizer.decode(db_tokenized_ds_test[2]['input_ids'], skip_special_tokens=True)

In [None]:
# for i in range(100):
#     text = tokenizer.decode(tokenized_ds_test[i]['input_ids'], skip_special_tokens=True)
#     true_label = tokenized_ds_test[i]['labels']
#     inputs = tokenizer(text, return_tensors="pt")

#     with torch.no_grad():
#       outputs = distilbert_model(**inputs)
#       prediction = torch.argmax(outputs.logits, dim=-1).item()
#     if(prediction != true_label):
#         print(f"TEXT: {text} \nTRUE LABEL:{true_label} - PREDICTION: {prediction}\n\n")

## Experiments


### Preprocessing dataset of November
A second unlabeled dataset with comments from the "r/cryptocurrency" subreddit will be used to conduct experiments on the model.

This dataset was built using the PushShift API (https://github.com/pushshift/api) for Reddit which is used to collect data of posts and comments from Reddit and `pmaw` (https://pypi.org/project/pmaw/0.0.2/) which is a wrapper for the PushShift API to make multiple requests.

The timeframe selected for the comments in the dataset is the month of November 2021, we believe this timeframe to be significant because of multiple sudden crashes of the crypto market which happened throughout the month.

In [73]:
df_month = pd.read_csv(PATH+'./nov_cc.csv')

In [74]:
remove_comments(df_month)

Number of deleted rows: 76449
Number of deleted rows with automatic comments: 2143


In [75]:
df_month['created_utc'] = pd.to_datetime(df_month['created_utc'],unit='s')

In [76]:
df_month = df_month.dropna()
df_month = df_month[df_month['body'].apply(lambda x: len(x) <= 512) ]
df_month.rename(columns={"body":"text"}, inplace=True)
df_month.reset_index(drop=True, inplace=True)

### Predict sentiment of each day
The comments are split according to the day of the month, in this way we obtain the number of positive and negative comments for each day.


In [98]:
import calendar

df_days = []
tok_ds_days = []

month_selected = df_month['created_utc'].dt.month.unique()
year_selected = df_month['created_utc'].dt.year.unique()

num_days = calendar.monthrange(int(year_selected), int(month_selected))[1]

for day_number in range(1, num_days):
  print(day_number)
  df_tmp = df_month[df_month['created_utc'].dt.day == day_number]
  df_days.append(df_tmp) 

  ds_day = Dataset.from_pandas(df_tmp)
  tok_ds_days.append(ds_day.map(tokenize_function, batched=True))

1



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  2.93ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.00ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.95ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  2.95ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.06ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  3.00ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  3.00ba/s][A
 53%|██████████

2



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.08ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.02ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:04,  2.99ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.04ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.04ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:01<00:02,  3.07ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.71ba/s][A
 53%|██████████

3



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.23ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.05ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:03,  3.03ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.02ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.02ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:01<00:02,  3.06ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  3.08ba/s][A
 53%|██████████

4



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:05,  2.76ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.72ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.72ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  2.79ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.78ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.79ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.81ba/s][A
 53%|██████████

5



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  2.92ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.77ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.70ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:04,  2.55ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.59ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.53ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:03,  2.61ba/s][A
 53%|██████████

6



  0%|                                                                                           | 0/19 [00:00<?, ?ba/s][A
  5%|████▎                                                                              | 1/19 [00:00<00:05,  3.32ba/s][A
 11%|████████▋                                                                          | 2/19 [00:00<00:05,  3.18ba/s][A
 16%|█████████████                                                                      | 3/19 [00:00<00:05,  3.03ba/s][A
 21%|█████████████████▍                                                                 | 4/19 [00:01<00:05,  2.99ba/s][A
 26%|█████████████████████▊                                                             | 5/19 [00:01<00:04,  2.96ba/s][A
 32%|██████████████████████████▏                                                        | 6/19 [00:01<00:04,  2.98ba/s][A
 37%|██████████████████████████████▌                                                    | 7/19 [00:02<00:04,  2.93ba/s][A
 42%|██████████

7



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.01ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.87ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.94ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.01ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.03ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:01<00:02,  3.03ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  3.00ba/s][A
 53%|██████████

8



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.28ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.11ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:03,  3.07ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.14ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.13ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.72ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.84ba/s][A
 53%|██████████

9



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.07ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.04ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:03,  3.00ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  2.97ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.98ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.98ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.93ba/s][A
 53%|██████████

10



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.00ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.89ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.93ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.01ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.06ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.99ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.98ba/s][A
 53%|██████████

11



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  2.86ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:04,  2.80ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:03,  2.82ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.85ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.81ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.81ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.84ba/s][A
 57%|██████████

12



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  2.90ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.83ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.86ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  2.89ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.89ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.86ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.85ba/s][A
 53%|██████████

13



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.02ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  2.93ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:01<00:04,  2.80ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  2.86ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.88ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.93ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.90ba/s][A
 53%|██████████

14



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.12ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.03ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:03,  3.03ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.03ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  2.99ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:01<00:02,  3.02ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.98ba/s][A
 53%|██████████

15



  0%|                                                                                           | 0/15 [00:00<?, ?ba/s][A
  7%|█████▌                                                                             | 1/15 [00:00<00:04,  3.09ba/s][A
 13%|███████████                                                                        | 2/15 [00:00<00:04,  3.03ba/s][A
 20%|████████████████▌                                                                  | 3/15 [00:00<00:04,  3.00ba/s][A
 27%|██████████████████████▏                                                            | 4/15 [00:01<00:03,  3.03ba/s][A
 33%|███████████████████████████▋                                                       | 5/15 [00:01<00:03,  3.01ba/s][A
 40%|█████████████████████████████████▏                                                 | 6/15 [00:02<00:03,  2.96ba/s][A
 47%|██████████████████████████████████████▋                                            | 7/15 [00:02<00:02,  2.96ba/s][A
 53%|██████████

16



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.08ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:03,  3.00ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:03,  2.94ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.95ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  3.00ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.87ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.86ba/s][A
 57%|██████████

17



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.14ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:04,  2.97ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:00<00:03,  3.00ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.88ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.83ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.84ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.77ba/s][A
 57%|██████████

18



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  2.80ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:04,  2.70ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:04,  2.69ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.76ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.77ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.86ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.91ba/s][A
 57%|██████████

19



  0%|                                                                                           | 0/13 [00:00<?, ?ba/s][A
  8%|██████▍                                                                            | 1/13 [00:00<00:04,  2.99ba/s][A
 15%|████████████▊                                                                      | 2/13 [00:00<00:03,  3.03ba/s][A
 23%|███████████████████▏                                                               | 3/13 [00:00<00:03,  3.04ba/s][A
 31%|█████████████████████████▌                                                         | 4/13 [00:01<00:03,  2.97ba/s][A
 38%|███████████████████████████████▉                                                   | 5/13 [00:01<00:02,  2.90ba/s][A
 46%|██████████████████████████████████████▎                                            | 6/13 [00:02<00:02,  2.93ba/s][A
 54%|████████████████████████████████████████████▋                                      | 7/13 [00:02<00:02,  2.96ba/s][A
 62%|██████████

20



  0%|                                                                                            | 0/8 [00:00<?, ?ba/s][A
 12%|██████████▌                                                                         | 1/8 [00:00<00:02,  2.99ba/s][A
 25%|█████████████████████                                                               | 2/8 [00:00<00:01,  3.05ba/s][A
 38%|███████████████████████████████▌                                                    | 3/8 [00:00<00:01,  3.02ba/s][A
 50%|██████████████████████████████████████████                                          | 4/8 [00:01<00:01,  3.03ba/s][A
 62%|████████████████████████████████████████████████████▌                               | 5/8 [00:01<00:01,  2.93ba/s][A
 75%|███████████████████████████████████████████████████████████████                     | 6/8 [00:02<00:00,  2.92ba/s][A
 88%|█████████████████████████████████████████████████████████████████████████▌          | 7/8 [00:02<00:00,  2.91ba/s][A
100%|██████████

21



  0%|                                                                                           | 0/11 [00:00<?, ?ba/s][A
  9%|███████▌                                                                           | 1/11 [00:00<00:03,  2.90ba/s][A
 18%|███████████████                                                                    | 2/11 [00:00<00:03,  2.95ba/s][A
 27%|██████████████████████▋                                                            | 3/11 [00:01<00:02,  3.00ba/s][A
 36%|██████████████████████████████▏                                                    | 4/11 [00:01<00:02,  2.91ba/s][A
 45%|█████████████████████████████████████▋                                             | 5/11 [00:01<00:02,  2.97ba/s][A
 55%|█████████████████████████████████████████████▎                                     | 6/11 [00:02<00:01,  2.93ba/s][A
 64%|████████████████████████████████████████████████████▊                              | 7/11 [00:02<00:01,  2.85ba/s][A
 73%|██████████

22



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.01ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:04,  2.98ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:03,  2.94ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.94ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.91ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.95ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.97ba/s][A
 57%|██████████

23



  0%|                                                                                           | 0/13 [00:00<?, ?ba/s][A
  8%|██████▍                                                                            | 1/13 [00:00<00:03,  3.14ba/s][A
 15%|████████████▊                                                                      | 2/13 [00:00<00:03,  3.14ba/s][A
 23%|███████████████████▏                                                               | 3/13 [00:00<00:03,  3.05ba/s][A
 31%|█████████████████████████▌                                                         | 4/13 [00:01<00:02,  3.02ba/s][A
 38%|███████████████████████████████▉                                                   | 5/13 [00:01<00:02,  3.00ba/s][A
 46%|██████████████████████████████████████▎                                            | 6/13 [00:02<00:02,  2.94ba/s][A
 54%|████████████████████████████████████████████▋                                      | 7/13 [00:02<00:02,  2.96ba/s][A
 62%|██████████

24



  0%|                                                                                           | 0/13 [00:00<?, ?ba/s][A
  8%|██████▍                                                                            | 1/13 [00:00<00:03,  3.05ba/s][A
 15%|████████████▊                                                                      | 2/13 [00:00<00:03,  2.96ba/s][A
 23%|███████████████████▏                                                               | 3/13 [00:01<00:03,  2.89ba/s][A
 31%|█████████████████████████▌                                                         | 4/13 [00:01<00:03,  2.44ba/s][A
 38%|███████████████████████████████▉                                                   | 5/13 [00:01<00:03,  2.59ba/s][A
 46%|██████████████████████████████████████▎                                            | 6/13 [00:02<00:02,  2.70ba/s][A
 54%|████████████████████████████████████████████▋                                      | 7/13 [00:02<00:02,  2.79ba/s][A
 62%|██████████

25



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.12ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:03,  3.05ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:03,  2.88ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.98ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:02,  3.00ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  3.00ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  3.02ba/s][A
 57%|██████████

26



  0%|                                                                                           | 0/13 [00:00<?, ?ba/s][A
  8%|██████▍                                                                            | 1/13 [00:00<00:03,  3.15ba/s][A
 15%|████████████▊                                                                      | 2/13 [00:00<00:03,  3.02ba/s][A
 23%|███████████████████▏                                                               | 3/13 [00:00<00:03,  3.05ba/s][A
 31%|█████████████████████████▌                                                         | 4/13 [00:01<00:02,  3.03ba/s][A
 38%|███████████████████████████████▉                                                   | 5/13 [00:01<00:02,  3.03ba/s][A
 46%|██████████████████████████████████████▎                                            | 6/13 [00:01<00:02,  3.04ba/s][A
 54%|████████████████████████████████████████████▋                                      | 7/13 [00:02<00:01,  3.06ba/s][A
 62%|██████████

27



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.21ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:03,  3.17ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:00<00:03,  3.13ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  3.12ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:02,  3.19ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:01<00:02,  3.17ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  3.15ba/s][A
 57%|██████████

28



  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  2.94ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:04,  2.83ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:04,  2.74ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.70ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.69ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:02,  2.72ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.74ba/s][A
 57%|██████████

29



  0%|                                                                                           | 0/12 [00:00<?, ?ba/s][A
  8%|██████▉                                                                            | 1/12 [00:00<00:03,  3.10ba/s][A
 17%|█████████████▊                                                                     | 2/12 [00:00<00:03,  2.99ba/s][A
 25%|████████████████████▊                                                              | 3/12 [00:01<00:03,  2.92ba/s][A
 33%|███████████████████████████▋                                                       | 4/12 [00:01<00:02,  2.97ba/s][A
 42%|██████████████████████████████████▌                                                | 5/12 [00:01<00:02,  2.95ba/s][A
 50%|█████████████████████████████████████████▌                                         | 6/12 [00:02<00:02,  2.96ba/s][A
 58%|████████████████████████████████████████████████▍                                  | 7/12 [00:02<00:01,  2.99ba/s][A
 67%|██████████

In [100]:
for i in range(num_days+1):
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["__index_level_0__"])
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["text"])
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["created_utc"])

ValueError: Column name __index_level_0__ not in the dataset. Current columns in the dataset: ['input_ids', 'attention_mask']

In [None]:
preds_counts = []


for i in range (30):
  preds = []
  tok_ds_days[i].set_format("torch")
  exp_dataloader = DataLoader(tok_ds_days[i], batch_size=12)
  progress_bar = tqdm(exp_dataloader)
  for batch in exp_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = distilbert_model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    preds.append(predictions.cpu().detach().tolist())
    progress_bar.update(1)
  preds = [item for sublist in preds for item in sublist]  
    
  predictions_count = np.bincount(preds)
  preds_counts.append(predictions_count)
  df_days[i]['prediction'] = preds

A ratio of the number of positive over negative comments is computed, we will use this ratio to analyze the sentiment for the days when sudden crashes happened.

In [60]:
scores = []

for i in range(30):
  scores.append(preds_counts[i][1]/preds_counts[i][0])

### Correlation between sentiment and price
The price of Bitcoin shows an high correlation with the sentiment on the crypto community, this is commonly believed to be a direct causation of price towards sentiment. 

In [64]:
import pandas as pd
import pandas_datareader.data as pdr
import datetime

start = datetime.datetime(2021,11,1)
end = datetime.datetime(2021,11,29)
df = pdr.DataReader('BTC-USD','yahoo',start,end)

We multiply the sentiment score for a constant factor to obtain a qualitative plot showing the correlation between price and sentiment.

In [65]:
new_scores = [elem*29000 for elem in scores]

In [68]:
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode(connected=True)

data = [go.Candlestick(x=df.index,
                       open=df.Open,
                       high=df.High,
                       low=df.Low,
                       close=df.Close,
                       name="candlesticks"),
       go.Scatter(x=df.index, y=new_scores,
                  line=dict(
                        color='rgb(0, 0, 0)',
                        width=2
                    ),
                  name="sentiment score"
                  )]

layout = go.Layout(title='Bitcoin price (November 2021)',
                   xaxis={'title': "Date", 'rangeslider':{'visible':False},
                          'dtick': "day"},
                   yaxis={'title': "Price (USD)"},
                   width=1024)

fig = go.Figure(data=data,layout=layout)
py.iplot(fig,filename='bitcoin_candlestick')

## Conclusions

We used a labeled dataset of comments made by the crypto community to perform fine tuning on the DistilBERT and BERT pre-trained models provided by HuggingFace.
We experimented with different configurations of the models and amount of data, in this way we obtained a final DistilBERT model which achieves an accuracy of 0.9511 on the test set.
We observed that in the experiments the DistilBERT version took half of the time and it also had a similar performance with respect to the corresponding BERT model.
Further, as expected with a greater amount of data the models performed better on the evaluation set.
Finally we showed a practical example of how to analyze the movements of the crypto market in comparison to the sentiment showing that the approach is well-founded.

A further work that could be done is an analysis on a much longer span and possibly on multiple sources of data, this could possibly show if a so called ``wisdom of the crowd" could predict rises or falls of the market. 
More in general the model could help in finding some kind of insight that could be extrapolated from the sentiment of the community.