# Sentiment Analysis on crypto related comments using transformers

**Luca Santarella (A.Y. 2021/2022)**


The datasets and the fine-tuned models can be downloaded from here: https://drive.google.com/drive/folders/1QDCGFKcqSQpaK9UGHUsJpi4RWYYv8LCK?usp=sharing


In [1]:
!pip install datasets
!pip install transformers



## Importing libraries

In [2]:
import pandas as pd
import torch
import torch.nn as nn
import numpy as np
import tensorflow as tf
import requests
from tqdm.auto import tqdm
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import datetime
import json
import pandas_datareader
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset, Value, ClassLabel, Features

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
PATH = 'C:\\Users\\lucas\\Desktop\\Unipi\\HLT\\sentiment-analysis-crypto' 

## Preprocessing labeled dataset used for fine tuning
The dataset contains comments taken from various crypto subreddits crypto such as "r/cryptocurrency" (https://www.reddit.com/r/CryptoCurrency/) for the period of August 2021. The dataset was taken from SocialGrep which was responsible for the aggreagation and labeling of the comments (https://socialgrep.com/datasets/reddit-cryptocurrency-data-for-august-2021).

In [4]:
# df = pd.read_csv(os.path.join(PATH,'crypto-aug-2021-comments.csv'))

### Removing irrelevant columns

We only keep the columns useful for the task which are: `body` and `sentiment`. 

In [5]:
# df.drop(labels=['type','id','subreddit.id','subreddit.name', 'subreddit.nsfw','created_utc', 'permalink', 'score'], axis=1, inplace=True)

In [6]:
# df.info()

### Removing irrelevant comments
Comments which have a body such as: "[deleted\]" or "[removed\]" mean that the content of the body is not available anymore, so every instance of such comments are removed from the dataset.
Automatic comments made by bots on the subreddits are also removed from the dataset, in this way we keep only comments made by humans.

In [7]:
# df['body']= df['body'].replace(r'\n',' ', regex=True) 

In [8]:
# df.body.value_counts()[:10]

In [9]:
def remove_comments(dataframe):
  print(f"Number of deleted rows: {(dataframe.body == '[removed]').sum()+(dataframe.body == '[deleted]').sum()}")

  dataframe.drop(dataframe[(dataframe.body == "[removed]") | (dataframe.body == "[deleted]")].index, inplace=True)
  sum_auto_comments = 0
  auto_comments = []

  with open(os.path.join(PATH, "auto_comments.txt"), encoding="utf8") as fp:
    lines = fp.read().splitlines()
    for line in lines:
        auto_comments.append(line)
  for comment in auto_comments:
    sum_auto_comments += (dataframe.body == comment).sum()
    dataframe.drop(dataframe[dataframe.body == comment].index, inplace=True)

  print(f"Number of deleted rows with automatic comments: {sum_auto_comments}")

#remove_comments(df)


In [10]:
# df.body.value_counts()[:10]

We also keep only unique comments which are 512 characters or shorter.

In [11]:
# df = df.dropna()
# df = df[df['body'].apply(lambda x: len(x) <= 512) ]

### Renaming the columns
We rename the columns to keep them consistent with the `datasets.Dataset`
implementation

In [12]:
# df.rename(columns={"sentiment":"labels","body":"text"}, inplace=True)
# df.reset_index(drop=True, inplace=True)

In [13]:
# n_rows_before = df.shape[0]
# df.drop_duplicates(subset="text", inplace=True)
# print(f"Unique {df.shape[0]} rows out of {n_rows_before} rows")

### Save dataframe

In [14]:
# df.to_csv(os.path.join(PATH, "preprocessed_df.csv"))

In [15]:
# df = pd.read_csv(os.path.join(PATH, "preprocessed_df.csv"), index_col=0)

### Sampling the dataframe

Finally we take a sample of the original dataset which we will then divided into training set, test set and evaluation set.

In [16]:
# n_sample = 300000

# df_sample = df.sample(n=n_sample, random_state=42)
# print(f"Sample of {df_sample.shape[0]} rows out of {df.shape[0]} rows")

In [17]:
# df = df_sample.copy()

The labels are converted from float to integers where 0 is a **negative** comment, 1 is a **positive** comment and 2 is a **neutral** one.

In [18]:
# df.loc[df_sample['labels'] > 0, 'labels'] = 1 #POSITIVE
# df.loc[df_sample['labels'] == 0, 'labels'] = 2 #NEUTRAL
# df.loc[df_sample['labels'] < 0, 'labels'] = 0 #NEGATIVE

We keep the dataset balanced, in this way we will have the same amount of negative, positive and neutral comments.

In [19]:
# df = (df.groupby('labels', as_index=False)
#         .apply(lambda x: x.sample(n=30000, random_state=69))
#         .reset_index(drop=True))

In [20]:
# df.reset_index(drop=True, inplace=True)

In [21]:
# df['labels'].value_counts()

In [22]:
# df_dev, df_test = train_test_split(df, test_size=0.3, random_state=69)
# df_train, df_val = train_test_split(df_dev, test_size=0.2, random_state=69)

In [23]:
# df_train.to_csv(os.path.join(PATH, "df_train.csv"), index=False)
# df_val.to_csv(os.path.join(PATH, "df_val.csv"), index=False)
# df_test.to_csv(os.path.join(PATH, "df_test.csv"), index=False)

### Prepare dataset

The dataframe is converted to the Dataset format accepted by the Huggingface model.

In [24]:
# schema = Features({'text': Value(dtype='string', id=None),
#  'labels': ClassLabel(num_classes=3, id=None),
# }) #defining the schema we make sure that the Dataset is well built 

# dataset_all = Dataset.from_pandas(df, schema)

The dataset is partitioned into training set (70%) and evaluation set (30%).

In [25]:
# dataset = dataset_all.train_test_split(test_size=0.3,seed=69)

In [26]:
# dataset_train = dataset_all.train_test_split(test_size=0.2,seed=69)

In [27]:
# dataset_train["train"]

### Tokenize the dataset

We tokenize the dataset using the appropriate tokenizer provided by the *transformers* module.

In [28]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# tok_ds_train = dataset_train["train"].map(tokenize_function, batched=True)
# tok_ds_val = dataset_train["test"].map(tokenize_function, batched=True)
# tok_ds_test = dataset["test"].map(tokenize_function, batched=True)

# tok_ds_train = tok_ds_train.remove_columns(["text"])
# tok_ds_test = tok_ds_test.remove_columns(["text"])
# tok_ds_val = tok_ds_val.remove_columns(["text"])

## Fine-tune DistilBERT model

In [29]:
# tok_ds_train.set_format("pt")
# tok_ds_test.set_format("pt")
# tok_ds_val.set_format("pt")

# train_dataloader = DataLoader(tok_ds_train, shuffle=True, batch_size=10)
# test_dataloader = DataLoader(tok_ds_test, batch_size=10)
# val_dataloader = DataLoader(tok_ds_val, batch_size=10)


In [30]:
# class Classifier(nn.Module):
#     def __init__(self, num_units, activation_fun, hidden_layers):
#         super(Classifier, self).__init__()
        
#         if(activation_fun == "sigmoid"):
#             activation_fun = nn.Sigmoid()
#         elif activation_fun == "relu":
#             activation_fun = nn.ReLU()
#         elif activation_fun == "tanh":
#             activation_fun = nn.Tanh()
            
#         if hidden_layers == 2:
#             self.linear_stack = nn.Sequential(
#                 nn.Linear(768, num_units),
#                 activation_fun,
#                 nn.Linear(num_units, num_units),
#             )
#         elif hidden_layers == 1:
#             self.linear_stack = nn.Sequential(
#                 nn.Linear(768, num_units),
#             )     

#     def forward(self, x):
#         X = self.linear_stack(x)
#         return X
    
#     def predict(self, x):
#         X = self.linear_stack(x)
#         return X

In [31]:
# myclassifier = Classifier(500, 'relu', 2)

In [32]:
from transformers import DistilBertForSequenceClassification

distilbert_model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3)
# distilbert_model.pre_classifier = myclassifier
# distilbert_model.classifier = nn.Linear(500, 3)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

In [33]:
# for idx, m in enumerate(distilbert_model.modules()):
#   print(idx, '->', m)

In [34]:
# from torch.optim import AdamW

# optimizer = AdamW(distilbert_model.parameters(), lr=3e-5)

In [35]:
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# classifier = Classifier(num_units=768, activation_fun = "sigmoid", hidden_layers = 1)
# print(classifier)

In [36]:
# from transformers import get_scheduler

# num_epochs = 3
# num_training_steps = num_epochs * len(train_dataloader)
# lr_scheduler = get_scheduler(
#     name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
# )

In [37]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilbert_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [38]:
# print(device)

In [39]:
# tok_ds_train

In [40]:
# progress_bar = tqdm(range(num_training_steps))

# distilbert_model.train()
# for epoch in range(num_epochs):
#     for batch in train_dataloader:
#         batch = {k: v.to(device) for k, v in batch.items()}
#         outputs = distilbert_model(**batch)
#         loss = outputs.loss
#         loss.backward()

#         optimizer.step()
#         lr_scheduler.step()
#         optimizer.zero_grad()
#         progress_bar.update(1)

In [41]:
# torch.save(distilbert_model.state_dict(), PATH+"/distilbert_model.pth")

In [42]:
distilbert_model.load_state_dict(torch.load(PATH+"/distilbert_model.pth"))
distilbert_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [43]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
distilbert_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

### Testing DistilBERT model

In [44]:
# from datasets import load_metric
# distilbert_model.eval()
# preds = []

# metric1 = load_metric('accuracy')
# metric2 = load_metric("precision")
# metric3 = load_metric("recall")

# progress_bar = tqdm(val_dataloader)
# distilbert_model.eval()
# for batch in val_dataloader:
#     batch = {k: v.to(device) for k, v in batch.items()}
#     with torch.no_grad():
#         outputs = distilbert_model(**batch)

#     logits = outputs.logits
#     predictions = torch.argmax(logits, dim=-1)
#     preds.append(predictions.cpu().detach().tolist())
#     metric1.add_batch(predictions=predictions, references=batch["labels"])
#     metric2.add_batch(predictions=predictions, references=batch["labels"])
#     metric3.add_batch(predictions=predictions, references=batch["labels"])
#     progress_bar.update(1)
    
# accuracy = metric1.compute()
# precision = metric2.compute(average='macro')
# recall = metric3.compute(average='macro')

In [45]:
# print(accuracy)
# print(precision)
# print(recall)

In [46]:
# distilbert_model.to('cpu')

In [47]:
# tokenizer.decode(db_tokenized_ds_test[2]['input_ids'], skip_special_tokens=True)

In [48]:
# for i in range(100):
#     text = tokenizer.decode(tokenized_ds_test[i]['input_ids'], skip_special_tokens=True)
#     true_label = tokenized_ds_test[i]['labels']
#     inputs = tokenizer(text, return_tensors="pt")

#     with torch.no_grad():
#       outputs = distilbert_model(**inputs)
#       prediction = torch.argmax(outputs.logits, dim=-1).item()
#     if(prediction != true_label):
#         print(f"TEXT: {text} \nTRUE LABEL:{true_label} - PREDICTION: {prediction}\n\n")

## Experiments


### Preprocessing dataset of November
A second unlabeled dataset with comments from the "r/cryptocurrency" subreddit will be used to conduct experiments on the model.

This dataset was built using the PushShift API (https://github.com/pushshift/api) for Reddit which is used to collect data of posts and comments from Reddit and `pmaw` (https://pypi.org/project/pmaw/0.0.2/) which is a wrapper for the PushShift API to make multiple requests.

The timeframe selected for the comments in the dataset is the month of November 2021, we believe this timeframe to be significant because of multiple sudden crashes of the crypto market which happened throughout the month.

In [70]:
df_month = pd.read_csv(PATH+'./june_cc.csv')

In [71]:
remove_comments(df_month)

Number of deleted rows: 31050
Number of deleted rows with automatic comments: 2533


In [72]:
df_month['created_utc'] = pd.to_datetime(df_month['created_utc'],unit='s')

In [73]:
df_month = df_month.dropna()
df_month = df_month[df_month['body'].apply(lambda x: len(x) <= 512) ]
df_month.rename(columns={"body":"text"}, inplace=True)
df_month.reset_index(drop=True, inplace=True)

In [76]:
df_month.drop(labels = df_month[df_month['created_utc'].dt.month == 5].index, inplace=True)


In [80]:
df_month[df_month['created_utc'].dt.month == 6]

Unnamed: 0,text,created_utc
0,Get your red dildos ready folk,2022-06-03 22:53:55
1,Imagine if Con edison tips off the police acco...,2022-06-03 22:53:50
2,Yeah just comparing 2018 to this cycle has mad...,2022-06-03 22:53:48
3,Probably because this post has nothing to do w...,2022-06-03 22:53:43
4,Exposing your components to a high temperature...,2022-06-03 22:53:39
...,...,...
442237,"Thank you for submitting to /r/CryptoCurrency,...",2022-06-19 12:41:09
442238,"Thank you for submitting to /r/CryptoCurrency,...",2022-06-19 12:41:08
442239,"Newbies in crypto bull market ""Let's exit the ...",2022-06-19 12:40:57
442240,Solana is just a piggy bank for FTX Almeida an...,2022-06-19 12:40:51


### Predict sentiment of each day
The comments are split according to the day of the month, in this way we obtain the number of positive and negative comments for each day.


In [81]:
import calendar
from datasets import Dataset

df_days = []
tok_ds_days = []

year = "2022"
month_selected = df_month['created_utc'].dt.month.unique()
year_selected = df_month['created_utc'].dt.year.unique()

num_days = calendar.monthrange(int(year_selected), int(month_selected))[1]

for day_number in range(1, num_days+1):
  df_tmp = df_month[df_month['created_utc'].dt.day == day_number]
  df_days.append(df_tmp) 

  ds_day = Dataset.from_pandas(df_tmp)
  tok_ds_days.append(ds_day.map(tokenize_function, batched=True))


  0%|                                                                                           | 0/14 [00:00<?, ?ba/s][A
  7%|█████▉                                                                             | 1/14 [00:00<00:04,  3.08ba/s][A
 14%|███████████▊                                                                       | 2/14 [00:00<00:03,  3.01ba/s][A
 21%|█████████████████▊                                                                 | 3/14 [00:01<00:03,  2.92ba/s][A
 29%|███████████████████████▋                                                           | 4/14 [00:01<00:03,  2.77ba/s][A
 36%|█████████████████████████████▋                                                     | 5/14 [00:01<00:03,  2.64ba/s][A
 43%|███████████████████████████████████▌                                               | 6/14 [00:02<00:03,  2.36ba/s][A
 50%|█████████████████████████████████████████▌                                         | 7/14 [00:02<00:02,  2.52ba/s][A
 57%|██████████

 45%|█████████████████████████████████████▋                                             | 5/11 [00:02<00:02,  2.59ba/s][A
 55%|█████████████████████████████████████████████▎                                     | 6/11 [00:02<00:01,  2.68ba/s][A
 64%|████████████████████████████████████████████████████▊                              | 7/11 [00:02<00:01,  2.63ba/s][A
 73%|████████████████████████████████████████████████████████████▎                      | 8/11 [00:03<00:01,  2.63ba/s][A
 82%|███████████████████████████████████████████████████████████████████▉               | 9/11 [00:03<00:00,  2.37ba/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00,  2.70ba/s][A

  0%|                                                                                           | 0/12 [00:00<?, ?ba/s][A
  8%|██████▉                                                                            | 1/12 [00:00<00:04,  2.32ba/s][A
 17%|██████████

 79%|████████████████████████████████████████████████████████████████▍                 | 11/14 [00:03<00:01,  2.74ba/s][A
 86%|██████████████████████████████████████████████████████████████████████▎           | 12/14 [00:04<00:00,  2.71ba/s][A
 93%|████████████████████████████████████████████████████████████████████████████▏     | 13/14 [00:04<00:00,  2.74ba/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00,  2.95ba/s][A

  0%|                                                                                           | 0/20 [00:00<?, ?ba/s][A
  5%|████▏                                                                              | 1/20 [00:00<00:06,  2.90ba/s][A
 10%|████████▎                                                                          | 2/20 [00:00<00:06,  2.98ba/s][A
 15%|████████████▍                                                                      | 3/20 [00:00<00:05,  3.03ba/s][A
 20%|██████████

 24%|███████████████████▉                                                               | 6/25 [00:02<00:07,  2.61ba/s][A
 28%|███████████████████████▏                                                           | 7/25 [00:02<00:06,  2.68ba/s][A
 32%|██████████████████████████▌                                                        | 8/25 [00:02<00:06,  2.63ba/s][A
 36%|█████████████████████████████▉                                                     | 9/25 [00:03<00:06,  2.30ba/s][A
 40%|████████████████████████████████▊                                                 | 10/25 [00:03<00:06,  2.48ba/s][A
 44%|████████████████████████████████████                                              | 11/25 [00:04<00:05,  2.56ba/s][A
 48%|███████████████████████████████████████▎                                          | 12/25 [00:04<00:04,  2.71ba/s][A
 52%|██████████████████████████████████████████▋                                       | 13/25 [00:04<00:04,  2.72ba/s][A
 56%|███████████

100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:07<00:00,  2.69ba/s][A

  0%|                                                                                           | 0/16 [00:00<?, ?ba/s][A
  6%|█████▏                                                                             | 1/16 [00:00<00:05,  2.93ba/s][A
 12%|██████████▍                                                                        | 2/16 [00:00<00:04,  2.91ba/s][A
 19%|███████████████▌                                                                   | 3/16 [00:01<00:04,  2.95ba/s][A
 25%|████████████████████▊                                                              | 4/16 [00:01<00:04,  2.89ba/s][A
 31%|█████████████████████████▉                                                         | 5/16 [00:01<00:04,  2.69ba/s][A
 38%|███████████████████████████████▏                                                   | 6/16 [00:02<00:04,  2.46ba/s][A
 44%|██████████

 95%|█████████████████████████████████████████████████████████████████████████████▉    | 19/20 [00:07<00:00,  2.57ba/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00,  2.62ba/s][A

  0%|                                                                                           | 0/16 [00:00<?, ?ba/s][A
  6%|█████▏                                                                             | 1/16 [00:00<00:05,  2.67ba/s][A
 12%|██████████▍                                                                        | 2/16 [00:00<00:05,  2.70ba/s][A
 19%|███████████████▌                                                                   | 3/16 [00:01<00:04,  2.83ba/s][A
 25%|████████████████████▊                                                              | 4/16 [00:01<00:04,  2.77ba/s][A
 31%|█████████████████████████▉                                                         | 5/16 [00:01<00:04,  2.69ba/s][A
 38%|██████████

 42%|██████████████████████████████████▌                                                | 5/12 [00:01<00:02,  2.38ba/s][A
 50%|█████████████████████████████████████████▌                                         | 6/12 [00:02<00:02,  2.55ba/s][A
 58%|████████████████████████████████████████████████▍                                  | 7/12 [00:02<00:01,  2.61ba/s][A
 67%|███████████████████████████████████████████████████████▎                           | 8/12 [00:03<00:01,  2.60ba/s][A
 75%|██████████████████████████████████████████████████████████████▎                    | 9/12 [00:03<00:01,  2.69ba/s][A
 83%|████████████████████████████████████████████████████████████████████▎             | 10/12 [00:03<00:00,  2.70ba/s][A
 92%|███████████████████████████████████████████████████████████████████████████▏      | 11/12 [00:04<00:00,  2.71ba/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:04<00:00,  2.81ba/s][A

  0%|          

 92%|███████████████████████████████████████████████████████████████████████████▏      | 11/12 [00:04<00:00,  2.59ba/s][A
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:04<00:00,  2.86ba/s][A

  0%|                                                                                           | 0/13 [00:00<?, ?ba/s][A
  8%|██████▍                                                                            | 1/13 [00:00<00:04,  2.92ba/s][A
 15%|████████████▊                                                                      | 2/13 [00:00<00:03,  2.94ba/s][A
 23%|███████████████████▏                                                               | 3/13 [00:01<00:03,  2.88ba/s][A
 31%|█████████████████████████▌                                                         | 4/13 [00:01<00:03,  2.54ba/s][A
 38%|███████████████████████████████▉                                                   | 5/13 [00:01<00:03,  2.44ba/s][A
 46%|██████████

In [82]:
len(tok_ds_days)

30

In [83]:
for i in range(num_days):
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["__index_level_0__"])
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["text"])
    tok_ds_days[i] = tok_ds_days[i].remove_columns(["created_utc"])

In [84]:
preds_counts = []

progress_bar = tqdm(range(num_days))
for i in range(num_days):
  preds = []
  tok_ds_days[i].set_format("torch")
  exp_dataloader = DataLoader(tok_ds_days[i], batch_size=12)
  for batch in exp_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = distilbert_model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    preds.append(predictions.cpu().detach().tolist())
  preds = [item for sublist in preds for item in sublist]  
    
  predictions_count = np.bincount(preds)
  preds_counts.append(predictions_count)
  df_days[i]['prediction'] = preds
  progress_bar.update(1)


100%|███████████████████████████████████████████████████████████████████████████████| 31/31 [3:57:19<00:00, 459.32s/it][A
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['prediction'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['prediction'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['prediction'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['prediction'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_days[i]['prediction'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_index

A ratio of the number of positive over negative comments is computed, we will use this ratio to analyze the sentiment for the days when sudden crashes happened.

In [85]:
scores = []

for i in range(num_days):
  scores.append(preds_counts[i][1]/preds_counts[i][0])

In [86]:
print(month_selected)

[6]


In [87]:
year = "2022"

start_str = str(int(month_selected))+'/1/'+year
end_str = str(num_days)+'/'+str(int(month_selected))+'/'+str(year)


date_days = pd.date_range(start=start_str, end=end_str)
print(date_days)

data = {'score': scores,
        'date': date_days,}
df_final = pd.DataFrame(data=data)

DatetimeIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-04',
               '2022-06-05', '2022-06-06', '2022-06-07', '2022-06-08',
               '2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12',
               '2022-06-13', '2022-06-14', '2022-06-15', '2022-06-16',
               '2022-06-17', '2022-06-18', '2022-06-19', '2022-06-20',
               '2022-06-21', '2022-06-22', '2022-06-23', '2022-06-24',
               '2022-06-25', '2022-06-26', '2022-06-27', '2022-06-28',
               '2022-06-29', '2022-06-30'],
              dtype='datetime64[ns]', freq='D')


In [88]:
df_final

Unnamed: 0,score,date
0,1.473476,2022-06-01
1,1.620942,2022-06-02
2,1.520142,2022-06-03
3,1.560479,2022-06-04
4,1.557252,2022-06-05
5,1.534075,2022-06-06
6,1.502872,2022-06-07
7,1.69673,2022-06-08
8,1.626885,2022-06-09
9,1.46165,2022-06-10


In [89]:
df_final.to_csv('june_22.csv')

### Correlation between sentiment and price
The price of Bitcoin shows an high correlation with the sentiment on the crypto community, this is commonly believed to be a direct causation of price towards sentiment. 

In [163]:
import pandas as pd
import pandas_datareader.data as pdr
import datetime

start = datetime.datetime(2022,1,1)
end = datetime.datetime(2022,1,30)
df = pdr.DataReader('BTC-USD','yahoo',start,end)

We multiply the sentiment score for a constant factor to obtain a qualitative plot showing the correlation between price and sentiment.

In [164]:
new_scores = [elem*23000 for elem in scores]

In [165]:
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode(connected=True)

data = [go.Candlestick(x=df.index,
                       open=df.Open,
                       high=df.High,
                       low=df.Low,
                       close=df.Close,
                       name="candlesticks"),
       go.Scatter(x=df.index, y=new_scores,
                  line=dict(
                        color='rgb(0, 0, 0)',
                        width=2
                    ),
                  name="sentiment score"
                  )]

layout = go.Layout(title='Bitcoin price (January 2022)',
                   xaxis={'title': "Date", 'rangeslider':{'visible':False},
                          'dtick': "day"},
                   yaxis={'title': "Price (USD)"},
                   width=1024)

fig = go.Figure(data=data,layout=layout)
py.iplot(fig,filename='bitcoin_candlestick')

## Conclusions

We used a labeled dataset of comments made by the crypto community to perform fine tuning on the DistilBERT and BERT pre-trained models provided by HuggingFace.
We experimented with different configurations of the models and amount of data, in this way we obtained a final DistilBERT model which achieves an accuracy of 0.9511 on the test set.
We observed that in the experiments the DistilBERT version took half of the time and it also had a similar performance with respect to the corresponding BERT model.
Further, as expected with a greater amount of data the models performed better on the evaluation set.
Finally we showed a practical example of how to analyze the movements of the crypto market in comparison to the sentiment showing that the approach is well-founded.

A further work that could be done is an analysis on a much longer span and possibly on multiple sources of data, this could possibly show if a so called ``wisdom of the crowd" could predict rises or falls of the market. 
More in general the model could help in finding some kind of insight that could be extrapolated from the sentiment of the community.