# Evolving sentiment in Crypto markets
James Rockey (jrockey2)

Kareem Benaissa (kareem2)

### This project explores the relationship between cryptocurrency sentiment through tweets

This is done in the following steps:

1. Calculate sentiment of tweets with BERT model pretrained with financial sentiment

Model: https://huggingface.co/ProsusAI/finbert

Tweets dataset: https://www.kaggle.com/code/codeblogger/bitcoin-sentiment-analysis

Tweets are preprocessed before they are fed into the model.

The FinBERT model outputs percentage confidence in three following sentiment categories: ['positive', 'negative', 'neutral']

2. After sentiment scores are calculated for every tweet, we compare the effect of sentiment on predictive power of LSTMs. 

For the purposes of this research we want to explore intraday sentiment. So, we collected tweets from 15 random days in February 2021 and March 2021, and for each of these days we calculated sentiment in 10 minute increments. We then considered different sized sequences of 10 minutes to predict the next 1 or 2 sequences. We compare LSTM's prediction of price, volume, and other features, using these features as a baseline, with its performance when sentiment and weighted average sentiment are included in trainging.

## Step 1: Calculating Sentiment

In our dataset, the 'text' column contains the content of the tweet. We preprocess this text and use our BERT model to calculate sentiment

In [5]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
from scipy.special import softmax
import csv
import urllib.request




  from .autonotebook import tqdm as notebook_tqdm


In [8]:
df = pd.read_csv('datasets/Bitcoin_tweets.csv', sep=',', header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48583 entries, 0 to 48582
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_name         48582 non-null  object
 1   user_location     28273 non-null  object
 2   user_description  45263 non-null  object
 3   user_created      48583 non-null  object
 4   user_followers    48583 non-null  int64 
 5   user_friends      48583 non-null  int64 
 6   user_favourites   48583 non-null  int64 
 7   user_verified     48583 non-null  bool  
 8   date              48583 non-null  object
 9   text              48583 non-null  object
 10  hashtags          38416 non-null  object
 11  source            47685 non-null  object
 12  is_retweet        48583 non-null  bool  
dtypes: bool(2), int64(3), object(8)
memory usage: 4.2+ MB


In [10]:
# Load the model
MODEL = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [11]:
# helper functions
def get_labels(): 
    labels=[]

    # This is for another BERT model that we tried:
    # mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
    # with urllib.request.urlopen(mapping_link) as f:
    #     html = f.read().decode('utf-8').split("\n")
    #     csvreader = csv.reader(html, delimiter='\t')
    # labels = [row[1] for row in csvreader if len(row) > 1]

    # This is for the FinBERT model:
    labels = ['positive', 'negative', 'neutral']
    return labels


def preprocess(text):
    '''
    Preprocess text (username and link placeholders)
    '''    
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


def process_tweet(text: str):
    '''
    Calculates sentiment scores for the given text
    '''
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    return scores

# Modified print_scores function to return values instead of printing
def get_sentiment_scores(scores):
    labels = get_labels()
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    return {labels[ranking[i]]: np.round(float(scores[ranking[i]]), 4) for i in range(scores.shape[0])}

# Function to apply to each row
def analyze_sentiment(row):
    '''
    Takes in row from dataframe and returns a series of sentiment scores
    Used in df.apply()
    '''
    text = row['text']
    scores = process_tweet(text)
    labels = get_labels()
    sentiment_scores = get_sentiment_scores(scores, labels)
    print(pd.Series([sentiment_scores.get('positive', 0), 
                      sentiment_scores.get('neutral', 0), 
                      sentiment_scores.get('negative', 0)]))
    return pd.Series([sentiment_scores.get('positive', 0), 
                      sentiment_scores.get('neutral', 0), 
                      sentiment_scores.get('negative', 0)])

def print_scores(scores, labels):
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    for i in range(scores.shape[0]):
        l = labels[ranking[i]]
        s = scores[ranking[i]]
        print(f"{i+1}) {l} {np.round(float(s), 4)}")


### Example use of sentiment model

In [33]:
text = '''I'm absolutely ecstatic about Bitcoin's remarkable 
performance and incredibly optimistic about its
 potential to revolutionize finance''' # example tweet

scores = process_tweet(text)
labels = get_labels()
print_scores(scores, labels)

1) positive 0.9032
2) neutral 0.0864
3) negative 0.0103


### Apply to dataset
Note, this may take several hours. Data with sentiment scores already calculated is found in this file: sentiment_added_data.csv

In [34]:
# df[['positive', 'neutral', 'negative']] = df.apply(analyze_sentiment, axis=1)
# df.to_csv('sentiment_added_data.csv')



## Step 2: Analyzing Sentiment

Here, we explore how sentiment using the bert model effects LSTM prediction

### Calculate 10 minute sentiment scores for each day in dataset

The sentiment score for a given day is found by taking an average of the ['positive', 'negative', 'neutral'] columns in our dataset during a 10-mintue interval. 

For each trading day, we group tweets into 10-minute intervals starting at 12:00:00 am until 11:59:59 pm. 

We also explore how the "reach" of a tweet affects daily sentiment by calculating a weighted average over the number of followers twitter user.

If a 10-minute slice is missing tweets, we give each column equal weight

In [35]:
'''
The following functions calculate the weighted average 
and average sentiment scores for a group from the dataframe.

group is from the df.groupby() function

Input: group from dataframe

Output: series of weighted average and average sentiment scores

'''

def weighted_average(group):
    weighted_positive = (group['positive'] * group['user_followers']).sum()
    weighted_neutral = (group['neutral'] * group['user_followers']).sum()
    weighted_negative = (group['negative'] * group['user_followers']).sum()
    total_followers = np.sum(group['user_followers'])
    
    if total_followers == 0:
        return pd.Series({
            'weighted_avg_positive': 1/3,
            'weighted_avg_neutral': 1/3,
            'weighted_avg_negative': 1/3
        })
    return pd.Series({
        'weighted_avg_positive': weighted_positive / total_followers,
        'weighted_avg_neutral': weighted_neutral / total_followers,
        'weighted_avg_negative': weighted_negative / total_followers
    })

def average(group):
    avg_positive = group['positive'].mean()
    avg_neutral = group['neutral'].mean()
    avg_negative = group['negative'].mean()

    total_followers = np.sum(group['user_followers'])
    
    if total_followers == 0:
        # print('INVALID TOTAL' + str(count))
        return pd.Series({
            'weighted_avg_positive': 1/3,
            'weighted_avg_neutral': 1/3,
            'weighted_avg_negative': 1/3
        })
    return pd.Series({
        'avg_positive': avg_positive,
        'avg_neutral': avg_neutral,
        'avg_negative': avg_negative
    })
