# Bitcoin price prediction based on *sentiment analysis* of related tweets
-----------

## Introduction

This project employs sentiment analysis on tweets related to cryptocurrencies in order to create a model that can predict how valuations might change based on public sentiment. We decided to do this to see whether there is a correlation between the behaviour of cryptocurrencies and people's sentiments about them since they are based on intangible services and as such could be considered volatile based on the availability of these services.

We have taken a sentiment-tagged dataset of Bitcoin ($BTC) related tweets, and used that to train a machine learning model in order to sentiment-tag a curated dataset which we have pulled from the Twitter API. We then compare the change in sentiment per day to the change in price of BTC per day, in order to analyse the correlation. Finally, we attempt to determine whether the sentiment influences the price or vice versa.

## Data
First, we selected the datasets we would use for the analysis as well as to train and test our models later on. We have chosen the following datasets for our analysis:

- Bitcoin Sentiment Twitter dataset

> We use data from [this](https://www.kaggle.com/code/alexandrayuliu/bitcoin-tweets-sentiment-analysis/data) sentiment tagged dataset to train and test a model for tagging our curated data

- Twitter API web-crawled dataset

> As we want to analyse the most dramatic and interesting points in time realted to BTC, we are using a custom dataset which we have gathered using [Twitter's Developer API](https://developer.twitter.com/en/docs/twitter-api)

## Research Question 
Is the sentiment of online discourse about Bitcoin representative of the change of its market price? 

If there is a correlation, which one affects the other more?

# Preprocessing of training data
-----------
Only to be run once, with results saved to CSV file for later access

In [2]:
import pandas as pd
sent_tweets = pd.read_csv("data/sent_tweets.csv").drop(columns=['user_location', 'user_description', 
                                                           'user_followers', 'user_friends',])
display(sent_tweets.head(5))
sent_tweets.shape

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,AT_USER AT_USER AT_USER right here w/ AT_USER ...,0.0
1,2021-02-05 10:52:04,AT_USER AT_USER please donate bitcoin19 donate...,0.6597
2,2021-02-05 10:52:06,$sos market cap is 308 million. if they’re min...,0.0
3,2021-02-05 10:52:07,"bitcoin btc current price (gbp): £34,880 like ...",0.3612
4,2021-02-05 10:52:26,AT_USER right here w/ AT_USER URL referral cod...,0.0


(1519555, 3)

## Filtering and formatting

We chose to not remove the stopwords as Saif et al ( LREC 2014) suggests that it might degrade the quality of classification.

[On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter](http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf) (Saif et al., LREC 2014)

In [53]:
import re 

# Provide case insensitive data
tweets = sent_tweets["tweets"].str.lower().astype(str)
S
# Take out links with or without www
tweets = tweets.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
tweets = tweets.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

# Take out possible HTML character references 
tweets = tweets.apply(lambda x: re.sub(r'&[a-z]+;', '', x))

# Take out nonletter characters except for spaces and sentence delimitators
tweets = tweets.apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

# Sometimes twitter data has links preprocessed into a reference such as {link}
tweets = tweets.apply(lambda x: re.sub(r'{link}', '', x))

# Noticed the dataset contains at user and url references so we can remove them
tweets = tweets.str.replace('url', '')
tweets = tweets.str.replace('atuser', '')

sent_tweets["tweets"] = tweets

sent_tweets.head(5)

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,right here w referral code,0.0
1,2021-02-05 10:52:04,please donate bitcoin donate challenge from ...,0.6597
2,2021-02-05 10:52:06,sos market cap is million. if theyre mining ....,0.0
3,2021-02-05 10:52:07,bitcoin btc current price gbp like my updates...,0.3612
4,2021-02-05 10:52:26,right here w referral code helpmehelpyou,0.0


## Tokenisation and POS-tagging

Using the regular expression tokeniser from NLTK as it gives us a high level of customisability

In [40]:
import nltk
from nltk.tokenize import RegexpTokenizer

tweets = list(zip(sent_tweets["tweets"], sent_tweets["score"]))

regextk = RegexpTokenizer('\s+', gaps=True)

tokens = [(regextk.tokenize(tweet), sentiment) for (tweet, sentiment) in tweets if type(tweet) == str]

tagged = [(nltk.pos_tag(tweet), sentiment) for tweet, sentiment in tokens]

tagged[3]

([('bitcoin', 'NN'),
  ('btc', 'NN'),
  ('current', 'JJ'),
  ('price', 'NN'),
  ('gbp', 'NN'),
  ('like', 'IN'),
  ('my', 'PRP$'),
  ('updates?', 'NN'),
  ('you', 'PRP'),
  ('can', 'MD'),
  ('tip', 'VB'),
  ('me', 'PRP'),
  ('at', 'IN'),
  ('ldztlqrcxnpnvgfnrbazuqvmrz', 'NN')],
 0.3612)

## Lemmatisation with Wordnet POS-tags

In [41]:
import string
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lem_tweets[3]

(['bitcoin',
  'btc',
  'current',
  'price',
  'gbp',
  'like',
  'my',
  'update',
  'you',
  'can',
  'tip',
  'me',
  'at',
  'ldztlqrcxnpnvgfnrbazuqvmrz'],
 0.3612)

## Saving to CSV file for later use

In [42]:
pd.DataFrame(lem_tweets).to_csv("lem_tweets.csv")

End of training data preprocessing


# Training the model

In [2]:
import pandas as pd
import numpy as np

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep]

lem_tweets = pd.read_csv("data/lem_tweets.csv").drop(columns="Unnamed: 0").rename(columns={"0":"Tweet", "1":"Sentiment"})

# joining data back to strings
new = []
for tweet in lem_tweets["Tweet"]:
    new.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
data = pd.DataFrame(zip(new, lem_tweets["Sentiment"]), columns=["Tweet", "Sentiment"])
data = clean_dataset(data)

X = data['Tweet'].astype(str)
y = data['Sentiment']

## Splitting and vectorising

TF-IDF (term frequency inverse document frequency) is used as our baseline for word representation.

Term frequency TF (number of times a term appears in a document) multiplied by Inverse Document Frequency IDF (log of the number of documents over the document frequency of term). 

Does not incorporate similarity between different words.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split

count = CountVectorizer()
tfidf_transformer = TfidfTransformer()

# split data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

train_counts = count.fit_transform(X_train)
train_tfidf = tfidf_transformer.fit_transform(train_counts)

## Applying ML model to train sentiment tagger

Using a regression model as we are trying to predict sentiment on a scale of -1 to 1.

The *Ridge Regression* model proved to be the best with some testing, as it excels at estimating coefficients of multiple-regression models in scenarios where linearly independent variables are highly correlated.

In this model, the loss function is a *linear least squares* and regularization is given by the *l2-norm*.

In [4]:
from sklearn.linear_model import SGDRegressor, Ridge

ridge = Ridge().fit(train_tfidf, y_train)

#### Using MSE for accuracy scoring

In [5]:
from sklearn.metrics import mean_squared_error

X_test_V = count.transform(X_test)
X_t = tfidf_transformer.transform(X_test_V)
y_pred = ridge.predict(X_t)

mean_squared_error(y_test, y_pred)

0.045889882755098726

#### Let's try to improve this regressor

In [6]:
ridge.get_params().keys()

dict_keys(['alpha', 'copy_X', 'fit_intercept', 'max_iter', 'normalize', 'positive', 'random_state', 'solver', 'tol'])

## Changing the Alpha parameter

Alpha is the constant that multiplies the L2 term, controlling regularization strength. 

Alpha must be a non-negative float.

In [7]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

alpha = [x for x in [0.1, 0.2, 0.25, 0.3, 0.5, 0.75, 1, 2, 5, 7.5, 10]]

params = {"alpha": alpha}

rscv = RandomizedSearchCV(ridge, params, n_iter=9, cv=5, verbose=2, n_jobs=-1, random_state=42)
search = rscv.fit(train_tfidf, y_train)
search.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


{'alpha': 0.25}

In [8]:
ridge_opt = Ridge(alpha=0.25).fit(train_tfidf, y_train)
opt_pred = ridge_opt.predict(X_t)

mean_squared_error(y_test, opt_pred)

0.044991285484770945

Minor improvement (~ 0.002)

# Applying model to API data

Reading in and formatting the web-scraped tweets

In [5]:
data = pd.read_csv("data/Btc_tweets_17_23.csv").drop(columns=["Unnamed: 0"])

updated = []
for tweet in data["tweet"]:
    updated.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
api = pd.DataFrame(zip(updated, data["date"]), columns=["Tweet", "Date"])

display(api.head(5))
api.shape

Unnamed: 0,Tweet,Date
0,current stats of delegatedonthate block find p...,2022-05-19 23:59:59+00:00
1,bbcworld for all those who be new to this work...,2022-05-19 23:59:58+00:00
2,smilingpunks floor price no gas fee polygon b...,2022-05-19 23:59:56+00:00
3,i be claim my free lightning sat from bitcoine...,2022-05-19 23:59:56+00:00
4,washingtonpost for all those who be new to thi...,2022-05-19 23:59:54+00:00


(500000, 2)

Creating a new dataframe to store our tweets and their new sentiment tags

In [56]:
api_counts = count.transform(api["Tweet"])
api_tfidf = tfidf_transformer.transform(api_counts)

r_pred = ridge_opt.predict(api_tfidf)

api_tagged = pd.DataFrame(zip(api["Tweet"], api["Date"], r_pred), columns=["Tweet", "Date", "Sentiment"])
api_tagged.sample(5)

Unnamed: 0,Tweet,Date,Sentiment
41012,a legal case involve the hack of crypto platfo...,2022-05-19 13:19:09+00:00,-0.021642
137662,you know that all confidence be lose when you ...,2022-05-18 14:29:03+00:00,0.056309
308382,brucelaing backbitcoin rbw mariuscrypt several...,2022-05-21 20:55:47+00:00,-0.384532
432656,the european central bank will decide to raise...,2022-05-23 11:40:35+00:00,0.189848
56843,royalkeyafrica turkish lira be a brilliant exa...,2022-05-19 09:16:19+00:00,0.264485


## Format dates and send to CSV

In [57]:
import datetime as dt

api_tagged["Date"] = [d.date() for d in pd.to_datetime(api_tagged["Date"])]
api_tagged

Unnamed: 0,Tweet,Date,Sentiment
0,current stats of delegatedonthate block find p...,2022-05-19,-0.079393
1,bbcworld for all those who be new to this work...,2022-05-19,0.192822
2,smilingpunks floor price no gas fee polygon b...,2022-05-19,0.446237
3,i be claim my free lightning sat from bitcoine...,2022-05-19,0.333951
4,washingtonpost for all those who be new to thi...,2022-05-19,0.192822
...,...,...,...
499995,natashacryptous luna lunaterra lunaburn will h...,2022-05-22,0.215744
499996,cryptocurrencies bitcoin litecoin and ethereum...,2022-05-22,0.209185
499997,cryptocurrencies bitcoin litecoin and ethereum...,2022-05-22,0.209185
499998,this be a big opportunity.made by a very profe...,2022-05-22,0.799702


In [58]:
api_tagged.to_csv("data/api_tagged.csv")

# Big dataset tagging

In [1]:
import pandas as pd

btc_tweets = pd.read_csv("data/BIG_btc_tweets.csv", low_memory=False)

In [2]:
btc_tweets = btc_tweets[['date', 'text']]
btc_tweets.head(5) 

Unnamed: 0,date,text
0,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...
1,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""..."
2,2021-02-10 23:54:48,"Guys evening, I have read this article about B..."
3,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...
4,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...


In [3]:
sorted = btc_tweets.sort_values('date').dropna().reset_index(drop=True)

In [4]:
reduced = sorted.iloc[1200000:1999000]

In [8]:
dates = [d.date() for d in pd.to_datetime(reduced["date"])]

In [27]:
import re 

# Provide case insensitive data
text = reduced["text"].str.lower().astype(str)

# Take out links with or without www
text = text.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
text.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

# Take out possible HTML character references 
text = text.apply(lambda x: re.sub(r'&[a-z]+;', '', x))

# Take out nonletter characters except for spaces and sentence delimitators
text = text.apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

# Sometimes twitter data has links preprocessed into a reference such as {link}
text = text.apply(lambda x: re.sub(r'{link}', '', x))

# Noticed the dataset contains at user and url references so we can remove them
text = text.str.replace('url', '')
text = text.str.replace('atuser', '')

filtered = pd.DataFrame({"Date": dates, "Tweet": text})

filtered.head(5)

Unnamed: 0,Date,Tweet
1200000,2021-08-26,bitcointrustbct i hope its a great project. a...
1200001,2021-08-26,jobcashofficial good project\n\nairdropinside\...
1200002,2021-08-26,jobcashofficial nice project is implemented ve...
1200003,2021-08-26,bitcoin atms illinoisfunded coinflip new compl...
1200004,2021-08-26,what was the third part of the holy trinity? a...


In [29]:
filtered["Date"].unique()

array([datetime.date(2021, 8, 26), datetime.date(2021, 9, 10),
       datetime.date(2021, 10, 18), datetime.date(2021, 10, 19),
       datetime.date(2021, 10, 20), datetime.date(2021, 10, 21),
       datetime.date(2021, 10, 22), datetime.date(2021, 10, 23),
       datetime.date(2021, 10, 27), datetime.date(2021, 10, 28),
       datetime.date(2021, 10, 29), datetime.date(2021, 11, 4),
       datetime.date(2021, 11, 5), datetime.date(2021, 11, 6),
       datetime.date(2021, 11, 11), datetime.date(2021, 11, 12),
       datetime.date(2021, 11, 18), datetime.date(2021, 11, 19),
       datetime.date(2021, 11, 24), datetime.date(2021, 11, 25),
       datetime.date(2021, 11, 26), datetime.date(2021, 12, 11),
       datetime.date(2021, 12, 17), datetime.date(2021, 12, 29),
       datetime.date(2021, 12, 30)], dtype=object)

In [30]:
import nltk
from nltk.tokenize import RegexpTokenizer

regextk = RegexpTokenizer('\s+', gaps=True)

reduced_tagged = [nltk.pos_tag(regextk.tokenize(tweet)) for tweet in filtered["Tweet"] if type(tweet) == str]

In [31]:
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in reduced_tagged:
    lemmas = []
    
    for word, tag in tweet:
        wn_tag = wn_pos(tag)

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append(lemmas)

In [32]:
joined = [" ".join(tweet) for tweet in lem_tweets]

big = pd.DataFrame({"Date" : reduced["date"], 
                    "Tweet" : joined})
big

Unnamed: 0,Date,Tweet
1200000,2021-08-26,bitcointrustbct i hope it a great project. and...
1200001,2021-08-26,jobcashofficial good project airdropinside pan...
1200002,2021-08-26,jobcashofficial nice project be implement very...
1200003,2021-08-26,bitcoin atm illinoisfunded coinflip new compli...
1200004,2021-08-26,what be the third part of the holy trinity? ah...
...,...,...
1998995,2021-12-30,georgehahn jerrysaltz ill be buy crypto!!! my ...
1998996,2021-12-30,the late bitcoin block with transaction be jus...
1998997,2021-12-30,saylor wonder what historical event happen in ...
1998998,2021-12-30,happy new year eve everyone xcad zil btc


In [33]:
big_counts = count.transform(big["Tweet"])
big_tfidf = tfidf_transformer.transform(big_counts)

r_pred = ridge_opt.predict(big_tfidf)

df = pd.DataFrame(zip(big["Tweet"], big["Date"], r_pred), columns=["Tweet", "Date", "Sentiment"])
df.sample(5)

Unnamed: 0,Tweet,Date,Sentiment
153270,a backtest of the old all time high allright l...,2021-10-21,-0.156123
135330,gdet add more s .s...yall know this be go to t...,2021-10-20,-0.071871
295611,crypto price usd bitcoin . ethereum . binance ...,2021-10-27,-0.145125
510185,you can do cloud mine use this site mining dog...,2021-11-11,0.068456
444095,thanks cathiedwood and charlieshrem for share ...,2021-11-05,0.685202


In [34]:
df.to_csv("data/big_tagged.csv")

In [13]:
import pandas as pd

old = pd.read_csv("data/proccessed_old.csv").drop(columns=["Unnamed: 0"])

updated = []
for tweet in old["0"]:
    updated.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
old = pd.DataFrame(zip(updated, old["1"]), columns=["Tweet", "Date"])
old

Unnamed: 0,Tweet,Date
0,rt alxtoken paul krugman nobel luddite i have ...,0.000000
1,lopp kevinpham psychosage naval but proffaustu...,0.000000
2,rt tippereconomy another use case for blockcha...,0.136364
3,free coin,0.400000
4,rt payvxofficial we be happy to announce that ...,0.468182
...,...,...
50868,rt fixyapp fixy network bring popular cryptocu...,0.600000
50869,rt bethereumteam after a successful launch of ...,0.375000
50870,rt gymrewards buy gymrewards tokens bonus time...,0.000000
50871,i add a video to a youtube playlist how to bit...,0.400000


In [15]:
old_counts = count.transform(old["Tweet"])
old_tfidf = tfidf_transformer.transform(old_counts)

r_pred = ridge_opt.predict(old_tfidf)

df = pd.DataFrame(zip(old["Tweet"], old["Date"], r_pred), columns=["Tweet", "Date", "Sentiment"])
df.sample(5)

Unnamed: 0,Tweet,Date,Sentiment
10947,rt gymrewards ready for gymbase ico part of g...,0.2,0.736366
31753,name everex symbol evx hour change price ran...,-0.4,-0.142198
17872,rt etherspinio etherspin will support all cryp...,0.0,0.529898
12778,dutch court find bitcoin a legitimate transfer...,0.0,0.245588
36250,rt cryptoflix cryptoflix be feature at this list,0.0,-0.133766


In [16]:
df.to_csv("data/old_tagged")