# Bitcoin price prediction based on *sentiment analysis* of related tweets


## Introduction
For our project, we have chosen to use twitter data to perform a sentiment analysis on users opinions about crypto currencies inorder to create a predictive model that relies on the sentiment to predict how different crpyto currencies may behave, or vice versa.

## Description
We begin the project by collecting our data

### Data
-----------
First, we selected the datasets we would use for the analysis as well as to train and test our models later on. We have chosen the following datasets for our analysis:

- Bitcoin_Sentiment Twitter dataset
  > We use data from the following source:https://www.kaggle.com/code/alexandrayuliu/bitcoin-tweets-sentiment-analysis/data as it is already sentiment tagged to train and test our model


### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the sentiment-tagged corpus and a final prediction model based on the sentiment model.

# Preprocessing of training data

#### Only to be run once, with results saved to CSV file for later access

In [69]:
import pandas as pd
sent_tweets = pd.read_csv("data/sent-tweets.csv").drop(columns=['user_location', 'user_description', 
                                                           'user_followers', 'user_friends',])
sent_tweets

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,AT_USER AT_USER AT_USER right here w/ AT_USER ...,0.0000
1,2021-02-05 10:52:04,AT_USER AT_USER please donate bitcoin19 donate...,0.6597
2,2021-02-05 10:52:06,$sos market cap is 308 million. if they’re min...,0.0000
3,2021-02-05 10:52:07,"bitcoin btc current price (gbp): £34,880 like ...",0.3612
4,2021-02-05 10:52:26,AT_USER right here w/ AT_USER URL referral cod...,0.0000
...,...,...,...
1519550,2021-10-29 23:59:45,bitcoin &amp; binary trading invest with us ✔️...,0.2732
1519551,2021-10-29 23:59:49,,
1519552,2021-10-29 23:59:51,,
1519553,2021-10-29 23:59:53,join me at bybit and get a $20 bonus in usdt! ...,0.9259


We chose to not remove the stopwords as Saif et al ( LREC 2014) suggests that it might degrade the quality of the classification.

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter](http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf) (Saif et al., LREC 2014)

In [77]:
import re 

# Provide case insensitive data
sent_tweets["tweets"]=sent_tweets["tweets"].str.lower().astype(str)

# Take out links with or without www
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
sent_tweets["tweets"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

# Take out possible HTML character references 
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

# Take out nonletter characters except for spaces and sentence delimitators
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

# Sometimes twitter data has links preprocessed into a reference such as {link}
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('url', '')
sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('atuser', '')

sent_tweets.head(5)

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,right here w referral code,0.0
1,2021-02-05 10:52:04,please donate bitcoin donate challenge from ...,0.6597
2,2021-02-05 10:52:06,sos market cap is million. if theyre mining ....,0.0
3,2021-02-05 10:52:07,bitcoin btc current price gbp like my updates...,0.3612
4,2021-02-05 10:52:26,right here w referral code helpmehelpyou,0.0


In [40]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, TweetTokenizer

tweets = list(zip(sent_tweets["tweets"], sent_tweets["score"]))

regextk = RegexpTokenizer('\s+', gaps=True)
tweettk = TweetTokenizer()

tokens = [(regextk.tokenize(tweet), sentiment) for (tweet, sentiment) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), sentiment) for tweet, sentiment in filtered]

tagged[3]

([('bitcoin', 'NN'),
  ('btc', 'NN'),
  ('current', 'JJ'),
  ('price', 'NN'),
  ('gbp', 'NN'),
  ('like', 'IN'),
  ('my', 'PRP$'),
  ('updates?', 'NN'),
  ('you', 'PRP'),
  ('can', 'MD'),
  ('tip', 'VB'),
  ('me', 'PRP'),
  ('at', 'IN'),
  ('ldztlqrcxnpnvgfnrbazuqvmrz', 'NN')],
 0.3612)

In [41]:
import string
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

lem_tweets[3]

(['bitcoin',
  'btc',
  'current',
  'price',
  'gbp',
  'like',
  'my',
  'update',
  'you',
  'can',
  'tip',
  'me',
  'at',
  'ldztlqrcxnpnvgfnrbazuqvmrz'],
 0.3612)

In [42]:
pd.DataFrame(lem_tweets).to_csv("lem_tweets.csv")

# End of training data preprocessing

# Begin training and apply sentiment tags to new data

In [2]:
import pandas as pd

lem_tweets = pd.read_csv("data/lem_tweets.csv").drop(columns="Unnamed: 0").rename(columns={"0":"Tweet", "1":"Sentiment"})

new = []
for tweet in lem_tweets["Tweet"]:
    new.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
train = pd.DataFrame(zip(new, lem_tweets["Sentiment"]), columns=["Tweet", "Sentiment"])
train

Unnamed: 0,Tweet,Sentiment
0,right here w referral code 35002540…,0.0000
1,please donate bitcoin19 donate challenge from ...,0.6597
2,$sos market cap be 308 million if they’re mini...,0.0000
3,bitcoin btc current price (gbp) £34880 like my...,0.3612
4,right here w referral code 71559444 helpmehelp...,0.0000
...,...,...
9995,elon musk again bitcoin hit $43k all-time hig...,0.0000
9996,crazy when binance freeze bybit work normally ...,-0.4003
9997,bitcoin hit $44000 bitcoin btc btc dogecoins e...,0.0000
9998,how many short be liquidated btc elonmusk bitc...,0.0000


In [3]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

count = CountVectorizer()
tfidf_transformer = TfidfTransformer()

train_counts = count.fit_transform(train["Tweet"])
train_tfidf = tfidf_transformer.fit_transform(train_counts)

In [11]:
import numpy as np
from sklearn.linear_model import SGDRegressor, Ridge

ridge = Ridge().fit(train_tfidf, train["Sentiment"])


## Import API data

In [8]:
data = pd.read_csv("data/Btc_tweets.csv").drop(columns=["Unnamed: 0"])

updated = []
for tweet in data["tweet"]:
    updated.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
api = pd.DataFrame(zip(updated, data["date"]), columns=["Tweet", "Date"])
api.head(5)

Unnamed: 0,Tweet,Date
0,current stats of delegatedonthate block find p...,2022-05-19 23:59:59+00:00
1,bbcworld for all those who be new to this work...,2022-05-19 23:59:58+00:00
2,smilingpunks floor price no gas fee polygon b...,2022-05-19 23:59:56+00:00
3,i be claim my free lightning sat from bitcoine...,2022-05-19 23:59:56+00:00
4,washingtonpost for all those who be new to thi...,2022-05-19 23:59:54+00:00


In [19]:
api_counts = count.transform(api["Tweet"])
api_tfidf = tfidf_transformer.transform(api_counts)

r_pred = ridge.predict(api_tfidf)

df = pd.DataFrame(zip(api["Tweet"], api["Date"], r_pred), columns=["Tweet", "Date", "Predicted Sentiment"])

display(df.sample(5))

Unnamed: 0,Tweet,Date,Predicted Sentiment
87902,peterschiff for a guy that hat bitcoin you sur...,2022-05-18 22:28:57+00:00,0.466427
24154,slippage crypto cryptonews cryptocurrency cryp...,2022-05-19 17:10:17+00:00,0.060184
80542,jack dorsey affirms block bitcoin future and d...,2022-05-19 01:01:11+00:00,0.145125
17767,blocktrainer bitcoin be a runaway freight train,2022-05-19 18:42:04+00:00,-0.092451
62140,bitcoin and ethereum trend low altcoins slide ...,2022-05-19 07:50:14+00:00,-0.126398


Let's try to improve this regressor

In [22]:
ridge.get_params().keys()

dict_keys(['alpha', 'copy_X', 'fit_intercept', 'max_iter', 'normalize', 'positive', 'random_state', 'solver', 'tol'])

In [32]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

alpha = [x for x in [0.1, 0.2, 0.25, 0.3, 0.5, 0.75, 1, 2, 5, 7.5, 10]]

params = {"alpha": alpha}

rscv = RandomizedSearchCV(ridge, params, n_iter=9, cv=5, verbose=2, n_jobs=-1, random_state=42)
search = rscv.fit(train_tfidf, train["Sentiment"])
search.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


{'alpha': 0.25}

In [60]:
ridge_opt = Ridge(alpha=0.25).fit(train_tfidf, train["Sentiment"])
r_opt_pred = ridge.predict(api_tfidf)
df = pd.DataFrame(zip(api["Tweet"], api["Date"], r_opt_pred), columns=["Tweet", "Date", "Sentiment"])
df.sample(5)

Unnamed: 0,Tweet,Date,Sentiment
2052,cryptoa let go bitcoin,2022-05-19 23:16:19+00:00,0.028651
5974,samcallah which one alt be you say outperforms...,2022-05-19 21:58:15+00:00,0.097207
69815,i be claim my free lightning sat from bitcoine...,2022-05-19 05:16:50+00:00,0.380444
2824,bitcoin price prediction price fall in min ta...,2022-05-19 23:01:10+00:00,-0.049248
79721,bitcoin bottom price,2022-05-19 01:18:08+00:00,-0.119115


### testing

In [61]:
df[df["Sentiment"] < -0.2]

Unnamed: 0,Tweet,Date,Sentiment
27,california law wont ban offshore drilling afte...,2022-05-19 23:59:19+00:00,-0.582496
36,updated bitcoin transaction fee bch next block...,2022-05-19 23:59:12+00:00,-0.234924
93,therealkiyosaki what why you not say bitcoi...,2022-05-19 23:58:13+00:00,-0.211711
122,bookwhiz pierrepoilievre no body own bitcoin ...,2022-05-19 23:57:32+00:00,-0.250647
140,autocontrolsys bitcoinmagazine totally not get...,2022-05-19 23:57:15+00:00,-0.201819
...,...,...,...
99922,bitcoin bearish signal whale ramp up dump cryp...,2022-05-18 19:14:41+00:00,-0.206141
99934,the fear greed index touch the low of the last...,2022-05-18 19:14:33+00:00,-0.341533
99966,bitcoin loss cost country ten of million of do...,2022-05-18 19:13:54+00:00,-0.210591
99968,elonmusk now fuck up their shit and buy b of b...,2022-05-18 19:13:54+00:00,-0.397400


## Splitting data

In [65]:
import datetime as dt

df["Date"] = [d.date() for d in pd.to_datetime(df["Date"])]
df.head(5)

Unnamed: 0,Tweet,Date,Sentiment
0,current stats of delegatedonthate block find p...,2022-05-19,-0.030147
1,bbcworld for all those who be new to this work...,2022-05-19,0.252789
2,smilingpunks floor price no gas fee polygon b...,2022-05-19,0.251752
3,i be claim my free lightning sat from bitcoine...,2022-05-19,0.380444
4,washingtonpost for all those who be new to thi...,2022-05-19,0.252789


In [68]:
df.to_csv("api_tagged.csv")