# Bitcoin price prediction based on *sentiment analysis* of related tweets


## Introduction
For our project, we have chosen to use twitter data to perform a sentiment analysis on users opinions about crypto currencies inorder to create a predictive model that relies on the sentiment to predict how different crpyto currencies may behave, or vice versa.

## Description
We begin the project by collecting our data

### Data
-----------
First, we selected the datasets we would use for the analysis as well as to train and test our models later on. We have chosen the following datasets for our analysis:

- Bitcoin_Sentiment Twitter dataset
  > We use data from the following source:https://www.kaggle.com/code/alexandrayuliu/bitcoin-tweets-sentiment-analysis/data as it is already sentiment tagged to train and test our model


### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the sentiment-tagged corpus and a final prediction model based on the sentiment model.

# Preprocessing of training data

#### Only to be run once, with results saved to CSV file for later access

In [2]:
import pandas as pd
sent_tweets = pd.read_csv("data/sent-tweets.csv").drop(columns=['user_location', 'user_description', 
                                                           'user_followers', 'user_friends',])
display(sent_tweets.sample(5))
sent_tweets.shape

Unnamed: 0,date,tweets,score
151559,2021-06-21 13:44:04,"$btc bitcoin lod AT_USER pretty close to 32,17...",0.4939
468569,2021-07-21 23:40:17,imagine all the people living for todayyyy oly...,0.0
10267,2021-02-08 14:26:53,deal! i will switch from porsche to tesla when...,-0.1759
194108,2021-06-22 13:34:38,the big boys are starting to buy the dip bitco...,0.0
474109,2021-07-22 05:01:02,"spacex owns bitcoin bitcoin going to the moon,...",0.0


(1519555, 3)

We chose to not remove the stopwords as Saif et al ( LREC 2014) suggests that it might degrade the quality of the classification.

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter](http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf) (Saif et al., LREC 2014)

In [77]:
import re 

# Provide case insensitive data
sent_tweets["tweets"]=sent_tweets["tweets"].str.lower().astype(str)

# Take out links with or without www
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
sent_tweets["tweets"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

# Take out possible HTML character references 
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

# Take out nonletter characters except for spaces and sentence delimitators
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

# Sometimes twitter data has links preprocessed into a reference such as {link}
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('url', '')
sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('atuser', '')

sent_tweets.head(5)

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,right here w referral code,0.0
1,2021-02-05 10:52:04,please donate bitcoin donate challenge from ...,0.6597
2,2021-02-05 10:52:06,sos market cap is million. if theyre mining ....,0.0
3,2021-02-05 10:52:07,bitcoin btc current price gbp like my updates...,0.3612
4,2021-02-05 10:52:26,right here w referral code helpmehelpyou,0.0


In [40]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, TweetTokenizer

tweets = list(zip(sent_tweets["tweets"], sent_tweets["score"]))

regextk = RegexpTokenizer('\s+', gaps=True)
tweettk = TweetTokenizer()

tokens = [(regextk.tokenize(tweet), sentiment) for (tweet, sentiment) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), sentiment) for tweet, sentiment in filtered]

tagged[3]

([('bitcoin', 'NN'),
  ('btc', 'NN'),
  ('current', 'JJ'),
  ('price', 'NN'),
  ('gbp', 'NN'),
  ('like', 'IN'),
  ('my', 'PRP$'),
  ('updates?', 'NN'),
  ('you', 'PRP'),
  ('can', 'MD'),
  ('tip', 'VB'),
  ('me', 'PRP'),
  ('at', 'IN'),
  ('ldztlqrcxnpnvgfnrbazuqvmrz', 'NN')],
 0.3612)

In [41]:
import string
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

lem_tweets[3]

(['bitcoin',
  'btc',
  'current',
  'price',
  'gbp',
  'like',
  'my',
  'update',
  'you',
  'can',
  'tip',
  'me',
  'at',
  'ldztlqrcxnpnvgfnrbazuqvmrz'],
 0.3612)

In [42]:
pd.DataFrame(lem_tweets).to_csv("lem_tweets.csv")

# End of training data preprocessing

# Begin training and apply sentiment tags to new data

In [33]:
import pandas as pd
import numpy as np

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep]

lem_tweets = pd.read_csv("data/lem_tweets.csv").drop(columns="Unnamed: 0").rename(columns={"0":"Tweet", "1":"Sentiment"})

new = []
for tweet in lem_tweets["Tweet"]:
    new.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
data = pd.DataFrame(zip(new, lem_tweets["Sentiment"]), columns=["Tweet", "Sentiment"])
data = clean_dataset(data)

X = data['Tweet'].astype(str)
y = data['Sentiment']

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split

count = CountVectorizer()
tfidf_transformer = TfidfTransformer()

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

train_counts = count.fit_transform(X_train)
train_tfidf = tfidf_transformer.fit_transform(train_counts)

#### Ridge regression model

In [35]:
from sklearn.linear_model import SGDRegressor, Ridge
from sklearn.svm import SVC

ridge = Ridge().fit(train_tfidf, y_train)
#svm = SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

#### Using MSE for accuracy scoring

In [36]:
from sklearn.metrics import mean_squared_error

X_test_V = count.transform(X_test)
X_t = tfidf_transformer.transform(X_test_V)
y_pred = ridge.predict(X_t)

mean_squared_error(y_test, y_pred)

0.044348873137165935

## Import API data

In [37]:
data = pd.read_csv("data/Btc_tweets.csv").drop(columns=["Unnamed: 0"])

updated = []
for tweet in data["tweet"]:
    updated.append((tweet[1:-1].replace(",", "").replace("'", "")))
    
api = pd.DataFrame(zip(updated, data["date"]), columns=["Tweet", "Date"])
api.head(5)

Unnamed: 0,Tweet,Date
0,current stats of delegatedonthate block find p...,2022-05-19 23:59:59+00:00
1,bbcworld for all those who be new to this work...,2022-05-19 23:59:58+00:00
2,smilingpunks floor price no gas fee polygon b...,2022-05-19 23:59:56+00:00
3,i be claim my free lightning sat from bitcoine...,2022-05-19 23:59:56+00:00
4,washingtonpost for all those who be new to thi...,2022-05-19 23:59:54+00:00


In [38]:
api_counts = count.transform(api["Tweet"])
api_tfidf = tfidf_transformer.transform(api_counts)

r_pred = ridge.predict(api_tfidf)

df = pd.DataFrame(zip(api["Tweet"], api["Date"], r_pred), columns=["Tweet", "Date", "Predicted Sentiment"])

display(df.sample(5))

Unnamed: 0,Tweet,Date,Predicted Sentiment
2280,just because youre stream on twitch doesnt mea...,2022-05-19 23:10:59+00:00,0.062029
41785,bitcoin ethereum cardano solana lose up to ter...,2022-05-19 13:09:39+00:00,-0.248105
91213,may utc . . bitcoin btc btc crypto financia...,2022-05-18 21:30:01+00:00,0.198276
83606,bitcoin currency be it even a good idea,2022-05-18 23:58:04+00:00,0.4814
51311,despite the crypto crash bitcoin still have a ...,2022-05-19 10:52:06+00:00,-0.020306


Let's try to improve this regressor

In [39]:
ridge.get_params().keys()

dict_keys(['alpha', 'copy_X', 'fit_intercept', 'max_iter', 'normalize', 'positive', 'random_state', 'solver', 'tol'])

In [40]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

alpha = [x for x in [0.1, 0.2, 0.25, 0.3, 0.5, 0.75, 1, 2, 5, 7.5, 10]]

params = {"alpha": alpha}

rscv = RandomizedSearchCV(ridge, params, n_iter=9, cv=5, verbose=2, n_jobs=-1, random_state=42)
search = rscv.fit(train_tfidf, y_train)
search.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


{'alpha': 0.25}

In [45]:
ridge_opt = Ridge(alpha=0.25).fit(train_tfidf, y_train)
y_opt_pred = ridge_opt.predict(X_t)

df = pd.DataFrame(zip(api["Tweet"], api["Date"], y_opt_pred), columns=["Tweet", "Date", "Sentiment"])
df.sample(5)

Unnamed: 0,Tweet,Date,Sentiment
1516,free nft giveaway rt follow nftholding nftho...,2022-05-19 23:30:00+00:00,-0.151003
673,lorigubbio lorenzojova faracifra hello im kait...,2022-05-19 23:48:11+00:00,-0.084975
1140,apompliano i calculate the liquidity pool and ...,2022-05-19 23:38:55+00:00,0.053416
362,wvbcaa coltrincrist matej saylor jordanbpeters...,2022-05-19 23:53:32+00:00,0.242719
1511,usd en elsalvador bitcoinday te dieren sat ah...,2022-05-19 23:30:01+00:00,0.032646


In [46]:
mean_squared_error(y_test, y_opt_pred)

0.04252225443446537

Very slight improvement (~ 0.002)

## Splitting data

In [47]:
import datetime as dt

df["Date"] = [d.date() for d in pd.to_datetime(df["Date"])]
df.head(5)

Unnamed: 0,Tweet,Date,Sentiment
0,current stats of delegatedonthate block find p...,2022-05-19,0.187431
1,bbcworld for all those who be new to this work...,2022-05-19,0.306517
2,smilingpunks floor price no gas fee polygon b...,2022-05-19,-0.009743
3,i be claim my free lightning sat from bitcoine...,2022-05-19,0.023495
4,washingtonpost for all those who be new to thi...,2022-05-19,-0.253645


In [48]:
df.to_csv("data/api_tagged.csv")