# Bitcoin Prediction based on Sentiment Analysis


## Introduction
For our project, we have chosen to use twitter data to perform a sentiment analysis on users opinions about crypto currencies inorder to create a predictive model that relies on the sentiment to predict how different crpyto currencies may behave.

## Description
We begin the project by collecting our data

### Data
-----------
First, we selected the datasets we would use for the analysis as well as to train and test our models later on. We have chosen the following datasets for our analysis:

- Bitcoin_Sentiment Twitter dataset
  > We use data from the following source:https://www.kaggle.com/code/alexandrayuliu/bitcoin-tweets-sentiment-analysis/data as it is already sentiment tagged to train and test our model

- Ucc(The Unhealthy Comments Corpus)
  > This dataset can be obtained from: https://github.com/conversationai/unhealthy-conversations

### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the UCC corpus and a final prediction model based on the sentiment model.

## Preprocessing

In [38]:
import pandas as pd
sent_tweets = pd.read_csv("sent-tweets.csv").drop(columns=['user_location', 'user_description', 
                                                           'user_followers', 'user_friends',])
sent_tweets.head(5)

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,AT_USER AT_USER AT_USER right here w/ AT_USER ...,0.0
1,2021-02-05 10:52:04,AT_USER AT_USER please donate bitcoin19 donate...,0.6597
2,2021-02-05 10:52:06,$sos market cap is 308 million. if they’re min...,0.0
3,2021-02-05 10:52:07,"bitcoin btc current price (gbp): £34,880 like ...",0.3612
4,2021-02-05 10:52:26,AT_USER right here w/ AT_USER URL referral cod...,0.0


Preprocessing the data

We chose to not remove the stopwords as Saif et al ( LREC 2014) suggests that it might degrade the quality of the classification.

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter](http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf) (Saif et al., LREC 2014)

In [39]:
import re 

# provide case insensitive data
sent_tweets["tweets"]=sent_tweets["tweets"].str.lower().astype(str)

# Take out links with or without www
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
sent_tweets["tweets"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

#Take out possible HTML character references 
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

#Take out nonletter characters except for spaces and sentence delimitators
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

#Sometimes twitter data has links preprocessed into a reference such as {link}
sent_tweets["tweets"] = sent_tweets["tweets"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('url', '')
sent_tweets["tweets"]= sent_tweets["tweets"].str.replace('atuser', '')

sent_tweets.head(5)

Unnamed: 0,date,tweets,score
0,2021-02-05 10:52:04,right here w referral code,0.0
1,2021-02-05 10:52:04,please donate bitcoin donate challenge from ...,0.6597
2,2021-02-05 10:52:06,sos market cap is million. if theyre mining ....,0.0
3,2021-02-05 10:52:07,bitcoin btc current price gbp like my updates...,0.3612
4,2021-02-05 10:52:26,right here w referral code helpmehelpyou,0.0


linear svc
mb naive bayes
random forest

In [40]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, TweetTokenizer

tweets = list(zip(sent_tweets["tweets"], sent_tweets["score"]))

regextk = RegexpTokenizer('\s+', gaps=True)
tweettk = TweetTokenizer()

tokens = [(regextk.tokenize(tweet), sentiment) for (tweet, sentiment) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), sentiment) for tweet, sentiment in filtered]

tagged[3]

([('bitcoin', 'NN'),
  ('btc', 'NN'),
  ('current', 'JJ'),
  ('price', 'NN'),
  ('gbp', 'NN'),
  ('like', 'IN'),
  ('my', 'PRP$'),
  ('updates?', 'NN'),
  ('you', 'PRP'),
  ('can', 'MD'),
  ('tip', 'VB'),
  ('me', 'PRP'),
  ('at', 'IN'),
  ('ldztlqrcxnpnvgfnrbazuqvmrz', 'NN')],
 0.3612)

In [41]:
import string
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

lem_tweets[3]

(['bitcoin',
  'btc',
  'current',
  'price',
  'gbp',
  'like',
  'my',
  'update',
  'you',
  'can',
  'tip',
  'me',
  'at',
  'ldztlqrcxnpnvgfnrbazuqvmrz'],
 0.3612)

In [42]:
pd.DataFrame(lem_tweets).to_csv("lem_tweets.csv")

# Above preprocessing to be run only once

In [43]:
pd.read_csv("lem_tweets.csv")

Unnamed: 0.1,Unnamed: 0,0,1
0,0,"['right', 'here', 'w', 'referral', 'code']",0.0000
1,1,"['please', 'donate', 'bitcoin', 'donate', 'cha...",0.6597
2,2,"['so', 'market', 'cap', 'be', 'million', 'if',...",0.0000
3,3,"['bitcoin', 'btc', 'current', 'price', 'gbp', ...",0.3612
4,4,"['right', 'here', 'w', 'referral', 'code', 'he...",0.0000
...,...,...,...
1519550,1519550,"['bitcoin', 'binary', 'trading', 'invest', 'wi...",0.2732
1519551,1519551,['nan'],
1519552,1519552,['nan'],
1519553,1519553,"['join', 'me', 'at', 'bybit', 'and', 'get', 'a...",0.9259


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

count = CountVectorizer()
train_counts = count.fit_transform(train["Tweet"])

tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)

In [None]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(train_tfidf, train["Sentiment"])

val_counts = count.transform(val["Tweet"])
val_tfidf = tfidf_transformer.transform(val_counts)

mnb_pred = MNB.predict(val_tfidf)

MNB_df = pd.DataFrame(zip(val["Tweet"], mnb_pred, val["Sentiment"]), columns=["Tweet", "Predicted", "Actual"])

display(MNB_df)

print("Sentiment prediction accuracy:", np.mean(mnb_pred == val["Sentiment"]))