## Web-Crawling Twitter Data


### Instructions
----------------
we select the data we will use for the analysis later on.
We have chosen the following datasets for our analysis:
- Bitcoin Twitter chatter dataset
  > We webcrawl this data Ourselves and use it only for the purposes of attempting to predict bitcoin price according to the sentiment of the tweets.
  
  
**Please Do not Reapeat the Steps In This Section Until after the Preprocessing**
### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the UCC corpus and a final prediction model based on the sentiment model.

## Preprocessing 
-------------------------

> For our project, we perform a sentiment analysis on tweets related to crypto currencies and use this analysis to predict how the currencies will varry depending on the sentiment. 

> Since we are only interested in tweets that are related to Bitcoin, we will specify a filter then filter out the tweets that do not contain the words in the filter.

>After that, we perform a sentiment analysis using pre trained models to see whether we can accurately predict what the sentiment of the tweets are.

>The models used will be trained on the UCC(The Unhealthy Comments Corpus) Coprus that was mentioned before , which contains over 40,000 online comments which have been tagged with sentiment values. 

In [36]:
import pandas as pd
import numpy as np
import tweepy as tw 
from tqdm import tqdm
from IPython.display import clear_output


In [37]:
import configparser
config = configparser.ConfigParser(interpolation=None)
config.read("conf.conf")

['conf.conf']

In [38]:
consumer_api_key = config['twitter']['api_key']
consumer_api_secret = config['twitter']['api_key_secret']

In [39]:
auth = tw.OAuthHandler(consumer_api_key, consumer_api_secret)
api = tw.API(auth, wait_on_rate_limit=True)
client = tweepy.Client(config['twitter']['bearer_token'])


In [40]:
search_words = ["Bitcoin -filter:retweets","bitcoin -filter:retweets","Btc -filter:retweets","btc -filter:retweets",
                "cryptocurrency -filter:retweets"]


In [41]:
date_since = "2022-05-12"
date_until="2022-05-19"

# Collect tweets
tweets = tw.Cursor(client.search_recent_tweets, 
                   max_results=100,
                   query='Bitcoin',
                   start_time= date_since,
                expansions= search_words,    
              ).items(50000) #We instruct the cursor to return maximum of 50,000 tweets


In [42]:
#Tweet retrival
tweets_copy = []
for tweet in tqdm(tweets):
    tweets_copy.append(tweet)
    clear_output()

46224it [2:00:28, 15.48it/s]Unexpected parameter: since
46224it [2:00:28,  6.39it/s]


### Checking we have received the right number of Tweets

In [57]:
print(f"New tweets retrieved: {len(tweets_copy)}")

New tweets retrieved: 46224


In [58]:
tweets_df=pd.DataFrame()

In [59]:
for tweet in tqdm(tweets_copy):
    tweets_df=tweets_df.append(pd.DataFrame({'date': tweet.created_at,
                                               'text': tweet.text,
                                               'is_retweet': tweet.retweeted}, index=[0]))
    clear_output()  

clear_output()  

In [74]:
tweets_df.head()

Unnamed: 0,date,text,is_retweet
0,2022-05-18 23:59:49+00:00,Take this as a lesson. Anything offering 20%+...,False
0,2022-05-18 23:59:42+00:00,Walkfivemiles found #bitcoin in a User vault a...,False
0,2022-05-18 23:59:39+00:00,Be sure to join this AMA with @OsoPowerful and...,False
0,2022-05-18 23:59:37+00:00,#BTC #Bitcoin Top analyst price target for ne...,False
0,2022-05-18 23:59:35+00:00,Going down the #bitcoin rabbit hole ensures ma...,False


In [75]:
print(len(tweets_df))

46224


### Removing Retweets, Filltering and Lemmatizing Tweets


In [76]:
tweets_df.to_csv('btc_past7.csv',index=False)

In [77]:
btc_raw = pd.read_csv("btc_past7.csv")

In [90]:
btc_raw.drop(btc_raw.index[btc_raw['is_retweet'] == 1], inplace=True)

In [93]:
btc_raw.head()

Unnamed: 0,date,text,is_retweet
0,2022-05-18 23:59:49+00:00,Take this as a lesson. Anything offering 20%+...,False
1,2022-05-18 23:59:42+00:00,Walkfivemiles found #bitcoin in a User vault a...,False
2,2022-05-18 23:59:39+00:00,Be sure to join this AMA with @OsoPowerful and...,False
3,2022-05-18 23:59:37+00:00,#BTC #Bitcoin Top analyst price target for ne...,False
4,2022-05-18 23:59:35+00:00,Going down the #bitcoin rabbit hole ensures ma...,False


In [94]:
print(len(btc_raw))

46224


In [95]:
import re 
filtered_btc=btc_raw.drop(columns='is_retweet')

# provide case insensitive data
filtered_btc["text"]=filtered_btc["text"].str.lower().astype(str)

# Take out links with or without www
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
filtered_btc["text"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

#Take out possible HTML character references 
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

#Take out nonletter characters except for spaces and sentence delimitators
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

#Sometimes twitter data has links preprocessed into a reference such as {link}
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

filtered_btc["text"]= filtered_btc["text"].str.replace('url', '')
filtered_btc["text"]= filtered_btc["text"].str.replace('atuser', '')


In [96]:
filtered_btc.head()

Unnamed: 0,date,text
0,2022-05-18 23:59:49+00:00,take this as a lesson. anything offering apy...
1,2022-05-18 23:59:42+00:00,walkfivemiles found bitcoin in a user vault at...
2,2022-05-18 23:59:39+00:00,be sure to join this ama with osopowerful and ...
3,2022-05-18 23:59:37+00:00,btc bitcoin top analyst price target for next...
4,2022-05-18 23:59:35+00:00,going down the bitcoin rabbit hole ensures man...


In [97]:
import nltk
from nltk.tokenize import TweetTokenizer

tweets = list(zip(filtered_btc["text"], filtered_btc["date"]))

tweet_tokenizer = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)

tokens = [(tweet_tokenizer.tokenize(tweet), date) for (tweet, date) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), date) for tweet, date in filtered]


In [111]:
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

len(lem_tweets)

46224

In [112]:
lem_tweets = pd.DataFrame(lem_tweets, columns =['tweet', 'date'])  
pd.DataFrame(lem_tweets).to_csv("Btc_tweets.csv")

## Bitcoin Daily Price Data Pre-Processing

## Analysis
------------------

We begin the Analysis by splitting the data by day which will will then run through the model and create a daily report on based on the results.
- First, we create a report card that will be used to score the performance of the model on the data by day and add this information to a dataframe with the different rows representing differnt days.
- Then we will append to the same data frame the average change in the price of Bitcoin per day, caluctaed as a difference between the opening price at midnight and the closing price at midnight 24 hours later.

In [113]:
from datetime import datetime
from datetime import timedelta  
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score


In [117]:
btc= pd.read_csv("Btc_tweets.csv").drop(columns='Unnamed: 0')

In [118]:
btc_price= pd.read_csv("BTC-USD.csv")


In [119]:
btc.head()

Unnamed: 0,tweet,date
0,"['take', 'this', 'a', 'a', 'lesson', '', 'anyt...",2022-05-18 23:59:49+00:00
1,"['walkfivemiles', 'find', 'bitcoin', 'in', 'a'...",2022-05-18 23:59:42+00:00
2,"['be', 'sure', 'to', 'join', 'this', 'ama', 'w...",2022-05-18 23:59:39+00:00
3,"['btc', 'bitcoin', 'top', 'analyst', 'price', ...",2022-05-18 23:59:37+00:00
4,"['go', 'down', 'the', 'bitcoin', 'rabbit', 'ho...",2022-05-18 23:59:35+00:00
