## Web-Crawling Twitter Data


### Instructions
----------------
**Please Do not Reapeat the Steps In This Section**

we select the data we will use for the analysis later on.
We have chosen the following datasets for our analysis:
- Bitcoin Twitter chatter dataset
  > We webcrawl this data Ourselves and use it only for the purposes of attempting to predict bitcoin price according to the sentiment of the tweets.
  
  
### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the UCC corpus and a final prediction model based on the sentiment model.

## Preprocessing 
-------------------------

> For our project, we perform a sentiment analysis on tweets related to crypto currencies and use this analysis to predict how the currencies will varry depending on the sentiment. 

> Since we are only interested in tweets that are related to Bitcoin, we will specify a filter then filter out the tweets that do not contain the words in the filter.

>After that, we perform a sentiment analysis using pre trained models to see whether we can accurately predict what the sentiment of the tweets are.

>The models used will be trained on the UCC(The Unhealthy Comments Corpus) Coprus that was mentioned before , which contains over 40,000 online comments which have been tagged with sentiment values. 

In [1]:
import pandas as pd
import numpy as np
import tweepy as tw 
from tqdm import tqdm
from IPython.display import clear_output


In [2]:
import configparser
config = configparser.ConfigParser(interpolation=None)
config.read("conf.conf")

['conf.conf']

In [3]:
client = tw.Client(config['twitter']['bearer_token'], wait_on_rate_limit=True)

In [4]:
date_since = "2022-05-22T00:00:00.00Z"
date_until="2022-05-22T00:00:00Z"
search_words= ("Bitcoin lang:en -is:retweet"or"bitcoin lang:en -is:retweet"or
               "Btc lang:en -is:retweet"or"btc lang:en -is:retweet"or
               "#bitcoin lang:en -is:retweet"or"#Btc lang:en -is:retweet"or
               "#btc lang:en -is:retweet")
fields=['created_at','text']

In [5]:
# Collect tweets
tweets = tw.Paginator(client.search_recent_tweets,
                      tweet_fields=fields,
                      query=search_words,
                      start_time=date_since,
                      #end_time=date_until,
                      max_results=100).flatten(limit=100000) #We instruct the Paginator to return maximum of 100,000 tweets


In [6]:
#Tweet retrival
tweets_copy = []
for tweet in tqdm(tweets):
    tweets_copy.append(tweet)



    

44843it [03:46, 204.57it/s]Rate limit exceeded. Sleeping for 675 seconds.
89808it [18:53, 194.72it/s]Rate limit exceeded. Sleeping for 669 seconds.
100000it [30:54, 53.91it/s]


### Checking we have received the desired number of Tweets

In [7]:
print(f"New tweets retrieved: {len(tweets_copy)}")

New tweets retrieved: 100000


In [8]:
tweets_df=pd.DataFrame()

In [9]:
for tweet in tqdm(tweets_copy):
    tweets_df=tweets_df.append(pd.DataFrame({'date': tweet.created_at,
                                               'text': tweet.text}, index=[0]))
    clear_output()  

clear_output()  

In [10]:
tweets_df.head()

Unnamed: 0,date,text
0,2022-05-23 18:58:27+00:00,Mark my words we are currently a baby whale🐬 b...
0,2022-05-23 18:58:25+00:00,@BDliveSA Being the head of the home isn't eas...
0,2022-05-23 18:58:25+00:00,@BitcoinMagazine We are early. Every news abou...
0,2022-05-23 18:58:24+00:00,"🚨 170 #BTC (5,094,422 USD) just transferred 🚨\..."
0,2022-05-23 18:58:23+00:00,Liberals want to repeal the word “obesity” bec...


###  Filltering and Lemmatizing Tweets


In [11]:
import re 

filtered_btc = tweets_df.dropna()

# provide case insensitive data
filtered_btc["text"]=filtered_btc["text"].str.lower().astype(str)

# Take out links with or without www
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
filtered_btc["text"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

#Take out possible HTML character references 
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

#Take out nonletter characters except for spaces and sentence delimitators
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

#Sometimes twitter data has links preprocessed into a reference such as {link}
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

filtered_btc["text"]= filtered_btc["text"].str.replace('url', '')
filtered_btc["text"]= filtered_btc["text"].str.replace('atuser', '')


In [12]:
filtered_btc.head()

Unnamed: 0,date,text
0,2022-05-23 18:58:27+00:00,mark my words we are currently a baby whale bu...
0,2022-05-23 18:58:25+00:00,bdlivesa being the head of the home isnt easy ...
0,2022-05-23 18:58:25+00:00,bitcoinmagazine we are early. every news about...
0,2022-05-23 18:58:24+00:00,btc usd just transferred \n\nfrom\nhfbtnavz...
0,2022-05-23 18:58:23+00:00,liberals want to repeal the word obesity becau...


In [13]:
import nltk
from nltk.tokenize import TweetTokenizer

tweets = list(zip(filtered_btc["text"], filtered_btc["date"]))

tweet_tokenizer = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)

tokens = [(tweet_tokenizer.tokenize(tweet), date) for (tweet, date) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), date) for tweet, date in filtered]


In [14]:
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

len(lem_tweets)

100000

In [15]:
lem_tweets = pd.DataFrame(lem_tweets, columns =['tweet', 'date'])  
pd.DataFrame(lem_tweets).to_csv("Btc_tweets_22_23.csv")

## Bitcoin Daily Price Data Pre-Processing

## Analysis
------------------

We begin the Analysis by splitting the data by day which will will then run through the model and create a daily report on based on the results.
- First, we combine the data that will be used in the model to a dataframe with the data from the past 7 days.
- Then we will append to the same data frame the average change in the price of Bitcoin per day, caluctaed as a difference between the opening price at midnight and the closing price at midnight 24 hours later.

In [21]:
import os

In [22]:
df = pd.concat([pd.read_csv(f'data/Btc_tweets_17-23/{f}') for f in os.listdir('data/Btc_tweets_17-23') if f.endswith('.csv')])

In [23]:
df.head()

Unnamed: 0.1,Unnamed: 0,tweet,date
0,0,"['current', 'stats', 'of', 'delegatedonthate',...",2022-05-19 23:59:59+00:00
1,1,"['bbcworld', 'for', 'all', 'those', 'who', 'be...",2022-05-19 23:59:58+00:00
2,2,"['smilingpunks', 'floor', 'price', '', 'no', '...",2022-05-19 23:59:56+00:00
3,3,"['i', 'be', 'claim', 'my', 'free', 'lightning'...",2022-05-19 23:59:56+00:00
4,4,"['washingtonpost', 'for', 'all', 'those', 'who...",2022-05-19 23:59:54+00:00


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  500000 non-null  int64 
 1   tweet       500000 non-null  object
 2   date        500000 non-null  object
dtypes: int64(1), object(2)
memory usage: 15.3+ MB


In [25]:
pd.DataFrame(df).to_csv("Btc_tweets_17_23.csv")