## Web-Crawling Twitter Data


### Instructions
----------------
**Please Do not Reapeat the Steps In This Section**

we select the data we will use for the analysis later on.
We have chosen the following datasets for our analysis:
- Bitcoin Twitter chatter dataset
  > We webcrawl this data Ourselves and use it only for the purposes of attempting to predict bitcoin price according to the sentiment of the tweets.
  
  
### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the UCC corpus and a final prediction model based on the sentiment model.

## Preprocessing 
-------------------------

> For our project, we perform a sentiment analysis on tweets related to crypto currencies and use this analysis to predict how the currencies will varry depending on the sentiment. 

> Since we are only interested in tweets that are related to Bitcoin, we will specify a filter then filter out the tweets that do not contain the words in the filter.

>After that, we perform a sentiment analysis using pre trained models to see whether we can accurately predict what the sentiment of the tweets are.

>The models used will be trained on the UCC(The Unhealthy Comments Corpus) Coprus that was mentioned before , which contains over 40,000 online comments which have been tagged with sentiment values. 

In [1]:
import pandas as pd
import numpy as np
import tweepy as tw 
from tqdm import tqdm
from IPython.display import clear_output


In [2]:
import configparser
config = configparser.ConfigParser(interpolation=None)
config.read("conf.conf")

['conf.conf']

In [3]:
client = tw.Client(config['twitter']['bearer_token'], wait_on_rate_limit=True)

In [4]:
date_since = "2022-05-22T00:00:00.00Z"
date_until="2022-05-22T00:00:00Z"
search_words= ("Bitcoin lang:en -is:retweet"or"bitcoin lang:en -is:retweet"or
               "Btc lang:en -is:retweet"or"btc lang:en -is:retweet"or
               "#bitcoin lang:en -is:retweet"or"#Btc lang:en -is:retweet"or
               "#btc lang:en -is:retweet")
fields=['created_at','text']

In [5]:
# Collect tweets
tweets = tw.Paginator(client.search_recent_tweets,
                      tweet_fields=fields,
                      query=search_words,
                      start_time=date_since,
                      #end_time=date_until,
                      max_results=100).flatten(limit=100000) #We instruct the Paginator to return maximum of 100,000 tweets


In [6]:
#Tweet retrival
tweets_copy = []
for tweet in tqdm(tweets):
    tweets_copy.append(tweet)



    

44843it [03:46, 204.57it/s]Rate limit exceeded. Sleeping for 675 seconds.
89808it [18:53, 194.72it/s]Rate limit exceeded. Sleeping for 669 seconds.
100000it [30:54, 53.91it/s]


### Checking we have received the desired number of Tweets

In [7]:
print(f"New tweets retrieved: {len(tweets_copy)}")

New tweets retrieved: 100000


In [8]:
tweets_df=pd.DataFrame()

In [9]:
for tweet in tqdm(tweets_copy):
    tweets_df=tweets_df.append(pd.DataFrame({'date': tweet.created_at,
                                               'text': tweet.text}, index=[0]))
    clear_output()  

clear_output()  

# Accessing and Formatting Api Web-crawled Data

In [2]:
tweets=pd.read_csv('Btc_tweets_1-31_unprocessed700.csv')

In [5]:
tweets

Unnamed: 0.1,Unnamed: 0,date,text
0,0,2021-09-01 00:59:58+00:00,Cyber-thieves used malware to swipe 16.4 Bitco...
1,0,2021-09-01 00:59:53+00:00,Our CEO and Co-Founder @raypaxful believes tha...
2,0,2021-09-01 00:59:49+00:00,@Dennis_Porter_ I agree. Although loss of hope...
3,0,2021-09-01 00:59:46+00:00,What a fast growing ecosystem the XRP ledger h...
4,0,2021-09-01 00:59:37+00:00,@finance_keep I have participated. I believe t...
...,...,...,...
6995,0,2021-09-30 03:59:43+00:00,@mchooyah @RiggsBTC We buy bitcoin. We HODL. T...
6996,0,2021-09-30 03:59:42+00:00,#Bitcoin trying to break out out right now!!! ...
6997,0,2021-09-30 03:59:41+00:00,Brooklyn Police Stations Upland Block Chain He...
6998,0,2021-09-30 03:59:40+00:00,Current Bitcoin transaction fees: \n \nBCH ...


###  Filltering and Lemmatizing Tweets


In [6]:
import re 

filtered_btc = tweets.dropna()

# provide case insensitive data
filtered_btc["text"]=filtered_btc["text"].str.lower().astype(str)

# Take out links with or without www
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
filtered_btc["text"].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

#Take out possible HTML character references 
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'&[a-z]+;', '', x))

#Take out nonletter characters except for spaces and sentence delimitators
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r"[^a-z\s.!?]", '', x))

#Sometimes twitter data has links preprocessed into a reference such as {link}
filtered_btc["text"] = filtered_btc["text"].apply(lambda x: re.sub(r'{link}', '', x))

# I noticed the dataset contains at user and url references so we can remove them

filtered_btc["text"]= filtered_btc["text"].str.replace('url', '')
filtered_btc["text"]= filtered_btc["text"].str.replace('atuser', '')


In [7]:
filtered_btc.head()

Unnamed: 0.1,Unnamed: 0,date,text
0,0,2021-09-01 00:59:58+00:00,cyberthieves used malware to swipe . bitcoin f...
1,0,2021-09-01 00:59:53+00:00,our ceo and cofounder raypaxful believes that ...
2,0,2021-09-01 00:59:49+00:00,dennisporter i agree. although loss of hope ca...
3,0,2021-09-01 00:59:46+00:00,what a fast growing ecosystem the xrp ledger h...
4,0,2021-09-01 00:59:37+00:00,financekeep i have participated. i believe thi...


In [8]:
import nltk
from nltk.tokenize import TweetTokenizer

tweets = list(zip(filtered_btc["text"], filtered_btc["date"]))

tweet_tokenizer = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)

tokens = [(tweet_tokenizer.tokenize(tweet), date) for (tweet, date) in tweets if type(tweet) == str]

filtered = []
for tweet in tokens:
    new = []
    for tok in tweet[0]:
        if tok != "AT_USER" and tok != "URL":
            new.append(tok)
            
    filtered.append((new, tweet[1]))

tagged = [(nltk.pos_tag(tweet), date) for tweet, date in filtered]


In [9]:
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

def wn_pos(tag):
    "converts treebank tags into wordbank tags for lemmatization"
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return None

lem_tweets = []
lem = WordNetLemmatizer()

for tweet in tagged:
    lemmas = []
    
    for word, tag in tweet[0]:
        wn_tag = wn_pos(tag)
        
        if word[-1] in string.punctuation:
                word = word[:-1]

        if wn_pos(tag) is not None:
            lemmas.append(lem.lemmatize(word, wn_tag))
        else:
            lemmas.append(lem.lemmatize(word))
                
    lem_tweets.append((lemmas, tweet[1]))

lemmas = [lem for tweet in lem_tweets for lem in tweet]

len(lem_tweets)

7000

In [10]:
lem_tweets = pd.DataFrame(lem_tweets, columns =['tweet', 'date'])  
pd.DataFrame(lem_tweets).to_csv("Btc_tweets_1_30.csv")