Name: Han Chen, Daniel Fernandez Davila

### Motivation
Many of the choices we make are influenced by how others see and perceive things. This is why our decisions (what to eat, wear, or where to invest) are influenced by other people, and whenever we need to make a decision, we usually ask for others opinions. Recently social media has gained traction on this matter.

Bitcoin has been a hot topic recently, and its prices seem to be driven by speculative trends. Some of these speculative trends and Bitcoin enthusiasm/aversion is probably spread across social networks. Specifically, there is a significant Bitcoin Twitter community that spread news, investment opinions, and feelings in general regarding Bitcoin. 

Some work has been done trying to predict the stock market movements using Twitter data. Specifically Zhiang Hu, Jian Jiao, Jialu Zhu (2013) uses tweets to predict the stock market. Additionally, Bollen et al. (2010) predicted the direction of DJIA movement using Twitter data getting an accuracy of 87.6%.

As an additional motivation to develop this project, one or our team member has investments in Bitcoin, so we thought it would be interesting to look at the Bitcoin market.


### Problem
We want to use Machine Learning methods to see the impact of Social Media (Twitter in this project) on Bitcoin market. The problem is defined as a classification problem: whether the market will go up or down based on the tweets tweeted during some time interval. We decided to look at the 6-hour time interval so that we have more data points for the modeling. We also want to accomodate the momentum of the market as it moves quickly during the day. 

### Dataset Overview
I. Tweets (scraped using Tweepy API)  
1. Challenge: Tweepy only allows to scrape 7 days back based on topics  
--> therefore, we selected 77 Bitcoin influencers, including Bitcoin analyst, Bitcoin investor, etc. "Influencer" determined by their number of followers and frequency of tweeting about Bitcoin.
2. A total of 68K tweets, features include timestamp break down to year, month, day, hour, minute, and second, and the actual text.

II. Bitcoin Price Data (processed in the second notebook file)  
1. Downloaded from Kaggle https://www.kaggle.com/preslavrachev/bitcoin-historical-prices-from-cryptocompare-/data
2. Covers the price information from 4/23/2017 - 4/5/2018 in 5 minute interval
3. Features: timestamp, date, open, close, low, high, weightedAverage, volume, quoteVolume

### Data Preprocessing (Tweets)

In [45]:
import pandas as pd
import glob

In [46]:
tweets = pd.DataFrame()
for file in glob.glob('Datasets/Tweets/*.csv'):
    user = pd.read_csv(file,encoding='utf-8')
    tweets = tweets.append(user)

In [47]:
tweets = tweets.drop('id', axis = 1)
tweets.head(10)

Unnamed: 0,created_at,text
0,2018-03-26 17:09:23,b'RT @coindesk: Cboe Prods SEC on Bitcoin ETF ...
1,2018-03-25 02:49:39,"b'At Mars, Jeff Bezos Hosted Roboticists, Astr..."
2,2018-03-22 17:28:31,b'RT @GeminiDotCom: U.K. fintech firm @BeeksFi...
3,2018-03-22 17:26:58,b'RT @BeeksFinCloud: Write up in @financemagna...
4,2018-03-17 23:24:44,"b""RT @coindesk: Peter Thiel: Bitcoin Will Be t..."
5,2018-03-16 22:59:20,"b'RT @HumanoidHistory: March 16, 1966: Astrona..."
6,2018-03-16 22:57:41,b'RT @HumanoidHistory: #TodayInHistory: High a...
7,2018-03-16 22:56:55,b'RT @FIAconnect: Thanks to Cameron &amp; Tyle...
8,2018-03-16 22:56:48,b'RT @GeminiDotCom: .Cameron @winklevoss and @...
9,2018-03-16 16:26:29,b'Yesterday at the Futures Industry Associatio...


### Filtering tweets

#### Challenge: 
Links, special characters in the text, but there is no exhaustive way to filter all the special characters out or find out what they originally meant  
We first filtered these out or either replaced special characters with their original characters

In [48]:
tweets.iloc[3198,1]

"b'RT @elonmusk: Tesla Goes Bankrupt\\nPalo Alto, California, April 1, 2018 -- Despite intense efforts to raise money, including a last-ditch ma\\xe2\\x80\\xa6'"

In [50]:
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('\.\+.', " ")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('\\', " ")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('\.\n', " ")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('\\n', " ")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('\\\'s', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 xa6', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('.https:.+', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 x9c', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 x99t', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace(' xe2 x80 x99', "'")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 x93',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 x93 xc2 xa0',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xf0 x9f x98 x9c',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('n xe2 x9d xafmobi',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xef xbf xa',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xf0 x9f x93 x89',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xf0 x9f x93 x88',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xf0 x9f x94 xae xf0 x9f x98 x8e',"") 
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x80 x9d', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xe2 x9e x97', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('xf0 x9f x91 x80', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('-&gt', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('|', "")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace("\\'", "")

In [51]:
# example of processed tweet
tweets.iloc[3198,1]

'bRT @elonmusk: Tesla Goes Bankrupt nPalo Alto, California, April 1, 2018 -- Despite intense efforts to raise money, including a last-ditch ma '

In [34]:
# take only Bitcoin related tweets for analysis
import re

boolean = tweets.iloc[:,1].str.contains('bitcoin') | tweets.iloc[:,1].str.contains('.bitcoin.') | tweets.iloc[:,1].str.contains('Bitcoin') | tweets.iloc[:,1].str.contains('.Bitcoin') | tweets.iloc[:,1].str.contains('.#Bitcoin')| tweets.iloc[:,1].str.contains('.#Bitcoin.') | tweets.iloc[:,1].str.contains('.#bitcoin.') | tweets.iloc[:,1].str.contains('#BITCOIN') | tweets.iloc[:,1].str.contains('.#bitcoin') | tweets.iloc[:,1].str.contains('.BitCoin.') | tweets.iloc[:,1].str.contains('BitCoin') | tweets.iloc[:,1].str.contains('.Bitcoin.') | tweets.iloc[:,1].str.contains('.BITCOIN.') | tweets.iloc[:,1].str.contains('BITCOIN') | tweets.iloc[:,1].str.contains(r'.BTC.') | tweets.iloc[:,1].str.contains(r'.btc.') | tweets.iloc[:,1].str.contains('.crypto.') | tweets.iloc[:,1].str.contains('.Crypto.') | tweets.iloc[:,1].str.contains(r'.blockchain.') | tweets.iloc[:,1].str.contains(r'.eth.') | tweets.iloc[:,1].str.contains(r'.crypto currency.') | tweets.iloc[:,1].str.contains(r'.hodl.') | tweets.iloc[:,1].str.contains(r'.Hodl.') | tweets.iloc[:,1].str.contains(r'.HODL.') | tweets.iloc[:,1].str.contains(r'.Gemini.') | tweets.iloc[:,1].str.contains(r'.Bitfinex.') | tweets.iloc[:,1].str.contains('MOON') | tweets.iloc[:,1].str.contains('Moon') | tweets.iloc[:,1].str.contains('moon') | tweets.iloc[:,1].str.contains('.BItcoin') | tweets.iloc[:,1].str.contains('BItcoin')

In [35]:
tweets = tweets.loc[boolean,:]

In [36]:
tweets.head()

Unnamed: 0,created_at,text,rt
0,2018-03-26 17:09:23,RT @coindesk: Cboe Prods SEC on Bitcoin ETF Ap...,1
2,2018-03-22 17:28:31,RT @GeminiDotCom: U.K. fintech firm @BeeksFinC...,1
3,2018-03-22 17:26:58,RT @BeeksFinCloud: Write up in @financemagnate...,1
4,2018-03-17 23:24:44,RT @coindesk: Peter Thiel: Bitcoin Will Be the...,1
5,2018-03-16 22:59:20,"RT @HumanoidHistory: March 16, 1966: Astronaut...",1


In [37]:
len(tweets)

67943

### Creating a Dummy if RT
Created a dummy variable rt if the tweet was retweeted indicated by "RT" at the beginning of the text  
(not used in the analysis)

In [38]:
tweets['rt'] = 0
boolean = tweets.iloc[:,1].str.contains('RT')
tweets.loc[boolean,'rt'] = 1                                         

In [39]:
tweets.head()

Unnamed: 0,created_at,text,rt
0,2018-03-26 17:09:23,RT @coindesk: Cboe Prods SEC on Bitcoin ETF Ap...,1
2,2018-03-22 17:28:31,RT @GeminiDotCom: U.K. fintech firm @BeeksFinC...,1
3,2018-03-22 17:26:58,RT @BeeksFinCloud: Write up in @financemagnate...,1
4,2018-03-17 23:24:44,RT @coindesk: Peter Thiel: Bitcoin Will Be the...,1
5,2018-03-16 22:59:20,"RT @HumanoidHistory: March 16, 1966: Astronaut...",1


In [40]:
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('^b\'',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('^b"',"")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('^bRT',"RT")
tweets.iloc[:,1] = tweets.iloc[:,1].str.replace('^b',"")

In [41]:
tweets.head(50)

Unnamed: 0,created_at,text,rt
0,2018-03-26 17:09:23,RT @coindesk: Cboe Prods SEC on Bitcoin ETF Ap...,1
2,2018-03-22 17:28:31,RT @GeminiDotCom: U.K. fintech firm @BeeksFinC...,1
3,2018-03-22 17:26:58,RT @BeeksFinCloud: Write up in @financemagnate...,1
4,2018-03-17 23:24:44,RT @coindesk: Peter Thiel: Bitcoin Will Be the...,1
5,2018-03-16 22:59:20,"RT @HumanoidHistory: March 16, 1966: Astronaut...",1
6,2018-03-16 22:57:41,RT @HumanoidHistory: #TodayInHistory: High abo...,1
7,2018-03-16 22:56:55,RT @FIAconnect: Thanks to Cameron &amp; Tyler ...,1
8,2018-03-16 22:56:48,RT @GeminiDotCom: .Cameron @winklevoss and @ty...,1
9,2018-03-16 16:26:29,Yesterday at the Futures Industry Association ...,0
10,2018-03-16 16:06:42,RT @ErikVoorhees: Bitcoin: a digital currency ...,1


In [42]:
# wrote out to csv file for further merging
tweets.to_csv('./tweets_clean.csv')

In [18]:
len(tweets)

68516