## Do Twitter Sentiment and Securities Pricing Show Measurable Correlation?

Here's a basic overview of how we will structure our data and step through the analysis.

1. Use Tweepy and Twitter API to search tweets for certain keywords within a certain date range.
2. Clean the tweets of links, emojis, retweet information, etc. 
3. Use TextBlob's sentiment analysis (Naive Bayes classifier trained on movie review data) to establish how Twitter 'generally' speaks about the keywords throughout the date range.
4. Pull stock, security, or cryptocurrency price over the date range using Yahoo Finance's API.
5. Build a dataframe containing relevant price information and sentiment data.
6. Use basic plotting and statistics to see how well price and sentiment correlate.

First, let's import the necessary libraries and packages, then set our matplotlib backend.

In [None]:
import re
import tweepy
import numpy as np
import pandas as pd
import yfinance as yf
import datetime
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
from textblob import TextBlob
from sklearn.metrics import r2_score

%matplotlib inline

#!pipreqsnb .

I have my Twitter API authentication information in a file tucked away. This code just reads the proper key and passes it to Tweepy. 

In [None]:
auth_tokens = []
with open('TwitterAPI.txt', 'r') as security:
        for i in range(3):
            auth_tokens.append(security.readline())
            auth_tokens[i] = re.sub('\\n', '', auth_tokens[i])

In [None]:
try:
    auth = tweepy.OAuth2BearerHandler(auth_tokens[2])
    tweepy_api = tweepy.API(auth)
    print("Success!")
except:
    print("Authentication Error")

***

This cell is home base for building our final dataframe. It prompts input for the keywords, ticker, and date range to be considered. It then calls the functions written below to get tweets, clean them, get stock information, and merge the two dataframes.

You will get a bunch of warnings because of the pandas.append deprecation, but the code still functions.

In [None]:
keyphrase = input("Keyphrase? ")
ticker = input("What ticker do you think is correlated with the key phrase? ").upper()

#Get date range. Use datetime for start date and an int for lookback time.
start_date = datetime.datetime.strptime(input("Analysis start date (YYYY-MM-DD)? "), '%Y-%m-%d')
lookback_days = int(input("How many days do you want to look back (recommended minimum of 15 days)? "))

#set up dataframe for twitter sentiment data.
twitter_df = pd.DataFrame(columns=['Date', 'Sentiment'])

#loop through date range and get sentiment value for each day. Add the date and sentiment to our twitter_df.
for i in range(lookback_days+1):
    curr_date_iter = start_date - datetime.timedelta(days=i)
    tweet_temp_raw = get_tweets(keyphrase, curr_date_iter)
    tweet_holder = make_tweet_list(tweet_temp_raw)
    tweet_holder = tweet_prep(tweet_holder)
    sentiment_value = get_sentiment(tweet_holder)
    twitter_df = twitter_df.append({'Date':curr_date_iter.date(), 'Sentiment':sentiment_value}, ignore_index=True)
    
twitter_df['Date'] = pd.to_datetime(twitter_df['Date']) 
twitter_df.set_index('Date', inplace=True)
#twitter_df.head()

stock_df = get_stock_info(ticker, start_date, lookback_days)
#stock_df.head()

merged_data = join_dataframes(twitter_df, stock_df)
merged_data.head(10)


***

In [None]:
def get_tweets(search_phrase, start_time):
    
    try:
        end_time = (start_time + datetime.timedelta(days=1)).astimezone().isoformat()
        start_time = start_time.astimezone().isoformat()
        
        end_time = re.sub('-', '', str((start_time + datetime.timedelta(days=1)).date()) + '0000')
        start_time = re.sub('-', '', str(start_time.date()) + '0000')
        
        raw_tweets = tweepy_api.search_full_archive(label='testAnalysis', query=search_phrase + ' lang:en', 
                                                    fromDate=start_time, toDate=end_time, maxResults=100)
        return raw_tweets
    
    except:
        print("Error retrieving tweets")

Tweepy returns a ResultSet object. We need to get the JSON data from within and clean everything up. We could also get rid of retweets here, but there is value in keeping them because they can be an indicator of a shared emotion among users.

In [None]:
def make_tweet_list(raw_tweets):
    '''converts returned tweepy ResultSet object into, and returns, list of text elements from the tweets'''
    tweet_text = []
    for i in range(len(raw_tweets)):
        tweet_text.append(raw_tweets[i]._json['text'])
    return tweet_text

Use regular expression library to clean the text from the tweets. 

In [None]:
def tweet_prep(tweets):
    '''takes list as input and cleans the string elements of links and unwanted characters.
    returns lowerscase, cleaned strings
    '''
    
    prepped_tweets = []
    for tweet in tweets:
        tweet_temp = re.sub(r'@\S+', "", tweet)
        tweet_temp = re.sub(r'http\S+', "", tweet_temp)
        tweet_temp = re.sub("[^a-zA-Z0-9(),\"'_ ]", "", tweet_temp)
        tweet_temp = re.sub(r'RT  ', "", tweet_temp)
        tweet_temp = re.sub('  ', ' ', tweet_temp)
        
        prepped_tweets.append(tweet_temp.lower())
        
    return prepped_tweets

Use TextBlob for sentiment analysis. It really is this simple.

In [None]:
def get_sentiment(cleaned_tweets):
    '''pass tweets ready for basic sentiment analysis and return an overall 
    sentiment value from 0 to 1 (0 is negative, 0.5 is neutral, 1 is positive)
    '''
    
    sentiment_scores = np.zeros(len(cleaned_tweets))
    
    for i, tweet in enumerate(cleaned_tweets):
        blobject = TextBlob(tweet)
        if blobject.sentiment.polarity > 0:
            sentiment_scores[i] = 1
        elif blobject.sentiment.polarity < 0:
            sentiment_scores[i] = 0
        else:
            sentiment_scores[i] = 0.5
  
    return sentiment_scores.sum()/len(sentiment_scores)

***

In [None]:
def get_stock_info(ticker_string, date, num_days):
        '''pass in ticker and date range info. return a dataframe with relevant stock price information'''
        
        end_date = str((start_date - datetime.timedelta(days=num_days)).date())
        stock_info = yf.download(ticker_string, end=str(date.date()), start=end_date, 
                                 index_as_date=True, interval='1d').drop(columns=['High', 'Low', 'Adj Close'])
        stock_info['Day Change %'] = 100 * (stock_info['Close']-stock_info['Open'])/stock_info['Open']
        
        stock_info.reset_index(inplace=True)
        stock_info['Date'] = stock_info['Date'].apply(lambda x: x.date())
        stock_info['Date'] = pd.to_datetime(stock_info['Date'])
        stock_info.set_index('Date', inplace=True)
            
        return stock_info

***

Join our twitter and stock dataframes. 

In [None]:
def join_dataframes(twitter, stock):
    '''join the twitter sentiment and stock price history dataframe on date'''
    twitter.reset_index(inplace=True)
    stock.reset_index(inplace=True)
    merged_data = pd.merge(twitter, stock, how='outer', on='Date').set_index('Date')
    return merged_data

***

Now that we have our dataframes. Let's start doing some EDA. Let's plot a few different things to see if there is anything obvious that pops out. After that we can quantify correlation and try to see if there is anything statiscally significant.

I think I'll just put in pictures for all of these so we don't overwrite anything next time the code is run.

This cell here is just to to check out the data and debug.

In [None]:
#merged_data.dropna(inplace=True)
#merged_data.reset_index(inplace=True)
#merged_data
print(merged_data.dtypes) #note that sentiment is dtype object
print(merged_data.describe())

I first tried a small test with Tesla and there was not exactly a lot to go on. We could almost fit an ellipse to the data... but I'll put the image here just for fun. Maybe I'll try something like Bitcoin. It might be more susceptible to internet's feelings.

<img src="tesla_test.png" align="left" width=400 height=400/>

Well, Bitcoin didn't give us much more to go on. But for the sake of completion, let's continue anyway.

<img src="btc_scatter.png" align="left" width=400 height=400/>

In [None]:
x = merged_data['Sentiment'].astype(float)
y = merged_data['Day Change %'].astype(float)
m, b = np.polyfit(x, y, 1)

fig, ax = plt.subplots()
ax.scatter('Sentiment', 'Day Change %', data=merged_data, label='')
ax.plot(x, m*x+b, color='red', label='y={:.2f}x+{:.2f}'.format(m, b))
ax.set(xlabel='Sentiment', ylabel='Stock % Change', title='BTC Price vs Twitter Sentiment')
fig.set_dpi(100)
plt.legend(fontsize=11)
plt.show()

<img src="line_fit.png" align="left" width=400 height=400/>

Let's quantify the correlation... Or, more correctly, lack thereof. 

And just to get a better feel for the data, let's plot the sentiment and day change percentage histograms as well.

Numpy returns the correlation coefficient matrix. We can simplify by accessing one of the relevant elements since we only have two variables to consider here. For the Bitcoin test case, it was 0.031.

In [None]:
correlation = np.corrcoef(x, y)[0,1]

In [None]:
fig2, ax2 = plt.subplots()
ax2.hist('Sentiment', data=merged_data)
fig2.set_dpi(100)
plt.legend()
plt.show()

<img src="sent_hist.png" align="left" width=400 height=400/>

In [None]:
fig3, ax3 = plt.subplots()
ax3.hist('Day Change %', data=merged_data)
fig3.set_dpi(100)
plt.legend()
plt.show()

<img src="daychange_hist.png" align="left" width=400 height=400/>

So we actually have something resembling normal distributions (albeit with some holes and skew) which means our correlation calculation is probably trustworthy when it tells us sentiment and change in securities price are not well correlated.

The final thing I'd like to do to round this out is find the $R^2$ value. We might as well pull in something from scikit-learn and do our due diligence. Sure, we could easily do this within numpy, but that's boring.

In [None]:
r2= r2_score(y, m*x+b)
r2

And sometimes boring is good to at least to check our work. We come up with the same answer here. So it's safe to say this model is junk (which we knew from the beginning). Almost no variance can be predicted by our model.

In [None]:
SSr=np.sum(np.square(merged_data['Day Change %'] - (m*merged_data['Sentiment'] + b)))
SSt=np.sum(np.square(merged_data['Day Change %'] - np.mean(merged_data['Day Change %'])))
r2_np = 1 - (SSr/SSt)
r2_np

There are quite a few ways to improve from where we are so this list is definitely not exhaustive.

First, I'd like to use a different Twitter endpoint to allow for more data. Unfortunately, we can only make 50 calls a month (each returning up to 100 tweets) for the full twitter archive. We can get a bit more freedom once we get within more recent timeframes though.

A better option (if we were to run this again) would be getting more granular with respect to time rather than aggregating and looking at one day at a time. 

Our TextBlob model is also not optimal for what we are doing. It's a Naive Bayes classifier that was trained on movie reviews. Training our own model would certainly be a better option here.

As I'm writing this, I forgot pandas has datetime functionality and I worked through everything with the datetime package... Live and learn. 

The goal here wasn't to walk away with some revelation we never saw coming. This is really just practice for different ways to interact with data or even use other resources to supplement data we may already have. We have an almost non-existent data set here, but it was fun to put a lot of moving parts together.