# Investor Sentiment Analysis

In this portion, we will be using comments scraped from r/wallstreetbets to determine the community's sentiment of TSLA over time. It is important to note that this is an eccentric community with unconventional trading practices and sometimes explicit language in their comments.

Let's get started!

In [2]:
import pandas as pd
import numpy as np
from os import getcwd, listdir
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
import re
import ast
import pytz
import spacy
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from string import punctuation
import collections
from collections import Counter

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from sklearn.naive_bayes import MultinomialNB

#nltk.download('wordnet')
#nltk.download('stopwords')

### Load Data 

In [3]:
# Loop through and append all r/wallstreetbets .csv files scraped from Reddit to dataframe
df = pd.DataFrame(columns=['id','post','created_utc','body','score'])
for filename in listdir(getcwd()+'\\WallStreetBets'):
    if filename[-4:] == '.csv':
        df = pd.concat([df, pd.read_csv(getcwd()+'\\WallStreetBets\\'+filename)], ignore_index=True)

In [4]:
# Get number of records
df.shape

(351250, 5)

In [5]:
df.head()

Unnamed: 0,id,post,created_utc,body,score
0,la75n9,"Daily Discussion Thread #2 for February 1, 2021",1612288210,b' Wish there was a broker option to diamond h...,1
1,la75n9,"Daily Discussion Thread #2 for February 1, 2021",1612232343,b'Listen up and don\'t fret all you re-re\'s. ...,13
2,la75n9,"Daily Discussion Thread #2 for February 1, 2021",1612229022,b'I am a NOK holder so not trying to kill the...,-9
3,la75n9,"Daily Discussion Thread #2 for February 1, 2021",1612219285,"b'If you aren\xe2\x80\x99t buying at discount,...",21
4,la75n9,"Daily Discussion Thread #2 for February 1, 2021",1612216647,b'newbies snoozing if u dont buy a tsla dip',1


### Process Data

One of the first things we will do when processsing the data, is convert emojis to text. There can be a lot of meaning in emojis and we don't want to omit this information. Below is an example of an extremely bullish tweet with a lot of emojis used.

In [7]:
# Decode emoji example 
comment = "LOL \\xF0\\x9F\\x8C\\x88\\xF0\\x9F\\x90\\xBBs capitulated. Never bet against Papa Musk. TSLA to the moon!!! \\xF0\\x9F\\x8C\\x95\\xf0\\x9f\\x9a\\x80"
comment = "b'"+comment+"'"
comment = ast.literal_eval(comment)
comment.decode('utf-8')

'LOL 🌈🐻s capitulated. Never bet against Papa Musk. TSLA to the moon!!! 🌕🚀'

In the particular example above, we could probably determine the sentiment without the emojis, but this may not always be the case. Below, we see an example with TSLA, a bull emoji and the word "rekt". Now without this emoji and the knowledge that this user intentionally mispelled the word "wrecked", this would be extremely difficult to interpret. It would be difficult to account for all of the mispelled words and slang in this group. However, usually bull and bear emojis are used to taunt people on the losing end. So in this particular case, we could assume this is actually a bearish post taunting the TSLA bulls.

In [8]:
# Decode emoji example 
comment = "TSLA \\xF0\\x9F\\x90\\x82s rekt"
comment = "b'"+comment+"'"
comment = ast.literal_eval(comment)
comment.decode('utf-8')

'TSLA 🐂s rekt'

Next, we will create a dictionary to map all emojis to their respective names with the help of https://apps.timwhitlock.info/emoji/tables/unicode

In [9]:
# Create dictionary to replace emoji bytes with words
emoji_df = pd.read_excel(getcwd()+'\\emoji_unicode.xlsx', encoding='ISO-8859-1', sheet_name='emoji_short')
emoji_df["Bytes"] = emoji_df["Bytes"].apply(lambda x: x.lower())
emoji_df["Description"] = emoji_df["Description"].apply(lambda x: " " + x + " ") #pad word substitute
emoji_dict = dict(zip(emoji_df['Bytes'], emoji_df['Description']))

# Loop through each post and replace emojis
for i in range(df.shape[0]):
    for k, v in zip(emoji_dict.keys(), emoji_dict.values()):
        if re.search(rf"{re.escape(k)}", df.loc[i, 'body']):
            df.loc[i,'body'] = re.sub(rf"{re.escape(k)}", v, df.loc[i, 'body'])
            
# remove unconverted emojis and unknown characters from message
df['body'] = df['body'].apply(lambda x: ast.literal_eval(x))
df['body'] = df['body'].apply(lambda x: x.decode('utf-8'))
printable = set(string.printable)
df['body'] = df['body'].apply(lambda x: ''.join(filter(lambda y: y in printable, x)).replace('\n', ' '))

We need to convert the comment timestamps from UTC -> EST because our data from the stock market is in EST.

In [10]:
# Convert epoch to UTC
df['created_utc'] = df['created_utc'].apply(lambda x: datetime.fromtimestamp(x))

# Convert UTC -> EST
df['created_utc'] = df['created_utc'].dt.tz_localize('UTC')
df['created_utc'] = df['created_utc'].dt.tz_convert('US/Eastern')
df['created_utc'] = df['created_utc'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
df['created_utc'] = df['created_utc'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

#### Checkpoint - Load Processed Posts Here

Some of the previous steps can take a few minutes to run, so the cell below allows us to pick up where we left off by reading in the preprocessed posts.

In [3]:
# Output file for checkpoint
#df.to_csv('Preprocessed_WSB_Posts.csv',index=False)
# Read file
df = pd.read_csv('.//Preprocessed_WSB_Posts.csv')

There are a few more preprocessing steps we need to take care of before our data is ready for sentiment analysis. This includes removing unhelpful characters, extracting call and put options, which are bullish/bearish indicators. 

In [4]:
# Preprocess text
df['body'] = df['body'].str.lower()
df['body'] = df['body'].apply(lambda x: ''.join(u for u in x if u not in ('?','.',';',':','!','"',',','(',')')))

# Flag call/put options (i.e. Replace "$1000c" with "call")
df['option'] = df['body'].apply(lambda x: re.findall(r'(\$?[0-9]+(p|c))', x))
df['option'] = df['option'].apply(lambda x: [y[1] for y in x])
df['call'] = df['option'].apply(lambda x: 1 if 'c' in x else 0)
df['put'] = df['option'].apply(lambda x: 1 if 'p' in x else 0)
df['body'] = df['body'].apply(lambda x: re.sub(r'(\$?[0-9]+(p|c))', '', x))

# Remove dates and numbers and other characters
df['body'] = df['body'].apply(lambda x: ''.join(i for i in x if i not in ('$','/','%')))
df['body'] = df['body'].apply(lambda x: x.replace('&amp', ''))
df['body'] = df['body'].apply(lambda x: ''.join(i for i in x if not i.isdigit()))

# add call and put positions detected in posts
df['body'] = df.apply(lambda row: row['body'] + ' call' if row['call'] > 0 else row['body'], axis=1)
df['body'] = df.apply(lambda row: row['body'] + ' put' if row['put'] > 0 else row['body'], axis=1)

# Trim all whitespace
df['body'] = df['body'].apply(lambda x: ' '.join(x.split()))

### Tokenization, Lemmatization and removing stopwords

In this step, we will use a natural language processing technique, called tokenization to split each post into a list of words, which we will call tokens. We remove stop words that do not add value, such as "I", "you", "hereafter", "thus", "indeed", "whereupon"... You get the point. Additionally, we will use another method, called lemmatization to normalize words derived from the same word, but used in different inflected forms (i.e. code, codes, coding, coded).

In [5]:
#Tokenization, Lemmatization and removing stopwords
#nlp = en_core_web_sm.load() 
nlp = spacy.load('en_core_web_sm')
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()
#stop = set(stopwords.words('english'))
stop = nlp.Defaults.stop_words

# remove relevant words from stop list
keep_words = ['call','put','up','down']
for w in keep_words:
    stop.remove(w)
    
punctuation = list(string.punctuation) #already taken care of with the cleaning function.
stop.update(punctuation)
w_tokenizer = WhitespaceTokenizer()

def process_comments(text):
    """Convert user comments to tokenized strings"""
    final_text = []
    for i in w_tokenizer.tokenize(text):
        if i.lower() not in stop:
            word = lemmatizer.lemmatize(i)
            final_text.append(word.lower())
    return ' '.join(final_text)

df['tokenized'] = df['body'].apply(process_comments)

### Calculate Jaccard Similarity Scores to label posts

Because our posts are not already labeled and we do not have time to manually label 280,000+ posts, we will start with two word banks with words we already know indicate bullish/bearish sentiment.

In [6]:
# Sets of words
bullish_words = '''call up moon moonin rocket long pump pumpin bear rainbow against doubt rally'''

bearish_words = '''put down drop drill drillin dump dumpin short ox bull rug'''

Using our two word banks, we will create sentiment scores using the Jaccard Similarity Index. This approach will measure the similarity between each user comment and the word bank, comparing the members in each set to come up with a similarity score.

In [7]:
# Jaccard Similarity Scores
def jaccard_similarity(group, comment):
    intersection = set(group).intersection(set(comment))
    union = set(group).union(set(comment))
    return len(intersection)/len(union)

def get_scores(group, comments):
    scores = []
    for c in comments:
        s = jaccard_similarity(group, c)
        scores.append(s)
    return scores

bull_scores = get_scores(bullish_words, list(df['tokenized']))
bear_scores = get_scores(bearish_words, list(df['tokenized']))

In [8]:
# Add scores
jdf = df.copy()
jdf['Bullish'] = bull_scores
jdf['Bearish'] = bear_scores
jdf = jdf[['created_utc','tokenized','score','Bullish','Bearish']]
jdf = jdf.rename(columns={'created_utc':'Created_EST','tokenized':'Comment','score':'Score'})

In [20]:
jdf.loc[15,'Comment']

'tesla up sub doesnt notice care man place different couple week ago'

Using our jaccard similarity scores, we will flag each comment as bullish with a value of "1" or bearish with a value of "-1" depending on whose score is higher.

In [9]:
# Derive Sentiment Scores
jdf['Sentiment'] = jdf.apply(lambda row: 1 if row["Bullish"] >= row["Bearish"] else -1, axis=1)

In [10]:
# Create two dataframes for two Naive Bayes Classifiers to predict bullish/bearish posts
bull_df = jdf[['Comment','Sentiment']].copy()
bear_df = jdf[['Comment','Sentiment']].copy()

# Bullish flag
bull_df['Sentiment'] = bull_df['Sentiment'].apply(lambda x: 0 if x == -1 else x)

# Bearish flag
bear_df['Sentiment'] = bear_df['Sentiment'].apply(lambda x: 0 if x == 1 else x)
bear_df['Sentiment'] = bear_df['Sentiment'].apply(lambda x: 1 if x == -1 else x)

### Naive Bayes Classifier

Now, we are going to try to enhance our approach by using a Naive Bayes classifier to identify other words indicative of bullish/bearish sentiment that we can add to our word banks. To do this, we will need to vectorize our set of tokens.

In [11]:
#vectorizer = TfidfVectorizer(stop_words='english') # don't use
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(jdf['Comment'])

In [64]:
# Print vectors
#print(vectorizer.get_feature_names())

In [25]:
vectors.shape

(351250, 75064)

Now that we have our tokens vectorized (bag of words), we can fit a Naive Bayes classifier to the data. We will fit two separate models for bullish and bearish sentiment.

In [12]:
# Bullish classifier
bull_clf = MultinomialNB(alpha=.01).fit(vectors, bull_df['Sentiment'])

We can view the most important words for classifying sentiment by using the function below.

In [27]:
def show_top_words(classifier, vectorizer, categories, top_n):
    """Function to show most important words from NB classifier"""
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top = np.argsort(classifier.coef_[i])[-top_n:]
        print(f"{category}: {' '.join(feature_names[top])}")

In [28]:
show_top_words(bull_clf, vectorizer, ["Bullish"], 100)

Bullish: month red probably spread high drop run thing end yesterday bull go gain morning fucked long papa lmao battery feel pump new car come big profit selling retard apple friday don eow split option hit company nio play hold holding close bear look green year want guy way earnings let short people aapl moon it belong spy need price dont know good right eod sold got shit face dip market open share fucking buying musk think time lol rocket gonna sell money im bought down stock week tomorrow going fuck today day like buy up put elon call tesla tsla


In [29]:
# Bearish classifier
bear_clf = MultinomialNB(alpha=.01).fit(vectors, bear_df['Sentiment'])

In [30]:
show_top_words(bear_clf, vectorizer, ["Bearish"], 100)

Bearish: that don imo gain run ath elons fomo stop tweet shop people diamond green printing rn imagine go nah let bull split big wish it expiring eod guh dumping option strength god belongs print want mooning wtf drop thats amzn long power earnings damn holder dont open retard need lmao rip bullish dump nio time sell overnight aapl who got worth good moon hope sold pump morning oh hard shorting hand month high belong im whats thought share lol call thing ah dip shit hour hit gonna hold down short right going holding up tomorrow elon bought put tesla tsla


### Update Word Bank

In [13]:
# Sets of words
bullish_words = '''call up moon moonin mooning rocket long pump pumpin bear rainbow against doubt long
                    green rally strength papa battery'''

bearish_words = '''put down drop drill drillin dump dumpin short shorting ox bull bulls rug rip fomo red'''

# Jaccard Similarity Scores
def jaccard_similarity(group, comment):
    intersection = set(group).intersection(set(comment))
    union = set(group).union(set(comment))
    return len(intersection)/len(union)

def get_scores(group, comments):
    scores = []
    for c in comments:
        s = jaccard_similarity(group, c)
        scores.append(s)
    return scores

bull_scores = get_scores(bullish_words, list(df['tokenized']))
bear_scores = get_scores(bearish_words, list(df['tokenized']))

In [14]:
jdf = df.copy()
jdf['Bullish'] = bull_scores
jdf['Bearish'] = bear_scores

jdf = jdf[['created_utc','tokenized','score','Bullish','Bearish']]
jdf = jdf.rename(columns={'created_utc':'Created_EST','tokenized':'Comment','score':'Score'})
jdf.head()
jdf['Sentiment'] = jdf.apply(lambda row: 1 if row["Bullish"] >= row["Bearish"] else -1, axis=1)

Another approach is to create sentiment scores entirely based on a word bank, and scoring based on the ratio of bullish/bearish words present.

In [None]:
## create bull and bear sentiment scores for each day using keywords
bullish_words = ['call', 'up', 'moon', 'moonin', 'mooning', 'rocket', 'long', 'pump', 'pumpin', 'bear', 'rainbow',
                 'against', 'doubt', 'long', 'green', 'rally', 'strength', 'papa', 'battery']

bearish_words = ['put', 'down', 'drop', 'drill', 'drillin', 'dump', 'dumpin', 'short', 'shorting',
                 'ox', 'bull', 'bulls', 'rug', 'rip', 'fomo', 'red']

jdf['Bull Count'] = jdf['Comment'].apply(lambda x: len([w for w in x.split(' ') if w in bullish_words]))
jdf['Bear Count'] = jdf['Comment'].apply(lambda x: len([w for w in x.split(' ') if w in bearish_words]))

jdf['Sentiment WB'] = jdf.apply(lambda row: (row["Bull Count"] - row["Bear Count"])/max((row["Bull Count"] + row["Bear Count"]),1), axis=1)

### VADER Sentiment Analysis (Valence Aware Dictionary for sEntiment Reasoning)

Before we finish, we will create one more sentiment variable using "VADER"

In [15]:
#http://www.nltk.org/howto/sentiment.html
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
#nltk.download('punkt')
#nltk.download('vader_lexicon')

VADER is a parsimonious rule-based model developed by a group of Georgia Tech researchers for sentiment analysis of social media text. You can view their research paper here: (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

In [20]:
sid = SentimentIntensityAnalyzer()
#sid.polarity_scores(x) #{'neg': 0.735, 'neu': 0.265, 'pos': 0.0, 'compound': -0.7616}
jdf['VADER Score'] = jdf['Comment'].apply(lambda x: sid.polarity_scores(x)['compound'])

### Aggregate Sentiment

Now we are going to create weighted scores using the number of upvotes and downvotes.

In [22]:
# Add weights based on upvotes/downvotes
jdf.loc[jdf['Score'] == 0,'Score'] = 1
jdf['Weighted Sentiment'] = jdf['Sentiment']*jdf['Score']

In [24]:
# Assign to date based on market close - If >= 16:00 (4pm) then +1 day
#jdf['Created_EST'] = jdf['Created_EST'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
jdf['Post Hour'] = jdf['Created_EST'].apply(lambda x: x.strftime('%H'))
jdf['Created_EST'] = jdf.apply(lambda row: row['Created_EST'] + timedelta(days=1) if int(row['Post Hour']) >= 16 else row['Created_EST'], axis=1)
jdf['Created_EST'] = jdf['Created_EST'].apply(lambda x: x.strftime('%Y-%m-%d'))

In [38]:
# Aggregate data
agg_df = jdf.groupby('Created_EST').apply(lambda x: pd.Series({
            'weighted_avg': x['Weighted Sentiment'].mean(), # upvote weighted score
            'avg': x['Sentiment'].mean(),    # original sentiment score
            'count': x['Sentiment'].count(), # comment volume
            'wb': x['Sentiment WB'].mean(),  # word bank sentiment score
            'vader': x['VADER Score'].mean() # vader sentiment score
            }))

In [39]:
agg_df = agg_df.sort_values(by=['count'])
agg_df.tail()

Unnamed: 0_level_0,weighted_avg,avg,count,wb,vader
Created_EST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-02-05,1.241135,0.691095,3807.0,0.062,0.046129
2020-08-20,2.695002,0.709946,4082.0,0.161087,0.074849
2020-09-01,3.214615,0.708621,4338.0,0.236101,0.045215
2020-07-22,2.468863,0.603455,4978.0,0.186348,0.053659
2020-02-04,1.326475,0.699393,5103.0,0.145415,0.061487


We're all wrapped up with the sentiment scores. Scores greater than 0 are bullish and less than zero are bearish. We can output this file and use it in our forecasting model.

In [40]:
# Output
agg_df.to_csv('Sentiment Scores.csv')