# r/wallstreetbets Sentiment Analysis

In [1]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected = True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

In [2]:
wsb = pd.read_csv("./wsbsentiment.csv", names = ['title', 'text', 'sentiment'], encoding = "utf-8", encoding_errors = 'ignore')

print(wsb.head)

<bound method NDFrame.head of                                                  title  \
0    Yolo'd my first ever 8k on an FHA 35 mortgage ...   
1    Britons use of consumer credit is rising with ...   
2                           Norwegian salmon over eggs   
3    Calling all dividend investors  Stop making in...   
4    Let get fucked This is not how imagined 2022 I...   
..                                                 ...   
320  To the two sitting in front of me on the plane...   
321  Bearish on Musk buying TWTR Here's what you ca...   
322                2MM Twitter YOLO. In Elon we Trust.   
323  What do you guys say about Ford I kept watchin...   
324                       INTC YOLO  Already down $15k   

                                                  text sentiment  
0                                                  NaN  positive  
1    Excuse my retardedness but couldn't this lead ...  negative  
2                                                  NaN   neutral  
3    

In [4]:
fig = px.histogram(wsb, x = "sentiment")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Sentiment Score')
fig.show()

## Naive Bayes Classifier.

Add the `punkt` tokenizing library from `nltk`.

In [5]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kim3\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kim3\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Remove neutral sentiment, since it kind of messes things up, then make a new list of lists with just the `title` and `sentiment` (dropping `text`).

In [6]:
wsblist = wsb.drop(columns = ['text'])
wsbtext = wsb.dropna(subset = ['text'])
wsbtext = wsbtext.drop(columns = ['title'])
wsblist = wsblist[wsblist['sentiment'] != 'neutral']
wsblist = wsblist.values.tolist()
wsbtext = wsbtext[wsbtext['sentiment'] != 'neutral']
wsbtext = wsbtext.values.tolist()
wsblist = wsblist + wsbtext

Remove stopwords and tokenize the words (painfully) (and by hand).

In [7]:
all_words = set(word.lower() for post in wsblist for word in word_tokenize(post[0]))
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('\'s')
stopwords.append('\'d')
for stopword in stopwords:
    all_words.discard(stopword)
import pickle
words = []
try:
    with open ('words.pkl', 'rb') as fp:
        words = pickle.load(fp)
except:
    words = [({word: (word in word_tokenize(post[0])) for word in all_words}, post[1]) for post in wsblist]
    with open ('words.pkl', 'wb') as fp:
        pickle.dump(words, fp)


80/20 split on the training versus testing data.

In [8]:
print(len(words))
train_data = words[:(round(len(words) * 0.8))]
test_data = words[(round(len(words) * 0.8)):]

431


Train the Bayes classifier on the training data, then determine accuracy based on test data.

In [9]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)
print('Accuracy is ', classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(20))

Accuracy is  0.6162790697674418
Most Informative Features
                  making = True           negati : positi =      9.4 : 1.0
               contracts = True           negati : positi =      6.6 : 1.0
                  growth = True           negati : positi =      5.2 : 1.0
                    lead = True           negati : positi =      5.2 : 1.0
                  longer = True           negati : positi =      5.2 : 1.0
                       0 = True           negati : positi =      4.5 : 1.0
                   alone = True           negati : positi =      4.5 : 1.0
                  banned = True           negati : positi =      4.5 : 1.0
                dividend = True           negati : positi =      4.5 : 1.0
                 exports = True           negati : positi =      4.5 : 1.0
                   might = True           negati : positi =      4.5 : 1.0
                    rest = True           negati : positi =      4.5 : 1.0
                   seems = True           

This could probably be better with more data.

Make a custom classifier. Write a custom post in the `custom_wsb` variable, then run the cell to see what the classifier thinks the post's sentiment is.

In [10]:
from nltk.tokenize import word_tokenize
custom_wsb = "to the moon"
custom_tokens = word_tokenize(custom_wsb)
print(classifier.classify(dict([token, True] for token in custom_tokens)))

positive


Try words like "bullish", "bearish", "bull", "infinity", or "hedge fund"!

## Text Generation

We'll need to retokenize the data after discarding the sentiment. We'll use the Bayesian classifier later to determine the sentiment of the generated text. Start by making a single long string with all of the titles and text.

In [40]:
wsbstrlist = []
for index, row in wsb.iterrows():
    wsbstrlist.append(str(row['title']))
    wsbstrlist.append(str(row['text']))
wsbstrlist[0]

"Yolo'd my first ever 8k on an FHA 35 mortgage 2 years ago and now have over 100k 'equity' Loving this K shaped recovery lol Idk if this is allowed"

We will now pre-process and vectorize the documents. This will tokenize the text, then lemmatize the tokens. We'll also compute bigrams and trigrams.

In [41]:
from nltk.tokenize import RegexpTokenizer
# Split documents into tokens
tokenizer = RegexpTokenizer(r"[A-Za-z0-9']+\b")
for index in range(len(wsbstrlist)):
    wsbstrlist[index] = str(wsbstrlist[index]).lower()
    wsbstrlist[index] = tokenizer.tokenize(wsbstrlist[index])
print(type(wsbstrlist))
print(wsbstrlist[0])

# Remove words that are only one character
wsbstrlist = [[token for token in post if len(token) > 1] for post in wsbstrlist]

# Remove nans
wsbstrlist = [[token for token in post if token != 'nan'] for post in wsbstrlist]

<class 'list'>
["yolo'd", 'my', 'first', 'ever', '8k', 'on', 'an', 'fha', '35', 'mortgage', '2', 'years', 'ago', 'and', 'now', 'have', 'over', '100k', "'equity", 'loving', 'this', 'k', 'shaped', 'recovery', 'lol', 'idk', 'if', 'this', 'is', 'allowed']


We can now use the WordNet lemmatizer from `nltk`. A lemmatizer is preferred over a stemmer because it provides more readable words, even in the extreme case of WSB's unique vocabulary.

In [5]:
nltk.download('wordnet')
nltk.download('omw-1.4')

NameError: name 'nltk' is not defined

Now, we can lemmatize the document.

In [42]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
wsbstrlist = [[lemmatizer.lemmatize(token) for token in post] for post in wsbstrlist]
while [] in wsbstrlist:
    wsbstrlist.remove([])

This lets us find bigrams and trigrams in our WSB posts. Bigrams are sets of two adjacent words. Using bigrams, we can get phrases like "free_tendies" in our output. Without bigrams, we would only get "free" and "tendies".

Note that in the code below, when we find bigrams, we add those bigrams to the original data instead of overwriting the bigram, since we'd like to keep the words "free" and "tendies" as well as the bigram "free_tendies".

***Careful!** Computing n-grams of a large dataset can be very computationally and memory intensive.*

In [43]:
from gensim.models.phrases import Phrases
ngram = Phrases(wsbstrlist, min_count = 20)
for index in range(len(wsbstrlist)):
    for token in ngram[wsbstrlist[index]]:
        if '_' in token:
            wsbstrlist[index].append(token)

In [44]:
wsbstrlist = [item for subitem in wsbstrlist for item in subitem]

A neural network works with numbers, not text characters. So well need to convert the characters in our input to numbers. We'll sort the list of the set of all characters that appear in our input text, then use the `enumerate` function to get numbers which represent the characters. We then create a dictionary that stores the keys and values, or the characters and the numbers that represent them.

In [45]:
chars = sorted(list(set(wsbstrlist)))
char_to_num = dict((c, i) for i, c in enumerate(chars))

We need the total length of our inputs and total length of our set of characters for later data prep, so we'll store these in a variable. Just so we get an idea of if our process of converting words to characters has worked thus far, let's print the length of our variables.

In [46]:
input_len = len(wsbstrlist)
vocab_len = len(chars)
print ("Total number of characters:", input_len)
print ("Total vocab:", vocab_len)

Total number of characters: 50660
Total vocab: 5088


Now that we've transformed the data into the form it needs to be in, we can begin making a dataset out of it, which we'll feed into our network. We need to define how long we want an individual sequence (one complete mapping of inputs characters as integers) to be. We'll set a length of 100 for now, and declare empty lists to store our input and output data.

In [47]:
seq_length = 100
x_data = []
y_data = []

Now we need to go through the entire list of inputs and convert the characters to numbers. We'll do this with a `for` loop. This will create a bunch of sequences where each sequence starts with the next character in the input data, beginning with the first character.

In [48]:
for i in range(0, input_len - seq_length, 1):
    # Define input and output sequences
    # Input is the current character plus desired sequence length
    in_seq = wsbstrlist[i:i + seq_length]

    # Out sequence is the initial character plus total sequence length
    out_seq = wsbstrlist[i + seq_length]

    # We now convert list of characters to integers based on
    # previously and add the values to our lists
    x_data.append([char_to_num[char] for char in in_seq])
    y_data.append(char_to_num[out_seq])

Now we have our input sequences of characters and our output, which is the character that should come after the sequence ends. We now have our training data features and labels, stored as `x_data` and `y_data`. Let's save our total number of sequences and check to see how many total input sequences we have.

In [49]:
n_patterns = len(x_data)
print ("Total Patterns:", n_patterns)

Total Patterns: 50560


Now we'll go ahead and convert our input sequences into a processed numpy array that our network can use. We'll also need to convert the numpy array values into floats so that the sigmoid activation function our network uses can interpret them and output probabilities from 0 to 1.

In [50]:
import numpy as np
X = np.reshape(x_data, (n_patterns, seq_length, 1))
X = X/float(vocab_len)

One-hot encode the label data.

In [51]:
from keras.utils import np_utils
y = np_utils.to_categorical(y_data)

Since our features and labels are now ready for the network to use, let's go ahead and create our LSTM model. We specify the kind of model we want to make (a `sequential` one), and then add our first layer.

We'll do dropout to prevent overfitting, followed by another layer or two. Then we'll add the final layer, a densely connected layer that will output a probability about what the next character in the sequence will be.

In [61]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
print(X.shape)
print(y.shape)

(50560, 100, 1)
(50560, 5088)


Compile and train the model.

In [None]:
from keras.callbacks import ModelCheckpoint
model.compile(loss='categorical_crossentropy', optimizer='adam')
filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
desired_callbacks = [checkpoint]
model.fit(X, y, epochs=20, batch_size=256, callbacks=desired_callbacks)

After it has finished training, we'll specify the file name and load in the weights. Then recompile our model with the saved weights.

In [58]:
filename = "model_weights_saved.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

ValueError: Cannot assign value to variable ' dense_8/kernel:0': Shape mismatch.The variable shape (128, 5088), and the assigned value shape (128, 38) are incompatible.

Since we converted the characters to numbers earlier, we need to define a dictionary variable that will convert the output of the model back into numbers.

In [35]:
num_to_char = dict((i, c) for i, c in enumerate(chars))

To generate characters, we need to provide our trained model with a random seed character that it can generate a sequence of characters from.

In [36]:
start = np.random.randint(0, len(x_data) - 1)
pattern = x_data[start]
print("Random Seed:")
print("\"", ''.join([num_to_char[value] for value in pattern]), "\"")

Random Seed:
" 'done'some'work'for'the'poultry'industry'in'the'past'because'of'the'bird'flu'scare'they'wanted'to'unload'ton'of'eggs'quickly'at'cut'rate'costs'they'reached'out'to'me'because'they'know'am'on'wsb'and'do'some'day'trading'saw'this'opportunity'and'agreed'to'sell'some'futures'contracts'ended'up'selling'the'agreed'upon'allotment'of'dce'jd'contracts'fresh'hen'egg'futures'turns'out'probably'could'have'sold'those'contracts'for'way'more'than'did'but'we'farmers'and'wanted'to'make'sure'the'eggs'would'move'before'they'went'bad'only'took "


Make a random string based on the pattern.

In [33]:
import sys
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(vocab_len)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = num_to_char[index]

    pattern.append(index)
    pattern = pattern[1:len(pattern)]

## tendiebot

To run tendiebot, run the following code to start the Flask webserver.

In [None]:
%run flask/app.py