---
layout: code-post
title: Pedicting rooting interests on reddit
description: Going to see if we can predict neutral fan rooting interests from reddit posts.
tags: [neural nets]
---

In this notebook / post I'm going to try to see if we can predict who a
supposedly neutral fan is rooting for in the (as of this writing) about to
end Lakers-Heat NBA finals. To do this, I'm going to attempt to scrape posts
from game threads by Heat and Lakers fans to train an LSTM based model. As long
as I can scrape the flair from from the reddit API, this should be doable. If
not, then I will have to scrape from the team specific subreddits and try that
way.

Outline:
- Accessing Reddit
  - Reddit's API
  - Getting posts
- Making a dataset
  - Encoding posts
  - Saving the data
- Training a model
- Test set predictions

## Accessing reddit

If you don't know what [reddit](https://reddit.com/) is, it's a website that is
organized into communities called _subreddits_. The site has _users_ which belong to
multiple subreddits. Each subreddit contains a sequence of _submissions_ (also called
_posts_) which can be links
to other sites, images, or text. Each post contains _comments_ which are text only
an are made by the users. The _comment section_ is organized as a forest of trees
wich top level comments and then comments nested below each top level comment. Each
user can have a _flair_ which varies with the subreddit they are posting in. The
flair and username are posted along with each comment. The flair will contains a
small image as well as text. Not every user has a flair associated to it for every
subreddit to which they belong. It is not always required to be a member of the
subreddit in order to post, although this varies by community.

We will be using the NBA subreddit which as of this writing has ~3.5 million users.
Users in this subreddit have flairs which denote which team the user is a supporter
of. Mine is for the Cleveland Cavaliers.

### PRAW - Exploring Reddit

In order to access reddit, we need to have a reddit account and then create an application
with that account as a developer of the application. Go to [the apps page](https://www.reddit.com/prefs/apps)
to create an application. (Please read the terms and conditions!)
You need to store the name of the app as the `user_agent`, the
`client_id` is the 14 character string that is below the app name once it's created, 
you'll have a 27 character secret secret that is generated and is the `client_secret`, 
and you'll need your `username`. I have put these credentials in an encrypted YAML file
that I created using ansible-vault.

We will be using [PRAW](https://github.com/praw-dev/praw), the Python Reddit API
Wrapper.

In [9]:
import praw
import yaml
from ansible_vault import Vault
from getpass import getpass

In [10]:
vault = Vault(getpass())
with open('redditcreds.yml', 'r') as f:
    reddit_creds = vault.load(f.read())

reddit = praw.Reddit(username=reddit_creds['username'],
                     user_agent=reddit_creds['user_agent'],
                     client_id=reddit_creds['client_id'],
                     client_secret=reddit_creds['client_secret'])

 ····················


Now we get into the [NBA subreddit](https://www.reddit.com/r/nba) and search for posts which are _game threads_
to which users will post while a game is ongoing. These are posted automatically and so they have a 
predictible title format which makes searching for them and then filtering the received submisisons
by title easier.

In [76]:
subreddit = reddit.subreddit('nba')

submissions = subreddit.search(query='title:"GAME THREAD"',
                               time_filter="month")
game_threads = [
    s for s in submissions 
    if
        s.title[:11] == 'GAME THREAD' 
        and ("Lakers" in s.title or "Heat" in s.title)
]

In [77]:
for t in game_threads:
    print(t.title)

GAME THREAD: Miami Heat (44-29) @ Los Angeles Lakers (52-19) - (October 09, 2020)
GAME THREAD: Miami Heat (44-29) @ Los Angeles Lakers (52-19) - (September 30, 2020)
GAME THREAD: Los Angeles Lakers (52-19) @ Miami Heat (44-29) - (October 06, 2020)
GAME THREAD: Los Angeles Lakers (52-19) @ Miami Heat (44-29) - (October 04, 2020)
GAME THREAD: Miami Heat (44-29) @ Los Angeles Lakers (52-19) - (October 02, 2020)
GAME THREAD: Denver Nuggets (46-27) @ Los Angeles Lakers (52-19) - (September 26, 2020)
GAME THREAD: Los Angeles Lakers (52-19) @ Denver Nuggets (46-27) - (September 24, 2020)
GAME THREAD: Los Angeles Lakers (52-19) @ Denver Nuggets (46-27) - (September 22, 2020)
GAME THREAD: Denver Nuggets (46-27) @ Los Angeles Lakers (52-19) - (September 18, 2020)
GAME THREAD: Boston Celtics (48-24) @ Miami Heat (44-29) - (September 27, 2020)
GAME THREAD: Denver Nuggets (46-27) @ Los Angeles Lakers (52-19) - (September 20, 2020)
GAME THREAD: Miami Heat (44-29) @ Boston Celtics (48-24) - (Septembe

The comments for a submission are contained in PRAW `CommentForest`. If we do not
care about the structure, we can flatten this to a list. Note that the list
will contain both `Comment` objects as well as `MoreComments` objectcs. It is possible to replace
the `MoreComments` objects using the `replace_more` function, but each replacement
requires calling the reddit API. By default, 32 of the `MoreComments` objects will
be replaced, which I will keep but I will also limit to those which contain at least
5 more comments in them. Doing this removes all `MoreComments` instances from
the list of comments.

In [93]:
game_thread = game_threads[0]
print(game_thread.title)
print("number of comments:", game_thread.num_comments)
game_thread.comments.replace_more(limit=32, threshold=5)
comments = game_thread.comments.list()

GAME THREAD: Miami Heat (44-29) @ Los Angeles Lakers (52-19) - (October 09, 2020)
number of comments: 28455


And here is a random comment:

In [168]:
c = comments[10]
print('comment author:   ', c.author.name)
print('comment score:    ', c.score, '({}-{})'.format(c.ups, c.downs))
print('author flair text:', c.author_flair_text)
print('body:\n\n', c.body)

comment author:    StoneColdAM
comment score:     28 (28-0)
author flair text: Lakers
body:

 Dwight Howard strategy: get Jimmy Butler ejected.


For the training set, the goal will be to find authors which are flaired as Lakers or Heat or have
flair text that contains LAL or MIA.

In [178]:
lakers_comments = [
    c for c in comments
    if
        c.author_flair_text is not None
        and (
            c.author_flair_text == 'Lakers'
            or '[LAL]' in c.author_flair_text
        )
]
print('number of comments by Lakers users:', len(lakers_comments))

heat_comments = [
    c for c in comments
    if
        c.author_flair_text is not None
        and (
            c.author_flair_text == 'Heat'
            or '[MIA]' in c.author_flair_text
        )
]
print('number of comments by Heat users  :', len(heat_comments))

number of comments by Lakers users: 1042
number of comments by Heat users  : 275


### Pushshift - Gathering data

It turns out that PRAW is fairly limited and the [Pushshift API](https://github.com/pushshift/api) is
more powerful than it. There is a wrapper for this API called [PSAW](https://github.com/dmarx/psaw).
PSAW and PRAW interact with each other nicely: if you pass a PRAW `Reddit` instance to PSAW's
`PushShiftAPI` class, pushshift gathers the IDs you want but then returns PRAW objects.

In [274]:
from psaw import PushshiftAPI
import datetime as dt

api = PushshiftAPI(reddit)

# we'll search month by month for submissions
game_threads = []

months = [
    int(dt.datetime(2019, 10, 19).timestamp()),
    int(dt.datetime(2019, 11, 1).timestamp()),
    int(dt.datetime(2019, 12, 1).timestamp()),
    int(dt.datetime(2020, 1, 1).timestamp()),
    int(dt.datetime(2020, 2, 1).timestamp()),
    int(dt.datetime(2020, 3, 1).timestamp()),
    int(dt.datetime(2020, 4, 1).timestamp()),
    int(dt.datetime(2020, 5, 1).timestamp()),
    int(dt.datetime(2020, 6, 1).timestamp()),
    int(dt.datetime(2020, 7, 1).timestamp()),
    int(dt.datetime(2020, 8, 1).timestamp()),
    int(dt.datetime(2020, 9, 1).timestamp()),
    int(dt.datetime(2020, 10, 1).timestamp()),
    int(dt.datetime(2020, 11, 1).timestamp())
]

for i in range(len(months)-1):
    submissions = list(api.search_submissions(after=months[i],
                                              before=months[i+1],
                                              q='game thread',
                                              subreddit='nba',
                                              author='NBA_MOD',
                                              limit=5000))
    
    game_threads += [
        s for s in submissions
        if
            s.title[:11] == 'GAME THREAD'
            and ("Lakers" in s.title or "Heat" in s.title)
    ]

Now that we have game threads, it's time to collect relevant comments from each game.
We'll be using PRAW for this.

In [504]:
from IPython.display import clear_output

lakers_comments = []
heat_comments = []

num_game_threads = len(game_threads)
i = 1

def check_lakers(comment):
    if comment.author_flair_text is not None:
        if comment.author_flair_text == 'Lakers':
            return True
        if '[LAL]' in comment.author_flair_text:
            return True
    return False

def check_heat(comment):
    if comment.author_flair_text is not None:
        if comment.author_flair_text == 'Heat':
            return True
        if '[MIA]' in comment.author_flair_text:
            return True
    return False

for game_thread in game_threads:
    
    clear_output(wait=True)
    print('scouring game thread: {}/{}'.format(i, num_game_threads))
    print('number of lakers comments:', len(lakers_comments))
    print('  number of heat comments:', len(heat_comments))
    
    has_lakers = 'Lakers' in game_thread.title
    has_heat = 'Heat' in game_thread.title
    
    # NOTE: this part takes a long time
    game_thread.comments.replace_more(limit=32, threshold=2)
    comments = game_thread.comments.list()
    
    if has_lakers and has_heat:
        lakers = []
        heat = []
        
        for c in comments:
            if check_lakers(c) and len(c.body.split()) > 3:
                lakers += [c]
            elif check_heat(c) and len(c.body.split()) > 3:
                heat += [c]
                
        lakers_comments += lakers
        heat_comments += heat
        
    elif has_lakers:
        lakers =[]
        
        for c in comments:
            if check_lakers(c) and len(c.body.split()) > 3:
                lakers += [c]
                
        lakers_comments += lakers
      
    elif has_heat:
        heat = []
        
        for c in comments:
            if check_heat(c) and len(c.body.split()) > 3:
                heat += [c]
                
        heat_comments += heat
        
    i += 1

scouring game thread: 180/180
number of lakers comments: 47535
  number of heat comments: 12907


In [505]:
import pickle

with open('data/lakers_comments.pkl', 'wb') as f:
    pickle.dump(lakers_comments, f)
    
with open('data/heat_comments.pkl', 'wb') as f:
    pickle.dump(heat_comments, f)

### Perparing to clean

Cleaning text is tedious and many people have already though about what to do. 
I am taking code from [this blog post](https://hub.packtpub.com/clean-social-media-data-analysis-python/).
The cleaning we're doing is just about replacing weird and unwanted characters and then
lemmatizing.

I have played around with the stop words a bit. Since this is really a form of
sentiment analysis, I put back in the negations. Since this is basketball, I also
want numbers such as two and three to be allowed. I want some ofthe directions like
up and down and over and under.

Since I'm going to use spaCy's vector encoding for the words instead of a dummy
one-hot encoding, I thought about replacing the player names with a placeholder
such as `heatplayer`, but this appears to be mostly unnecessary, as the vectorizer
that is build in already knows names such as `lebron`. One can confirm this with
`nlp('lebron')[0].vector` and see it is not hte zero vector.

In [536]:
import re
import itertools
import spacy

nlp = spacy.load('en_core_web_lg')

regexes = {
    'weird_chars': re.compile(r'[\?;\(\)\\.,\!:–\-\"\[\]“”]'),
    'newlines': re.compile(r'\n'),
    'html': re.compile(r'<[^<]+/>', re.MULTILINE),
    'urls': re.compile(r'^https?://.*[rn]*', re.MULTILINE),
    'spaces': re.compile(r'\s{2,}'),
    'u': re.compile(r'\bu\b')
}

allowed_stops = ['no', 'never', 'not', 'none', 'up', 'down', 
                 'back', 'over', 'under', 'two', 'three']
stop_words = [
    word for word in nlp.Defaults.stop_words 
    if word not in allowed_stops
]
stop_words += ['-PRON-']

def clean_text(text):
    """ substitute things based on the regexes above """
    text = text.lower()
    
    # custom replacements we decided on
    text = text.replace('`', '\'')
    text = text.replace('’', '\'')
    
    text  = text.replace("won't", "will not")
    text  = text.replace("n't", " not")
    
    text = regexes['u'].sub('you', text)
    text = regexes['html'].sub(' ', text)
    text = regexes['urls'].sub(' ', text)
    text = regexes['weird_chars'].sub(' ', text)
    text = regexes['newlines'].sub(' ', text)
    text = regexes['spaces'].sub(' ', text)
    
    # removing the multiletters such as 'happppy' -> 'happy'
    text = ''.join(''.join(s)[:2] for _, s in itertools.groupby(text))
    
    # lemmatize
    text = ' '.join([word.lemma_ for word in nlp(text)])
    
    # remove stopwords
    text = ' '.join([w for w in text.split(' ') if w not in stop_words])
    
    return text

Now let's just print some examples of what we end up with when cleaning the text:

In [551]:
def print_cleantext(text):
    print('|', text)
    print('|\t\t-----------')
    print('|',clean_text(text))

print_cleantext(heat_comments[50].body)
print('\n\n')
print_cleantext(heat_comments[1047].body)
print('\n\n')
print_cleantext(lakers_comments[500].body)
print('\n\n')
print_cleantext(lakers_comments[-200].body)
print('\n\n')
print_cleantext(lakers_comments[14477].body)

| He's coming back from an injury.
|		-----------
| come back injury



| Y'all thought that we were joking about the championship?
|		-----------
| think joke championship



| Flagrant 2 requires it to be excessively violent, which it pretty clearly wasn’t. Easy flagrant 1 call
|		-----------
| flagrant 2 require excessively violent pretty clearly not easy flagrant 1



| Get AD back in pls
|		-----------
| ad back pls



| Are we going to win a fucking championship? I can't believe this. I thought the warriors would be on top forever (and then some).
|		-----------
| win fucking championship not believe think warrior forever


### Cleaning and saving data

Now we do initiate the (long?) work of converting the comments into vectors which we
can then save. We have approximately 60,000 comments total. Just for a ballpark estimate,
let's suppose there are 10 words per comment. Thus we need to store about 600k vectors.
How much space does each vector take? We can figure out as follows:

In [571]:
import sys

vector_size = sys.getsizeof(nlp('lebron')[0].vector)
print('Vector size in bytes:', vector_size)
print('Estimate size to store:', vector_size * 600_000 / 1e6, 'MB')

Vector size in bytes: 96
Estimate size to store: 57.6 MB


This is well within our ability to store things in memory or on disk even if we were off in our average comment size
by a factor of ten, so no worries there for this application. We'll store the data as a list of tuples where each
tuple contains both a numpy array of shape (n, 300) where n is the number of words in the comment and a 0 or 1
depending on if the comment is from a Heat fan (0) or a Lakers fan (1). Previously we guaranteed that the comments
had a certain length, but stop word removal will shorten comments, so we will drop some comments here if there are 
not enough words after cleaning.

In [595]:
import numpy as np

def comment_to_numpy_array(comment, min_words=3):
    
    # first step is to clean the comment
    clean_comment = clean_text(comment)
    
    # next is to get an nlp object
    nlp_comment = nlp(clean_comment)
    
    # get nonzero vectors
    word_vectors = [
        word.vector for word in nlp_comment
        if np.count_nonzero(word.vector) > 0
    ]
    
    # return sufficiently long comments
    if len(word_vectors) >= min_words:
        return word_vectors
    else:
        return None

In [None]:
inital_lakers_comments = [
    comment_to_numpy_array(comment.body) for comment in lakers_comments
]

lakers_vector_comments = [
    (comment, 1) for comment in initial_lakers_comments
    if comment is not None
]

heat_lakers_comments = [
    comment_to_numpy_array(comment.body) for comment in heat_comments
]

heat_vector_comments = [
    (comment, 1) for comment in initial_heat_comments
    if comment is not None
]

with open('data/lakers_vector_comments.pkl', 'wb') as f:
    pickle.dump(lakers_vector_comments, f)
    
with open('data/heat_vector_comments.pkl', 'wb') as f:
    pickle.dump(heat_vector_comments, f)