# Building a subreddit classifier
An exploration of nlp packages and evaluation of classifier models

## Problem Statement
The world of investing is a multifaceted universe. Long gone are the days of simple shareholding. Today, traders are able to trade a variety of securities that suit their needs best. We want to zoom in on 2 major classes of securities - stocks and options, which are a derivative of stocks. There are traders out there who exclusively trade stocks, and others who exclusively trade options. Both are still able to make money. In this project, we will seek to understand posts from 2 subreddits devoted to these 2 classes of traders - r/stocks and r/options. 

**Problem statement:** What are the most representative text features of posts from each subreddit that will allow us to correctly classify them?

To answer this question, we will explore a variety of classification models and natural language processing (nlp) tools to build a classifier that will reliably be able to determine if a given post is from r/options or r/stocks. This classifier should be able to identify for us the top predictive features for each subreddit, and from there, we hope to additionally answer questions such as "what trading strategies are currently trending within option and stock traders" and "what companies are currently popular with traders". This information will be useful to traders who want to better understand current trends within the market, regardless of whether he trades more in options or in stocks.

## Executive Summary

This project seeks to build a classifier that is able to differentiate between posts from r/options and r/stocks, two trading subreddits focused on different classes of securities. The classifier should be able to identify what features are important in discriminating between these subreddits and use them to make accurate predictions.

The methodology of our project is as follows:

    1. Scrape data from our 2 subreddits
        - we scrape by top posts, by month. Data was scraped on 10th March 2021.
    2. Data Cleaning
        - ensure no duplicate posts after scraping
        - process text data to make them suitable for EDA and modelling.
        - checking for pseudo duplicates, e.g. megathreads with the same body but different date in title
    3. EDA
        - examine frequency of features
        - examine post length
        - examine post engagement via upvotes
    4. Modeling
        - 2 parametric models
            - Logistic Regression, Multnomial Naive Bayes
        - 4 non-parametric tree models
            - Random Forest, Extra Trees, AdaBoosted trees, GradientBoosted trees
        - 2 Vectorizers
            - CountVectorizer
            - TfidfVectorizer
    5. Evaluation of models
        - compare models based on Accuracy score and ROC AUC score.
        - due to balance of classes, these metrics are very close.
        
Results show that our Logistic Regression model with TfidfVectorizer produces the best result. It is the least overfit (difference of 5% between train and validation set) and has the highest accuracy (89% after full optimization). Full details of our model testing is discussed in our 3rd notebook.

We found that top feature predictors for r/options are more technical in nature and unique to option trading strategies and options in general. Examples include terms like `call`, `put`, `premium`, `credit spread` and such. Top feature predictors for r/stocks show an emphasis on fundamental research on companies or the broader market. Examples include terms like `invest`, `revenue`, `market`, and `company`. 

Recommendations for future improvement of our classifier include scraping more posts every month to increase the corpus we can train our model on. The exploration of named entity recognition and sentiment analysis is also recommended for users wishing to use our classifier to identify what companies are most frequently discussed on a given subreddit, as well as the valence of the discussion surrounding these companies.

A limitation of our current model is that the content of discussion on these subreddits can vary over time, as market trends are constantly evolving. As such, topics that in vogue currently may fade into obscurity 6 months from now, which will throw off the performance of our classifier. Hence, this classifier will need to constantly be retrained at an appropriate time interval in order to maintain its performance.

## Set up OAuth to use the Reddit API
We will use Reddit's recommended method for accessing their API, which require the use of OAuth. If you do not have a Reddit account, you can still access the API by calling the .json version of the page instead.

In [1]:
import requests
import pandas as pd
import random
import time
pd.options.display.html.use_mathjax = False

In [16]:
# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('<CLIENT_ID', '<SECRET_TOKEN')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<USERNAME>',
        'password': '<PASSWORD'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'classifier proj by u/ltckrompirov'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests, like this
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

In [14]:
res.status_code

200

In [2]:
def df_from_json(url, cycles=10, params={'limit':'100'}):
    
    # instantiate a dataframe
    df = pd.DataFrame()
    # this is to tell the api where to start each new cycle from
    after = None
    url = url
    for a in range(cycles):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers=headers, params=params)

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        for post in res.json()['data']['children']:
            # append relevant data to dataframe
            df = df.append({'subreddit': post['data']['subreddit'],
                'title': post['data']['title'],
                'selftext': post['data']['selftext'],
                'upvote_ratio': post['data']['upvote_ratio'],
                'ups': post['data']['ups'],
                'downs': post['data']['downs'],
                'score': post['data']['score'],
                'author': post['data']['author']
            }, ignore_index=True)

        after = res.json()['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(6,12)
        print(sleep_duration)
        time.sleep(sleep_duration)
    # drops all duplicate rows based on the text values of title and selftext
    df = df.drop_duplicates(subset = ['title','selftext'])
    
    return df

### Scraping from r/options
Using our function above makes scraping 1000 posts trivial. Just come back in about 3 minutes.

To scrape from the top 1000 posts this month, we have to pass in the parameters appropriately.

In [91]:
df_options = df_from_json('http://oauth.reddit.com/r/options/top', params={'sort':'top','limit':'100','t':'month'})

http://oauth.reddit.com/r/options/top
10
http://oauth.reddit.com/r/options/top?after=t3_lpe4sw
10
http://oauth.reddit.com/r/options/top?after=t3_lw5mdv
12
http://oauth.reddit.com/r/options/top?after=t3_m003jm
11
http://oauth.reddit.com/r/options/top?after=t3_ljftql
13
http://oauth.reddit.com/r/options/top?after=t3_lioy86
11
http://oauth.reddit.com/r/options/top?after=t3_lol75v
15
http://oauth.reddit.com/r/options/top?after=t3_lql8ss
17
http://oauth.reddit.com/r/options/top?after=t3_lxoxr6
10
http://oauth.reddit.com/r/options/top?after=t3_lnm8qy
20


In [95]:
# save our scraped data
df_options_nodup.to_csv('options_topmonth.csv',index=False)

### Scraping from r/stocks

In [85]:
df_stocks = df_from_json('http://oauth.reddit.com/r/stocks/top', params={'sort':'top','limit':'100','t':'month'})

http://oauth.reddit.com/r/stocks/top
38
http://oauth.reddit.com/r/stocks/top?after=t3_lj3ec6
12
http://oauth.reddit.com/r/stocks/top?after=t3_lkj30k
36
http://oauth.reddit.com/r/stocks/top?after=t3_ly92v2
8
http://oauth.reddit.com/r/stocks/top?after=t3_lv5vnf
23
http://oauth.reddit.com/r/stocks/top?after=t3_m06i7b
40
http://oauth.reddit.com/r/stocks/top?after=t3_lhspgz
30
http://oauth.reddit.com/r/stocks/top?after=t3_lxinux
14
http://oauth.reddit.com/r/stocks/top?after=t3_lk4mzl
10
http://oauth.reddit.com/r/stocks/top?after=t3_lxlp5v
12


In [94]:
# save our scraped data
df_stocks_nodup.to_csv('stocks_topmonth.csv',index=False)