# Building a subreddit classifier
An exploration of nlp packages and evaluation of classifier models

## Problem Statement
The world of investing is a multifaceted universe. Long gone are the days of simple shareholding. Today, traders are able to trade a variety of securities that suit their needs best. We want to zoom in on 2 major classes of securities - stocks and options, which are a derivative of stocks. There are traders out there who exclusively trade stocks, and others who exclusively trade options. Both are still able to make money. In this project, we will seek to understand posts from 2 subreddits devoted to these 2 classes of traders - r/stocks and r/options. 

**Problem statement:** What are the most representative features of posts from each subreddit that will allow us to correctly classify them?

To answer this question, we will explore a variety of classification models and natural language processing (nlp) tools to build a classifier that will reliably be able to determine if a given post is from r/options or r/stocks. This classifier should be able to identify for us the top predictive features for each subreddit, and from there, we hope to additionally answer questions such as "what trading strategies are currently trending within option and stock traders" and "what companies are currently popular with traders". This information will be useful to traders who want to better understand current trends within the market, regardless of whether he trades more in options or in stocks.

In [1]:
import requests
import pandas as pd
import random
import time
pd.options.display.html.use_mathjax = False

### Set up OAuth to use the Reddit API

In [16]:
# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('<CLIENT_ID', '<SECRET_TOKEN>')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<USER>',
        'password': '<PASSWORD'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'classifier proj by <YOUR USER>'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

In [17]:
headers

{'User-Agent': 'classifier proj by u/ltckrompirov',
 'Authorization': 'bearer 30729750034-o28D9IXGWevnUKpeR8FkIGmJZ30rpw'}

In [13]:
res = requests.get('http://oauth.reddit.com/r/wallstreetbets/hot', headers=headers)

In [14]:
res.status_code

200

In [15]:
reddit_dict = res.json()

In [26]:
reddit_dict['data']['children'][2]['data']

{'approved_at_utc': None,
 'subreddit': 'wallstreetbets',
 'selftext': '',
 'author_fullname': 't2_vev2iv9',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'As promised in the GME daily thread: 1 pet goodnight for every dollar gained today',
 'link_flair_richtext': [{'e': 'text', 't': 'Meme'}],
 'subreddit_name_prefixed': 'r/wallstreetbets',
 'hidden': False,
 'pwls': 7,
 'link_flair_css_class': 'meme',
 'downs': 0,
 'thumbnail_height': 140,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_m1mnps',
 'quarantine': False,
 'link_flair_text_color': 'light',
 'upvote_ratio': 0.96,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 15743,
 'total_awards_received': 143,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': {'reddit_video': {'bitrate_kbps': 4800,
   'fallback_url': 'https://v.redd.it/8b0qa3wxy3m61/DASH_10

In [27]:
df = pd.DataFrame()

for post in reddit_dict['data']['children']:
    # append relevant data to dataframe
    df = df.append({'subreddit': post['data']['subreddit'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score'],
        'author': post['data']['author']
    }, ignore_index=True)

In [33]:
options = requests.get('http://oauth.reddit.com/r/options/hot', headers=headers,
                       params={'limit':'50'})

In [34]:
options.status_code

200

In [39]:
options.json()['data']['after']

't3_m1oguo'

In [2]:
def df_from_json(url, cycles=10, params={'limit':'100'}):
    
    # instantiate a dataframe
    df = pd.DataFrame()
    # this is to tell the api where to start each new cycle from
    after = None
    url = url
    for a in range(cycles):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers=headers, params=params)

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        for post in res.json()['data']['children']:
            # append relevant data to dataframe
            df = df.append({'subreddit': post['data']['subreddit'],
                'title': post['data']['title'],
                'selftext': post['data']['selftext'],
                'upvote_ratio': post['data']['upvote_ratio'],
                'ups': post['data']['ups'],
                'downs': post['data']['downs'],
                'score': post['data']['score'],
                'author': post['data']['author']
            }, ignore_index=True)

        after = res.json()['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(6,12)
        print(sleep_duration)
        time.sleep(sleep_duration)
    # drops all duplicate rows based on the text values of title and selftext
    df = df.drop_duplicates(subset = ['title','selftext'])
    
    return df

### Scraping from r/options

In [91]:
df_options = df_from_json('http://oauth.reddit.com/r/options/top', params={'sort':'top','limit':'100','t':'month'})

http://oauth.reddit.com/r/options/top
10
http://oauth.reddit.com/r/options/top?after=t3_lpe4sw
10
http://oauth.reddit.com/r/options/top?after=t3_lw5mdv
12
http://oauth.reddit.com/r/options/top?after=t3_m003jm
11
http://oauth.reddit.com/r/options/top?after=t3_ljftql
13
http://oauth.reddit.com/r/options/top?after=t3_lioy86
11
http://oauth.reddit.com/r/options/top?after=t3_lol75v
15
http://oauth.reddit.com/r/options/top?after=t3_lql8ss
17
http://oauth.reddit.com/r/options/top?after=t3_lxoxr6
10
http://oauth.reddit.com/r/options/top?after=t3_lnm8qy
20


In [95]:
df_options_nodup.to_csv('options_topmonth.csv',index=False)

### SCRAPING FROM R/STOCKS

In [85]:
df_stocks = df_from_json('http://oauth.reddit.com/r/stocks/top', params={'sort':'top','limit':'100','t':'month'})

http://oauth.reddit.com/r/stocks/top
38
http://oauth.reddit.com/r/stocks/top?after=t3_lj3ec6
12
http://oauth.reddit.com/r/stocks/top?after=t3_lkj30k
36
http://oauth.reddit.com/r/stocks/top?after=t3_ly92v2
8
http://oauth.reddit.com/r/stocks/top?after=t3_lv5vnf
23
http://oauth.reddit.com/r/stocks/top?after=t3_m06i7b
40
http://oauth.reddit.com/r/stocks/top?after=t3_lhspgz
30
http://oauth.reddit.com/r/stocks/top?after=t3_lxinux
14
http://oauth.reddit.com/r/stocks/top?after=t3_lk4mzl
10
http://oauth.reddit.com/r/stocks/top?after=t3_lxlp5v
12


In [94]:
df_stocks_nodup.to_csv('stocks_topmonth.csv',index=False)