# Using Reddit's API for Classifying Subreddit Posts

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [25]:
# Library Imports

# For webscraping
import requests
import json
import time

# Regular data cleaning/handling
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
%matplotlib inline


In [7]:
# Geocaching subreddit
URL = "https://www.reddit.com/r/geocaching.json"

In [8]:
# Save requests as an object using .get
res = requests.get(URL, headers = {'User-agent': 'nicoleBot'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [9]:
# Save as a json
data = res.json()

#### Explore the json

In [7]:
# Post body text
data['data']['children'][1]['data']['selftext']

''

In [8]:
# Post title
data['data']['children'][1]['data']['title']

'A quick Virtual Cache for the August Geochallenge!'

In [9]:
# How many posts in the json
print(len(data['data']['children']))

25


In [10]:
# What is the 'after' value?
data['data']['after']

't3_99g4sx'

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [52]:
# Exploring how to get the next 25 posts

new_url = 'http://www.reddit.com/r/geocaching.json?after=t3_99c6t7'
res = requests.get(new_url, headers={'User-agent': 'nicoleBot'})
new_data = res.json()
new_data['data']['children'][0]['data']['title']

'What do you think of field puzzles that are solved simply with lots of trial and error?'

## Geocaching posts captured, don't run this twice

In [55]:
# # Capturing Geocaching posts
# url = 'http://www.reddit.com/r/geocaching.json'

# all_posts = []

# for _ in range(40):
#     #construct list of 1000 posts
    
#     # Get posts by going to the url, put it into json, and store it
#     res = requests.get(url, headers={'User-agent': 'nicoleBot'})
#     data = res.json()
    
#     # save only the posts out of the json into the list_of_posts, then add
#     # add all the posts to the all_posts list
#     list_of_posts = data['data']['children']
    
    
#     # do your data cleaning every time you pull, then only
#     # save the info that you need
#     for post in list_of_posts:
#         current_row = []
#         current_row.append(post['data']['selftext'])
#         current_row.append(post['data']['title'])
#         current_row.append(post['data']['subreddit'])  
#         all_posts.append(current_row)
        
    
#     # reassign the after to the current 'after', and then update the url to hit
#     after = data['data']['after']
#     url = 'http://www.reddit.com/r/geocaching.json?after=' + after
    
#     # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
#     print('The current after: ', after)
#     time.sleep(3)

# now put the list of lists, where which inner list is a row
# straight into a dataframe
# df = pd.DataFrame(all_posts, columns = ['text','title','subreddit'])

### Webscraping a second subreddit

I chose the subreddit I Want To Learn, as I feel it would have a wider range of vocabulary and subjects.

In [44]:
learn_url = 'http://www.reddit.com/r/IWantToLearn.json'
res = requests.get(learn_url, headers={'User-agent': 'nicoleBot'})
learn_data = res.json()

In [52]:
# Confirm what subreddit we've scraped from and how to navigate down through the json
learn_data['data']['children'][0]['data']['subreddit']

'IWantToLearn'

## IWTL posts captured, do not run

In [None]:
# # Repeat process for IWantToLearn subbreddit

# url = 'http://www.reddit.com/r/IWantToLearn.json'

# all_posts = []

# for _ in range(3):
#     #construct list of 1000 posts
    
#     # Get posts by going to the url, put it into json, and store it
#     res = requests.get(url, headers={'User-agent': 'nicoleBot'})
#     data = res.json()
    
#     # save only the posts out of the json into the list_of_posts, then add
#     # add all the posts to the all_posts list
#     list_of_posts = data['data']['children']
    
    
#     # do your data cleaning every time you pull, then only
#     # save the info that you need
#     for post in list_of_posts:
#         current_row = []
#         current_row.append(post['data']['selftext'])
#         current_row.append(post['data']['title'])
#         current_row.append(post['data']['subreddit'])
#         all_posts.append(current_row)
        
    
#     # reassign the after to the current 'after', and then update the url to hit
#     after = data['data']['after']
#     url = 'http://www.reddit.com/r/IWantToLearn.json?after=' + after
    
#     # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
#     print('The current after: ', after)
#     time.sleep(3)

In [7]:
# Convert to dataframe
learn_df = pd.DataFrame(all_posts, columns = ['text','title','subreddit'])

# Concatenate onto geocaching df for a full dataset
two_subr = pd.concat([df, learn_df], axis=0)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [None]:
two_subr.to_csv('reddit_posts.csv')