# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [4]:
import requests
import json
import pandas as pd
import time

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [5]:
url = 'https://www.reddit.com/r/depression.json'

In [6]:
posts_depression = []
after = None

for i in range(40):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Naman 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_depression.extend(current_posts)
    after = current_dict['data']['after']
    time.sleep(2)

https://www.reddit.com/r/depression.json
https://www.reddit.com/r/depression.json?after=t3_9bufbr
https://www.reddit.com/r/depression.json?after=t3_9bsip2
https://www.reddit.com/r/depression.json?after=t3_9br1xk
https://www.reddit.com/r/depression.json?after=t3_9bx1j0
https://www.reddit.com/r/depression.json?after=t3_9bw2p8
https://www.reddit.com/r/depression.json?after=t3_9brpoh
https://www.reddit.com/r/depression.json?after=t3_9bqo5m
https://www.reddit.com/r/depression.json?after=t3_9bpwzh
https://www.reddit.com/r/depression.json?after=t3_9bneqs
https://www.reddit.com/r/depression.json?after=t3_9bpvjx
https://www.reddit.com/r/depression.json?after=t3_9blfz2
https://www.reddit.com/r/depression.json?after=t3_9bqxi1
https://www.reddit.com/r/depression.json?after=t3_9bot2t
https://www.reddit.com/r/depression.json?after=t3_9bgufz
https://www.reddit.com/r/depression.json?after=t3_9bp9vh
https://www.reddit.com/r/depression.json?after=t3_9bmm3a
https://www.reddit.com/r/depression.json?after=

In [7]:
res = requests.get('https://www.reddit.com/r/depression.json?after=t3_9ajnm4', headers = {'User-agent': 'Naman 1.0'})
current_dict = res.json()
current_posts = [p['data'] for p in current_dict['data']['children']]
posts_depression.extend(current_posts)
after = current_dict['data']['after']

In [8]:
len(posts_depression), len(set([p['name'] for p in posts_depression]))

(979, 979)

In [9]:
url = 'https://www.reddit.com/r/rant.json'

In [10]:
posts_rant = []
after = None

for i in range(40):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Naman 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_rant.extend(current_posts)
    after = current_dict['data']['after']
    time.sleep(2)

https://www.reddit.com/r/rant.json
https://www.reddit.com/r/rant.json?after=t3_9br5f6
https://www.reddit.com/r/rant.json?after=t3_9bu471
https://www.reddit.com/r/rant.json?after=t3_9bqbty
https://www.reddit.com/r/rant.json?after=t3_9bohwn
https://www.reddit.com/r/rant.json?after=t3_9bcavl
https://www.reddit.com/r/rant.json?after=t3_9bel76
https://www.reddit.com/r/rant.json?after=t3_9bfjqi
https://www.reddit.com/r/rant.json?after=t3_9bb5gu
https://www.reddit.com/r/rant.json?after=t3_9bdrqm
https://www.reddit.com/r/rant.json?after=t3_9bbks7
https://www.reddit.com/r/rant.json?after=t3_9b23xm
https://www.reddit.com/r/rant.json?after=t3_9b2tup
https://www.reddit.com/r/rant.json?after=t3_9b3bmk
https://www.reddit.com/r/rant.json?after=t3_9atwq5
https://www.reddit.com/r/rant.json?after=t3_9asyty
https://www.reddit.com/r/rant.json?after=t3_9arlrj
https://www.reddit.com/r/rant.json?after=t3_9aq64z
https://www.reddit.com/r/rant.json?after=t3_9agi6y
https://www.reddit.com/r/rant.json?after=t3_9ab

In [11]:
len(posts_rant), len(set([p['name'] for p in posts_rant]))

(977, 977)

### Save your results as a JSON
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [14]:
with open('../data/depression.json', 'w+') as f:
    json.dump(posts_depression, f)

In [15]:
with open('../data/rant.json', 'w+') as f:
    json.dump(posts_rant, f)