# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [13]:
# Library Imports

import requests
import json
import time

import pandas as pd
import numpy as np

In [2]:
URL = "https://www.reddit.com/r/geocaching.json"

In [3]:
## YOUR CODE HERE

res = requests.get(URL, headers = {'User-agent': 'nicoleBot'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [4]:
data = res.json()

In [5]:
data

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'geocaching',
     'selftext': '',
     'author_fullname': 't2_190yam2',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Logged my 300th find today! Atop Little Haystack, NH',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/geocaching',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 140,
     'hide_score': False,
     'name': 't3_9at73l',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 90,
     'domain': 'i.redd.it',
     'media_embed': {},
     'thumbnail_width': 140,
     'author_flair_template_id': None,
     'is_original_content': False,
     'user_reports': [],
     'secure_media': None,
   

In [6]:
data['data']['children'][1]['data']['selftext']

''

In [7]:
data['data']['children'][1]['data']['title']

'A quick Virtual Cache for the August Geochallenge!'

In [8]:
print(len(data['data']['children']))

25


In [9]:
data['data']['after']

't3_99c6t7'

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [10]:
## YOUR CODE HERE

new_url = 'http://www.reddit.com/r/geocaching.json?after=t3_99c6t7'
res = requests.get(new_url, headers={'User-agent': 'nicoleBot'})
new_data = res.json()
new_data['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'geocaching',
   'selftext': 'I found a cache once that was locked with a padlock and had a separate container nearby full of probably 100+ keys. As far as I could tell, the only option was brute force and going through and trying all the keys.\n\nNormally, I\'d think that I wouldn\'t enjoy something like this where there\'s no "cleverness" involved in solving the puzzle. But, for some reason, I enjoyed it anyway.\n\nJust curious what others thoughts are?',
   'author_fullname': 't2_3ly1t',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'What do you think of field puzzles that are solved simply with lots of trial and error?',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/geocaching',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': None,
   'parent_whitelist_status': 'all_ads',
   'hide_score': Fal

In [19]:
print(len(new_data['data']['children']))

25


In [11]:
# need 1000 posts, can only get 25/time

1000/25

40.0

In [12]:

url = 'http://www.reddit.com/r/geocaching.json'

all_posts = []

for _ in range(40):
    #construct list of 1000 posts
    
    # Get posts by going to the url, put it into json, and store it
    res = requests.get(url, headers={'User-agent': 'nicoleBot'})
    data = res.json()
    
    # save only the posts out of the json into the list_of_posts, then add
    # add all the posts to the all_posts list
    list_of_posts = data['data']['children']
    
    
    # do your data cleaning every time you pull, then only
    # save the info that you need
    for post in list_of_posts:
        current_row = []
        current_row.append(post['data']['selftext'])
        current_row.append(post['data']['title'])
        current_row.append(0)
        all_posts.append(current_row)
        
    
    # reassign the after to the current 'after', and then update the url to hit
    after = data['data']['after']
    url = 'http://www.reddit.com/r/geocaching.json?after=' + after
    
    # go to sleep for 3 seconds so you do not overwhelm reddit and get kicked out
    print('The current after: ', after)
    time.sleep(3)

The current after:  t3_99g4sx
The current after:  t3_97ub63
The current after:  t3_97485n
The current after:  t3_95xi6t
The current after:  t3_94y4rh
The current after:  t3_940f42
The current after:  t3_92wt18
The current after:  t3_922h2n
The current after:  t3_9178n0
The current after:  t3_907wzs
The current after:  t3_8zanqt
The current after:  t3_8y191f
The current after:  t3_8wtyel
The current after:  t3_8vmzce
The current after:  t3_8ufv2u
The current after:  t3_8t59b1
The current after:  t3_8rtoe3
The current after:  t3_8qohsu
The current after:  t3_8p5eq7
The current after:  t3_8nxs4j
The current after:  t3_8mft40
The current after:  t3_8krjc6
The current after:  t3_8jludl
The current after:  t3_8ij74h
The current after:  t3_8h92mc
The current after:  t3_8fhsmf
The current after:  t3_8e28k2
The current after:  t3_8c7jep
The current after:  t3_8av61j
The current after:  t3_89i6q8
The current after:  t3_88dxze
The current after:  t3_871uet
The current after:  t3_851gez
The curren

NameError: name 'pd' is not defined

In [14]:
# now put the list of lists, where which inner list is a row
# straight into a dataframe
df = pd.DataFrame(all_posts, columns = ['text','title','subreddit'])


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [15]:
# Export to csv
df.to_csv('reddit_geocaching_posts.csv')

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.