# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
import requests
import json
import pandas as pd
import time

In [2]:
URL = "http://www.reddit.com/r/boardgames.json"

In [3]:
## YOUR CODE HERE
result = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
results = result.json()

In [4]:
#results commented to reduce 

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [5]:
data = result.json()

In [6]:
data['data']['children'][0]['data']['title']

'/r/boardgames Daily Discussion and Game Recommendations (September 06, 2018)'

In [7]:
posts = [p['data'] for p in data['data']['children']]

In [8]:
df = pd.DataFrame(posts)

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [9]:
data['data']['after']

't3_9dgnby'

In [10]:
## YOUR CODE HERE


In [11]:
URL = 'http://www.reddit.com/r/nfl.json'
lp = data['data']['after']

In [12]:
URL+lp

'http://www.reddit.com/r/nfl.jsont3_9dgnby'

In [13]:
time.sleep(3)

In [14]:
URLPF = 'http://www.reddit.com/r/nfl.json'
pf = requests.get(URLPF, headers = {'User-agent':'jacksbot'})
dfp = pf.json()

dfp['data']['after']

't3_9dijn8'

In [15]:
url = 'http://www.reddit.com/r/nfl.json'
posts = []
after = None
var = 0


for _ in range(100):
    if after == None:
        curent_url = url
    else:
        curent_url = url + '?after=' + after
    print(curent_url)
    res = requests.get(curent_url, headers = {'User-agent':'Jacks bot'})
    if res.status_code != 200:
        print('States Error', res.status_code)
        break
    print(var)
    var += 1
    curent_dict = res.json()
    curent_posts = [p['data'] for p in curent_dict['data']['children']]
    posts.extend(curent_posts)
    after = curent_dict['data']['after']
    time.sleep(1.2)   
    
    
    
    

http://www.reddit.com/r/nfl.json
0
http://www.reddit.com/r/nfl.json?after=t3_9dijn8
1
http://www.reddit.com/r/nfl.json?after=t3_9djir8
2
http://www.reddit.com/r/nfl.json?after=t3_9di8l5
3
http://www.reddit.com/r/nfl.json?after=t3_9de84e
4
http://www.reddit.com/r/nfl.json?after=t3_9daygm
5
http://www.reddit.com/r/nfl.json?after=t3_9dagp6
6
http://www.reddit.com/r/nfl.json?after=t3_9cylim
7
http://www.reddit.com/r/nfl.json?after=t3_9d1zpg
8
http://www.reddit.com/r/nfl.json?after=t3_9dj0vz
9
http://www.reddit.com/r/nfl.json?after=t3_9d2ytr
10
http://www.reddit.com/r/nfl.json?after=t3_9cz2zh
11
http://www.reddit.com/r/nfl.json?after=t3_9cw88a
12
http://www.reddit.com/r/nfl.json?after=t3_9cxzkx
13
http://www.reddit.com/r/nfl.json?after=t3_9codnu
14
http://www.reddit.com/r/nfl.json?after=t3_9cnd18
15
http://www.reddit.com/r/nfl.json?after=t3_9cnryg
16
http://www.reddit.com/r/nfl.json?after=t3_9cq56k
17
http://www.reddit.com/r/nfl.json?after=t3_9ce43d
18
http://www.reddit.com/r/nfl.json?after

In [16]:

URLPF = 'http://www.reddit.com/r/personalfinance.json'
pf = requests.get(URLPF, headers = {'User-agent':'jacksbot'})
dfp = pf.json()

dfp['data']['after']

't3_9dk0mw'

In [17]:
len(posts)

2477

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [18]:
nfl_data = pd.DataFrame(posts)
pd.DataFrame(posts).to_csv('nfl_posts_thu.csv', index = False)

# Export to csv
pd.DataFrame(posts).to_csv('nfl_posts_thu.csv', index = False)

In [19]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [20]:
pd.DataFrame(posts).to_csv('nfl_posts_wed.csv', index = False)
nfl_words = nfl_data[['title', 'subreddit']]
nfl_words

Unnamed: 0,title,subreddit
0,Thursday Talk Thread... Yes That's The Thread ...,nfl
1,Official rNFL Pick 'Em League,nfl
2,Ohrnberger: Le’Veon Bell’s teammates are brain...,nfl
3,"After 214 days of waiting since SB 52, its fin...",nfl
4,[Fanatics] Top selling jerseys in the preseaso...,nfl
5,"What is a ""franchise tag"" and why is it consid...",nfl
6,Le'Veon Bell's response to a post about OL com...,nfl
7,Quote of the day goes to Packers CB Kevin King...,nfl
8,The Film Room Ep. 80: Sam Darnold is already t...,nfl
9,Rams offered 'aggressive' package for Khalil M...,nfl


In [21]:
url = 'http://www.reddit.com/r/soccer.json'
posts = []
after = None
var = 0


for _ in range(100):
    if after == None:
        curent_url = url
    else:
        curent_url = url + '?after=' + after
    print(curent_url)
    res = requests.get(curent_url, headers = {'User-agent':'Jacks bot'})
    if res.status_code != 200:
        print('States Error', res.status_code)
        break
    print(var)
    var += 1
    curent_dict = res.json()
    curent_posts = [p['data'] for p in curent_dict['data']['children']]
    posts.extend(curent_posts)
    after = curent_dict['data']['after']
    time.sleep(1.2)  
    
soccer_data = pd.DataFrame(posts)
pd.DataFrame(posts).to_csv('soccer_posts.csv', index = False)

# Export to csv
#pd.DataFrame(posts).to_csv('soccer_posts.csv', index = False)

http://www.reddit.com/r/soccer.json
0
http://www.reddit.com/r/soccer.json?after=t3_9dgely
1
http://www.reddit.com/r/soccer.json?after=t3_9dk3bl
2
http://www.reddit.com/r/soccer.json?after=t3_9d8hvw
3
http://www.reddit.com/r/soccer.json?after=t3_9dg0c7
4
http://www.reddit.com/r/soccer.json?after=t3_9d55s0
5
http://www.reddit.com/r/soccer.json?after=t3_9dizna
6
http://www.reddit.com/r/soccer.json?after=t3_9daaik
7
http://www.reddit.com/r/soccer.json?after=t3_9dbz7m
8
http://www.reddit.com/r/soccer.json?after=t3_9d59pm
9
http://www.reddit.com/r/soccer.json?after=t3_9cvho7
10
http://www.reddit.com/r/soccer.json?after=t3_9cv0e3
11
http://www.reddit.com/r/soccer.json?after=t3_9ctsga
12
http://www.reddit.com/r/soccer.json?after=t3_9cvvho
13
http://www.reddit.com/r/soccer.json?after=t3_9cueyc
14
http://www.reddit.com/r/soccer.json?after=t3_9cznia
15
http://www.reddit.com/r/soccer.json?after=t3_9ckunr
16
http://www.reddit.com/r/soccer.json?after=t3_9ctmuy
17
http://www.reddit.com/r/soccer.json?

In [22]:
pd.DataFrame(posts).to_csv('soccer_posts_thu.csv', index = False)

In [23]:
soccer_words = soccer_data[['title', 'subreddit']]

In [24]:
soccer_words

Unnamed: 0,title,subreddit
0,Throwback Thursday Thread [2018-09-06],soccer
1,Daily Discussion [2018-09-06],soccer
2,"A stray dog has become ""assistant coach"" of Pa...",soccer
3,Sneijder: Eto'o played left winger for Mourinh...,soccer
4,If the plans for Girona-Barça are as Cope repo...,soccer
5,[OC] Three Bundesliga players have reached 100...,soccer
6,"Reporter: FC Barcelona was interested, why did...",soccer
7,"[OC] Giggs, Rooney and Lampard are the 3 Premi...",soccer
8,Napoli have conceded 6 goals so far in 3 games...,soccer
9,Matteo Guendouzi is Arsenal's August Player of...,soccer


## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [25]:
import numpy as np
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
#cvec = CountVectorizer(stop_words='english', analyzer='word')

In [26]:
pd.DataFrame(total_df).to_csv('scraped_data.csv', index = False)

NameError: name 'total_df' is not defined

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = total_df['title']
y = total_df['class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
total_df.loc[total_df['subreddit'] != 'nfl', 'class'] = 0

In [None]:
total_df.head()

In [None]:
cvec = CountVectorizer(stop_words='english')
X_train = cvec.fit_transform(X_train)

In [None]:
X_train.todense()

In [None]:
df  = pd.DataFrame(X_train.todense(),
                   columns=cvec.get_feature_names())





In [None]:
df.head()

In [None]:
X_test = cvec.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score 
model_lr = LogisticRegression()
print(cross_val_score(model_lr, X_train, y_train).mean())
model_lr.fit(X_train, y_train)
model_lr.predict(X_test)

In [None]:
y_test.shape[0]

In [None]:
X_test.shape[0]

In [None]:
model.score(X_test, y_test)

In [None]:
## YOUR CODE HERE
#Previous is the count vectorizer. I am now going to fit a Tfid vectorizer



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words='english')
tvec.fit(corpus)

In [None]:
df  = pd.DataFrame(tvec.transform(corpus).todense(),
                   columns=tvec.get_feature_names(),
                   index=['nfl', 'soccer'])

df.transpose().sort_values('soccer', ascending=False).head(10).transpose()

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
pd_def['class'] = nfl_data['title'].map(pd_def['class'] = 1)

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

Have you ever felt the need to classify different inputs to get important information out of words? My goal was to build a model that is able to accuratly predict whether the title of the reddit post came from one subreddit, or another. 

Originally, I planned on using the Soccer and NFL subreddits to build a classification model. However, I was not happy with how similar the two were. I scored very highly on the training data, and figured it would be more interesting to use two subreddits of similar topics. I chose to do Personal Finance and Financial Independence. These scores were more interesting to me. 

Also interesting is the individual coefficients the logistic regression model came up with. It is cool to sort to see the most determinant coefficients for the model. 

Overall, my process was explore the subreddit, scrape the data, clean and organize the data, and build the various models. 