# Web Scraping for Reddit & Predicting Comments

Please read the readme carefully!

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. Scrape the actual text of the threads using Selenium 
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

# This starter code is just meant to be a guide to help frame the problem.  You DO NOT need to follow it, as long as you address the problem requirements set out in the readme.

You will need to use the `sleep` functionality so that you don't make too many calls to the reddit server (see the reddit API documentation for details).  Below is an example of how the timer works.

In [1]:
import time
print "hello"
time.sleep(3)
print "bye"

hello
bye


### Scraping Thread Info from Reddit.com

#### Set up a request (using requests or chromedriver) to the URL below. Use BeautifulSoup to parse the page and extract all results

In [365]:
import numpy as np
import pandas as pd

In [74]:
import requests
from bs4 import BeautifulSoup

In [171]:
# URL for reddit
URL = "http://www.reddit.com"

In [76]:
from selenium import webdriver

In [172]:
# Imported chromedriver, had to redownload the latest chromedriver and insert the file into project file path
# because the current chrome browser wouldnt work with the earlier version of chromedriver.

driver = webdriver.Chrome(executable_path="./chromedriver")

In [173]:
# Getting chromedriver to retrieve html from reddit, not sure what the timer is for exactly need to get info

driver.get("http://www.reddit.com")

# Giving it 2 seconds to completely load the reddit page before scraping

time.sleep(2)

html = driver.page_source

In [85]:
print html

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" class=" js cssanimations csstransforms"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link type="application/opensearchdescription+xml" rel="search" href="/static/opensearch.xml" /><link rel="canonical" href="https://www.reddit.com/" /><meta name="viewport" content="width=1024" /><link rel="dns-prefetch" href="//out.reddit.com" /><link rel="preconnect" href="//out.reddit.com" /><link rel="apple-touch-icon" sizes="57x57" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" /><link rel="apple-touch-icon" sizes="60x60" href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" /><link rel="apple-touch-

In [174]:
# Using BeautifulSoup parser, why do we need lxml?

soup = BeautifulSoup(html, 'lxml')

In [175]:
driver.close()

In [81]:
# List comprehension for reddit titles

rtitle = [x.text for x in soup.find_all('a', {'data-event-action':'title'})]
rtitle.pop(0)

u"TV's most-anticipated psychological thriller, The Alienist, is giving you an exclusive look before the premiere. Catch a sneak peek of what\u2019s to come before the show premieres tonight on TNT."

In [82]:
# List comprehension for reddit titles

rtitle = [x.text for x in soup.find_all('a', {'data-event-action':'title'})]
rtitle.pop(0)

# List comprehension for subreddits of reddit posts

subreddit = [x.text for x in soup.find_all('a', {'class':'subreddit hover may-blank'})]

# List comprehension for reddit post timestamps

rtime = [x.text for x in soup.find_all('time', {'class':'live-timestamp'})]

# List comprehension for reddit total comments of post

comments = [x.text for x in soup.find_all('a', {'class':'bylink comments may-blank'})]

In [83]:
# List comprehension for reddit post timestamps

rtime = [x.text for x in soup.find_all('time', {'class':'live-timestamp'})]



In [84]:
# List comprehension for reddit total comments of post

comments = [x.text for x in soup.find_all('a', {'class':'bylink comments may-blank'})]

In [16]:
# /span[@class='next-button']/a

SyntaxError: invalid syntax (<ipython-input-16-c11b9ece0634>, line 1)

In [47]:
rtitle[0]

u"Hey Reddit, we've created a special space full of fun, weird and unique items that you probably didn't even know existed on Amazon! Shop our endless finds today!"

In [86]:
len(rtitle)

25

In [87]:
len(subreddit)

25

In [88]:
len(rtime)

25

In [89]:
len(comments)

25

In [90]:
for x in rtitle:
    print x
    print '___'

Vermont has officially legalized cannabis
___
Twitch streamer suggests a game should have random scripted events to make the game more interesting, experiences a random scripted event.
___
I was wondering why my grenade didn't kill him...well here is why
___
NASA cancels and postpones all of their public events and activities until further notice due to government shutdown.
___
Finnish ski jumping team
___
Messing with the new guy.
___
Super excited about motherhood
___
This Pool in Germany
___
YEAH! Lets go outs... no.
___
New Bill Would Stop States From Banning Broadband Competition
___
Help identify this piece of bumper from a hit-and-run with a cyclist now in critical condition.
___
I made a [homemade] cake that looks like my dog. Needless to say, he was fascinated.
___
Guess you can't even start a casual conversation about being ill anymore
___
Germans make the best faces when confused
___
"Creepy unexplained sightings found footage" starter pack
___
U.S. Government Shutdown (2018

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The thread title is within an `<a>` tag with the attribute `data-event-action="title"`.
- The time since the thread was created is within a `<time>` tag with attribute `class="live-timestamp"`.
- The subreddit is within an `<a>` tag with the attribute `class="subreddit hover may-blank"`.
- The number of comments is within an `<a>` tag with the attribute data-event-action="comments"`.

## Write 4 functions to extract these items (one function for each): title, time, subreddit, and number of comments.¶
Example
```python
def extract_title_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [None]:
## YOUR CODE HERE

In [51]:
def extract_title_from_result(result):
    rtitle = [x.text for x in result.find_all('a', {'data-event-action':'title'})]
    rtitle.pop(0)
    return rtitle

In [52]:
def extract_subreddit_from_result(result):
    subreddit = [x.text for x in result.find_all('a', {'class':'subreddit hover may-blank'})]
    return subreddit

In [53]:
def extract_time_from_result(result):
    rtime = [x.text for x in result.find_all('time', {'class':'live-timestamp'})]
    return rtime


In [54]:
def extract_ncomments_from_result(result):
    comments = [x.text for x in result.find_all('a', {'class':'bylink comments may-blank'})]
    return comments

Now, to scale up our scraping, we need to accumulate more results.

First, look at the source of a Reddit.com page: (https://www.reddit.com/).
Try manually changing the page by clicking the 'next' button on the bottom. Look at how the url changes.

After leaving the Reddit homepage, the URLs should look something like this:
```
https://www.reddit.com/?count=25&after=t3_787ptc
```

The URL here has two query parameters
- count is the result number that the page starts with
- after is the unique id of the last result on the _previous_ page

In order to scrape lots of pages from Reddit, we'll have to change these parameters every time we make a new request so that we're not just scraping the same page over and over again. Incrementing the count by 25 every time will be easy, but the bizarre code after `after` is a bit trickier.

To start off, let's look at a block of HTML from a Reddit page to see how we might solve this problem:
```html
<div class=" thing id-t3_788tye odd gilded link " data-author="LordSneaux" data-author-fullname="t2_j3pty" data-comments-count="1548" data-context="listing" data-domain="v.redd.it" data-fullname="t3_788tye" data-kind="video" data-num-crossposts="0" data-permalink="/r/funny/comments/788tye/not_all_heroes_wear_capes/" data-rank="25" data-score="51468" data-subreddit="funny" data-subreddit-fullname="t5_2qh33" data-timestamp="1508775581000" data-type="link" data-url="https://v.redd.it/ush0rh2tultz" data-whitelist-status="all_ads" id="thing_t3_788tye" onclick="click_thing(this)">
      <p class="parent">
      </p>
      <span class="rank">
       25
      </span>
      <div class="midcol unvoted">
       <div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" tabindex="0">
       </div>
       <div class="score dislikes" title="53288">
        53.3k
       </div>
       <div class="score unvoted" title="53289">
        53.3k
       </div>
       <div class="score likes" title="53290">
        53.3k
       </div>
       <div aria-label="downvote" class="arrow down login-required access-required" data-event-action="downvote" role="button" tabindex="0">
       </div>
      </div>
```

Notice that within the `div` tag there is an attribute called `id` and it is set to `"thing_t3_788tye"`. By finding the last ID on your scraped page, you can tell your _next_ request where to start (pass everything after "thing_").

For more info on this, you can take a look at the [Reddit API docs](https://github.com/reddit/reddit/wiki/JSON)

## Write one more function that finds the last `id` on the page, and stores it.

In [57]:
## YOUR CODE HERE
[x.text for x in html.find_all('a', {'class':'next '})]

[]

In [164]:
soup.find_all('div', {'class': 'thing'})['data-fullname']

TypeError: list indices must be integers, not str

In [95]:
soup.find_all('span', {'class':'next-button'})[0]

<span class="next-button"><a href="https://www.reddit.com/?count=25&amp;after=t3_7s6k5t" rel="nofollow next">next \u203a</a></span>

In [66]:
import re

In [110]:
def getlastid(my_soup):
    return my_soup.find(id=re.compile("thing"))['id'][6:]

In [111]:
getlastid(soup)

't3_7rd2ya'

In [156]:
def last_id(result):
    ids=[]
    data = result.findAll('div', {'class': 'thing'})
    for el in data:
        ids.append(el['data-fullname'])
    return ids[-1]

In [183]:
ids=[]
for x in soup.findAll('div', {'class': 'thing'}):
    ids.append(x['data-fullname'])

In [185]:
ids[-1]

't3_7s67zn'

In [157]:
last_id(soup)

't3_7s6k5t'

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [None]:
def upvotes(result):
    upvotes = [x.text for x in soup.find_all('div', {'class':'score unvoted'})]

## Now, let's put it all together.

Use the functions you wrote above to parse out the 4 fields - title, time, subreddit, and number of comments. Create a dataframe from the results with those 4 columns.

In [423]:
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep

# The code loops over 3 pages of reddit and grabs only titles. Doesn't implement try/except so no error 
# handling built in. Also grabs sponsored titles, which should be excluded. Just proof of concept to show how
# to jump over pages.

#just testing how it works on three pages. You can try more!
pages = range(1,4)
titles = []
# starting url 
url = 'https://www.reddit.com/'    


for page in pages: 
    # Instantiate a new driver every loop
    driver = webdriver.Chrome(executable_path="./chromedriver")
    driver.get(url)
    html = driver.page_source
    # Put the page HTML in a soup object
    soup = BeautifulSoup(html, 'lxml')
    # Loop below grabs the titles on each page and appends them to titles, above. 
    for element in soup.find_all('p', {'class':'title'}):
        titles.append(element.a.text)
    # overwrite the url with the url that the "Next" link points to.
    url = soup.find('span', {'class':'next-button'}).a['href']
    # Close out the driver
    driver.close()
    # Sleeping 
    sleep(2)

In [303]:
def reddit_scraper(website):
    
    driver = webdriver.Chrome(executable_path="./chromedriver")
    driver.get(website)
    
    time.sleep(1)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
        
    titles = [x. text for x in soup.find_all('a', {'data-event-action':'title'})]
    titles.pop(0)    
    
    upvotes = [x.text for x in soup.find_all('div', {'class':'score unvoted'})]
    upvotes.pop(0)
    
    subreddit = [x.text for x in soup.find_all('a', {'class':'subreddit hover may-blank'})]
    
    rtime = [x.text for x in soup.find_all('time', {'class':'live-timestamp'})]
    
    comments = [x.text for x in soup.find_all('a', {'class':'bylink comments may-blank'})]
    
    dataframe = pd.DataFrame(
    {'titles': titles,
     'upvotes': upvotes,
     'subreddit': subreddit,
     'time_stamp': rtime,
     'num_comments': comments
    })
    
    ids=[]
        
    for x in soup.findAll('div', {'class': 'thing'}):
        ids.append(x['data-fullname'])
            
    max_results = 20001
    
    for i in range(25, max_results, 25):
        
        url_template = "http://www.reddit.com/?count={}&after={}".format(i, ids[-1])
        driver.get(url_template)
        
        time.sleep(1)
        
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')
        
        # Additional titles added onto titles varaible 
        
        titles_2 = []
        
        for x in soup.find_all('a', {'data-event-action':'title'}):
            titles_2.append(x.text)
            
        titles_2.pop(0)
        titles += titles_2
        
        # Additional upvote numbers added onto upvote varaible 
        
        upvotes_2 = []
        
        for x in soup.find_all('div', {'class':'score unvoted'}):
            upvotes_2.append(x.text)
            
        upvotes_2.pop(0)
        upvotes += upvotes_2
        
        # Additional subreddit groups added onto subreddit varaible 
        
        subreddit_2 = []
        
        for x in soup.find_all('a', {'class':'subreddit hover may-blank'}):
            subreddit_2.append(x.text)
    
        subreddit += subreddit_2
        
        # Additional time stamps added onto rtime varaible 
        
        rtime_2 = []
        
        for x in soup.find_all('time', {'class':'live-timestamp'}):
            rtime_2.append(x.text)
    
        rtime += rtime_2
                                     
        # Additional comment numbers added onto comments varaible 
        
        comments_2 = []
        
        for x in soup.find_all('a', {'class':'bylink comments may-blank'}):
            comments_2.append(x.text)
        
        comments += comments_2                             
          
    
        # Adding additional ids to id's variable allowing us to see the next page id
        
        for x in soup.findAll('div', {'class': 'thing'}):
            ids.append(x['data-fullname'])
    
        try:
            dataframe_1 = pd.DataFrame({'titles': titles_2,
                                  'upvotes': upvotes_2,
                                  'subreddit': subreddit_2,
                                  'time_stamp': rtime_2,
                                  'num_comments': comments_2
                                 })
            dataframe = pd.concat([dataframe, dataframe_1])
            
        except:
            pass       
               
        time.sleep(1)
    
    dataframe.reset_index(inplace = True)
    driver.close()
    return dataframe

In [304]:
# hashed out to prevent rerunning
# reddit_info = reddit_scraper('http://www.reddit.com')

In [305]:
reddit_info.columns

Index([u'index', u'num_comments', u'subreddit', u'time_stamp', u'titles',
       u'upvotes'],
      dtype='object')

In [308]:
reddit_info = reddit_info.drop('index', axis = 1)

In [385]:
reddit_info.titles.drop_duplicates(inplace = True)

In [390]:
reddit_info.titles.duplicated()

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
6444     False
7087     False
7518     False
7984     False
7985     False
9741     False
10161    False
11046    False
11051    False
11060    False
11430    False
11713    False
11912    False
12849    False
12851    False
13703    False
13730    False
13731    False
14131    False
14150    False
15397    False
16309    False
16327    False
16755    False
17181    False
17184    False
17534    False
17589    False
17607    False
17627    False
Name: titles, Length: 516, dtype: bool

In [388]:
reddit_info.shape

(17885, 5)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [None]:
# Export to csv
reddit_info.to_csv('reddit_data_2.csv', encoding = 'utf-8', index = False)

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [410]:
df = pd.read_csv('./reddit_data_2.csv')

In [425]:
df_2 = pd.read_csv('./scraping_results.csv')

In [411]:
df.shape

(17885, 5)

In [430]:
df_2.head()

Unnamed: 0.1,Unnamed: 0,created_at,id,num_comments,subreddit,time_delta,time_now,title,upvotes
0,0,2018-01-22 16:36:35,470616764,1153,gifs,0 days 04:24:57.883078000,2018-01-22 21:01:32.883070,Finnish ski jumping team,86005
1,1,2018-01-22 16:37:28,470617063,238,pics,0 days 04:24:04.883093000,2018-01-22 21:01:32.883092,Super excited about motherhood,20336
2,2,2018-01-22 16:50:12,470621440,844,funny,0 days 04:11:20.883101000,2018-01-22 21:01:32.883100,Messing with the new guy.,17611
3,3,2018-01-22 17:25:20,470633617,380,space,0 days 03:36:12.883109000,2018-01-22 21:01:32.883108,NASA cancels and postpones all of their public...,11178
4,4,2018-01-22 15:41:56,470598246,430,technology,0 days 05:19:36.883120000,2018-01-22 21:01:32.883118,New Bill Would Stop States From Banning Broadb...,13467


In [432]:
df_2.drop_duplicates('title', inplace = True)

In [None]:
index_list = []
for x,y in zip(df_2.index, df_2.title):
    if y not in index_list:
        

In [433]:
df.titles.shape

(17885,)

In [434]:
df_2.shape

(5405, 9)

In [427]:
len(set(df_2.title))

5405

In [417]:
unique_titles = []
for title in df.titles:
    if title not in unique_titles:
        unique_titles.append(title)
    else:
        pass

In [418]:
len(unique_titles)

516

In [426]:
from collections import Counter
Counter(df_2.titles)

AttributeError: 'DataFrame' object has no attribute 'titles'

In [336]:
# removed string word comments
df.num_comments = df.num_comments.apply(lambda x: x.replace(' comments', ''))

In [337]:
# removed string word comment
df.num_comments = df.num_comments.apply(lambda x: x.replace(' comment', ''))

In [339]:
# convert column into int type 
df.num_comments = df.num_comments.astype(int)

In [341]:
df.dtypes

num_comments     int64
subreddit       object
time_stamp      object
titles          object
upvotes         object
dtype: object

In [344]:
df.num_comments.median()

74.0

In [346]:
df.num_comments.max()

32452

In [347]:
df.num_comments.min()

1

In [349]:
# 74 comments or less is encoded as 0, 75 comments or higher is encoded as 1
df.num_comments= df.num_comments.apply(lambda x:0 if x <= 74 else 1)

In [350]:
X = df.drop('num_comments', axis = 1)

In [351]:
y = df.num_comments

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

#### What is the baseline accuracy for this model?

In [355]:
y.value_counts()

0    8970
1    8915
Name: num_comments, dtype: int64

In [436]:
y.value_counts().values[0]/float(len(y))

0.50153760134190661

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [358]:
## YOUR CODE HERE
Xs = pd.get_dummies(df.subreddit, drop_first = True)

In [359]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

In [360]:
cv = StratifiedKFold(n_splits = 10, random_state = 100)

In [361]:
rf = RandomForestClassifier(class_weight='balanced', n_jobs=-1, n_estimators=50)

In [363]:
score = cross_val_score(rf, Xs, y, cv = cv, verbose = 1)

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   16.4s finished


In [373]:
print 'Mean {} STD +/- {}'.format(np.mean(score),np.std(score))

Mean 0.892313579256 STD +/- 0.00934743485959


#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [406]:
from sklearn.linear_model import LogisticRegression 

In [None]:
lr = LogisticRegression()

In [None]:
## YOUR CODE HERE
cross_val_score()

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [403]:
from sklearn.feature_extraction.text import CountVectorizer

In [404]:
cvec = CountVectorizer(ngram_range = (1,2), lowercase = True, stop_words = 'english', max_features = 5000)

In [437]:
X_train_dtm = pd.DataFrame(cvec.fit_transform(X_train.title).todense(), columns=cvec.get_feature_names())

AttributeError: 'Series' object has no attribute 'title'

In [422]:
summaries = "".join(df.titles)
ngrams_summaries = cvec.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'new', 664),
 (u'people', 505),
 (u'just', 421),
 (u'like', 388),
 (u'got', 380),
 (u'old', 375),
 (u'time', 372),
 (u'oc', 290),
 (u'world', 285),
 (u'000', 275),
 (u'upvote', 274),
 (u'left', 264),
 (u'post', 252),
 (u'console', 240),
 (u'don', 236),
 (u'old console', 235),
 (u'yes', 229),
 (u'bank', 210),
 (u'burn', 205),
 (u'years', 203)]

In [408]:
from sklearn.model_selection import train_test_split

In [407]:
from sklearn.naive_bayes import MultinomialNB

In [409]:
X_train, X_test, y_train, y_test = train_test_split(X.titles, y, test_size=0.33, random_state=100)

In [None]:
X_train

In [None]:
mnnb = MultinomialNB()
mnnb.fit(X_train_dtm, y_train)

In [None]:

## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE