## MISSION PART ONE: GETTING DATA
* You are going to scrape the front page of reddit every 4 hours, saving a CSV file that includes:
1. The title of the post
2. The number of votes it has (the number between the up and down arrows)
3. The number of comments it has
4. What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
5. When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago")
6. The URL to the post itself
7. The URL of the thumbnail image associated with the post

For the purposes of this exercise I shall be scraping https://www.reddit.com/r/funny/ because who doesn't want funny e-mails coming in at 8am every morning? I am going to regret this aren't I.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

In [3]:
#Grab the reddit site
response = requests.get("https://www.reddit.com/r/funny/", headers=headers)

In [4]:
doc = BeautifulSoup(response.text, 'html.parser')

In [5]:
#doc

In [6]:
#Grab the posts

posts = doc.find_all('div', {'class': 'thing'})
#len(posts)

In [7]:
#posts

In [8]:
all_posts = []

#Let's grab the title of the posts
for post in posts:
    #Titles are <a> and title
    title = post.find('a', {'class': 'title'})
    title_text = title.text.strip()
    #Votes are <div> <score>
    vote = post.find('div', {'class': 'score'})
    vote_text = vote.text.strip()
    #Comments are <a> and bylink
    comment = post.find('a', {'class': 'bylink'})
    comment_text = comment.text.strip()
    #When it was posted <time> datetime
    time = post.find('time', {'class': 'live-timestamp'})['datetime']
    #URL
    link = post.find('a', {'class': 'title'})['href']
    #URL link
    thumbnail = post.find('img')
    if thumbnail:
        thumbnail = "http:" + (thumbnail['src'])
    funny_posts = {'title': title_text, 'votes': vote_text, 'comments': comment_text, 'timestamp': time, 'link': link, 'thumbnail': thumbnail}
    all_posts.append(funny_posts)
all_posts
    
#QUESTION: How to replace /r/ with link? Do we use regular expressions?

[{'comments': '129 comments',
  'link': 'https://www.reddit.com/r/unfortunateplacement',
  'thumbnail': 'http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJpkFNcC0viWh-2iQcEc9HrocEZcxw8.jpg',
  'timestamp': '2016-06-01T16:08:54+00:00',
  'title': 'Subreddit Of The Month [June 2016]: /r/unfortunateplacement, "(Un)fortunate Ad Placement". Know of a small (under 10,000 subscribers) humor-based subreddit that deserves a month in the spotlight? Link it inside!',
  'votes': '406'},
 {'comments': '378 comments',
  'link': '/r/funny/comments/4j1nln/irs_phone_scams_and_similar_posts_tldr_dont_post/',
  'thumbnail': None,
  'timestamp': '2016-05-12T16:57:43+00:00',
  'title': "IRS phone scams, and similar posts. tldr - DON'T POST PHONE NUMBERS ON REDDIT",
  'votes': '1304'},
 {'comments': '2694 comments',
  'link': 'http://imgur.com/WQ9f3g0',
  'thumbnail': 'http://b.thumbs.redditmedia.com/1ZRp2v23jOSqkzRmfcqjR5fthye4Iz1dRPwea7Bz9YQ.jpg',
  'timestamp': '2016-06-23T12:41:33+00:00',
  'title': 'Army S

In [9]:
#len(all_posts)

In [10]:
import pandas as pd

In [11]:
posts_df = pd.DataFrame(all_posts)
posts_df.head()

Unnamed: 0,comments,link,thumbnail,timestamp,title,votes
0,129 comments,https://www.reddit.com/r/unfortunateplacement,http://a.thumbs.redditmedia.com/ExQ61Q54Z-aAuJ...,2016-06-01T16:08:54+00:00,Subreddit Of The Month [June 2016]: /r/unfortu...,406
1,378 comments,/r/funny/comments/4j1nln/irs_phone_scams_and_s...,,2016-05-12T16:57:43+00:00,"IRS phone scams, and similar posts. tldr - DON...",1304
2,2694 comments,http://imgur.com/WQ9f3g0,http://b.thumbs.redditmedia.com/1ZRp2v23jOSqkz...,2016-06-23T12:41:33+00:00,Army Specialist was denied leave to go to a ba...,9292
3,834 comments,http://i.imgur.com/m0xCyiI.jpg,http://b.thumbs.redditmedia.com/sXji7NpPLobngC...,2016-06-23T03:43:15+00:00,A girl at the fire station after getting stuck...,5853
4,145 comments,http://i.imgur.com/tw02cpV.jpg,http://b.thumbs.redditmedia.com/ZuCNEoY9po1AeT...,2016-06-23T10:50:10+00:00,This is it. No more arguing.,1304


In [12]:
import time

In [13]:
datestring = time.strftime("%Y-%m-%d-%H-%M")
datestring

'2016-06-23-13-16'

In [14]:
filename = "reddit-data" + datestring + ".csv"
posts_df.to_csv(filename, index=False)