## Reddit Part One: Getting Data

You are going to scrape the front page of reddit every 4 hours, saving a CSV file that includes:


`div.thing`
* The title of the post `a.title`
* The number of votes it has (the number between the up and down arrows) `div.unvoted`
* The number of comments it has `ul.flat-list .first`
* What subreddit it is from (e.g. /r/AskReddit, /r/todayilearned)
* When it was posted (get a TIMESTAMP, e.g. 2016-06-22T12:33:58+00:00, not "4 hours ago") `p.tagline time att:title`
* The URL to the post itself
* The URL of the thumbnail image associated with the post

Columns: `title|votes|comments|parent|date|url|img`

Data:

`list[
    {'title': post_title, 'votes': ...},
    {...},
    ...
]`



In [2]:
import requests
import time, datetime
import pandas as pd

In [8]:
from bs4 import BeautifulSoup
# not necessary - the most important is to avoid sending too many requests 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0'}
response = requests.get("https://www.reddit.com", headers)
doc = BeautifulSoup(response.text, 'html.parser')
if doc.find('title') == 'Too Many Requests':
    print('Reddit sent a “Too Many Requests” response.')
doc

<!DOCTYPE doctype html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>reddit: the front page of the internet</title><meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/><meta content="reddit: the front page of the internet" name="description"/><meta content="always" name="referrer"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="https://www.reddit.com/" rel="canonical"/><link href="https://m.reddit.com/?User-Agent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10.11%3B+rv%3A47.0%29+Gecko%2F20100101+Firefox%2F47.0" media="only screen and (max-width: 640px)" rel="alternate"/><meta content="width=1024" name="viewport"><link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/><link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/><link href="https://www

In [10]:
data = []
https = 'https:'
base_url = 'https://www.reddit.com'

def get_title(post_div):
    title = post.select('a.title')
    if len(title) > 0: # else: "it looks like you haven't subscribed..."
        return title[0].string
    else:
        return False

def get_votes(post_div):
    votes = post.select('div.unvoted div.unvoted')
    if len(votes) > 0:
        try:
            votes_nb = int(votes[0].string)
            return votes_nb
        except ValueError:
            return False;
    else:
        return False
    
def get_comments_nb(post_div):
    comments_list = post_div.select('ul.flat-list .first a')
    if len(comments_list) > 0:
        comment_str = comments_list[0]
        if comment_str.string == 'comment':
            return 0
        else:
            return int(comment_str.string[0:-8])
    else:
        return False

def get_subreddit(post_div):
    subreddit = post_div.select('.tagline a.subreddit')
    if len(subreddit) > 0:
        return subreddit[0].string
    else:
        return False
     
def get_date(post_div):
    date = post_div.select('p.tagline time') #  att:title
    if len(date)>0:
        return date[0].get('title')
    else:
        return False

def get_url(post_div):
    url_list = post_div.select('a.title')
    if len(url_list) > 0: # else: "it looks like you haven't subscribed..."
        url = url_list[0].get("href")
        if url[0:4] == 'http': # absolute url
            return url
        else:
            return base_url + url
    else:
        return False

def get_thumbnail(post_div):
    img = post_div.select('a.thumbnail img')
    if len(img) > 0:
        return https + img[0].get('src')
    else:
        return '' # No thumbnail
    
posts = doc.select('div.sitetable div.thing')
print(len(posts), "posts found…")
for post in posts:
    # title|votes|comments|parent|date|url|img
    title = get_title(post)
    if title:
        votes = get_votes(post)
        comments = get_comments_nb(post)
        subreddit = get_subreddit(post)
        date = get_date(post)
        url = get_url(post)
        thumbnail = get_thumbnail(post)

        dic = {'title': title, 'votes': votes, 'comments': comments, 'subreddit': subreddit, 'date': date, 'url': url, 'thumbnail': thumbnail}
        data.append(dic)
    
print(len(data), "posts scrapped.")
print(data[20])

25 posts found…
25 posts scrapped.
{'subreddit': '/r/AskReddit', 'date': 'Thu Jun 23 08:07:05 2016 UTC', 'votes': 4763, 'title': 'Germans, Japanese, and Italians of Reddit, What did you learn about WW2 in School?', 'comments': 8721, 'url': 'https://www.reddit.com/r/AskReddit/comments/4pfnig/germans_japanese_and_italians_of_reddit_what_did/', 'thumbnail': ''}


In [11]:
datestring = time.strftime("%Y-%m-%d")
filename = "reddit-" + datestring + ".csv"

posts_df = pd.DataFrame(data)
posts_df.to_csv(filename)


## Reddit Part Two: Sending data

You'd like to get something in your inbox about what's happening on reddit every morning at 8:30AM. Using a mailgun.com account and their API, send an email to your email address with the the CSV you saved at 8AM attached. The title of the email should be something like "Reddit this morning: January, 1 1970" 

TIP: How are you going to find that csv file? Well, think about specific the datetime stamp in the filename really needs to be.

In [15]:
import requests

datestring = time.strftime("%Y-%m-%d")
filename = "reddit-" + datestring + ".csv"

subject = 

def send_message(to='paul.ronga@laposte.net', subject='No Subject', message='Test message', file='test.csv'):
    return requests.post(
        "https://api.mailgun.net/v3/sandboxd0b0701ec9a74e0eba05c70bd49ed81a.mailgun.org/messages",
        auth=("api", "key-568389f7b3856d32560304ddac69f2d8"),
        files=[("attachment", open(file))],
        data={"from": "Paul Ronga (MG) <mailgun@mg.tcch.ch>",
              "to": [to],
              "subject": subject,
              "text": message})
result = send_message(file=filename)