<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Subreddit Classification

## Table of Contents
1. [Web-Scraping](#1.-Web-Scraping)   

# 1. Web-Scraping

In this project, we used [Pushshift API](https://github.com/pushshift/api) to perform webscraping of two subreddits from [reddit.com](https://www.reddit.com/) for the classification.

## 1.1 Import Libraries

In [1]:
import requests
import pandas as pd
import time
from datetime import datetime as dt
from os import path

## 1.2 Fetch Data From API

In [2]:
# Define constants
URL = 'https://api.pushshift.io/reddit/search/submission'
SIZE = 100
WAIT_TIME = 10

In [3]:
def fetch_data(URL, params):
    '''
    Fetch data from API
    :parameter
        :param URL: URL of API
        :param params: the parameters of API, e.g. 'subreddit', 'size'
    :return
        dtf: dataframe with fetched data
    '''
    success = False
    while not success:
        try:
            res = requests.get(URL, params)
            status = res.status_code
            print(f'Get Status: {status}')
            if status == 200:
                success = True
            else:
                time.sleep(WAIT_TIME)
        except Exception as error:
            print(error)
            time.sleep(WAIT_TIME)
            continue

    data = res.json()
    return pd.DataFrame(data['data'])

In [4]:
def main_loop(file, subreddit, num_loops):
    '''
    Main loop to fetch all data from API and output as a file
    :parameter
        :param file: file path and file name you indicated to store the fetched data
        :param subreddit: the subreddit name you want to fetch from API
        :param num_loops: the number of loops you want to repeat for fetching
    :output
        a csv file with all fetched data
    '''
    for i in range(num_loops):
        print(f'Loop {i}')

        # If file does not exists, start pulling posts from current datetime
        # else pull from file last post created_utc
        if not path.isfile(file):
            params = {
                'subreddit': subreddit,
                'size': SIZE,
                'before': int(dt.utcnow().timestamp())
            }
            df = fetch_data(URL, params)
        else:
            df = pd.read_csv(file)
            params = {
                'subreddit': subreddit,
                'size': SIZE,
                'before': df['created_utc'].iloc[-1]
            }
            new_df = fetch_data(URL, params)
            df = pd.concat([df, new_df])

        df.to_csv(file, index=False)
        time.sleep(WAIT_TIME)

In [5]:
# Fetch subreddit_1: SleepApnea
file = '../data/SleepApnea.csv'
subreddit = 'SleepApnea'

main_loop(file, subreddit, 17)

Loop 0
Get Status: 200
Loop 1
Get Status: 200
Loop 2
Get Status: 200
Loop 3
Get Status: 200
Loop 4
Get Status: 200
Loop 5
Get Status: 200
Loop 6
Get Status: 200
Loop 7
Get Status: 200
Loop 8
Get Status: 200
Loop 9
Get Status: 200
Loop 10
Get Status: 200
Loop 11
Get Status: 200
Loop 12
Get Status: 200
Loop 13
Get Status: 200
Loop 14
Get Status: 200
Loop 15
Get Status: 200
Loop 16
Get Status: 200


In [6]:
# Fetch subreddit_2: Sleepparalysis
file = '../data/Sleepparalysis.csv'
subreddit = 'Sleepparalysis'

main_loop(file, subreddit, 17)

Loop 0
Get Status: 200
Loop 1
Get Status: 200
Loop 2
Get Status: 200
Loop 3
Get Status: 200
Loop 4
Get Status: 200
Loop 5
Get Status: 200
Loop 6
Get Status: 200
Loop 7
Get Status: 200
Loop 8
Get Status: 200
Loop 9
Get Status: 200
Loop 10
Get Status: 200
Loop 11
Get Status: 200
Loop 12
Get Status: 200
Loop 13
Get Status: 200
Loop 14
Get Status: 200
Loop 15
Get Status: 200
Loop 16
Get Status: 200
