# Project 3: Web APIs & Classification

## Problem Statement

Given posts from two Using Reddit's API, you'll collect posts from two subreddits, r/worldnews and r/todayilearned, we will use NLP to train a classifier on which subreddit a given post came from.

## Executive Summary

### Contents:
- [Scraping reddit for data](#Scraping-reddit-for-data)
- [2018 Data Import and Cleaning](#2018-Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-data)
- [Descriptive and Inferential Statistics](#Descriptive-and-Inferential-Statistics)
- [Outside Research](#Outside-Research)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

In [1]:
import numpy as np
import pandas as pd
import requests
import time
import random

## Scraping reddit for data

#### 1. Scrape reddit.

In [2]:
url = 'https://www.reddit.com/r/worldnews.json'
url2 = 'https://www.reddit.com/r/todayilearned.json'

In [19]:
def subreddit_scraper(subreddit, n_posts):
    '''
    Returns DataFrame of subreddit posts.
    '''
    url = f'https://www.reddit.com/r/{subreddit}.json'
    headers = {'User-agent': 'srs 0.01'}
    posts = []
    after = None
    # Reddit returns 25 posts per scrape.
    n_scrapes = int(np.ceil(n_posts/25))  
    for i in range(n_scrapes):
        
        # Load indicator.
        if i%5 == 0:
            print(f'scrape {i} in progress...')

        # Use original url for first scrape.
        if after is None:
            params = {}
        else:
            params = {'after': after}

        r = requests.get(url, params=params, headers=headers)
        if r.status_code == 200:
            js = r.json()
            posts.extend(js['data']['children'])
            after = js['data']['after']
        else:
            print(r.status_code)
            break

        # Randomize sleep timing.
        time.sleep(random.random()*7)
    
    # Remove repeats.
    names = []
    posts_nr = []
    for p in posts:
        if p['data']['name'] not in names:
            names.append(p['data']['name'])
            posts_nr.append(p)
    
    df_interm = pd.DataFrame(posts_nr)
    # Probably the easiest way to unpack a list of dictionaries.
    df = pd.DataFrame(df_interm['data'].apply(pd.Series))
    return df

In [20]:
df_wn = subreddit_scraper('worldnews', 1000)

scrape 0 in progress...
scrape 5 in progress...
scrape 10 in progress...
scrape 15 in progress...
scrape 20 in progress...
scrape 25 in progress...
scrape 30 in progress...
scrape 35 in progress...


In [21]:
df_wn.shape

(795, 104)

#### 2. Save dataset after scraping.

In [26]:
export_path = r'..\datasets\worldnews.csv'
df_wn.to_csv(export_path, index=False)

In [24]:
df_til = subreddit_scraper('todayilearned', 1000)

scrape 0 in progress...
scrape 5 in progress...
scrape 10 in progress...
scrape 15 in progress...
scrape 20 in progress...
scrape 25 in progress...
scrape 30 in progress...
scrape 35 in progress...


In [25]:
df_til.shape

(674, 105)

In [27]:
export_path = r'..\datasets\todayilearned.csv'
df_wn.to_csv(export_path, index=False)