# Project 3 - Web APIs & Classification

## Description:

For project 3, the goal is two-fold:

1. Using Reddit's API to collect posts from two subreddits.
2. Using NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## Project Structure:

- Part 1 - Web APIs and Data Collection
- Part 2 - EDA, Data Cleaning, and Feature Engineering
- Part 3a - Modeling: Naive-Bayes
- Part 3b - Modeling: Logistic-Regression
- Part 4 - Visualization with Scattertext

## Part 1 - Web APIs and Data Collection

### Table of Content

- [1.0-Import Libraries](#1.0---Import-Libraries)
- [1.1-Reddit API](#1.1---Reddit-API)
- [1.2-Generate CSV File Names](#1.2---Generate-CSV-File-Names)
- [1.3-Collecting Posts](#1.3---Collecting-Posts)

### 1.0 - Import Libraries

In [1]:
import requests
import pandas as pd
import re
import time

### 1.1 - Reddit API

In [2]:
# list of urls
'''to add more urls, label each url using url_# scheme'''

url_1 = 'http://www.reddit.com/r/breastcancer.json'
url_2 = 'http://www.reddit.com/r/airquality.json'
url_list = [url_1,
            url_2
           ]

### 1.2 - Generate CSV File Names

In [3]:
# list of csv file names
def gen_filename(url_list):
    pattern = r'r/(\w+\.)json'
    return [re.findall(pattern, url)[0].lower() +'csv' for url in url_list]

In [4]:
# Generate filenames
filenames = gen_filename(url_list)
filenames

['breastcancer.csv', 'airquality.csv']

### 1.3 - Collecting Posts

In [5]:
# Function for generating posts
'''Function to collect posts from subreddits:
   Inputs:
          url_list: list, list of urls (reddit APIs)
          post_n:   int, number of post for each url
          filenames: list of csv file names
   Output: csv files 
'''
def post_collector(url_list, post_n, filenames):
    headers = {'User-agent': 'Kai Bot 2.0'}
    for n in range(len(url_list)):  # loop through reddit url lists
        posts = []                  
        after = None                # After stores the id for the last post on the page. 
        for i in range(int(post_n/25)): # Each page contain 25 posts. The code will loop through n/25 pages.
            print(f'Downloaded posts {25*(i+1)} from {url_list[n]}')
            if after == None:       
                params = {}
            else:
                params = {'after': after} 
            res = requests.get(url_list[n], params=params, headers=headers) # load page
            if res.status_code == 200:
                the_json = res.json()  # save post data in Json structure
                posts.extend(the_json['data']['children'])
                after = the_json['data']['after'] # update 'after' to the ID of the last post of current page 
            else:
                print(res.status_code)
                break
            time.sleep(5) # delay request for 5 seconds. 
        pd.DataFrame(posts).to_csv('./dataset/'+filenames[n], index=False) # save to csv file using input filenames.

In [6]:
# Collect posts and save to csv
post_collector(url_list, 1000, filenames)

Downloaded posts 25 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 50 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 75 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 100 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 125 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 150 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 175 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 200 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 225 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 250 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 275 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 300 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 325 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 350 from http://www.reddit.com/r/breastcancer.json
Downloaded posts 375 from http://www.reddit.com/r/b