# Project 3: NLP for Right-Classifying Reddit Posts (Data Collection)

## Problem Statement

- With increasing number of working moms, and dads having to play an increasing role in taking care of children, child rearing becomes increasing a shared responsibility.

- As an employee of theAsianparent, we help parents experience healthy pregnancies and raise healthy families. Through this project, we hope to:
    - Automate the classification and monitoring of posts and discussions by moms and dads
    - Better understand concerns and topics of interest amongst them to create targeted and interesting articles to better support parenthood

- Target Audience: Moms and dads

- As want to curate topics and articles for both moms and dads, we will use accuracy, sensitivity (proportion of /mommit posts being right classified) and specificity (proportion of /daddit posts being right classified) to measure the best performing model.

- We will be exploring using Logistic Regression, Multinomial Naive Bayes and Random Forest with count vectorizer and TF-IDF vectorizer.

## Import Libraries

In [6]:
import requests
import pandas as pd

In [7]:
def get_reddit_posts(total_posts, topic):

    n_posts = 0
    df = pd.DataFrame()

    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {'subreddit': topic, 'size': 100} 

    while n_posts < total_posts:
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        date = posts[len(posts)-1]['created_utc']
        params['before'] = date
        # len(posts)
        df_new = pd.DataFrame(posts)
        df_new = df_new[['subreddit', 'selftext', 'title']]
        df_new = df_new[(df_new['selftext'] != "") & (df_new['selftext'] != "[removed]") & (df_new['selftext'] != "[deleted]") & (df_new['selftext'].notnull())]
        df = pd.concat([df, df_new], ignore_index=True)
        df.drop_duplicates(subset=['selftext'], keep='last', inplace=True)
        n_posts = df.shape[0]
        
    return df

In [10]:
# web-scrap 10,000 posts from r/mommit and save to mommit.csv
mommit = get_reddit_posts(10000, 'Mommit').to_csv('..\data\mommit.csv', index=False)

In [12]:
# web-scrap 10,000 posts from r/daddit and save to daddit.csv
daddit = get_reddit_posts(10000, 'Daddit').to_csv('..\data\daddit.csv', index=False)