# Project 3: 'AskFeminists' vs. 'MensRights'

## Part A: Webscraping

For scraping the two subreddits, I used [Pushshift's API](https://github.com/pushshift/api) to access the last years' worth of posts and comments for the two subreddits. In total, I scraped 25,000 comments for each, 4,300 posts for AskFeminists and 6,000 posts for MensRights.

In [1]:
# Import libaries
import pandas as pd
import requests
import time
import datetime as dt
import json
from bs4 import BeautifulSoup

pd.set_option('display.max_colwidth', -1)
pd.options.display.max_columns = 999

In [2]:
# Brian's function to scrape pushshift API
def query_pushshift(subreddit, # subreddit name
                    kind='submission', # can be 'submission' or 'comment'
                    times = 26, # number of time periods to iterate through
                    skip = 15, # number of days in each time period
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 
                                'author', 'num_comments', 'score', 'is_self', 'full_link'], 
                    # subfields for just submissions
                    comfields = ['body', 'score', 'created_utc']): # fields for comments

    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    # creating base url
    mylist = [] # instantiating empty list
    
    for x in range(1, times): # iterating through times
        
        URL = "{}&after={}d".format(stem, skip * x) # new url for each time period
        print(URL) # prints url as it's scraping it
        response = requests.get(URL) # setting up scraper
        assert response.status_code == 200 # if code is all clear
        mine = response.json()['data'] # content we want from scrape
        df = pd.DataFrame.from_dict(mine) # setting up dataframe from dictionaries of scraped content
        mylist.append(df) # adding to mylist
        time.sleep(2) # setting sleep time between scrapes
        
    full = pd.concat(mylist, sort=False) # concatenating all dfs into one
    
    if kind == "submission": # for submissions, dropping dups and not including comfields
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
    def get_date(created): # getting date in datetime from created_utc
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date) # changing created_utc to date
    
    full['timestamp'] = _timestamp # setting new timestamp as field in df

    print(full.shape) #prints shape of final df at end of scrape
    
    return full 

##### AskFeminists Scrape

In [None]:
askfeminists = query_pushshift('AskFeminists') # pulling submissions

In [None]:
askfeminists.to_csv('./askfeminists121818') # saving to csv

In [None]:
askfeminists_com = query_pushshift('AskFeminists', kind='comment') # pulling comments

In [None]:
askfeminists_com.to_csv('./askfeministscom121818') # saving to csv

##### MensRights Scrape

In [None]:
mensrights = query_pushshift('MensRights') # pulling submissions

In [None]:
mensrights.to_csv('./mensrights121818') # saving to csv

In [None]:
mensrights_com = query_pushshift('MensRights', kind='comment') # pulling comments

In [None]:
mensrights_com.to_csv('./mensrightscom121818') # saving to csv

## Organizing Data

#### AskFeminists

In [3]:
askfeminists = pd.read_csv('./data/askfeminists121818')
askfeminists_com = pd.read_csv('./data/askfeministscom121818')
mensrights = pd.read_csv('./data/mensrights121818')
mensrights_com = pd.read_csv('./data/mensrightscom121818')

In [4]:
# for submissions
# combining title and selftext for new column
askfeminists['text'] = askfeminists['title'] + askfeminists['selftext'] 
# creating column for type = 'post'
askfeminists['type'] = 'post'

In [5]:
# for comments
# combining title and selftext for new column
askfeminists_com['text'] = askfeminists_com['body']
# creating column for type = 'comment'
askfeminists_com['type'] = 'comment'

In [6]:
# creating new df with just three columns
askfeminists_DF = askfeminists[['text', 'type', 'subreddit']].copy()

In [7]:
# checking for nulls
askfeminists_DF['text'].isnull().sum()

468

In [8]:
askfeminists_DF.dropna(inplace = True)

In [9]:
# checking shape
askfeminists_DF.shape

(4305, 3)

In [10]:
# creating new df with just three columns
askfeminists_com_DF = askfeminists_com[['text', 'type', 'subreddit']].copy()

In [12]:
askfeminists_com_DF.isnull().sum()

text         0
type         0
subreddit    0
dtype: int64

In [11]:
# checking shape
askfeminists_com_DF.shape

(25000, 3)

In [13]:
# Creating one DF for all askfeminists
askfeminists_all = pd.concat([askfeminists_DF, askfeminists_com_DF], axis=0, join='outer')

In [14]:
askfeminists_all.shape

(29305, 3)

#### MensRights

In [15]:
# for submissions
# combining title and selftext for new column
mensrights['text'] = mensrights['title'] + mensrights['selftext']
# creating new column type = 'post'
mensrights['type'] = 'post'

In [16]:
# for comments
# combining title and selftext for new column
mensrights_com['text'] = mensrights_com['body']
# creating new column type = 'comment'
mensrights_com['type'] = 'comment'

In [17]:
# creating df for submissions
mensrights_DF = mensrights[['text', 'type', 'subreddit']].copy()

In [18]:
# checking for nulls (picture posts)
mensrights_DF['text'].isnull().sum()

17736

In [19]:
mensrights_DF.dropna(inplace = True)

In [20]:
# checking shape, about 2000 more than askfeminists
mensrights_DF.shape

(6446, 3)

In [21]:
# creating df for comments
mensrights_com_DF = mensrights_com[['text', 'type', 'subreddit']].copy()

In [22]:
# checking nulls
mensrights_com_DF['text'].isnull().sum()

0

In [23]:
# checking shape, same as askfeminists
mensrights_com_DF.shape

(25000, 3)

In [24]:
# Creating one DF for all mensrights
mensrights_all = pd.concat([mensrights_DF, mensrights_com_DF], axis=0, join='outer')

In [25]:
mensrights_all.shape # closer to askfeminists shape

(31446, 3)

##### Saving to csv

In [26]:
mensrights_all.to_csv('./mensrights_all')

In [27]:
askfeminists_all.to_csv('./askfeminists_all')