# Project 3: Web APIs & NLP
## Notebook I

Author: Julie Vovchenko

---

### Table of Content:
- [Scrapping](#Scrapping)
    -  [Defining Subreddits](#Defining-Subreddits)  
    -  [Scrapping Subreddits](#Scrapping-Subreddits) 
    -  [Saving Raw Data](#Saving-Raw-Data) 

### Dataset

- [Scrapped Raw Data from Both Subreddits](../data/both_scraped_subreddits.csv)


# Scrapping 

In [1]:
#Importing Libraries: 
import pandas as pd
import datetime as dt
import time 
import requests


## Defining Subreddits

In [2]:
# We are using two subreddits for the analysis: 
# 'Parenting' and 'stepparents'
subreddit_1='Parenting'
subreddit_2='stepparents'

## Scrapping Subreddits

In [5]:
# This function does scrapping from reddit.com with a given subreddit

# query_pushshift function take the following parameters: 
# subreddit - name of the subreddit from which we collect data
# Default Parameters: 
# skip: how many days you want your time go back (skip=30)
# times: how many time to collect api query into this list (times=5)
# subfields: rows of data that we need to collect
# comfields: other columns name for scrapping


def query_pushshift(subreddit, kind='submission', skip=30, times=5, 
                    subfield = ['title', 'selftext', 
                                'subreddit', 'created_utc', 
                                'author', 'num_comments', 
                                'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    
    # to store data we get from website
    mylist = [] #will have a list of 5 times posts from subs
    
    #append the results to the empty list above
    for x in range(1, times + 1): #should run 5 time (6(times+1), but it starts from 1)
        
        URL = "{}&after={}d".format(stem, skip * x) #URL stem, 30/60/90 days
        print(URL)
        
        #scrapping
        response = requests.get(URL)
        #checking if the response was successfull
        assert response.status_code == 200
        
        mine = response.json()['data'] #'data' list
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2)
        
    #creates of df object full that will be 5 subreaded data  
    full = pd.concat(mylist, sort=False)
    
    
    if kind == "submission":
        full = full[subfield]
        #delete duplicates that might occur during scrapping
        full = full.drop_duplicates() 
        full = full.loc[full['is_self'] == True] 
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    
    full['timestamp'] = full["created_utc"].apply(get_date)
   
    print(full.shape)
    return full 

In [6]:
# Running the function query_pushshift that scrapes all data from 
# subreddit_1, which is 'Parenting'
sub_1_query = query_pushshift(subreddit_1)

https://api.pushshift.io/reddit/search/submission/?subreddit=Parenting&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=Parenting&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=Parenting&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=Parenting&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=Parenting&size=500&after=150d
(2500, 9)


In [7]:
# checking the size of the dataframe for scraping 
# subreddit_1, which is 'Parenting'
sub_1_query.shape

(2500, 9)

In [8]:
# Running the function query_pushshift that scrapes all data from 
# subreddit_2, which is 'stepparents'
sub_2_query = query_pushshift(subreddit_2)

https://api.pushshift.io/reddit/search/submission/?subreddit=stepparents&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=stepparents&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=stepparents&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=stepparents&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=stepparents&size=500&after=150d
(2494, 9)


In [9]:
# checking the size of the dataframe for scraping 
# subreddit_2, which is 'stepparents'
sub_2_query.shape

(2494, 9)

**Observation:**  
Since we have collected about 2,500 posts from each subreddits, total of almost 5,000 posts, we believe it is sufficient enough data to establish proper predictions and to determine specific words that would indicate the group posting them.

In [10]:
# Concatenating both data scrapping into one large dataframe
both_dataframes = [sub_1_query, sub_2_query]

df = pd.concat(both_dataframes)

In [11]:
# viewing top rows of the raw dataframe
df.head(3)

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Two month old sleeping 8-9 hours at a time.,My daughter is two months old today. Normally ...,Parenting,1577553731,shiteinmemooth,7,1,True,2019-12-28
1,Living room takeovers,Son spends all his waking hours watching YouTu...,Parenting,1577553829,Rach_InOz,11,1,True,2019-12-28
2,I freaking hate the baby stage.,[removed],Parenting,1577553878,HokieGirl07,1,1,True,2019-12-28


In [12]:
# checking the size of the whole dataframe with both subreddits
df.shape

(4994, 9)

## Saving Raw Data

In [14]:
# saving raw data into one file 
# will do all the exploratory data analysis and modeling in next notebooks
df.to_csv('../data/both_scraped_subreddits.csv', index = None);