# Table of Contents
- [Import Libraries](#import)      
- [Scraping Posts from Reddit](#scrape)
- [Cleaning Data to Title + Text and Subreddit Only](#clean)
- [Saving and Exporting csv for EDA](#save)

Keeping scraping in an independent notebook such that when I restart kernels, I do NOT repull data.

###  Importing Libraries <a id="import"></a>

In [1]:
# Import Libraries

import requests # Library enables requests from reddit
import time # Library enables lag to be added to request loop

import pandas as pd

###  Scraping Posts from Reddit <a id="scrape"></a>

In [2]:
# Set up source to scrape from

url_base_ww = 'https://www.reddit.com/r/thewestwing.json' # API Endpoint 1
url_base_nr = 'https://www.reddit.com/r/Thenewsroom.json' # API Endpoint 2

In [42]:
# Create a function with a for loop to make pull requests from the sub reddit 
# until either 1000 posts have been pulled or the entire subreddit has been pulled

def pull_posts(url_base):
    
    # Setting user agent to enable pulling from Reddit
    user_agent = {'User-agent': 'mags'}
    
    # Instantiating 'after' string that can be concatenated to url after the first pull request
    after = None
    
    # Creating empty posts list
    posts = []
    
    for pull_req in range(int(1000/25)):
        # Setting url to pull from based on whether it is the first pull (if)
        # or a subsequent pull (else)
        if after == None:
            url = url_base
        else:
            url = url_base+"?after="+after
        
        # Making request
        res = requests.get(url, headers = user_agent)
        
        # If statement checks that request worked
        # adds posts to posts list.
        if res.status_code == 200:
            json_data = res.json() 
            posts.extend(json_data['data']['children'])
            # need to change after string such that next pull pulls the next set of posts
            after = json_data['data']['after']
            
        else:
            print(f'There was an error : {res.status_code}.')
            break
        
        if after == None:
            break
        
        print(f'Pulled {len(posts)} posts so far...')
        time.sleep(5)
    
    return posts

In [43]:
posts_ww = pull_posts(url_base_ww)
ww_raw_df = pd.DataFrame(posts_ww)
ww_raw_df.to_csv('./raw_ww.csv')

Pulled 26 posts so far...
Pulled 51 posts so far...
Pulled 76 posts so far...
Pulled 101 posts so far...
Pulled 126 posts so far...
Pulled 151 posts so far...
Pulled 176 posts so far...
Pulled 201 posts so far...
Pulled 226 posts so far...
Pulled 251 posts so far...
Pulled 276 posts so far...
Pulled 301 posts so far...
Pulled 326 posts so far...
Pulled 351 posts so far...
Pulled 376 posts so far...
Pulled 401 posts so far...
Pulled 426 posts so far...
Pulled 451 posts so far...
Pulled 476 posts so far...
Pulled 501 posts so far...
Pulled 526 posts so far...
Pulled 551 posts so far...
Pulled 576 posts so far...
Pulled 601 posts so far...
Pulled 626 posts so far...
Pulled 651 posts so far...
Pulled 676 posts so far...
Pulled 701 posts so far...
Pulled 726 posts so far...
Pulled 751 posts so far...
Pulled 776 posts so far...
Pulled 801 posts so far...
Pulled 826 posts so far...
Pulled 851 posts so far...
Pulled 876 posts so far...
Pulled 901 posts so far...
Pulled 926 posts so far...
Pull

In [15]:
posts_nr = pull_posts(url_base_nr)
nr_raw_df = pd.DataFrame(posts_nr)
nr_raw_df.to_csv('./raw_nr.csv')

Pulled 25 posts so far...
Pulled 50 posts so far...
Pulled 75 posts so far...
Pulled 100 posts so far...
Pulled 125 posts so far...
Pulled 150 posts so far...
Pulled 175 posts so far...
Pulled 200 posts so far...
Pulled 225 posts so far...
Pulled 250 posts so far...
Pulled 275 posts so far...
Pulled 300 posts so far...
Pulled 325 posts so far...
Pulled 350 posts so far...
Pulled 375 posts so far...
Pulled 400 posts so far...
Pulled 425 posts so far...
Pulled 450 posts so far...
Pulled 475 posts so far...
Pulled 500 posts so far...
Pulled 525 posts so far...
Pulled 550 posts so far...
Pulled 575 posts so far...
Pulled 600 posts so far...
Pulled 625 posts so far...
Pulled 650 posts so far...
Pulled 675 posts so far...
Pulled 700 posts so far...
Pulled 725 posts so far...
Pulled 750 posts so far...
Pulled 775 posts so far...
Pulled 800 posts so far...
Pulled 825 posts so far...
Pulled 850 posts so far...
Pulled 875 posts so far...
Pulled 900 posts so far...
Pulled 925 posts so far...
Pull

###  Cleaning Data to Title + Text and Subreddit Only <a id="clean"></a> 

In [32]:
# Creating a function to take the posts and format them such that 
# the title and text can be used for NLP.

# Concatenating title and text because some posts do not have any text,
# and some posts have short titles
def clean_posts(posts):
    # Creating empty list, one for titles and one for text
    clean_text = []
    
    # Looping through each post to create dictionary
    for data in posts:
        # Creating empty dictionary
        text = {}
        
        text['subreddit'] = data['data']['subreddit']
        temp_text =  data['data']['title'] + " " + data['data']['selftext']
        text = {text['subreddit'] : temp_text}
        
        clean_text.append(text)
    
    return pd.DataFrame(clean_text)

In [33]:
text_ww = clean_posts(posts_ww)
text_ww.to_csv('./titletext_ww.csv')
text_ww.columns = ['text']

In [39]:
text_nr = clean_posts(posts_nr)
text_nr.to_csv('./titletext_nr.csv')
text_nr.columns = ['text']

### Saving and Exporting CSV for EDA  <a id="save"></a> 

In [40]:
# Y = 1 for West Wing, Y = 0 for Newsroom
text_nr['subreddit'] = 0
text_ww['subreddit'] = 1

# Combining Newsroom and West Wing data frames 
text_df = text_nr.append(text_ww)
text_df['class'] = text_df['subreddit']
text_df.drop('subreddit', axis=1, inplace=True)

text_df.to_csv('./text_df.csv')