# Reddit Classification Project

## L. Minter
### November 2021

### Notebook 01: Data Collection 
This is the first notebook of the analysis.  It is designed to collect the necessary data using the Pushshift API.  

#### Problem statement
Identify common and disparate themes in the reddit posts to for Portland and Seattle subreddits that could be useful for marketing campaigns across the Pacific Northwest.  

In [1]:
#imports
import requests
import pandas as pd
import time

import os

In [2]:
base_url = 'https://api.pushshift.io/reddit/search/submission'

### Start with a quick look at a single request for r/Seattle

In [3]:
params_Seattle= {
    'subreddit':'Seattle',
    'metadata':'True',
    'size':100,
    'before':1635439118
}

In [6]:
res = requests.get( url = base_url, params = params_Seattle)
res.status_code

200

In [7]:
posts = res.json()['data']
metadata = res.json()['metadata']

In [8]:
#quick look at the data coming back
seattle = pd.DataFrame(posts)
seattle.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,link_flair_css_class,link_flair_text,author_flair_background_color,author_flair_text_color,link_flair_template_id,author_flair_template_id,is_gallery,gallery_data,media_metadata
0,[],False,galumphix,,[],,text,t2_s4xtg,False,False,...,,,,,,,,,,
1,[],False,Automatic_Man52,,[],,text,t2_cqpcdsow,False,False,...,https://datastudio.google.com/reporting/314b52...,,,,,,,,,
2,[],False,phillipsn21,,[],,text,t2_59m8pcz0,False,False,...,,moving,Moving / Visiting,,,,,,,
3,[],False,wsdot,flair verified,"[{'e': 'text', 't': 'WA State Dept of Transpor...",WA State Dept of Transportation,richtext,t2_4upd9,False,False,...,,,,,dark,,,,,
4,[],False,gharrity,,[],,text,t2_6lx0n,False,False,...,https://www.seattletimes.com/seattle-news/poli...,flair,Politics,,,d03c04ee-412a-11e8-88cf-0e80e220ed5c,,,,


We now have 100 posts from `r/Seattle`.  Now we can make a function to do this over and over again for specific subreddits. 

### Getting large numbers of posts at a time.  

In [150]:
# function to get a bunch of posts from the subreddit we want
# input: subreddit_name, 
# output: a bunch of saved files with 100 posts each, labeled by subreddit name and UTC
def get_all_posts(subreddit_name):
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    starting_utc = 1635439118
    
    params = {
        'subreddit':subreddit_name,
        'size':100,
        'before':starting_utc
        }
    
    #get initial data frame for first 100 submissions
    df = pd.DataFrame(requests.get( url = base_url, params = params).json()['data'])
    df.to_csv(f'./data/{subreddit_name}_{starting_utc}.csv')
    
    #then loop to get the rest of them
    for i in range(1500):
        if i%10 == 0: print('processing', i) #print a little update so we know how it is going
        params['before']=df['created_utc'].min() #update UTC cutoff using data we already have
        #print(df['created_utc'].min())
        res = requests.get( url = base_url, params = params)
        if res.status_code==200:
            posts = res.json()['data']
            df = pd.DataFrame(posts)
            utc = params['before'] #UTC we will use to label file
            
            #write the 100 posts data frame to file, no index
            df.to_csv(f'./data/{subreddit_name}/posts_{subreddit_name}_{utc}.csv',index = False)
    
        time.sleep(2)
    

In [151]:
get_all_posts('SeattleWA')

processing 0
processing 10
processing 20
processing 30
processing 40
processing 50
processing 60
processing 70
processing 80
processing 90
processing 100
processing 110
processing 120
processing 130
processing 140
processing 150
processing 160
processing 170
processing 180
processing 190
processing 200
processing 210
processing 220
processing 230
processing 240
processing 250
processing 260
processing 270
processing 280
processing 290
processing 300
processing 310
processing 320
processing 330
processing 340
processing 350
processing 360
processing 370
processing 380
processing 390
processing 400
processing 410
processing 420
processing 430
processing 440
processing 450
processing 460
processing 470
processing 480
processing 490
processing 500
processing 510
processing 520
processing 530
processing 540
processing 550
processing 560
processing 570
processing 580
processing 590
processing 600
processing 610
processing 620
processing 630
processing 640
processing 650
processing 660
proces

In [152]:
get_all_posts('Seattle')

processing 0
processing 10
processing 20
processing 30
processing 40
processing 50
processing 60
processing 70
processing 80
processing 90
processing 100
processing 110
processing 120
processing 130
processing 140
processing 150
processing 160
processing 170
processing 180
processing 190
processing 200
processing 210
processing 220
processing 230
processing 240
processing 250
processing 260
processing 270
processing 280
processing 290
processing 300
processing 310
processing 320
processing 330
processing 340
processing 350
processing 360
processing 370
processing 380
processing 390
processing 400
processing 410
processing 420
processing 430
processing 440
processing 450
processing 460
processing 470
processing 480
processing 490
processing 500
processing 510
processing 520
processing 530
processing 540
processing 550
processing 560
processing 570
processing 580
processing 590
processing 600
processing 610
processing 620
processing 630
processing 640
processing 650
processing 660
proces

In [None]:
get_all_posts('Portland')

### Combine the files into single file for each subreddit

In [4]:
def combine_files(subreddit_name):
    #want to figure out how to avoid using full path...but for now here it is
    os.chdir(f'/Users/mamabear/Documents/GA-DSI/Projects/project-3/data/{subreddit_name}')
    filenames = os.listdir(".") #get list of files
    filenames = [file for file in filenames if subreddit_name in file] #limit to subreddit
    dataframes = [pd.DataFrame(pd.read_csv(file)) for file in filenames]

    return pd.concat(dataframes)
    

In [15]:
seattlewa = combine_files('SeattleWA')

In [16]:
seattle = combine_files('Seattle')

In [None]:
portland = combine_files('Portland')

In [7]:
len(seattlwa),len(seattle)

(59524, 61181)

### Next steps
Now that we are done getting the data we can move on to data cleaning using the next [notebook](./02_data_cleaning.ipynb).