# Data Collection
I created a series of functions to gather the length of the data, get posts and comments from before and after Sanders' declaration for the 2020 campaign. These are saved to pandas databases and csv files. 

In [8]:
import requests
import pandas as pd
import datetime
from time import sleep, mktime
from pathlib import Path
import datetime

## Project variables
This project is creating an NLP model comparison to detect if a post comes from the [/r/SandersforPresident](http://reddit.com/r/sandersforpresident) subreddit during the 2016 or 2020 campaign. Using the [PushShift](http://pushshift.io) API I am going to pull down posts from each time period, starting with the day he declared his candidacy. Comments and Submissions will be evaluated so different types of metadata and content will be needed for each.

In [16]:
subreddit = 'SandersforPresident'
global max_data
max_data = 500_000

fields_dict = {'comment' : ['id',
                  'parent_id',
                  'created_utc',
                  'body','author',
                  'score',
                 ],
               'submission' : ['id',
                     'created_utc',
                     'title',
                     'selftext',
                     'author',
                     'score',
                     'num_comments',
                     'stickied',]
              }


years = ('2020','2016')
d_types = ('comment','submission')
last_retrieved = {'2016' : 1430352060,
                  '2020': 1550563200
                 }

## Fetching Functions
### I defined a series of 4 functions to help in this task.

### `get_len`
This function combines the number of rows in the data for both comments and submissions per year to calculate the amount of data.

In [10]:
def get_len(subreddit,d_type,year):
    
    # Try to read in your dataset for each subreddit/d_type/year combo
    # and get the length. If that fails, return 0 because we have no data!
    
    try:
        df = pd.read_csv(f'datasets/{subreddit}_{d_type}_{year}.csv')
        df.dropna(axis=0, subset=['created_utc'], inplace=True)
        df['created_utc'] = [datetime.datetime.fromtimestamp(int(date)) for date in df['created_utc']]
        
        if year == '2020':
            if (df['created_utc'].max() >= datetime.datetime.today()):
                return(max_data)
            return(df.shape[0])
        elif year == '2016':
            if (df['created_utc'].max() >= datetime.datetime.fromtimestamp(1478692800)):
                return(max_data)
            else:
                return(df.shape[0])
        
    except: 
        return(0)
    

### `pre_run`
This function creates the data files if they do not exist and sets the start times for the retrieval if it is the first running of the function. If the data files do exist, this function grabs the `created_utc.max()` from each dataset to determine where to start fetching data at, datewise.

In [11]:
def pre_run(subreddit, d_type, year):
    # 'last_retrieved' is key to getting a sequential amount of data.
    
    # try to read the most recent timestamp from each dataset and write
    # to the `last_retrieved` dictionary
    try:
        df = pd.read_csv(f'datasets/{subreddit}_{d_type}_{year}.csv')
        last_retrieved[year] = df.created_utc.max()
            
    # if that doesn't work then set the dates to the date Sanders declared
    # his candidacy in 2016 and 2020. Create the datafiles to be written to
    # since they do not exist.
    except:
        pass
    Path(f'datasets/df_{year}.csv').touch()
    

### `run`
This function runs the data fetching using `get_data` and loops it based on the result from `data_len` and `max_data`

In [12]:
def run(subreddit, d_type, fields, year):
    
    data_len = get_len(subreddit,d_type,year)
    print(f'Fetching {max_data} rows of data for:\n- Subreddit: {subreddit}\n- d_type: {d_type}\n- year: {year}\n')    
    
    if data_len < max_data:
            print(f'- There are only {data_len} rows of data.') 
    
    while data_len < max_data:
        get_data(subreddit,
            d_type = d_type,
            fields = fields,
            year = year,
             )
        
        data_len = get_len(subreddit,d_type,year)
        
        if data_len < max_data:
            print(f'- There are now {data_len} rows of data.')
            print('Waiting to fetch more data ', end='')

            for i in range(65):
                print(f'.', end='')
                sleep(1)
            print(' Done!')
        
    print(f'- There are {data_len} rows of data.\n')  
    
    print(f'Data fetching complete for:\n- Subreddit: {subreddit}\n- d_type: {d_type}\n- Year: {year}\n\n')
    print(f'- Last fetched date was {last_retrieved[year]}')
        

### `get_data`
This function uses `requests` and the PushShift.io API to get posts and submissions 500 at a time. It also writes those results out to csv files for later processing.

In [13]:
def get_data(subreddit,
            d_type,
             year,
             **kwargs
             ):
        
    base_url = "https://api.pushshift.io/reddit/search/" + d_type + "/?"
    
    fields = kwargs.get('fields',None)
    
    params = {
        "subreddit" : subreddit,
        "size" : 500,
        'after': int(last_retrieved[year]),
    }

    res = requests.get(base_url,params)
    if res.status_code != 200:
        print(f'Error Code: {res.status_code}')
        return

    df = pd.DataFrame(res.json()['data'])[fields]
    last_retrieved[year] = df.created_utc.max()
    
    df.set_index('id')

    try: #try to load in your existing data and then merge
        old_df = pd.read_csv(f'datasets/{subreddit}_{d_type}_{year}.csv')
    except:
        old_df = pd.DataFrame(columns=fields)
    old_df.set_index('id')

    df = pd.concat([old_df,df])

    #save your data to csv for the future
    df = df[~df.duplicated(subset='id',keep='first')] 
    df.dropna(axis=0, subset=['created_utc','score'], inplace=True)
    df.to_csv(f'datasets/{subreddit}_{d_type}_{year}.csv', index=False)      
    df['created_utc'] = [datetime.datetime.fromtimestamp(int(date)) for date in df['created_utc']]
    print(f'Date of last fetched data: {df.created_utc.max()}')
    
    return()

## Fetching
Below here we fetch the data, complete with output to guide us.

In [17]:
d_type = 'submission'
# for year in years:
year = '2016'
pre_run(subreddit = subreddit,
        d_type=d_type,
        year = year)

run(subreddit,
    d_type=d_type,
    fields = fields_dict[d_type],
    year = year
)

Fetching 500000 rows of data for:
- Subreddit: SandersforPresident
- d_type: submission
- year: 2016

- There are 500000 rows of data.

Data fetching complete for:
- Subreddit: SandersforPresident
- d_type: submission
- Year: 2016


- Last fetched date was 1483400691


## From here move to [Data Cleaning & Vectorization](2-Data_Cleaning_Vectorization.ipynb)