# Data Collector
This notebook collects data from my selected subreddits and saves the data as csv files.

## Imports

In [8]:
# Import necessary libraries
import pandas as pd
import numpy as np
import datetime as dt
import time
import requests
import json

## Pushshift API

Collect some data to test API.

In [9]:
# Pull data form the Pushshift API usign requests.get()
# URL pulls submissions from UNresolvedMysteries subreddit, before a specific epoch time and gets 100 rows
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=unresolvedmysteries&before=1642778603&size=100'
res = requests.get(url)

# Turn the response into a DataFrame and see the first 5 rows
data = pd.DataFrame(res.json()['data'])
data.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,author_flair_background_color,author_flair_text_color,banned_by
0,[],False,Ok-Development-5805,,[],,text,t2_9womdgl4,False,False,...,1.0,https://www.reddit.com/r/UnresolvedMysteries/c...,all_ads,6,,,,,,
1,[],False,Odd_Ad1962,,[],,text,t2_5zvncy2o,False,False,...,1.0,https://www.reddit.com/r/UnresolvedMysteries/c...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'hpInOvKp...",,,,
2,[],False,Starasolum,,[],,text,t2_agtepedt,False,False,...,1.0,https://www.reddit.com/r/UnresolvedMysteries/c...,all_ads,6,,,moderator,,,
3,[],False,SwissChocolate1024,,[],,text,t2_303cypb,False,False,...,1.0,https://www.reddit.com/r/UnresolvedMysteries/c...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'fzN5-DY-...",,,,
4,[],False,HoloGalaxy,,[],,text,t2_2jhn6qgk,False,False,...,1.0,https://www.reddit.com/r/UnresolvedMysteries/c...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'qioXaA1R...",,,,


Got data OK.

## Feature selection

In [10]:
# List of features to include in data
features = [
    'author', 
    'created_utc',
    'selftext',
    'subreddit',
    'title',
    'total_awards_received',
]

## API request function
This function pulls data from the API and keeps the features listed in [Feature selection](#feature-selection).

Function improvements to-do:
- ~~Get initial current time automatically~~
- ~~Add DOCSTRING or similar~~
- ~~Keep function from adding removed posts to DataFrame~~
- ~~Add while loop so function runs until there are enough (500?) valid posts (not 'removed')~~

In [20]:
# This function built on Chuck's Breakfast Hour example
def pushshift_query(subreddit='unresolvedmysteries', features=features):
    '''
    Generate a DataFrame of posts (submissions) to a particular subreddit, including only a list of specified features from the submission.
    
    
    Args:
        subreddit (str): subreddit from which to retrieve posts
        features (list): a list of valid features to include in resulting DataFrame
        
    Returns:
        posts (DF): DataFrame containing valid posts and features
    '''
    current_time = int(time.time())
    posts = []
    while len(posts) < 1000:
        url = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit}&before={current_time}&size=100'
        res = requests.get(url)
        print(res.status_code)
        for post in res.json()['data']:
            try:
                if post['selftext'] == '[removed]':
                    continue
            except:
                None
            post_dict = {}
            for feature in features:
                try:
                    post_dict[feature] = post[feature]
                except:
                    post_dict[feature] = np.nan
            posts.append(post_dict)
            current_time = pd.DataFrame(posts)['created_utc'].min()
        print(f'Posts before {current_time}. Current data frame has {len(posts)} rows')
        time.sleep(5)
    return pd.DataFrame(posts)

## r/UnresolvedMysteries Data

Using the function defined in [API request function](#api-request-function) we collect data from the r/UnresolvedMysteries subreddit, and save the data to a .csv file.

In [22]:
# Collect data from unresolved
data_unre = pushshift_query(subreddit='unresolvedmysteries', features=features)

200
Posts before 1642544274. Current data frame has 42 rows
200
Posts before 1642116931. Current data frame has 87 rows
200
Posts before 1641812756. Current data frame has 132 rows
200
Posts before 1641185195. Current data frame has 180 rows
200
Posts before 1640634755. Current data frame has 219 rows
200
Posts before 1640020520. Current data frame has 269 rows
200
Posts before 1639426720. Current data frame has 315 rows
200
Posts before 1638905201. Current data frame has 371 rows
200
Posts before 1638397065. Current data frame has 421 rows
200
Posts before 1637950274. Current data frame has 467 rows
200
Posts before 1637413379. Current data frame has 512 rows
200
Posts before 1637049743. Current data frame has 565 rows
200
Posts before 1636574329. Current data frame has 609 rows
200
Posts before 1636181319. Current data frame has 654 rows
200
Posts before 1635731561. Current data frame has 703 rows
200
Posts before 1635264432. Current data frame has 752 rows
200
Posts before 163479140

In [23]:
# Check for duplicates
data_unre.duplicated().sum()

1

In [24]:
# Check for NaNs
data_unre.isna().sum()

author                    0
created_utc               0
selftext                 28
subreddit                 0
title                     0
total_awards_received     0
dtype: int64

In [25]:
# Verify amount of posts collected
data_unre.shape

(1005, 6)

### Save to CSV file.

In [26]:
data_unre.to_csv('../data/unresolved.csv')

## r/UnsolvedMysteries Data

Using the function defined in [API request function](#api-request-function) we collect data from the r/UnsolvedMysteries subreddit, and save the data to a .csv file.

In [27]:
data_unsol = pushshift_query(subreddit='unsolvedmysteries', features=features)

200
Posts before 1641337025. Current data frame has 100 rows
200
Posts before 1638912603. Current data frame has 200 rows
200
Posts before 1637080787. Current data frame has 300 rows
200
Posts before 1635131003. Current data frame has 400 rows
200
Posts before 1633669779. Current data frame has 499 rows
200
Posts before 1632099408. Current data frame has 599 rows
200
Posts before 1630473431. Current data frame has 699 rows
200
Posts before 1628094607. Current data frame has 799 rows
200
Posts before 1625934082. Current data frame has 899 rows
200
Posts before 1623758550. Current data frame has 999 rows
200
Posts before 1622291579. Current data frame has 1099 rows


### Save to CSV file.

In [28]:
data_unsol.to_csv('../data/unsolved.csv')

## Check data for 'removed' posts

In [29]:
data_unre[data_unre['selftext'] == '[removed]'].shape

(0, 6)

In [30]:
data_unsol[data_unsol['selftext'] == '[removed]'].shape

(0, 6)