# Reddit Scraping
This notebook uses PSAW to get archive comments retrieved by Pushshift immediately after they are created. This way, we can get comments before the risk of them being deleted.

Using this method, we can get older data, for instance, from 2021. **IMPORTANT**: we need to check that `created_utc` and `retrieved_utc` are close so we can ensure pushshift was retrieving posts immediately after creation at the chosen epoch.

Pushshift API [here](https://reddit-api.readthedocs.io/en/latest/) <br/>
PRAW API [here](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)<br/>

PMAW documentation [here](https://github.com/mattpodolak/pmaw)

___
### --- to do --- 
##### before we start scraping:
* ~~decide on the epoch~~
* ~~finalize dictionary of terms to use in the filter for PSAW~~
* ~~decide on how to keep track of all the scraping instances. ~~
    * ~~For instance, do we want to use name convention so every time the scraper makes a request and saves new DFs to Google Drive, each DF has a number in the name to reference the time or order in which it was scraped? <br/>Like: `comments_df_202203121400` or `comments_df_001`~~



* we will probably encounter problems with <mark>ratelimits</mark> and will have to learn more about it and troubleshoot it


### --- hapenning now ---

* using pmaw in lieu of psaw because psaw maintainer mentioned pmaw is "much more actively maintained" [here](https://github.com/dmarx/psaw/issues/103)
    *  trying to see if it works better 
* starting to scrape the pushshift data first so classification can start
* scraping data from every hour of the day for jan2022. if data has less than 100k comments, will include feb2022
    * update: retrieved 223 items from the first hour of 2022. We may get enough data scraping either only jan22 or jan22+feb22.<br/>
* pushshift data will be included in df named `comments_jan22` (or `comments_monyear` if expanding to other months, each month will have its own file for now)
* praw data will populate two other DFs, `submissions_jan22` and `user_jan22`
* data from pushshift also populates a reference dictionary that stores `comment_id`, `link_id` and `author` <br/>this is not split into months
___

In [1]:
# pip install psaw
# pip install praw
# pip install pmaw

In [3]:
import reddit_auth
import praw
from datetime import datetime
import pandas as pd
from collections import Counter
from collections import defaultdict
import time

import joblib

In [4]:
reddit = praw.Reddit(
    client_id=reddit_auth.client_id,
    client_secret=reddit_auth.client_secret,
    user_agent=reddit_auth.user_agent, 
    password=reddit_auth.password,
    username=reddit_auth.username
)

In [5]:
import logging

handler = logging.StreamHandler()
handler.setLevel(logging.INFO)

logger = logging.getLogger('pmaw')
logger.setLevel(logging.INFO)
logger.addHandler(handler)

___
# Definitions

* Define features for PSAW
* Define search terms

In [6]:
# features for PSAW
our_filter = ['author',
              'author_flair_type',
              'author_fullname',
              'author_premium',
              'body',
              'body_sha1',
              'controversiality',
              'created_utc',
              'distinguished',
              'gilded',
              'id',
              'is_submitter',
              'link_id', 
              'locked',
              'parent_id',
              'permalink',
              'retrieved_utc',
              'subreddit',
              'subreddit_id',
              'subreddit_name_prefixed',
              'subreddit_type'
             ]

In [7]:
# terms for our filter
hate_terms = pd.read_csv('hate_terms.csv')

our_terms = '|'.join(hate_terms.term)

___
# Creating master files

In [8]:
# Save reference files with unique identifiers for 'comments', 'submissions', 'users'

reference_ids = defaultdict(list)
joblib.dump(reference_ids, 'reference_ids.joblib', compress=3)

# starting with pushshift first (comments only)
# starting with jan22
comments_jan22 = pd.DataFrame()
joblib.dump(comments_jan22, 'comments_jan22.joblib', compress=3)

['comments_jan22.joblib']

In [9]:
submissions_jan22 = pd.DataFrame()
users_jan22 = pd.DataFrame()
joblib.dump(submissions_jan22, 'submissions_jan22.joblib', compress=3)
joblib.dump(users_jan22, 'users_jan22.joblib', compress=3)

['users_jan22.joblib']

___
# Defining functions

In [10]:
def save_unique_ids(df):
    '''
    - Loads reference dictionary file,
    - updates it with unique id for comments, submissions, and users, in given df and 
    - saves updated dictionary as joblib compressed file
    '''
    import joblib
    # load reference dictionary
    ref = joblib.load('reference_ids.joblib')
    
    # update dictionary
    ref['comment'].extend(df.id)
    ref['submission'].extend(df.link_id)
    ref['user'].extend(df.author)
    
    # save updated reference dictionary
    joblib.dump(ref, 'reference_ids.joblib', compress=3)
    
    

In [11]:
def save_retrieved(df, endpoint, month, year):
    '''
    Loads comments file for specific month, updates it with newly scrapped comments and 
    saves updated DF as joblib compressed file
    - df: df created with newly scraped data
    - endpoint: comments | submissions | users
    - month: 3 letter lowercap
    - year: 2 last digits
    '''
    import joblib
    # load comments file
    ref_df = joblib.load(f'{endpoint}_{month}{year}.joblib')
    
    # update DF
    ref_df = ref_df.append(df)
    
    # save updated DF
    joblib.dump(ref_df, f'{endpoint}_{month}{year}.joblib', compress=3)

In [12]:
def get_pushshift(our_terms, before, after, our_filter):

    '''
    Input
    - our_filter: our list of terms. Needs to be separated by |
    - our_limit: number of comments to retrieve
    - before: upper time limit. Format: int(datetime(yyyy,m,d,h,m,s).timestamp())
    - after: lower time limit
    Output
    - comments_df
    
    Uses psaw to scrape historical data created between 'after' and 'before', 
    filtered by the list of terms in our_filter
    '''
    
    from pmaw import PushshiftAPI
    api = PushshiftAPI()

    gen = api.search_comments(q=our_terms, 
                              after=after, 
                              before=before,
                              filter=our_filter, 
                              mem_safe=True,
                              safe_exit=True)    
    
    comments_df = pd.DataFrame(gen)

    comments_df['created_utc'] = pd.to_datetime(comments_df.created_utc, unit='s')
#     comments_df['created'] = pd.to_datetime(comments_df.created, unit='s')
    comments_df['retrieved_utc'] = pd.to_datetime(comments_df.retrieved_utc, unit='s')

    return comments_df

___
## --- TEST ---

In [13]:
reference_ids = defaultdict(list)
joblib.dump(reference_ids, 'reference_ids.joblib', compress=3)

comments_jan22 = pd.DataFrame()
joblib.dump(comments_jan22, 'comments_jan22.joblib', compress=3)

['comments_jan22.joblib']

In [14]:
ref = joblib.load('reference_ids.joblib')
comm = joblib.load('comments_jan22.joblib')

In [None]:
start = time.time()

day = tuple(range(1,2))
hour = tuple(range(1))

for d in day:
    for h in hour:
        after =  int(datetime(2022, 1, d, h, 0, 0).timestamp())
        before = int(datetime(2022, 1, d, h, 59, 59).timestamp())
        try:
            df = get_pushshift(our_terms, before, after, our_filter)
            save_unique_ids(df)
            save_retrieved(df, 'comments', 'jan', 22)
        except:
            print('error')
            break
        print(df.shape,end='')

end = time.time()
print(' ')
print(end-start)

Response cache key: 88f2ab804ecf8149b8388cb2cc21e5a3
No previous requests to load
223 result(s) available in Pushshift


In [78]:
ref = joblib.load('reference_ids.joblib')
print(len(ref['comment']))

comm = joblib.load('comments_jan22.joblib')
comm.shape

___

Since comments are retrieved soon after being posted, their features are not up to date. To update them, we use praw.

# PRAW

#### Get updated information on each Comment
#### Get live data on each Submission (parent)
#### Get live data on each User

In [7]:
def get_livedata(df):
    '''
    Input
    - df: comments_df
    Output
    - comments_df
    - submissions_df
    - users_df
    
    Creates 2 new DFs:
    - submissions_df
    - users_df
    And updates df with current information
    
    Iterate over df to get identifying information and 
    use it to populate the 2 new DFs
    
    '''
    
    import praw
    import reddit_auth
    reddit = praw.Reddit(
        client_id=reddit_auth.client_id,
        client_secret=reddit_auth.client_secret,
        user_agent=reddit_auth.user_agent, 
        password=reddit_auth.password,
        username=reddit_auth.username
    )
    
    comments_df = df.copy()
    submissions_data = defaultdict(list)
    users_data = defaultdict(list)
    
    for idx, row in df.iterrows():
        comment = reddit.comment(id=row.id)
        comments_df['score'] = comment.score
        comments_df['num_replies'] = len(list(comment.replies))

        submission = reddit.submission(row.link_id[3:])
        # if line above doens't work, use line below
        # submission = reddit.submission(reddit.comment(id=row.id).submission.id)
        
        submissions_data['id'].append(submission.id)
        submissions_data['name'].append(submission.name)
        submissions_data['title'].append(submission.title)
        submissions_data['num_comments'].append(submission.num_comments)
        submissions_data['author'].append(submission.author)
        submissions_data['created_utc'].append(datetime.fromtimestamp(submission.created_utc))
        submissions_data['distinguished'].append(submission.distinguished)
        submissions_data['over_18'].append(submission.over_18)
        submissions_data['permalink'].append(submission.permalink)
        submissions_data['score'].append(submission.score)
        submissions_data['selftext'].append(submission.selftext)
        submissions_data['upvote_ratio'].append(submission.upvote_ratio)
        submissions_data['url'].append(submission.url)

        user = reddit.redditor(row.author)
        
        users_data['comment_karma'].append(user.comment_karma)
        users_data['created_utc'].append(datetime.fromtimestamp(user.created_utc))
        users_data['has_verified_email'].append(user.has_verified_email)
        users_data['is_mod'].append(user.is_mod)
        users_data['is_gold'].append(user.is_gold)
        users_data['link_karma'].append(user.link_karma)
        users_data['subreddit'].append(user.subreddit)
    
    submissions_df = pd.DataFrame(submissions_data)
    users_df = pd.DataFrame(users_data)
    
    return comments_df, submissions_df, users_df
        

In [30]:
# reference_df = pd.DataFrame().to_csv('reference.csv')

In [170]:
# epoch: from 1/1/2022, 0h0min0sec through 1/31/2022 23h59min59sec

# day = tuple(range(1,32))
# hour = tuple(range(24))

# for d in day:
#     for h in hour:
#         after =  int(datetime(2022, 1, d, h, m).timestamp())
#         before = int(datetime(2022, 1, d, h, m+9).timestamp())
#         df = get_pushshift(our_terms, our_limit, before, after, our_filter)

___
#### Check time lapse between created_utc and retrieved_utc

In [32]:
# we may oly need to see that the max time lapse is small enough
max(comments.retrieved_utc - comments.created_utc)

Timedelta('0 days 00:00:15')

___

In [381]:
# if we decide to get all comments for some posts:
# below is an instance of CommentForest
# print(reddit.submission(id=reddit.comment(id=df.id[0]).submission.id).comments)

In [382]:
# # if we decide to get all the comments from a user:
# user.comments
# user_subreddits = []
# for comment in reddit.redditor(user.name).comments.new(limit=None):
#     user_subreddits.append(comment.subreddit)
# 
# Counter(user_subreddits)
# print(len(user_subreddits)) #999, I wonder if it's cutting it off at 999...