# Reddit Scraping
This notebook uses PSAW to get archive comments retrieved by Pushshift immediately after they are created. This way, we can get comments before the risk of them being deleted.

Using this method, we can get older data, for instance, from 2021. **IMPORTANT**: we need to check that `created_utc` and `retrieved_utc` are close so we can ensure pushshift was retrieving posts immediately after creation at the chosen epoch.

___
### --- TO DO --- 
#### before we start scraping:
* decide on the epoch
* finalize dictionary of terms to use in the filter for PSAW
* decide on how to keep track of all the scraping instances. For instance, do we want to use name convention so every time the scraper makes a request and saves new DFs to Google Drive, each DF has a number in the name to reference the time or order in which it was scraped? For instance, `comments_df_202203121400` or `comments_df_001`
* we will probably encounter problems with ratelimits and will have to learn more about it and troubleshoot it
___

In [269]:
# pip install psaw
# pip install praw

In [11]:
from psaw import PushshiftAPI
import reddit_auth
import praw
from datetime import datetime
import pandas as pd
from collections import Counter
from collections import defaultdict

In [12]:
import praw
import reddit_auth
reddit = praw.Reddit(
    client_id=reddit_auth.client_id,
    client_secret=reddit_auth.client_secret,
    user_agent=reddit_auth.user_agent, 
    password=reddit_auth.password,
    username=reddit_auth.username
)

# PSAW

### Get archive comments

In [18]:
def get_pushshift(our_filter, our_limit, before, after):

    '''
    Input
    - our_filter: our list of terms. Needs to be separated by |
    - our_limit: number of comments to retrieve
    - before: upper time limit. Format: int(datetime(yyyy,m,d,h,m,s).timestamp())
    - after: lower time limit
    Output
    - comments_df
    
    Uses psaw to scrape historical data created between 'after' and 'before', 
    filtered by the list of terms in our_filter
    
    '''
    
    from psaw import PushshiftAPI
    api = PushshiftAPI()

    gen = api.search_comments(q=our_filter, 
                              limit=our_limit, 
                              after=after, 
                              before=before)
    comments_df = pd.DataFrame(gen)

    comments_df.drop(labels=['d_'], inplace=True, axis=1)
    comments_df['created_utc'] = pd.to_datetime(comments_df.created_utc, unit='s')
    comments_df['created'] = pd.to_datetime(comments_df.created, unit='s')
    comments_df['retrieved_utc'] = pd.to_datetime(comments_df.retrieved_utc, unit='s')
    
    return comments_df

Since comments are retrieved soon after being posted, their features are not up to date. To update them, we use praw.

# PRAW

#### Get updated information on each Comment
#### Get live data on each Submission (parent)
#### Get live data on each User

In [19]:
def get_livedata(df):
    '''
    Input
    - df: comments_df
    Output
    - comments_df
    - submissions_df
    - users_df
    
    Creates 2 new DFs:
    - submissions_df
    - users_df
    And updates df with current information
    
    Iterate over df to get identifying information and 
    use it to populate the 2 new DFs
    
    '''
    
    import praw
    import reddit_auth
    reddit = praw.Reddit(
        client_id=reddit_auth.client_id,
        client_secret=reddit_auth.client_secret,
        user_agent=reddit_auth.user_agent, 
        password=reddit_auth.password,
        username=reddit_auth.username
    )
    
    comments_df = df.copy()
    submissions_data = defaultdict(list)
    users_data = defaultdict(list)
    
    for idx, row in df.iterrows():
        comment = reddit.comment(id=row.id)
        comments_df['updated_score'] = comment.score
        comments_df['saved'] = comment.saved
        comments_df['updated_stickied'] = comment.stickied
        comments_df['num_replies'] = len(list(comment.replies))

        submission = reddit.submission(row.link_id[3:])
        # if line above doens't work, use line below
        # submission = reddit.submission(reddit.comment(id=row.id).submission.id)
        
        submissions_data['id'].append(submission.id)
        submissions_data['name'].append(submission.name)
        submissions_data['title'].append(submission.title)
        submissions_data['num_comments'].append(submission.num_comments)
        submissions_data['author'].append(submission.author)
        submissions_data['created_utc'].append(datetime.fromtimestamp(submission.created_utc))
        submissions_data['distinguished'].append(submission.distinguished)
        submissions_data['is_self'].append(submission.is_self)
        submissions_data['link_flair_text'].append(submission.link_flair_text)
        submissions_data['locked'].append(submission.locked)
        submissions_data['over_18'].append(submission.over_18)
        submissions_data['permalink'].append(submission.permalink)
        submissions_data['saved'].append(submission.saved)
        submissions_data['score'].append(submission.score)
        submissions_data['selftext'].append(submission.selftext)
        submissions_data['spoiler'].append(submission.spoiler)
        submissions_data['stickied'].append(submission.stickied)
        submissions_data['upvote_ratio'].append(submission.upvote_ratio)
        submissions_data['url'].append(submission.url)

        user = reddit.redditor(row.author)
        
        users_data['id'].append(user.id)
        users_data['comment_karma'].append(user.comment_karma)
        users_data['created_utc'].append(datetime.fromtimestamp(user.created_utc))
        users_data['has_verified_email'].append(user.has_verified_email)
        users_data['is_employee'].append(user.is_employee)
        users_data['is_mod'].append(user.is_mod)
        users_data['is_gold'].append(user.is_gold)
        users_data['link_karma'].append(user.link_karma)
        users_data['subreddit'].append(user.subreddit)
    
    submissions_df = pd.DataFrame(submissions_data)
    users_df = pd.DataFrame(users_data)
    
    return comments_df, submissions_df, users_df
        

In [20]:
# define epoch
after = int(datetime(2022,1,1,0,0).timestamp())
before = int(datetime(2022,1,2,0,0).timestamp())

# define our filter
our_list_of_terms = ['python','java']
our_filter = '|'.join(our_list_of_terms)

# define limit
our_limit = 10 # so testing can be faster

In [21]:
df = get_pushshift(our_filter, our_limit, before, after)
comments, submissions, users = get_livedata(df)

In [28]:
submissions.head(2)

Unnamed: 0,id,name,title,num_comments,author,created_utc,distinguished,is_self,link_flair_text,locked,over_18,permalink,saved,score,selftext,spoiler,stickied,upvote_ratio,url
0,rtoawg,t3_rtoawg,6 elixir move.,302,Gam-erAnimation,2022-01-01 09:46:09,,False,FAKE ARTICLE/TWEET/TEXT,False,False,/r/PoliticalCompassMemes/comments/rtoawg/6_eli...,False,4669,,False,False,0.98,https://i.redd.it/vfzjskwi54981.jpg
1,ru0tde,t3_ru0tde,Iron Farm Not working.,12,zylofan,2022-01-01 19:50:53,,True,Help,False,False,/r/Minecraft/comments/ru0tde/iron_farm_not_wor...,False,1,"Playing on 1.18.1, just built a Wattles Iron f...",False,False,1.0,https://www.reddit.com/r/Minecraft/comments/ru...


In [30]:
users.head(2)

Unnamed: 0,id,comment_karma,created_utc,has_verified_email,is_employee,is_mod,is_gold,link_karma,subreddit
0,7mloc6t0,13770,2020-08-09 03:16:06,True,False,True,True,2289,u_ankyboii007
1,8q6bu,1387,2012-08-20 00:31:15,True,False,True,False,2392,u_zylofan


___
#### Check time lapse between created_utc and retrieved_utc

In [32]:
# we may oly need to see that the max time lapse is small enough
max(comments.retrieved_utc - comments.created_utc)

Timedelta('0 days 00:00:15')

___

In [381]:
# if we decide to get all comments for some posts:
# below is an instance of CommentForest
# print(reddit.submission(id=reddit.comment(id=df.id[0]).submission.id).comments)

In [382]:
# # if we decide to get all the comments from a user:
# user.comments
# user_subreddits = []
# for comment in reddit.redditor(user.name).comments.new(limit=None):
#     user_subreddits.append(comment.subreddit)
# 
# Counter(user_subreddits)
# print(len(user_subreddits)) #999, I wonder if it's cutting it off at 999...