# This notebook includes 3 sections:

### 1. Using the pushshift API

This section steps through how the posts were gathered using the praw wrapper around the pushshift API for this project

For more info about the pushift and different ways to use it, visit this reddit [post](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

### 2. Using praw

This is a python wrapper around the reddit API. This has a limit of 1k posts and was only used to view some preliminary posts
### 3. Getting comments and replies per post
This was not used for project analysis, but usefull for future work

In [3]:
import praw
import pandas as pd
import numpy as np
import pickle

In [14]:
import reddit as red
%load_ext autoreload
%autoreload 2

### 1. PSAW - Pushshift API

In [3]:
from psaw import PushshiftAPI
import pandas as pd
import datetime as dt
api=PushshiftAPI()

#set date range desired for pull
start_epoch=int(dt.datetime(2021, 2, 20).timestamp())
end_epoch=int(dt.datetime(2021, 2, 24).timestamp())

gen = api.search_submissions(after=start_epoch, before=end_epoch,
                            subreddit='crossfit',
                            filter=['id', 'title','author', 'selftext', 'created', 'num_comments', 'score', 'upvote_ratio', 'url', 'is_video', 'thumbnail'],
                            limit=None)

df_praw = pd.DataFrame([sub.d_ for sub in gen])

In [269]:
#iterate through pulling data and append to single df_praw. Pickle out to save result
picklefile_name = 'df_praw_feb.pkl'
with open(picklefile_name, 'wb') as picklefile:
    pickle.dump(df_praw, picklefile)

### 2. PRAW - Reddit API

Limitation - 1k

Pre-requisite: Reddit app credentials. Go to [App Preferences](https://www.reddit.com/prefs/apps) and create app to receive credentials.

View available submission attributes in [documenation](http://lira.no-ip.org:8080/doc/praw-doc/html/code_overview/models/submission.html)

In [5]:
import service_creds as creds

In [7]:
reddit = praw.Reddit(client_id=creds.client_id_, client_secret=creds.client_secret_, user_agent=creds.user_agent_)

Version 7.1.4 of praw is outdated. Version 7.2.0 was released 4 days ago.


In [6]:
# get 10 top posts from the crossfit subreddit
hot_posts = reddit.subreddit('crossfit').top('all', limit=10)
for post in hot_posts:
    print(post.title, post.id)

Today marks 9 months of CF and I gave birth to what appear to be muscles! Who knew someone who hadn't even touched a barbell up until that point could fall in love with getting stronger each day? Completing my first Tough Mudder on Saturday! 💪🏾😝 78woxh
Look how less fat I am! 18 months progress 72zmnf
I got my BMU right before quarantine and am still in disbelief that I have them. Until this year, I've always assumed it was just a movement I wouldn't have. Through a lot of community support, 3 months of deliberate practice, and 40# of weightloss, I finally did it! This was my journey. hgc15v
I know I post often, but I feel like there’s a lot of people in my same boat or are scared of starting and I want to tell you that you’re totally capable. Here’s me 1.5 years apart. From 15kg to 30kg thrusters and down 70lbs. I never thought I’d be where I’m at now. cu5922
6 months of CrossFit + Macros tracking + blood,sweat, & tears later... (-60lbs of body fat) l2izvj
Celebrating my one year anni

In [25]:
subreddit = reddit.subreddit("crossfit")
submission_dict = {}

In [27]:
# limitation of 1k
for submission in subreddit.top('year', limit=None): # subreddit.hot(limit=None) for hot instead of top category
    details_dict = {}
    details_dict['title'] = submission.title
    details_dict['author'] = submission.author
    details_dict['self_text'] = submission.selftext
    details_dict['time'] = red.convert_date_time(submission.created)
    details_dict['num_comments'] = submission.num_comments
    details_dict['score'] = submission.score
    details_dict['upvote_ratio'] = submission.upvote_ratio
    details_dict['url'] = submission.url
    details_dict['video'] = submission.is_video
    details_dict['thumbnail'] = submission.thumbnail

    #save dictionary of dictionaries
    submission_dict[submission.id] = details_dict

In [63]:
temp = pd.DataFrame(submission_dict).T
hot_top_year = temp.reset_index().rename({'index': 'id'}, axis=1)

hot_top_year.head()

Unnamed: 0,id,title,author,self_text,time,num_comments,score,upvote_ratio,url,video,thumbnail
0,hgc15v,I got my BMU right before quarantine and am st...,Keekabo,,2020-06-26 18:33:08,163,2336,0.98,https://v.redd.it/q5mhrbnyha751,True,https://b.thumbs.redditmedia.com/Hb9Nsuxziy-I1...
1,l2izvj,6 months of CrossFit + Macros tracking + blood...,VivianE20,,2021-01-22 07:53:38,125,2041,0.99,https://i.redd.it/px4xkucnauc61.jpg,False,https://b.thumbs.redditmedia.com/njL2xrtaxA7R0...
2,gyx460,What’s the big deal with HQ/Glassman’s comments?,TacticalCocoaBunny,Black female crossfitter here.\n\nIt’s been a ...,2020-06-08 11:50:33,213,1805,0.94,https://www.reddit.com/r/crossfit/comments/gyx...,False,self
3,gdom00,A girl can dream ok,fwds,,2020-05-05 02:32:33,116,1605,0.98,https://i.redd.it/06o0hi1unuw41.jpg,False,https://a.thumbs.redditmedia.com/GOyCHXECnj-pJ...
4,lb5qf2,Fraser has retired,Puzzleheaded_Cod_716,,2021-02-02 20:36:31,347,1514,0.97,https://i.redd.it/4sdztv9sk4f61.jpg,False,https://b.thumbs.redditmedia.com/36jNRf2qSd_Fq...


### 3. Comments and replies per submission

Comments

In [197]:
# get list of submission IDs
sub_id_list = hot_top_year.id.tolist()

In [202]:
len(sub_id_list)

1000

In [199]:
details = []
replies_dict = {}

#iterate through submission IDs
for id_ in sub_id_list:
    submission = reddit.submission(id=id_)

    #iterate through top level comments in submission
    for top_level_comment in submission.comments:
        #save top level comment details
        details.append(red.get_details(top_level_comment))

        #save replies object for submission
        comment_id = top_level_comment.id
        replies_obj = top_level_comment.replies
        replies_dict[comment_id] = replies_obj

In [201]:
#save top level comments in df
cols = ['comment_id', 'parent_id', 'submission_id', 'author', 'body', 'date_time', 'score']
df_comments = pd.DataFrame(details, columns=cols)
df_comments.head()

Unnamed: 0,comment_id,parent_id,submission_id,author,body,date_time,score
0,fw31uua,t3_hgc15v,t3_hgc15v,hereforthenow,"Holy shit girl, that is awesome! Seeing the am...",2020-06-26 10:49:52,165
1,fw3hbdk,t3_hgc15v,t3_hgc15v,Keekabo,I've gotten a few questions so..\n\nBackyard P...,2020-06-26 12:54:49,91
2,fw31wqc,t3_hgc15v,t3_hgc15v,discojuggz,I love how ecstatic you were at end. That's go...,2020-06-26 10:50:17,48
3,fw34epl,t3_hgc15v,t3_hgc15v,Nice-Salary,"Girl, you are an inspiration!! Look at those p...",2020-06-26 11:09:59,20
4,fw30vsq,t3_hgc15v,t3_hgc15v,cjb67,Heck Yeah! Congrats!,2020-06-26 10:42:16,16


In [203]:
df_comments.shape

(20046, 7)

In [204]:
picklefile_name = 'df_comments_hottopyear.pkl'
with open(picklefile_name, 'wb') as picklefile:
    pickle.dump(df_comments, picklefile)

Replies

In [241]:
# get replies from dictionary
reply_details = []

for idx, replies in replies_dict.items():
    for reply in replies.list():
        detail = red.get_details(reply)
        if detail != None:
            reply_details.append(detail)

In [242]:
len(reply_details)

33094

In [245]:
df_reply = pd.DataFrame(reply_details, columns=cols)
df_reply.head()

Unnamed: 0,comment_id,parent_id,submission_id,author,body,date_time,score
0,fw696ve,t1_fw31uua,t3_hgc15v,,Exactly!!! This is what we need to celebrate!!...,2020-06-27 08:25:39,4
1,fw6kck0,t1_fw696ve,t3_hgc15v,walkenrider,"Oh for goodness sake, whyyyyyyy did you have t...",2020-06-27 10:06:01,13
2,fw6korp,t1_fw6kck0,t3_hgc15v,,I’m congratulated her. She fucking killed it! ...,2020-06-27 10:09:01,3
3,fw6kts7,t1_fw6korp,t3_hgc15v,walkenrider,"Time and place, my friend. Time and place.",2020-06-27 10:10:15,7
4,fw6kyis,t1_fw6kts7,t3_hgc15v,,Thank you for telling me when and where I can ...,2020-06-27 10:11:25,1


In [246]:
df_reply.shape

(33094, 7)

In [247]:
#c combine comments and replies df
temp = df_comments.append(df_reply, ignore_index=True)
#filter out deleted/removed posts
mask = (temp.body == '[deleted]') | (temp.body == '[removed]')
df_comm_replies = temp[~mask]
df_comm_replies.shape

(51918, 7)