### Author : 
Nishi Cestero

This notebook details data acquisition from Reddit using the Python Pushshift API wrapper. This method allows us to more easily retrieve historical posts without running into post-retrieval limits of the original Python API. For reference, I've kept code of using the original Python Reddit API at the bottom.

As of 4/19/2020: scraped subreddit data from January 2016 to April, 19, 2020. Start date corresponds to the year r/BPD began using "flairs" in their posts. Set the same starting date for all other subreddits for consistency. 

Subreddits scraped:
- General:
    - r/Advice
    - r/needadvice
    - r/Vent
    - r/venting
    - r/Rant
- Mental Health:
    - r/BPD
    - r/CPTSD
    - r/Anxiety
    - r/depressionhelp

# Python Pushshift API wrapper method

PSAW documentation:
https://github.com/dmarx/psaw

In [3]:
import pandas as pd
import numpy as np
from psaw import PushshiftAPI

api = PushshiftAPI()

Example of using API 

In [62]:
import datetime as dt

    
start_epoch=int(dt.datetime(2020,3,1).timestamp())
end_epoch=int(dt.datetime(2020,4,1).timestamp())

time_now = int(dt.datetime.utcnow().timestamp())
results = list(api.search_submissions(after=start_epoch,
                                  before=end_epoch,
                        subreddit='BPD',
                        filter=['id','created_utc' ,'author', 'title','selftext', 'subreddit','num_comments', 'ups',
                               'downs','score','url','link_flair_text' ],
                                  limit=10000
                        ))

Example of a submission object's dictionary returned by API

In [64]:
results[0].d_

{'author': 'ccholericc',
 'created_utc': 1585712740,
 'id': 'fsrzxr',
 'link_flair_text': 'Positivity',
 'num_comments': 9,
 'score': 3,
 'selftext': 'After everything that has been going on around the world, this is one of the only places I feel safe enough to talk about my feelings. Everyone here really understands what I’m going through and feeling and nobody ever judges. I’m not trying to romanticize BPD, but you are some of the most compassionate and kind people I’ve ever met. I hope everyone is coping well with the chaos of today, if not at least know we’re all going through it together',
 'subreddit': 'BPD',
 'title': 'I feel safe in this sub',
 'url': 'https://www.reddit.com/r/BPD/comments/fsrzxr/i_feel_safe_in_this_sub/',
 'created': 1585727140.0}

## crawl_reddit(...)

Retrieve all posts in a subreddit from given start year/month to present time. Queries API month-per-month to avoid results limitation.

By default only returns flaired posts. Specify to return all posts regardless of having flair. 

In [81]:
import datetime as dt

#crawl all submissions from a subreddit from start time to current in monthly increments
def crawl_reddit(start_year, start_month, subreddit_name, filter_nonflaired=True):
    
    df = pd.DataFrame()
    time_now = int(dt.datetime.utcnow().timestamp())    
    start_epoch = int(dt.datetime(start_year, start_month, 1).timestamp())

    for year in range(start_year,2021):
        for month in range(start_month, 13):
            start_epoch = int(dt.datetime(year, month, 1).timestamp())
            
            if(month==12):
                end_epoch = int(dt.datetime(year+1, 1, 1).timestamp())
            else:
                end_epoch = int(dt.datetime(year, month+1, 1).timestamp())
            
            if( start_epoch > time_now):
                break


            results = list(api.search_submissions(
                        after=start_epoch, before=end_epoch,
                        subreddit=subreddit_name,
                        filter=['id','created_utc' ,'author', 'title','selftext', 'subreddit',
                                'num_comments','score','url','link_flair_text' ],
                        limit=100000
                    ))

            if (filter_nonflaired):
                #remove objects missing the link_flair_text attribute
                results= [submission for submission in results if submission.d_.get('link_flair_text','NA') != 'NA']
            else:
                #make sure at least one submission carries 'link_flair_text' in .d_
                results[0].d_['link_flair_text'] = results[0].d_.get('link_flair_text',np.nan)
                results = [submission.d_ for submission in results]
            
            print(year, month)
            print(len(results))
            
            if(len(results)>0):
                df = df.append(results)
                
    return df

## Mental Health Subreddits with Intention Tags

Subreddits (Member Count) : [ List of Relevant Flairs]
- r/BPD (88.9k) : ['Venting', 'Input', 'Seeking Support',  + older flairs]
- r/CPTSD (65.3k): ['CPTSD Vent/ Rant' , 'Request: Emotional Support']
- r/depression_help (51.2k): [ 'REQUESTING SUPPORT', 'REQUESTING ADVICE' , 'RANT' ]
- r/Anxiety (349k) : ['Advice Needed', 'Needs A Hug/Support', 'Venting']





In [75]:
BPD_df = crawl_reddit(2016,1,'BPD')
BPD_df.to_csv('BPD_01_01_2016_04_19_2020')

2016 1
498
2016 2
498
2016 3
490
2016 4
450
2016 5
508
2016 6
486
2016 7
531
2016 8
542
2016 9
323
2016 10
122
2016 11
80
2016 12
70
2017 1
71
2017 2
43
2017 3
44
2017 4
93
2017 5
173
2017 6
173
2017 7
198
2017 8
204
2017 9
216
2017 10
251
2017 11
264
2017 12
324
2018 1
374
2018 2
378
2018 3
356
2018 4
409
2018 5
458
2018 6
500
2018 7
635
2018 8
634
2018 9
603
2018 10
657
2018 11
715
2018 12
771
2019 1
1000
2019 2
1835
2019 3
2218
2019 4
1978
2019 5
2069
2019 6
2058
2019 7
2067
2019 8
2245
2019 9
2209
2019 10
2513
2019 11
2413
2019 12
2754
2020 1
3540
2020 2
3457
2020 3
3997
2020 4
2579


In [82]:
CPTSD_df = crawl_reddit(2016,1,'CPTSD')
CPTSD_df.to_csv('CPTSD_01_01_2016_04_19_2020.csv')

2016 1
0
2016 2
0
2016 3
0
2016 4
0
2016 5
0
2016 6
0
2016 7
0
2016 8
0
2016 9
0
2016 10
0
2016 11
0
2016 12
0
2017 1
0
2017 2
0
2017 3
0
2017 4
0
2017 5
0
2017 6
0
2017 7
0
2017 8
0
2017 9
0
2017 10
0
2017 11
0
2017 12
0
2018 1
0
2018 2
0
2018 3
0
2018 4
0
2018 5
1
2018 6
0
2018 7
1
2018 8
0
2018 9
0
2018 10
0
2018 11
0
2018 12
0
2019 1
0
2019 2
48
2019 3
443
2019 4
533
2019 5
642
2019 6
628
2019 7
810
2019 8
847
2019 9
987
2019 10
1083
2019 11
1056
2019 12
1357
2020 1
1361
2020 2
1426
2020 3
1420
2020 4
924


In [83]:
depression = crawl_reddit(2016,1,'depression_help')
depression.to_csv('depression_help_04_19_2020.csv')

2016 1
0
2016 2
0
2016 3
0
2016 4
0
2016 5
0
2016 6
0
2016 7
0
2016 8
0
2016 9
0
2016 10
0
2016 11
0
2016 12
0
2017 1
0
2017 2
0
2017 3
0
2017 4
0
2017 5
0
2017 6
0
2017 7
0
2017 8
0
2017 9
0
2017 10
0
2017 11
0
2017 12
0
2018 1
0
2018 2
0
2018 3
0
2018 4
0
2018 5
0
2018 6
1
2018 7
0
2018 8
0
2018 9
0
2018 10
0
2018 11
0
2018 12
0
2019 1
333
2019 2
598
2019 3
1038
2019 4
1047
2019 5
947
2019 6
769
2019 7
336
2019 8
0
2019 9
0
2019 10
548
2019 11
771
2019 12
873
2020 1
980
2020 2
1102
2020 3
1063
2020 4
817


In [84]:
anxiety = crawl_reddit(2016,1,'Anxiety')
anxiety.to_csv('anxiety_04_19_2020.csv')

2016 1
802
2016 2
801
2016 3
742
2016 4
711
2016 5
802
2016 6
753
2016 7
849
2016 8
984
2016 9
944
2016 10
1025
2016 11
919
2016 12
917
2017 1
1102
2017 2
944
2017 3
1113
2017 4
1037
2017 5
996
2017 6
1082
2017 7
1163
2017 8
1323
2017 9
1225
2017 10
1284
2017 11
1578
2017 12
1536
2018 1
1846
2018 2
1661
2018 3
1779
2018 4
1689
2018 5
1506
2018 6
1395
2018 7
1499
2018 8
1997
2018 9
2474
2018 10
2599
2018 11
2587
2018 12
2706
2019 1
3217
2019 2
2904
2019 3
3258
2019 4
2920
2019 5
3213
2019 6
3183
2019 7
3436
2019 8
3605
2019 9
3559
2019 10
3396
2019 11
3177
2019 12
3259
2020 1
4207
2020 2
3864
2020 3
4250
2020 4
2773


In [360]:
BPD_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39423 entries, 0 to 1300
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   author                39423 non-null  object 
 1   created_utc           39423 non-null  int64  
 2   id                    39423 non-null  object 
 3   link_flair_text       39418 non-null  object 
 4   num_comments          39423 non-null  int64  
 5   score                 39423 non-null  int64  
 6   selftext              39423 non-null  object 
 7   subreddit             39423 non-null  object 
 8   title                 39423 non-null  object 
 9   url                   39423 non-null  object 
 10  created               39423 non-null  object 
 11  d_                    39384 non-null  object 
 12  body_compound_score   39423 non-null  float64
 13  title_compound_score  39423 non-null  float64
dtypes: float64(2), int64(3), object(9)
memory usage: 5.8+ MB


In [324]:
BPD_df.head()

Unnamed: 0,author,created_utc,id,link_flair_text,num_comments,score,selftext,subreddit,title,url,created,d_
0,[deleted],1454296789,43mkdn,Seeking Support,2,1,[deleted],BPD,Fight with a coworker was publicly disrespectf...,https://www.reddit.com/r/BPD/comments/43mkdn/f...,1454310000.0,"{'author': '[deleted]', 'created_utc': 1454296..."
1,SharpAtTheEdge,1454296727,43mk8p,Seeking Support,5,4,I just seem to fuck up every relationship I've...,BPD,I hate BPD so much.,https://www.reddit.com/r/BPD/comments/43mk8p/i...,1454310000.0,"{'author': 'SharpAtTheEdge', 'created_utc': 14..."
2,skyandbuildings,1454296592,43mjwl,Seeking Support,6,2,I think about killing myself all the time. As ...,BPD,I can't get the thought of suicide out of my h...,https://www.reddit.com/r/BPD/comments/43mjwl/i...,1454310000.0,"{'author': 'skyandbuildings', 'created_utc': 1..."
3,The_JollyGreenGiant,1454296570,43mjul,Questions,12,5,Now I've realized that I've entered into a rel...,BPD,I thought I was fine over the past few weeks. ...,https://www.reddit.com/r/BPD/comments/43mjul/i...,1454310000.0,"{'author': 'The_JollyGreenGiant', 'created_utc..."
4,justanotherikealamp,1454290090,43m3q5,Venting,4,14,I also have complex ptsd. I've been trapped in...,BPD,Is there anyone who could talk (or type) with ...,https://www.reddit.com/r/BPD/comments/43m3q5/i...,1454310000.0,"{'author': 'justanotherikealamp', 'created_utc..."


Cleaning dataset notes
- watch out for empty selftext
- watch out for [deleted] in self text
- author can also be [deleted] 
- Seems we cannot get ups and downs from this api, only final score 

# General Subreddits for Advice or Venting



Advice subreddits (member count):
- r/Advice (401k)
- r/needadvice (381k) : limits discussions of certain topics, including relationships, but is flaired with question category ("pet loss", "mental health")

Venting subreddits:
- r/venting (6.1k)
- r/Rant (170k)
- r/Vent (40k)

In [85]:
needadvice = crawl_reddit(2016,1,'needadvice',filter_nonflaired=False)
needadvice.head()
needadvice.to_csv('needadvice_01_01_2016_04_19_2020.csv')

2016 1
554
2016 2
490
2016 3
512
2016 4
512
2016 5
539
2016 6
520
2016 7
521
2016 8
594
2016 9
521
2016 10
538
2016 11
565
2016 12
611
2017 1
654
2017 2
530
2017 3
587
2017 4
445
2017 5
479
2017 6
511
2017 7
524
2017 8
513
2017 9
533
2017 10
564
2017 11
539
2017 12
575
2018 1
672
2018 2
703
2018 3
841
2018 4
835
2018 5
914
2018 6
909
2018 7
1102
2018 8
1072
2018 9
1130
2018 10
1191
2018 11
1126
2018 12
947
2019 1
1409
2019 2
1253
2019 3
1687
2019 4
1496
2019 5
1639
2019 6
1253
2019 7
1434
2019 8
1614
2019 9
1415
2019 10
1491
2019 11
1509
2019 12
1496
2020 1
1439
2020 2
1353
2020 3
1337
2020 4
546


In [86]:
rant = crawl_reddit(2016,1,'Rant',filter_nonflaired=False)
rant.head()
rant.to_csv('rant_01_01_2016_04_19_2020.csv')


2016 1
1155
2016 2
1214
2016 3
1398
2016 4
1331
2016 5
1531
2016 6
1570
2016 7
1664
2016 8
1801
2016 9
1756
2016 10
1751
2016 11
1938
2016 12
1805
2017 1
1884
2017 2
1656
2017 3
1965
2017 4
1786
2017 5
1884
2017 6
1949
2017 7
2037
2017 8
2300
2017 9
2236
2017 10
2333
2017 11
2252
2017 12
2441
2018 1
2380
2018 2
2562
2018 3
2918
2018 4
3023
2018 5
3238
2018 6
3032
2018 7
3536
2018 8
3508
2018 9
3642
2018 10
3880
2018 11
4198
2018 12
4423
2019 1
4813
2019 2
4724
2019 3
5027
2019 4
5269
2019 5
5248
2019 6
5264
2019 7
6287
2019 8
6522
2019 9
6275
2019 10
6574
2019 11
6762
2019 12
7414
2020 1
7594
2020 2
7333
2020 3
9206
2020 4
5355


In [87]:
vent = crawl_reddit(2016,1,'Vent',filter_nonflaired=False)
vent.head()
vent.to_csv('vent_01_01_2016_04_19_2020.csv')

2016 1
212
2016 2
186
2016 3
244
2016 4
263
2016 5
246
2016 6
236
2016 7
236
2016 8
246
2016 9
293
2016 10
301
2016 11
302
2016 12
340
2017 1
360
2017 2
391
2017 3
411
2017 4
399
2017 5
400
2017 6
430
2017 7
465
2017 8
590
2017 9
504
2017 10
576
2017 11
616
2017 12
746
2018 1
693
2018 2
769
2018 3
923
2018 4
996
2018 5
1069
2018 6
1110
2018 7
1237
2018 8
1391
2018 9
1385
2018 10
1684
2018 11
1827
2018 12
2036
2019 1
2300
2019 2
2455
2019 3
2816
2019 4
2757
2019 5
2900
2019 6
3205
2019 7
3473
2019 8
3624
2019 9
3585
2019 10
3780
2019 11
4133
2019 12
4762
2020 1
5055
2020 2
4906
2020 3
5901
2020 4
3628


In [71]:
advice_df = crawl_reddit(2016,1,'Advice', filter_nonflaired=False)

2016 1
5326
2016 2
5163
2016 3
5181
2016 4
5519
2016 5
6012
2016 6
6027
2016 7
6369
2016 8
7005
2016 9
6772
2016 10
6942
2016 11
6725
2016 12
7651
2017 1
8009
2017 2
7169
2017 3
7934
2017 4
7760
2017 5
8032
2017 6
8146
2017 7
8192
2017 8
8645
2017 9
8492
2017 10
9360
2017 11
9834
2017 12
9727
2018 1
10644
2018 2
9735
2018 3
10610
2018 4
11411
2018 5
12131
2018 6
11928
2018 7
13081
2018 8
14285
2018 9
14276
2018 10
16271
2018 11
16131
2018 12
17618
2019 1
19665
2019 2
19313
2019 3
22617
2019 4
22520
2019 5
23238
2019 6
24091
2019 7
26882
2019 8
27794
2019 9
26418
2019 10
26776
2019 11
26409
2019 12
29581
2020 1
30304
2020 2
28793
2020 3
30404
2020 4
18438


In [72]:
advice_df.to_csv('advice_01_01_2016_04_19_2020.csv')

In [73]:
venting_df = crawl_reddit(2016,1,'venting',filter_nonflaired=False)
venting_df.to_csv('venting_01_01_2016_04_19_2020.csv')

2016 1
58
2016 2
42
2016 3
43
2016 4
46
2016 5
46
2016 6
44
2016 7
54
2016 8
74
2016 9
72
2016 10
73
2016 11
49
2016 12
89
2017 1
70
2017 2
53
2017 3
85
2017 4
59
2017 5
107
2017 6
85
2017 7
80
2017 8
87
2017 9
67
2017 10
91
2017 11
86
2017 12
88
2018 1
113
2018 2
83
2018 3
95
2018 4
144
2018 5
111
2018 6
143
2018 7
146
2018 8
169
2018 9
167
2018 10
195
2018 11
185
2018 12
199
2019 1
220
2019 2
247
2019 3
274
2019 4
278
2019 5
278
2019 6
303
2019 7
381
2019 8
389
2019 9
395
2019 10
436
2019 11
472
2019 12
534
2020 1
536
2020 2
559
2020 3
626
2020 4
376


# Python Reddit API

Documentation for reference of data collection attempt directly with Python Reddit API.

Problem with method: no way to retrieve historical data by specifying date range. Limit on number of results returned. 

In [2]:
import praw

In [2]:
reddit= praw.Reddit(user_agent='Comment Extraction (by /u/USERNAME)',
                    client_id='NEwAilxBlqBWeA', client_secret='***', 
                    username='***',password='***' )

In [88]:
# print(reddit.user.me())


Example on retrieving comment threads:

https://github.com/akhilesh-reddy/Cable-cord-cutter-Sentiment-analysis-using-Reddit-data/blob/master/Scraping%20comments%20from%20Reddit%20forum.ipynb

In [32]:
#extracting comments from BPD subreddit, complete according to cordcutter example
comm_list = []
header_list = []
i = 0
for submission in reddit.subreddit('BPD').search('a',time_filter='week'):
    print(submission.link_flair_text, submission.title, submission.id, submission.created_utc,
          submission.author, submission.num_comments, submission.ups, submission.downs, submission.score
         ) #submission.selftext, )
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:]
    print(comment_queue)


## cordcutter example    
comm_list = []
header_list = []
i = 0
for submission in reddit.subreddit('cordcutters').hot(limit=2):
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:]  # Seed with top-level
    while comment_queue:
        header_list.append(submission.title)
        comment = comment_queue.pop(0)
        comm_list.append(comment.body)
        t = []
        t.extend(comment.replies)
        while t:
            header_list.append(submission.title)
            reply = t.pop(0)
            comm_list.append(reply.body)
df = pd.DataFrame(header_list)
df['comm_list'] = comm_list
df.columns = ['header','comments']
df['comments'] = df['comments'].apply(lambda x : x.replace('\n',''))
df.to_csv('cordcutter_comments.csv',index = False)

Seeking Support My ex got engaged after less than a year of dating someone else. Don’t know where else to talk about this. f1q0o6 1581339917.0 JordanLikeAStone 8 13 0 13
[Comment(id='fh7okkx'), Comment(id='fh858ol'), Comment(id='fh8j9sd'), Comment(id='fh84wkl')]
Urgent: Coping Skills Needed Should I go to the hospital for help with a meltdown??? f1uqeg 1581360256.0 carbeean 17 3 0 3
[Comment(id='fh8iwwc'), Comment(id='fh8jcbw')]
CW: Multiple DAE have a million thoughts and ideas constantly?? f1ia2k 1581295733.0 PeelMeDaddy 7 8 0 8
[Comment(id='fh6ef5c'), Comment(id='fh6s911'), Comment(id='fh6imnm'), Comment(id='fh6vi40')]
DAE DAE not have a full on fear of abandonment, but a fear of being hated? f1fu73 1581285314.0 nevillethrowaway22 6 10 0 10
[Comment(id='fh57bwc'), Comment(id='fh578gx'), Comment(id='fh5npk5'), Comment(id='fh6x0y8')]
Seeking Support Coping ahead for a Valentine's day alone f1i07c 1581294535.0 princessmudkip 6 3 0 3
[Comment(id='fh6jc3b'), Comment(id='fh6qe20'), Commen

In [114]:
#list of lists for submissions objects
submission_list = []

for submission in reddit.subreddit('BPD').search('a',time_filter='all', sort='new', limit=1000):
    submission_list.append([submission.id, submission.subreddit, submission.created_utc, submission.link_flair_text, 
                            submission.author, submission.title, submission.selftext, submission.num_comments, 
                            submission.ups, submission.downs, submission.score])
    

In [115]:
submissions_df = pd.DataFrame(submission_list, columns=['id','subreddit','created_utc', 'link_flair_text', 'author', 'title',
                                      'body', 'num_comments', 'ups', 'downs', 'score'])