
## 1. Data Collection
- For this project, we will be using Reddit's API to collect posts from two subreddits: "r/depression", "r/SuicideWatch", "r/Stress", "r/CasualConversation", "r/Showerthoughts"

> Note: Data in this notebook was collected on Oct 7th 2022, 10:30am. Do note that if you run the code again on another day, it will result in a new set of posts being scraped.


In [1]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture of the r/depression subreddit page 

In [36]:
#WE WILL SCRAPE THE r/depression AND r/SuicideWatch SUBREDDITS
#LET'S START BY EXPLORING THE HTML INNARDS OF THE FORMER
# url_1 = "https://www.reddit.com/r/depression.json"
# url_1 ="https://www.reddit.com/r/CasualConversation.json"
# url_1 = "https://www.reddit.com/r/Showerthoughts.json"
# url_1="https://www.reddit.com/r/Stress.json"

In [37]:
#DEFINING A USER AGENT AND MAKING SURE STATUS IS GOOD TO GO
headers = {"User-agent" : "Pranav Satyal"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [38]:
#PEEKING AT WHAT OUR DATA WILL LOOK LIKE
depress_json = res.json()
depress_json

{'kind': 'Listing',
 'data': {'after': 't3_y58vbg',
  'dist': 25,
  'modhash': '',
  'geo_filter': None,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Showerthoughts',
     'selftext': '',
     'author_fullname': 't2_lae4r',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'The world didn’t end in 2012 but everything did start to slowly get weirder right around then',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/Showerthoughts',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': None,
     'top_awarded_type': None,
     'hide_score': False,
     'name': 't3_y54uhb',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'upvote_ratio': 0.92,
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 8926,
     'total_awards_received': 0,
     'media_embed': {},

In [39]:
# #PEEKING AT WHAT OUR DATA WILL LOOK LIKE
# depress_json = res.json()
# depress_json

In [40]:
#THE REDDIT DATA SEEMS TO BE ORGANISED AS A DICTIONARY
#LET'S GET ITS KEYS
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'geo_filter', 'modhash']

In [41]:
#WITH SOME HELP FROM THIS YOUTUBE TUTORIAL: https://www.youtube.com/watch?v=5Y3ZE26Ciuk
#WE FIND OUT THAT THE after KEY IS THE QUERY STRING THAT WILL...
#INDICATE IN OUR URL THAT WE WANT TO SEE THE NEXT 25 POSTS AFTER THE after "CODE"

depress_json["data"]["after"]

't3_y58vbg'

In [42]:
#DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_y54uhb',
 't3_y4z5bu',
 't3_y5bcz4',
 't3_y51vxo',
 't3_y55i1m',
 't3_y5acas',
 't3_y5dkdv',
 't3_y5d4sr',
 't3_y4rm0q',
 't3_y582v2',
 't3_y4kqcl',
 't3_y58kaf',
 't3_y4u4et',
 't3_y4i1m3',
 't3_y5anpz',
 't3_y57n7v',
 't3_y4hrpz',
 't3_y55s6t',
 't3_y5e7ey',
 't3_y5f72k',
 't3_y5e16d',
 't3_y5g2v9',
 't3_y525jp',
 't3_y4tt7g',
 't3_y58vbg']

In [43]:
#CHECKING OUT THE NUMBER OF POSTS IN ONE PAGE
len(depress_json["data"]["children"])

25

In [44]:
# OH, WE CAN DATAFRAME IT. 
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'Shower..."


In [45]:
#LOOKS LIKE THIS DATA IS REALLY WHAT WE ARE LOOKING FOR
depress_json["data"]["children"][20]["data"]

{'approved_at_utc': None,
 'subreddit': 'Showerthoughts',
 'selftext': 'Just watched Goosebumps 2 and it made me realise the whole ‘adults are bad’ at stuff despite them growing up through it being a thing.',
 'author_fullname': 't2_16ajin',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Shows frequently display (mid) adults as being bad with technology compared to teenagers despite most adults growing up during the shift to the new technology and teenagers lacking knowledge of the stuff that existed before the shift - adults have more overall knowledge',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/Showerthoughts',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': None,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_y5e16d',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 0.65,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 

### 1.2 Creating functions to automate the Data Collection process 
- We will first run those functions on r/depression and check if they have worked well.

In [46]:
# NOW WE CAN DEFINE A FUNCTION TO SCRAPE A REDDIT PAGE

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))
 

In [47]:
#CALLING THE FUNCTION ON OUR DEPRESSION SUBREDDIT
depress_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/CasualConversation.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/CasualConversation.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1241
Number of unique posts: 814


In [48]:
#CREATING A FUNCTION TO OUTPUT A LIST OF UNIQUE POSTS
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))
    

In [49]:
#CALLING THE FUNCTION ON OUR SCRAPED DATA
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 814 UNIQUE SCRAPED POSTS


In [50]:
#PUTTING DATA INTO A DATAFRAME AND SAVING TO CSV
depression = pd.DataFrame(depress_scraped_unique)
# depression["is_suicide"] = 0
depression["class"] = "casual conversation"
depression.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,call_to_action,poll_data,author_cakeday,class
0,,CasualConversation,Welcome to r/CasualConversation! Thank you for...,t2_6l4z3,False,,0,False,r/CasualConversation Welcome Thread - Month of...,"[{'a': ':chat:', 'e': 'emoji', 'u': 'https://e...",...,https://www.reddit.com/r/CasualConversation/co...,2010592,1664636000.0,1,,False,,,,casual conversation
1,,CasualConversation,We've set up a Spotify playlist for our Commun...,t2_ghulyn8b,False,,0,False,/r/CasualConversation Community Spotify Playlist,"[{'a': ':note:', 'e': 'emoji', 'u': 'https://e...",...,https://www.reddit.com/r/CasualConversation/co...,2010592,1664708000.0,1,,False,,,,casual conversation
2,,CasualConversation,I got engaged today to my boyfriend. He doesn’...,t2_pv4ksqqq,False,,0,False,I got engaged today!,"[{'a': ':chat:', 'e': 'emoji', 'u': 'https://e...",...,https://www.reddit.com/r/CasualConversation/co...,2010592,1665869000.0,0,,False,,,,casual conversation
3,,CasualConversation,"Potato Broccoli Cheesy Soup from scratch, ya'l...",t2_hs5rg,False,,0,False,My anxiety has been debilitating lately. But t...,"[{'e': 'text', 't': '🍍'}]",...,https://www.reddit.com/r/CasualConversation/co...,2010592,1665815000.0,0,,False,,,,casual conversation
4,,CasualConversation,"Without mentioning what covid is, or just how ...",t2_ssi49iu5,False,,0,False,If you went back in time right before covid to...,"[{'a': ':thinking:', 'e': 'emoji', 'u': 'https...",...,https://www.reddit.com/r/CasualConversation/co...,2010592,1665861000.0,0,,False,,,,casual conversation


In [51]:
### Saving Data to CSV file
depression.to_csv('../data/shower_thoughts.csv', index = False)

### Scraping Suicide Data

In [16]:
#CALLING THE SCRAPING FUNCTION ON OUR SUICIDEWATCH SUBREDDIT
suicide_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

()

In [17]:
#CALLING THE "UNIQUE ONLY" FUNCTION ON OUR SCRAPED DATA
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 964 UNIQUE SCRAPED POSTS


In [19]:
#PUTTING SUICIDEWATCH DATA INTO A DATAFRAME AND SAVING TO CSV
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,call_to_action,author_cakeday,link_flair_template_id,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,https://www.reddit.com/r/SuicideWatch/comments...,381743,1567526000.0,0,,False,,,,1
1,,SuicideWatch,"Activism, i.e. advocating or fundraising for s...",t2_1t70,False,,0,False,Please remember that NO ACTIVISM of any kind i...,[],...,https://www.reddit.com/r/SuicideWatch/comments...,381743,1631232000.0,0,,False,,,,1
2,,SuicideWatch,,t2_8t8wj7fg,False,,0,False,The worst part is when you reach the stage whe...,[],...,https://www.reddit.com/r/SuicideWatch/comments...,381743,1665069000.0,0,,False,,,,1
3,,SuicideWatch,Sometimes it doesn't get better. Sometimes thi...,t2_217xa3t7,False,,0,False,"When people say ""It gets better"" it just makes...",[],...,https://www.reddit.com/r/SuicideWatch/comments...,381743,1665057000.0,0,,False,,,,1
4,,SuicideWatch,I don't understand how anyone can live through...,t2_51en9mor,False,,0,False,Everything just repeats,[],...,https://www.reddit.com/r/SuicideWatch/comments...,381743,1665096000.0,0,,False,,,,1


#### NOTE: I've commented out the code in the next cell for "pd.to_csv" to prevent any accidental overwriting of the the saved dataset.**

In [20]:
### saving filesssssssssssssssssssssssssssssss
# suicide_watch.to_csv('../data/suicide_watch.csv', index = False)
# depression.to_csv('../data/depression.csv', index = False)

In [21]:
#INVESTIGATING THE CASE OF r/SuicideWatch HAVING AN ADDITIONAL COLUMN
suicide_watch.columns.difference(depression.columns)

Index(['link_flair_template_id'], dtype='object')

In [22]:
#LOOKING INTO THAT ADDITIONAL COLUMN
suicide_watch['author_cakeday'].isnull().value_counts()

True     957
False      7
Name: author_cakeday, dtype: int64

### Data seems to be collected successfully.
#### We have 964 r/SuicideWatch posts and 976 r/depression posts.