<!-- <img src="../assets/a_eyes_readme.gif" style="float:right ; margin: 10px ; width:300px;">  -->

<h1><left>SAVE WORLD: Using Natural Language Processing to identify suicidal posts</left></h1>
<!-- <h4><left>By Prathamesh & Mayuresh</left></h4> -->

___

## 1. Data Collection
- For this project, we will be using Reddit's API to collect posts from two subreddits: "r/depression" and "r/SuicideWatch"
- We aim to automate as much of this process as possible into neat functions to enable repeatability on the data collection front.
- When collecting data from servers, we will create a randomized delay between requests as a consideration to Reddit's servers and security staff.

> Note: Data in this notebook was collected on 27 August 2021. Do note that if you run the code again on another day, it will result in a new set of posts being scraped.


In [1]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture of the r/depression subreddit page 

In [2]:
#WE WILL SCRAPE THE r/depression AND r/SuicideWatch SUBREDDITS
#LET'S START BY EXPLORING THE HTML INNARDS OF THE FORMER
url_1 = "https://www.reddit.com/r/depression.json"

In [3]:
#DEFINING A USER AGENT AND MAKING SURE STATUS IS GOOD TO GO
headers = {"User-agent" : "Sam He"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [4]:
#PEEKING AT WHAT OUR DATA WILL LOOK LIKE
depress_json = res.json()
depress_json

{'kind': 'Listing',
 'data': {'after': 't3_pciz4e',
  'dist': 27,
  'modhash': '',
  'geo_filter': None,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'depression',
     'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in publi

In [5]:
#THE REDDIT DATA SEEMS TO BE ORGANISED AS A DICTIONARY
#LET'S GET ITS KEYS
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'geo_filter', 'modhash']

In [7]:
#WE FIND OUT THAT THE after KEY IS THE QUERY STRING THAT WILL...
#INDICATE IN OUR URL THAT WE WANT TO SEE THE NEXT 25 POSTS AFTER THE after "CODE"

depress_json["data"]["after"]

't3_pciz4e'

In [8]:
#DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_doqwow',
 't3_m246c4',
 't3_pcb1su',
 't3_pc132v',
 't3_pcc6a6',
 't3_pc8c7z',
 't3_pcfnra',
 't3_pc9l7h',
 't3_pcditk',
 't3_pchbxe',
 't3_pchbii',
 't3_pcg0e6',
 't3_pcbw3e',
 't3_pciouj',
 't3_pcarlr',
 't3_pcerop',
 't3_pcfy8m',
 't3_pciwok',
 't3_pcg3bs',
 't3_pci36s',
 't3_pci0g7',
 't3_pboa6e',
 't3_pcef1w',
 't3_pc6gg1',
 't3_pce6m9',
 't3_pcg9ju',
 't3_pciz4e']

In [9]:
#CHECKING OUT THE NUMBER OF POSTS IN ONE PAGE
len(depress_json["data"]["children"])

27

In [10]:
# OH, WE CAN DATAFRAME IT. 
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [11]:
#LOOKS LIKE THIS DATA IS REALLY WHAT WE ARE LOOKING FOR
depress_json["data"]["children"][0]["data"]

{'approved_at_utc': None,
 'subreddit': 'depression',
 'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to know someone.  It will be maintained at /r/depression/wiki/private_contact, and the full text of the current v

### 1.2 Creating functions to automate the Data Collection process 
- We will first run those functions on r/depression and check if they have worked well.

In [12]:
# NOW WE CAN DEFINE A FUNCTION TO SCRAPE A REDDIT PAGE

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))
 

In [13]:
#CALLING THE FUNCTION ON OUR DEPRESSION SUBREDDIT
depress_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/depression.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/depression.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1241
Number of unique posts: 989


In [14]:
#CREATING A FUNCTION TO OUTPUT A LIST OF UNIQUE POSTS
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))
    

In [15]:
#CALLING THE FUNCTION ON OUR SCRAPED DATA
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 989 UNIQUE SCRAPED POSTS


In [16]:
#PUTTING DEPRESSION DATA INTO A DATAFRAME AND SAVING TO CSV
depression = pd.DataFrame(depress_scraped_unique)
depression["is_suicide"] = 0
depression.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,is_suicide
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,2,False,Our most-broken and least-understood rules is ...,[],...,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,785073,1572361000.0,2,,False,0
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_1t70,False,,0,False,"Regular Check-In Post, with important reminder...",[],...,/r/depression/comments/m246c4/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/m...,785073,1615400000.0,0,,False,0
2,,depression,At least it’s been there for the entirety of m...,t2_sfz74es,False,,0,False,"Depression is just this dark, cold cloud that ...",[],...,/r/depression/comments/pcb1su/depression_is_ju...,no_ads,False,https://www.reddit.com/r/depression/comments/p...,785073,1630020000.0,0,,False,0
3,,depression,"Since the past couple of months, I have been s...",t2_85eyhyk6,False,,0,False,"Having NO friends, ZERO motivation, and RUMINA...",[],...,/r/depression/comments/pc132v/having_no_friend...,no_ads,False,https://www.reddit.com/r/depression/comments/p...,785073,1629990000.0,0,,False,0
4,,depression,I just want time to stop for a second. If ther...,t2_7aa22vzb,False,,0,False,I always stay up late because I don't want the...,[],...,/r/depression/comments/pcc6a6/i_always_stay_up...,no_ads,False,https://www.reddit.com/r/depression/comments/p...,785073,1630024000.0,0,,False,0


### 1.3 Running our functions on the r/SuicideWatch subreddit 

In [17]:
#CALLING THE SCRAPING FUNCTION ON OUR SUICIDEWATCH SUBREDDIT
suicide_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

SCRAPING https://www.reddit.com/r/SuicideWatch.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1230
Number of unique posts: 981


In [18]:
#CALLING THE "UNIQUE ONLY" FUNCTION ON OUR SCRAPED DATA
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 981 UNIQUE SCRAPED POSTS


In [19]:
#PUTTING SUICIDEWATCH DATA INTO A DATAFRAME AND SAVING TO CSV
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,299322,1567526000.0,0,,False,,1
1,,SuicideWatch,"Activism, i.e. advocating or fundraising for s...",t2_1t70,False,,2,False,Please remember that NO ACTIVISM of any kind i...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,299322,1599734000.0,0,,False,,1
2,,SuicideWatch,I am about to sneak out the house rn in order ...,t2_b3jv0o43,False,,0,False,I am sneaking out rn to kms 15m,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,299322,1630032000.0,0,,False,,1
3,,SuicideWatch,What is the value of life? If someone wanted t...,t2_cre2vqty,False,,0,False,Why is suicide wrong?,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,299322,1630045000.0,0,,False,,1
4,,SuicideWatch,"Three weeks ago, my (21) girlfriend of three y...",t2_djggl438,False,,0,False,Girlfriend Was Raped. I Don’t Want to Live Any...,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,299322,1630043000.0,0,,False,,1


#### NOTE: I've commented out the code in the next cell for "pd.to_csv" to prevent any accidental overwriting of the the saved dataset.**

In [20]:
#suicide_watch.to_csv('../data/suicide_watch.csv', index = False)
#depression.to_csv('../data/depression.csv', index = False)

In [21]:
#INVESTIGATING THE CASE OF r/SuicideWatch HAVING AN ADDITIONAL COLUMN
suicide_watch.columns.difference(depression.columns)

Index(['author_cakeday'], dtype='object')

In [22]:
#LOOKING INTO THAT ADDITIONAL COLUMN
suicide_watch['author_cakeday'].isnull().value_counts()

True     976
False      5
Name: author_cakeday, dtype: int64

#### Early thoughts about the collected data
- Data seems to be collected successfully.
- We have some "uneven-ness" in the size of our set as we collected $ r/SuicideWatch posts and $ r/depression posts. We might want to consider "even-ing" out the posts with another round of collection. 
- There is also a matter of r/SuicideWatch having one extra column. Which is strange to me considering that they both exist on the same site. The column is "author_cakeday" and it is mostly NaNs. Thus, it doesn't seem like a column we will be using for our classifier.
