___

## 1. Data Collection
- For this project, we will be using Reddit's API to collect posts from two subreddits: "r/depression" and "r/SuicideWatch"
- We aim to automate as much of this process as possible into neat functions to enable repeatability on the data collection front.
- When collecting data from servers, we will create a randomized delay between requests as a consideration to Reddit's servers and security staff.

In [45]:
import requests
import time
import pandas as pd
from random import randint

### 1.1 Exploring the HTML architecture of the r/depression subreddit page 

In [2]:
#WE WILL SCRAPE THE r/depression AND r/SuicideWatch SUBREDDITS
#LET'S START BY EXPLORING THE HTML INNARDS OF THE FORMER
url_1 = "https://www.reddit.com/r/depression.json"

In [3]:
#DEFINING A USER AGENT AND MAKING SURE STATUS IS GOOD TO GO
headers = {"User-agent" : "Abhishek Chavan"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [4]:
#PEEKING AT WHAT OUR DATA WILL LOOK LIKE
depress_json = res.json()
depress_json

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'depression',
     'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to k

In [5]:
#THE REDDIT DATA SEEMS TO BE ORGANISED AS A DICTIONARY
#LET'S GET ITS KEYS
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [6]:
#WE FIND OUT THAT THE after KEY IS THE QUERY STRING THAT WILL...
#INDICATE IN OUR URL THAT WE WANT TO SEE THE NEXT 25 POSTS AFTER THE after "CODE"

depress_json["data"]["after"]

't3_fegz17'

In [7]:
#DOUBLE CONFIRMING THAT THE PREVIOUS AFTER KEY IS REALLY THE LAST ITEM ON OUR PAGE
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_doqwow',
 't3_exo6f1',
 't3_fedwbi',
 't3_feel0k',
 't3_fe6ua3',
 't3_fecd9s',
 't3_feb4tq',
 't3_fegd6t',
 't3_fef10i',
 't3_fe6pvs',
 't3_fedtxb',
 't3_feh19t',
 't3_fedkl8',
 't3_fecmob',
 't3_fe2bv5',
 't3_fedzro',
 't3_fduqan',
 't3_fectqt',
 't3_fefb0h',
 't3_fea8or',
 't3_fehlyx',
 't3_fedluw',
 't3_fecam7',
 't3_fehduy',
 't3_fefwhm',
 't3_feh60s',
 't3_fegz17']

In [8]:
#CHECKING OUT THE NUMBER OF POSTS IN ONE PAGE
len(depress_json["data"]["children"])

27

In [9]:
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [10]:
#THIS GIVES US THE DATA WE ARE LOOKING FOR
depress_json["data"]["children"][0]["data"]

{'approved_at_utc': None,
 'subreddit': 'depression',
 'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to know someone.  It will be maintained at /r/depression/wiki/private_contact, and the full text of the current v

### 1.2 Creating functions to automate the Data Collection process 
- We will first run those functions on r/depression and check if they have worked well.

In [11]:
# NOW WE CAN DEFINE A FUNCTION TO SCRAPE A REDDIT PAGE

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #SCRAPED POSTS WILL BE CONTAINED IN OUTPUT LIST(SHD BE EMPTY)
    #THIS IS USEFUL FOR THE FIRST SCRAPE FROM THE VIRGIN SUBREDDIT
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))
 

In [12]:
#CALLING THE FUNCTION ON OUR DEPRESSION SUBREDDIT
depress_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/depression.json", 50, depress_scraped)

SCRAPING https://www.reddit.com/r/depression.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1243
Number unique posts: 917


In [13]:
#CREATING A FUNCTION TO OUTPUT A LIST OF UNIQUE POSTS
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))
    

In [14]:
#CALLING THE FUNCTION ON OUR SCRAPED DATA
depress_scraped_unique = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 917 UNIQUE SCRAPED POSTS


In [15]:
#PUTTING DEPRESSION DATA INTO A DATAFRAME AND SAVING TO CSV
depression = pd.DataFrame(depress_scraped_unique)
depression["is_suicide"] = 0
depression.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,is_suicide
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],...,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,611580,1572361000.0,0,,False,0
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],...,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,611580,1580649000.0,0,,False,0
2,,depression,I've been feeling really depressed and lonely ...,t2_17aooz,False,,0,False,I hate it so much when you try and express you...,[],...,/r/depression/comments/fedwbi/i_hate_it_so_muc...,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611580,1583503000.0,0,,False,0
3,,depression,I literally broke down crying and asked to go ...,t2_5v2j4itq,False,,0,False,I went to the hospital because I was having re...,[],...,/r/depression/comments/feel0k/i_went_to_the_ho...,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611580,1583507000.0,0,,False,0
4,,depression,Any kind soul want to give a depressed person ...,t2_15xfmv,False,,0,False,Cake day for me,[],...,/r/depression/comments/fe6ua3/cake_day_for_me/,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611580,1583463000.0,0,,False,0


### 1.3 Running our functions on the r/SuicideWatch subreddit 

In [16]:
#CALLING THE SCRAPING FUNCTION ON OUR SUICIDEWATCH SUBREDDIT
suicide_scraped = [] #DEFINING AN EMPTY LIST THAT WILL CONTAIN OUR SCRAPED DATA
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

SCRAPING https://www.reddit.com/r/SuicideWatch.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1232
Number unique posts: 980


In [17]:
#CALLING THE "UNIQUE ONLY" FUNCTION ON OUR SCRAPED DATA
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 980 UNIQUE SCRAPED POSTS


In [18]:
#PUTTING SUICIDEWATCH DATA INTO A DATAFRAME AND SAVING TO CSV
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,188579,1567526000.0,0,,False,,1
1,,SuicideWatch,"If you want to recognise an occasion, please d...",t2_1t70,False,,0,False,Reminder: Absolutely no activism of any kind i...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,188579,1568093000.0,0,,False,,1
2,,SuicideWatch,I really fucking feel you,t2_111wkq,False,,0,False,To every single poster here i wanne say one thing,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188579,1583465000.0,0,,False,,1
3,,SuicideWatch,Everyone ends up hating me eventually. \nMy ps...,t2_4de9uxb2,False,,0,False,I just want it all to stop,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188579,1583505000.0,0,,False,,1
4,,SuicideWatch,,t2_5f7xbhrc,False,,0,False,"Nobody gives a fuck until you die, and even th...",[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188579,1583481000.0,0,,False,,1


In [47]:
#suicide_watch.to_csv('../data/suicide_watch.csv', index = False)
#depression.to_csv('../data/depression.csv', index = False)

In [21]:
#INVESTIGATING THE CASE OF r/SuicideWatch HAVING AN ADDITIONAL COLUMN
suicide_watch.columns.difference(depression.columns)

Index(['author_cakeday'], dtype='object')

In [34]:
#LOOKING INTO THAT ADDITIONAL COLUMN
suicide_watch['author_cakeday'].isnull().value_counts()

True     979
False      1
Name: author_cakeday, dtype: int64