This is the first of three notebooks to analyze whether we can distinguish between a depressed note and a suicidal note based on what I'd like to call "pseudo-supervised" learning. The idea is to 

1. Pull posts from two subreddits - **r/depression and r/SuicideWatch** - which we would assume should be labeled as depressed and suicidal, respectively
2. Conduct topic modeling of the two subreddits to see if differentiation between the two types of posts can be discerned
3. Build classification models to see how well we could label a post as being depressed or suicidal

This first notebook pulls about 4000 posts from each of the subreddits mentioned.

In [2]:
!pip install psaw

Collecting psaw
  Downloading https://files.pythonhosted.org/packages/60/b7/6724defc12bdcc45470e2b1fc1b978367f3d183ec6c6baa2770a0b083fc7/psaw-0.0.7-py3-none-any.whl
Installing collected packages: psaw
Successfully installed psaw-0.0.7


In [3]:
import pandas as pd
from psaw import PushshiftAPI
import requests
import json
import datetime
import math

In [4]:
def single_pull(subreddit, size=1000, beforeEpochDate=""):
    """
    This function makes one pull of posts from a given subreddit. The maximum number of posts per pull is 100.
    
    subreddit: subreddit you want posts from
    size: optional parameter. Number of posts you want from a pull with max 1000
    beforeEpochDate: optional parameter. This is the epoch date of the post you want posts after
    
    returns: tuple of (postsDF, afterName)
        postsDF: DataFrame including title and body of all posts
        epochKey: epoch date associated with last post in pull    
    """
    success = False
    while success is False:
        page_request = requests.get("https://api.pushshift.io/reddit/search/submission/?subreddit={}&size={}&before={}".format(subreddit, size, beforeEpochDate))
        raw_pull = page_request.json()        
        #If there's an error, sleep for 2 seconds. The rate limit is 30 requests per minute.
        if "error" not in raw_pull.keys():
            success = True
        else:
            print("Pull Failed")
            time.sleep(2)
    # Create DataFrame
    titles = []
    bodies = []
    for post in raw_pull["data"]:
        try:
            bodies.append(post["selftext"])
            titles.append(post["title"]) 
        except:
            pass
    postsDF = pd.DataFrame({"Title":titles, "Body":bodies}, index = range(0,len(titles)))
    # Key to use to reference last post of this pull
    epochKey = raw_pull["data"][-1]["created_utc"]
    return (postsDF, epochKey)

In [5]:
def pull_x_posts(subreddit, numPosts):
    """
    This funcion pulls a given number of posts from a given subreddit
    
    subreddit: subreddit you want to pull from
    numPosts: number of posts you want from that subreddit
    
    returns: compiledDF--DataFrame with title and body for each post
    """
    pulls_needed = math.ceil(numPosts/1000)
    compiledDF = pd.DataFrame(columns=["Title", "Body"])
    after_key=""
    for i in range(0,pulls_needed):
        current_pull = single_pull(subreddit, beforeEpochDate=after_key)
        compiledDF = compiledDF.append(current_pull[0])
        after_key = current_pull[1]
    compiledDF.reset_index(inplace=True)
    compiledDF.drop("index", axis=1, inplace=True)
    return compiledDF 

In [6]:
r_depression = pull_x_posts("depression", 4000)
r_depression.drop_duplicates(inplace=True)
r_depression = r_depression[r_depression["Body"] != "[removed]"]
r_depression.to_csv("r_depression.csv")

In [9]:
print(r_depression.shape)
r_depression.head()

(3796, 2)


Unnamed: 0,Title,Body
0,Not sure if I'm being annoying and overbearing...,This semester a friend of mine has skipped sev...
1,Why can't I just be homeless and die in the cold,I dont want to work I dont want to get up I do...
2,I’m better,I’d officially better and I’m ready to leave t...
3,i have therapy and i’m not going,i’ve got a therapy appointment in 1 hour and i...
4,This one girl has actually driven me into depr...,This girl and I had known each other for a whi...


In [7]:
r_suicide = pull_x_posts("suicidewatch", 4000)
r_suicide.drop_duplicates(inplace=True)
r_suicide = r_suicide[r_suicide["Body"] != "[removed]"]
r_suicide.to_csv("r_suicide.csv")

In [10]:
print(r_suicide.shape)
r_suicide.head()

(3900, 2)


Unnamed: 0,Title,Body
0,Holy crap I’m back,Is that what this account will be? Pouring my ...
1,Telling someone not to kill themselves seems a...,"Also, stop telling people ""Think about your fa..."
2,My life is meaningless,\n\n\n\nI just want to kill myself. I cannot g...
3,I think this is my last option,It's not that I want to die. It's that I have ...
4,I don't think I can do this,I have midterms and the last thing I wanna do ...
