# Project 3: Suicide Watch

## Notebook 1: Data Collection, Cleaning and Preprocessing

This notebook contains the Problem Statement and Executive Summary followed by codes for data collection, data cleaning and preprocessing.

Preprocessing here consists of both stemming and lemmatizing, after which data can be visualised in the next notebook.

## Problem Statement

To predict the language used by individuals who are suicidal versus depressed using webscraped data from subreddits Suicide Watch and depression using Multinomial Naive Bayes and Logistic Regression. The subreddits allow for free expression of emotions, which helps us to study the difference between those who are depressed and suicidal in the most honest form. The information can help counsellors and psychologists in schools to classify children and adolescents into 'high-alert' cases if they are predicted to be suicidal so that more attention can be given to them.

Success will be evaluated by accuracy, precision, recall and f1-score. For this problem, recall is the most important metric because we want to identify as many possible cases of suicidal individuals as possible. The cost of not identifying them is a matter of life and death and it also does not hurt to be on high alert for those who were wrongly identified to be suicidal.

## Executive Summary

The number of suicides in Singapore rose 10 per cent last year, with suicides among boys aged 10 to 19 at a record high, according to the Samaritans of Singapore (SOS). Against this backdrop, we investigate how we can better identify potential cases of suicide by looking at the words used by these individuals.

The subreddits Suicide Watch and depression will be used to help in the study of the language used. They offer firsthand accounts into the thought processes and emotional experiences of these individuals. The posts are cleaned by removing stopwords and pins by moderators and subsequently, the Multinomial Naive Bayes, Logistic Regression and K-nearest Neighbours models were run with Count and Tfidf Vectorizer. The better models - Multinomial Naive Bayes and Logistic Regression with Tfidf Vectorizer were then optimised before selecting the best model as the production model. 

The model that was found to be of the highest recall/ sensitivity (81%) was Tfidf Vectorizer with Multinomial Naive Bayes. This means that 81% of suicidal cases were identified correctly. It will be helpful for psychologists and counsellors in a school setting to be more sensitive to such cases and be on high alert to prevent loss of lives.

### Contents
- [Data Collection](#Data-Collection)
- [Data Cleaning and EDA](#Data-Cleaning-and-EDA)
    * [Combine Features](#Combine-Features)
- [Pre-processing](#Preprocessing)
    * [Tokenizing](#Tokenizing)
    * [Lemmatizing](#Lemmatizing)
    * [Stemming](#Stemming)

### Data Collection

In [1]:
# Read in libraries
import requests
import time
import pandas as pd
import random
import re

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

The following has been commented out to prevent re-scraping of data.

In [2]:
# #Loop through 25 depression posts at a time
# depression_posts = []
# after = None
# url0 = 'https://www.reddit.com/r/depression/.json'

# for a in range(41):
#     if after == None:
#         current_url = url0
#     else:
#         current_url = url0 + '?after=' + after
#     print(current_url)
#     res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
#     if res.status_code != 200:
#         print('Status error', res.status_code)
#         break
    
#     current_dict = res.json()
#     current_posts = [p['data'] for p in current_dict['data']['children']]
#     depression_posts.extend(current_posts)
#     after = current_dict['data']['after']
    
#     if a > 0:
#         prev_posts = pd.read_csv('../datasets/depression.csv')
#         current_df = pd.DataFrame()
        
#     else:
#         pd.DataFrame(depression_posts).to_csv('../datasets/depression.csv', index = False)

#     # generate a random sleep duration to look more 'natural'
#     sleep_duration = random.randint(2,6)
#     print(sleep_duration)
#     time.sleep(sleep_duration)

In [3]:
# Check the number of posts
# print('The number of depression posts scraped: {}'.format(len(depression_posts)))

In [4]:
# Commented out the code below to prevent overwriting of saved dataset
# pd.DataFrame(depression_posts).to_csv('../datasets/depression.csv', index = False)

In [5]:
# Display scraped posts in a dataframe
# depression = pd.DataFrame(depression_posts)
# depression.head()

In [6]:
# # Loop through 25 suicide posts at a time
# suicide_posts = []
# after = None
# url1 = 'https://www.reddit.com/r/SuicideWatch/.json'

# # Loop for 40 times:
# for a in range(41):
#     if after == None:
#         current_url = url1
#     else:
#         current_url = url1 + '?after=' + after
#     print(current_url)
#     res = requests.get(current_url, headers={'User-agent': 'Beep Inc 1.0'})
    
#     if res.status_code != 200:
#         print('Status error', res.status_code)
#         break
    
#     current_dict = res.json()
#     current_posts = [p['data'] for p in current_dict['data']['children']]
#     suicide_posts.extend(current_posts)
#     after = current_dict['data']['after']
    
#     if a > 0:
#         prev_posts = pd.read_csv('..datasets/suicide.csv')
#         current_df = pd.DataFrame()
        
#     else:
#         pd.DataFrame(suicide_posts).to_csv('..datasets/suicide.csv', index = False)

#     # generate a random sleep duration to look more 'natural'
#     sleep_duration = random.randint(2,6)
#     print(sleep_duration)
#     time.sleep(sleep_duration)

In [7]:
# Check the number of posts
# print('The number of suicide posts scraped: {}'.format(len(suicide_posts)))

In [8]:
# Commented out the code below to prevent overwriting of saved dataset
# pd.DataFrame(suicide_posts).to_csv('../datasets/suicide.csv', index = False)

In [9]:
# Display scraped posts in a dataframe
# suicide = pd.DataFrame(suicide_posts)
# suicide.head()

### Data Cleaning and EDA

In [10]:
# Import data from saved csv file
depression = pd.read_csv('../datasets/depression.csv')
suicide = pd.read_csv('../datasets/suicide.csv')

In [11]:
# Check for missing values in depression dataframe
print('Missing values in depression selftext column: {}' .format(depression.selftext.isnull().sum()))
print('Missing values in depression title column: {}' .format(depression.title.isnull().sum()))

Missing values in depression selftext column: 1
Missing values in depression title column: 0


In [12]:
# Impute missing value with empty string
depression.selftext.fillna('', inplace=True)

In [13]:
# Check if it has been imputed
print('Missing values in depression selftext column: {}' .format(depression.selftext.isnull().sum()))

Missing values in depression selftext column: 0


In [14]:
# Check for missing values in suicide dataframe
print('Missing values in suicide selftext column: {}' .format(suicide.selftext.isnull().sum()))
print('Missing values in suicide title column: {}' .format(suicide.title.isnull().sum()))

Missing values in suicide selftext column: 61
Missing values in suicide title column: 0


In [15]:
# Impute missing value with empty string
suicide.selftext.fillna('', inplace=True)

# Check if it has been imputed
print('Missing values in suicide selftext column: {}' .format(suicide.selftext.isnull().sum()))

Missing values in suicide selftext column: 0


#### Combine Features

In [16]:
# Select all duplicate rows based on one column
depression_dup = depression[depression.duplicated(['selftext'])]

# Check the number of duplicates
depression_dup.shape

(104, 100)

In [17]:
# Checking some of the contents of the posts
depression.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],...,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,611712,1572361000.0,0,,False,
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],...,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,611712,1580649000.0,0,,False,
2,,depression,I've been feeling really depressed and lonely ...,t2_17aooz,False,,0,False,I hate it so much when you try and express you...,[],...,/r/depression/comments/fedwbi/i_hate_it_so_muc...,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611712,1583503000.0,0,,False,
3,,depression,Seems like when anyone else in my life is feel...,t2_2e3v4lor,False,,0,False,"I'm tired of caring about other people, but no...",[],...,/r/depression/comments/feh19t/im_tired_of_cari...,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611712,1583516000.0,0,,False,
4,,depression,I literally broke down crying and asked to go ...,t2_5v2j4itq,False,,0,False,I went to the hospital because I was having re...,[],...,/r/depression/comments/feel0k/i_went_to_the_ho...,no_ads,False,https://www.reddit.com/r/depression/comments/f...,611712,1583507000.0,0,,False,


The title contains useful descriptions as well. Join the two columns together.

Notice that there are 2 pinned posts from the moderator, which will be removed later. The posts are not the voices of those who are suffering from depression but rather just setting the rules that regulate the posts. 

In [18]:
depression['title_selftext'] = depression['title'] + depression['selftext']

In [19]:
# Check for duplicates again and get the number of duplicated posts
depression_join_dup = depression[depression.duplicated(['title_selftext'])]
depression_join_dup.shape

(104, 101)

In [20]:
# Drop duplicates
depression.drop_duplicates(subset='title_selftext', keep='first', inplace=True)
# Check shape of depression df after dropping
depression.shape

(914, 101)

In [21]:
# Remove the pinned posts from the moderator
depression.drop([0,1], inplace=True)

In [22]:
# Select all duplicate rows based on one column and get the number of duplicate posts
suicide_dup = suicide[suicide.duplicated(['selftext'])]
suicide_dup.shape

(88, 100)

In [23]:
# Check the duplicated data
suicide_dup.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
15,,SuicideWatch,,t2_6ur9j,False,,0,False,I have two brothers who have killed themselves...,[],...,/r/SuicideWatch/comments/feenlk/i_have_two_bro...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583507000.0,0,,False,
57,,SuicideWatch,,t2_35asc889,False,,0,False,I give up,[],...,/r/SuicideWatch/comments/fepwuq/i_give_up/,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583554000.0,0,,False,
62,,SuicideWatch,,t2_5f7xbhrc,False,,0,False,"Nobody gives a fuck until you die, and even th...",[],...,/r/SuicideWatch/comments/fea9x1/nobody_gives_a...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583481000.0,0,,False,
80,,SuicideWatch,,t2_3gs1nbv2,False,,0,False,"One day, I’ll step in front of a moving truck ...",[],...,/r/SuicideWatch/comments/feokwz/one_day_ill_st...,,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583548000.0,0,,False,
95,,SuicideWatch,,t2_467pkhb3,False,,0,False,If someone checks themselves into a psych hosp...,[],...,/r/SuicideWatch/comments/feqiyb/if_someone_che...,,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583558000.0,0,,False,


There are contents in the title that seem to be useful but are not in the selftext. There are also some empty selftext columns. The two columns will be joined to get all the texts, similar to the depression posts.

In [24]:
suicide['title_selftext'] = suicide['title'] + suicide['selftext']

In [25]:
# Check for duplicates again and get the number of duplicated posts
suicide_join_dup = suicide[suicide.duplicated(['title_selftext'])]
suicide_join_dup.shape

(30, 101)

In [26]:
# Drop duplicates
suicide.drop_duplicates(subset='title_selftext', keep='first', inplace=True)
# Check shape of depression df after dropping
suicide.shape

(980, 101)

In [27]:
suicide.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,title_selftext
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,188688,1567526000.0,0,,False,,New wiki on how to avoid accidentally encourag...
1,,SuicideWatch,"If you want to recognise an occasion, please d...",t2_1t70,False,,0,False,Reminder: Absolutely no activism of any kind i...,[],...,no_ads,True,https://www.reddit.com/r/SuicideWatch/comments...,188688,1568093000.0,0,,False,,Reminder: Absolutely no activism of any kind i...
2,,SuicideWatch,"Please, take a moment to listen.\n\nI cannot s...",t2_5nixf695,False,,1,False,"I ask you kindly to stop what you are doing, m...",[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583541000.0,0,,False,,"I ask you kindly to stop what you are doing, m..."
3,,SuicideWatch,About a year ago I made a post on this subredd...,t2_1ycht3ps,False,,0,False,To /u/Ryfflex - if you are still out there. Pl...,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583518000.0,0,,False,,To /u/Ryfflex - if you are still out there. Pl...
4,,SuicideWatch,I see so many people just like me on this subr...,t2_4dhvypdd,False,,0,False,For all those broken kids like me,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583548000.0,0,,False,,For all those broken kids like meI see so many...


Remove the pinned posts from the moderator because the moderator's post does not represent a post of a person suffering from suicidal thoughts. 

In [28]:
# Remove the pinned posts from the moderator
suicide.drop([0,1], inplace=True)

### Preprocessing and Modelling

In [29]:
# Concatenate the two dataframes to be processed together
df = pd.concat([suicide, depression], axis=0, ignore_index=True)
df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,title_selftext
0,,SuicideWatch,"Please, take a moment to listen.\n\nI cannot s...",t2_5nixf695,False,,1,False,"I ask you kindly to stop what you are doing, m...",[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583541000.0,0,,False,,"I ask you kindly to stop what you are doing, m..."
1,,SuicideWatch,About a year ago I made a post on this subredd...,t2_1ycht3ps,False,,0,False,To /u/Ryfflex - if you are still out there. Pl...,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583518000.0,0,,False,,To /u/Ryfflex - if you are still out there. Pl...
2,,SuicideWatch,I see so many people just like me on this subr...,t2_4dhvypdd,False,,0,False,For all those broken kids like me,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583548000.0,0,,False,,For all those broken kids like meI see so many...
3,,SuicideWatch,I can’t tolerate anybody. I hate the sound of ...,t2_5flkjo17,False,,0,False,Is anyone angry and bitter all the time?,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583534000.0,0,,False,,Is anyone angry and bitter all the time?I can’...
4,,SuicideWatch,,t2_5nc0f3ro,False,,0,False,NOW I COULD USE SOMEONE TO TALK TO,[],...,no_ads,False,https://www.reddit.com/r/SuicideWatch/comments...,188688,1583551000.0,0,,False,,NOW I COULD USE SOMEONE TO TALK TO


To predict a binary variable, set SuicideWatch subreddit as 1 and depression as 0

In [30]:
df['suicide'] = df['subreddit'].apply(lambda x: 1 if x=='SuicideWatch' else 0)

#### Tokenizing

We will start with tokenizing, lowercasing, removing stopwords and saving the tokens into a list to be lemmatized and stemmed later. 

In [31]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = []

for i in range(len(df['title_selftext'])):
    loop_tokens = tokenizer.tokenize(df['title_selftext'].iloc[i].lower())
    
    for j, token in enumerate(loop_tokens):
            # Remove non-letters
            re.sub('[^a-zA-Z]', '', token)
        
            # Replace url with empty string
            if 'http' in token:
                loop_tokens[j] = ''
                
            # Remove stopwords   
            if token in stopwords.words('english'):
                loop_tokens[j] = ''
                
    tokens.append(loop_tokens)

In [32]:
# Check token
tokens[2][:10]

['', '', '', 'broken', 'kids', 'like', 'mei', 'see', '', 'many']

#### Lemmatizing

While Stemming and Lemmatization both generate the root form of the inflected words, stem might not be an actual word whereas, lemma is an actual language word. 

Both of these techniques are designed with recall in mind, and precision tends to suffer as a result. In the case of depression and suicide, we want a recall near 1.0 — we want to find all patients who actually might commit suicide — and we can accept a lower precision since extra attention given to an individual suffering from depression would not be harmful. 

So we will try cleaning the data with both Lemmatization and Stemming and compare the results.

In [33]:
# Instantiate Lemmatizer
lem = WordNetLemmatizer()

posts_token_lem = []
for post in tokens:
    post_lem = []
    
    for word in post:
        #print(word)
        word_lem = lem.lemmatize(word) # get lemmatized word
        post_lem.append(word_lem) # add to post list
    posts_token_lem.append(post_lem)  # add post list to lemma matrix

# Check lemmatized token
posts_token_lem[2][:10]

['', '', '', 'broken', 'kid', 'like', 'mei', 'see', '', 'many']

In [34]:
# Format tokenized lemma for vectorizer i.e. change to a list of strings
lem_list = []

for post in posts_token_lem:
    lem_list.append(' '.join(post))

#### Stemming
We will improve the modeling ability of our strings by using a stemmer, which trims characters from each word to convert it to a stem. Words will register as equivalent during feature extraction if they share a stem.

In [35]:
# Instantiate Stemmer
stem = PorterStemmer()

posts_token_stem = []
for post in tokens:
    post_stem = [] # empty post stems
    for word in post:
        #print(word)
        word_stem = stem.stem(word) # get stemmed word
        post_stem.append(word_stem) # add to post list
    posts_token_stem.append(post_stem)  # add post list to stem matrix

# Check stemmed token
posts_token_stem[2][:10]

['', '', '', 'broken', 'kid', 'like', 'mei', 'see', '', 'mani']

In [36]:
# Format tokenized stems for vectorizer i.e. change to a list of strings
stem_list = []

for post in posts_token_stem:
    stem_list.append(' '.join(post))

#### Save preprocessed data to csv

In [37]:
df_prep = pd.DataFrame(data=[lem_list, stem_list], index=['post_lem','post_stem']).T

In [38]:
df_prep.head()

Unnamed: 0,post_lem,post_stem
0,ask kindly stop make tea read text p...,ask kindli stop make tea read text p...
1,u ryfflex still please let know year ...,u ryfflex still pleas let know year a...
2,broken kid like mei see many people like ...,broken kid like mei see mani peopl like ...
3,anyone angry bitter time tolerate anybo...,anyon angri bitter time toler anybodi ...
4,could use someone talk,could use someon talk


In [39]:
# Add in target variable
df_prep['suicide'] = df['suicide']
df_prep.head()

Unnamed: 0,post_lem,post_stem,suicide
0,ask kindly stop make tea read text p...,ask kindli stop make tea read text p...,1
1,u ryfflex still please let know year ...,u ryfflex still pleas let know year a...,1
2,broken kid like mei see many people like ...,broken kid like mei see mani peopl like ...,1
3,anyone angry bitter time tolerate anybo...,anyon angri bitter time toler anybodi ...,1
4,could use someone talk,could use someon talk,1


In [40]:
# Save to datasets folder
df_prep.to_csv('../datasets/preprocessed.csv', index=False)

## To be Continued in Notebook 2: EDA and Model Selection