# Suicide Prediction using Machine Learning.

Here is the outline of the procedure:
1. Question
2. Understand the data
3. Manipulate the data
4. Fit & test models
5. Deploy models

## 1. Question

Due to the random nature of suicidality, traditional statistical methods have reported poor performance (SOURCE). This can be attributed to the notion that no single risk factor or risk assessment approach can be directly applied to suicide prediction with great accuracy. This makes relying on few strong predictors implausible and places greater emphasis on examining a larger number of factors. This requirement of complex combinations of hundreds of factors, is not ideal for traditional statistical methods. Furthermore, traditional statistical methods require the ideal algorithm to be designed before the collection of the data (a priori) to account for certain assumptions. In contrast, DL algorithms are self-optimizing and increase their predictive performance automatically through a method called back-propagation. This use of optimization means that the DL algorithm instead, learns from the data itself giving the analyst data driven insights. This also allows DL algorithms to harness high-dimensional data where the large number of predictors allow it to learn the appropriate parameters which can be applied to future datasets. Finally, a meta-analysis by Franklin et al., (2017) found that traditional statistical methods in suicide attempts have only produced models with near chance level predictive performance, highlighting a need for more advanced data driven methods. 

Emerging DL research in suicidality has shown promising results compared to traditional methods. Hack et al. (2017) examined suicidality in the Grady Trauma project for those looking for primary care in large urban hospitals. Their dataset comprised of 1017 participants, split into training and test sets (80/20) both of which had stratified 16% suicide attempters. They deployed Support Vector Machines (SVMs) and least absolute shrinkage and selection operator (LASSO) models with 100 iterations of which scored an area under the curve (AUC) of 70% and 71% respectively. Walsh et al (2017) used machine learning to predict suicide attempts from electronic health records. They developed a Random Forest algorithm that utilises an ensemble of decision trees and was run for 500 iterations. Their model accurately predicted future suicide attempts with an AUC of 0.84 and found that accuracy improved from 720 days to 7 days before the suicide attempt.  

Walsh et al., (2017) found that “recurrent depression with psychosis, schizophrenia, and schizoaffective disorder were consistently ranked highly in importance. Age and diagnoses of dependence on opioids, sedative-hypnotics, and cannabis increased in relative importance as prediction windows shortened. Some codes that likely indicate prior suicide attempts were also consistently predictive: poisoning, the most common mechanism of prior nonfatal suicide attempts in these data; injuries by firearms; and injuries “NEC” or not elsewhere classifiable. Medication classes such as selective serotonin reuptake inhibitors (SSRIs), benzodiazepines, anilides (such as acetaminophen), and propionic acid derivatives (such as ibuprofen) appear stronger within longer prediction windows. Melatonin receptor agonists such as melatonin supplements gain relative importance closer to the suicide attempt (i.e., shorter prediction windows).”

## 2. Understand the data

In [1]:
import numpy as np 
import pandas as pd 
baseline = pd.read_csv("../../data/t1_beta.csv")
core = pd.read_csv("../../data/core_labels.csv")

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
baseline.head()

Unnamed: 0,Participant_ID,wave,today,INT_ID,phone,Postcode,dob,doby,dobm,dobd,...,OME_over90,threeorMoreCC,fourorMoreCC,Deceased1,Withdrew_self,Withdrawn_all,Cannabis_12m,Cannabis12m_notbaseline,Cannabisbaseline_not12m,Cannabisbaseline_12m
0,26885,0,6/25/2013,TM,Telephone,4500,10/8/1951,1951,10,8,...,0,0,0,No,,0,No,No,No,No
1,27928,0,10/25/2013,Don't Know,Telephone,3020,3/26/1974,1974,3,26,...,0,0,0,No,,0,No,No,No,No
2,27112,0,8/23/2013,Other,Telephone,5052,11/8/1960,1960,11,8,...,0,1,0,No,Yes,1,No,No,No,No
3,26719,0,6/20/2013,CO,Telephone,5276,7/15/1950,1950,7,15,...,0,0,0,Yes,,2,No,No,No,No
4,28473,0,1/7/2014,Don't Know,Telephone,4575,7/30/1948,1948,7,30,...,0,1,1,No,,0,No,No,No,No


In [3]:
print("Shape:", baseline.shape)

Shape: (1514, 2903)


We can see that there are 1514 rows with 2903 variables for the baseline data.

In [6]:
# Variables we want are saved in a txt file in the current directory 
!ls

Project-Assignment.ipynb [31mmodel1.ipynb[m[m             [31mmodel1_vars.txt[m[m


In [9]:
filename = "model1_vars.txt"

# using the with construct closes the file automatically when finished with it
with open(filename, "r") as file:
    variables = file.readlines()
variables = [i.strip() for i in variables]

# Now we subset the data based on the variables we are intersted in. 
model1_data = baseline[variables]

In [10]:
# Ideation
x = model1_data.drop('Suicidal_Thoughts_12m', axis=1)
x = x.replace(" ", np.NaN)

y = model1_data[['Suicidal_Thoughts_12m']]

# Attempts
# x = model1_data.drop('Suicide_Attempts_12m', axis=1)
# y = model1_data[['Suicide_Attempts_12m']]

In [11]:
to_drop1 = variables[21:32]
to_drop2 = variables[38:51]

to_drop = to_drop1 + to_drop2

In [13]:
list(baseline)

['Participant_ID',
 'wave',
 'today',
 'INT_ID',
 'phone',
 'Postcode',
 'dob',
 'doby',
 'dobm',
 'dobd',
 'Actual_age',
 'Age_group',
 'age_under_58',
 'sex',
 'cob',
 'cob_s',
 'indig',
 'indig_yn',
 'edu',
 'tertiaryedu_yn',
 'MarritalStatus',
 'Employ',
 'Employ_sp',
 'Employ_chnge_pain',
 'Employ_before_pain',
 'Employ_before_pain_sp',
 'income_wk',
 'income_wk_partner',
 'income_lowhigh',
 'income_range',
 'income_yn',
 'income_partner_lowhigh',
 'accom',
 'accom_sp',
 'Live_Count',
 'Live_Spouse',
 'Live_Chldrn',
 'Live_Parents',
 'Live_Friends',
 'Live_OthRelatives',
 'Live_Housemates',
 'Live_Alone',
 'Live_Oth',
 'relativecount',
 'relative_yn',
 'otherlive',
 'otherlivewith_yn',
 'hcm',
 'hfeet',
 'hinch',
 'wkg',
 'wpnd',
 'hcmall',
 'wkgall',
 'bmi',
 'bmicat',
 'pain_wrk_related',
 'complete_daily_activ',
 'cond1',
 'cond2',
 'ever_Arth',
 'ever_Back',
 'ever_Head',
 'ever_Visc',
 'ever_Fibro',
 'ever_Cmplx',
 'ever_Shing',
 'ever_GenPain',
 'ever_OthPain',
 'num_chronic

First i am going to test a model on basic features only.

# 3. Fit the model(s)

Here we will be testing many models:
1. Logistic regression
2. Support Vector Machines
3. Tree models
4. Artificial Neural Networks

### 3.1. Logistic Regression

### 3.2. Support Vector Machine

### 3.3. Tree Models

### 3.4. Artificial Neural Network

**REFERENCES**

Franklin, J., Riberio, J., Fox, K., Bentley K, Kleiman, E., Huang, X., & Nock, M. (2017). 			     	Psychological Bulletin , 187-232.

Hack, L., Jovanovic, T., Carter, S., Ressler, K., & Smith, A. (2017). Suicide prediction using machine 	   	learning techniques in screening clinician derived data. Biological Psychiatry, 361-361.

Walsh, C., Ribeiro, J., & Franklin, J. (2017). Predicting Risk of Suicide Attempts Over Time Through   	Machine Learning. Clinical Psychological Science, 457-469.

# End.