Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

Sentiment analysis refers to identifying as well as classifying the sentiments that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of the people about a variety of topics.

In [None]:
#Importing necessary libraries
import numpy as np
import pandas as pd

In [None]:
# Get the data
df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/twitter-suicidal-intention-dataset/master/twitter-suicidal_data.csv")
df.head()

Unnamed: 0,tweet,intention
0,my life is meaningless i just want to end my l...,1
1,muttering i wanna die to myself daily for a fe...,1
2,work slave i really feel like my only purpose ...,1
3,i did something on the 2 of october i overdose...,1
4,i feel like no one cares i just want to die ma...,1


In [None]:
# Counting the data each label holds
df['intention'].value_counts()

0    5121
1    3998
Name: intention, dtype: int64

<h3>Preprocessing</h3>

In [None]:
import re
def clean_tweets(text):
    text = re.sub('http\S+\s*', ' ', text) #Remove URls
    text = re.sub(r'@[A-Za-z0-9]+','',text) #Removing @ mentions
    text = re.sub(r'#','',text) #Removing the hashtag symbol
    text = re.sub(r'RT[\s]+','',text) #Removing RT
    text = re.sub("(.)\\1{2,}", "\\1", text) #Remove repeating characters
    
    return text

In [None]:
df['tweet'] = df['tweet'].apply(lambda x:clean_tweets(x))

In [None]:
df.head()

Unnamed: 0,tweet,intention
0,my life is meaningless i just want to end my l...,1
1,muttering i wanna die to myself daily for a fe...,1
2,work slave i really feel like my only purpose ...,1
3,i did something on the 2 of october i overdose...,1
4,i feel like no one cares i just want to die ma...,1


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
#Taking top 20000 dictionary words into account
#unigram, bigram, trigram ---> there is a single word, combination of 2 words, combination of 3 words
#character analyzer ---> character by character tokenization
text_clf_rf = Pipeline([('word_vectorizer', TfidfVectorizer(max_features = 20000, ngram_range=(1,3), analyzer='char')),
                    ('clf', RandomForestClassifier())])

text_clf_knn = Pipeline([('word_vectorizer', TfidfVectorizer(max_features = 20000, ngram_range=(1,3), analyzer='char')),
                    ('clf', KNeighborsClassifier())])

text_clf_lsvc = Pipeline([('word_vectorizer', TfidfVectorizer(max_features = 20000, ngram_range=(1,3), analyzer='char')),
                    ('clf', LinearSVC())])

print ("Feature completed .....")

Feature completed .....


In [None]:
# Splitting the dataset into training and testing data
X_train,X_test,y_train,y_test = train_test_split(df['tweet'],df['intention'],random_state=0, test_size=0.2)
print(X_train.shape)
print(X_test.shape)

(7295,)
(1824,)


# Linear SVC

In [None]:
#Fitting the data to the model
text_clf_lsvc.fit(X_train,y_train)
prediction = text_clf_lsvc.predict(X_test)

In [None]:
print("\n Classification report for classifier %s:\n%s\n" % (text_clf_lsvc, metrics.classification_report(y_test, prediction)))


 Classification report for classifier Pipeline(memory=None,
         steps=[('word_vectorizer',
                 TfidfVectorizer(analyzer='char', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=20000,
                                 min_df=1, ngram_range=(1, 3), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_in

# Kneighbors Classifier

In [None]:
#Fitting the data to the model
text_clf_knn.fit(X_train,y_train)
prediction = text_clf_knn.predict(X_test)

In [None]:
print("\n Classification report for classifier %s:\n%s\n" % (text_clf_lsvc, metrics.classification_report(y_test, prediction)))


 Classification report for classifier Pipeline(memory=None,
         steps=[('word_vectorizer',
                 TfidfVectorizer(analyzer='char', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=20000,
                                 min_df=1, ngram_range=(1, 3), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_in

# Random Forest Classifier

In [None]:
#Fitting the data to the model
text_clf_rf.fit(X_train,y_train)
prediction = text_clf_rf.predict(X_test)

In [None]:
print("\n Classification report for classifier %s:\n%s\n" % (text_clf_rf, metrics.classification_report(y_test, prediction)))


 Classification report for classifier Pipeline(memory=None,
         steps=[('word_vectorizer',
                 TfidfVectorizer(analyzer='char', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=20000,
                                 min_df=1, ngram_range=(1, 3), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 toke...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                      

As you can see Linear SVC has higher accuracy than knn and Random Forest Classifier. Hence we decided to save our final model with linear SVC classifier.

In [None]:
test_ans = '''I don't feel like I can stand anymore stress in my life. 
                I want to end it for once and all.'''
text_clf_lsvc.predict([test_ans])[0]

1

In [None]:
test_ans_pos = '''I wish tremendous joy and good health to you and your family.'''
text_clf_lsvc.predict([test_ans_pos])[0]

0

In [None]:
# Saving the state of working model so that it could be imported later in twitter sentiment analysis model
import pickle

In [None]:
with open('suicide_tendency_model', 'wb') as to_write:
    pickle.dump(text_clf_lsvc, to_write)

Thus the model helps in classifying suicidal texts with an accuracy of 93%