# Binary Classification Model for Sentiment Analysis

## Overview:

The goal of this project is to develop a binary classification model that detects suicidal intent / thoughts through text.

## Table of Contents:

1. Text **Cleaning, Vectorization and Splitting** for NLP
2. **Training** the Model (Logistic Regression)
3. **Evaluating** the Model's Performance


## ***1. Text Cleaning, Vectorization and Splitting for NLP***

In [2]:
import pandas as pd
    
# Reading CSV data to pandas dataframe
df = pd.read_csv('Suicide_Detection.csv', encoding="utf-8")

# Replacing all occurences of 'suicide' with 1 and 'non-suicide' with 0 in the 'class' column of the dataframe
df['class'] = df['class'].replace({'suicide': 1, 'non-suicide': 0})

# Dropping duplicate entries in the dataframe
df = df.drop_duplicates()

# Display head of dataframe to check
df.head()

# Making sure training data is balanced
df.groupby('class').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,116037.0,174078.570852,100600.329485,3.0,86843.0,174534.0,261124.0,348110.0
1,116037.0,174227.156183,100400.800343,2.0,87240.0,174196.0,261448.0,348108.0


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
 
# Vectorizing the text data such that 1, 2 and 3 word combinations will be listed
# Words that do not appear in at least 30 distinct documents will not be included in the vocabulary

vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words='english', min_df=30)
x = vectorizer.fit_transform(df['text'])
y = df['class']

In [4]:
from sklearn.model_selection import train_test_split

# Reserve 75% for training
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

## ***2. Training the Model***

In [9]:
from sklearn.linear_model import LogisticRegression
 
# Using logistic regression to train the model
model = LogisticRegression(max_iter=1000, random_state=0)
model.fit(x_train, y_train)

## ***3. Evaluating the Model's Performance***

In [37]:
from sklearn.metrics import confusion_matrix

# Creating the confusion matrix
confusion = confusion_matrix(y_test, model.predict(x_test))

confusion_df = pd.DataFrame(
    confusion,
    columns=['Predicted Negative', 'Predicted Positive'],
    index=['Actual Negative', 'Actual Positive']
)

# Displaying the Confusion Matrix
print("Confusion Matrix:")
print(confusion_df)

correct = confusion_df.loc['Actual Negative', 'Predicted Negative'] + confusion_df.loc['Actual Positive', 'Predicted Positive']
incorrect = confusion_df.loc['Actual Negative', 'Predicted Positive'] + confusion_df.loc['Actual Positive', 'Predicted Negative']

# Displaying the Accuracy
print("Accuracy: ", (correct/(correct+incorrect))*100, "%")


Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative               27505                1521
Actual Positive                2437               26556
Accuracy:  93.17809683034868 %


In [36]:
# Individual Prediction Test
model.predict_proba(vectorizer.transform(["I have everything, a good family, good friends, a great son, great house, plenty of money, but yet I feel empty. Recently i've been feeling very lonely and I have no one to talk to. In front of everyone else I act like nothing happened but every day I feel like I dont have a purpose anymore and I should jut end it."]))

array([[0.08645274, 0.91354726]])