# Binary Classification Model for Sentiment Analysis

## Overview:

The goal of this project is to develop a binary classification model that detects suicidal intent / thoughts through text.

## Table of Contents:

1. Text **Cleaning, Vectorization and Splitting** for NLP
2. **Training** the Model (Logistic Regression)
3. **Evaluating** the Model's Performance
    - Confusion Matrix
    - Precision, Recall, and F1 Score


## ***1. Text Cleaning, Vectorization and Splitting for NLP***

In [1]:
import pandas as pd
    
# Reading CSV data to pandas dataframe
df = pd.read_csv('Suicide_Detection.csv', encoding="utf-8")

# Replacing all occurences of 'suicide' with 1 and 'non-suicide' with 0 in the 'class' column of the dataframe
df['class'] = df['class'].replace({'suicide': 1, 'non-suicide': 0})

# Dropping duplicate entries in the dataframe
df = df.drop_duplicates()

# Display head of dataframe to check
df.head()

# Making sure training data is balanced
df.groupby('class').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,116037.0,174078.570852,100600.329485,3.0,86843.0,174534.0,261124.0,348110.0
1,116037.0,174227.156183,100400.800343,2.0,87240.0,174196.0,261448.0,348108.0


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
 
# Vectorizing the text data such that 1, 2 and 3 word combinations will be listed
# Words that do not appear in at least 30 distinct documents will not be included in the vocabulary

vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words='english', min_df=30)
x = vectorizer.fit_transform(df['text'])
y = df['class']

In [3]:
from sklearn.model_selection import train_test_split

# Reserve 75% for training and 25% for testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

## ***2. Training the Model***

In [4]:
from sklearn.linear_model import LogisticRegression
 
# Using logistic regression to train the model
model = LogisticRegression(max_iter=1000, random_state=0)
model.fit(x_train, y_train)

In [5]:
from joblib import dump

# Save the model
dump(model, 'logistic_regression_model.joblib')

['logistic_regression_model.joblib']

## ***3. Evaluating the Model's Performance***

### Displaying Confusion Matrices and Basic Accuracy

In [11]:
import joblib

# Importing the model
model = joblib.load('logistic_regression_model.joblib')

In [12]:
from sklearn.metrics import confusion_matrix

# Creating the confusion count matrix
confusion = confusion_matrix(y_test, model.predict(x_test))

confusion_count_df = pd.DataFrame(
    confusion,
    columns=['Predicted Negative', 'Predicted Positive'],
    index=['Actual Negative', 'Actual Positive']
)

# Calculating the number of true negatives, true positives, false positives and false negatives after the tests
TN_count = confusion_count_df.loc['Actual Negative', 'Predicted Negative']
TP_count = confusion_count_df.loc['Actual Positive', 'Predicted Positive']
FP_count = confusion_count_df.loc['Actual Negative', 'Predicted Positive']
FN_count = confusion_count_df.loc['Actual Positive', 'Predicted Negative']
TOTAL_count = confusion_count_df.values.sum()

# Calculating the rate of true negatives, true positives, false positives, and false negatives after the tests
TN_rate = (TN_count/TOTAL_count)
TP_rate = (TP_count/TOTAL_count)
FP_rate = (FP_count/TOTAL_count)
FN_rate = (FN_count/TOTAL_count)

# Constructing the confusion rate matrix
confusion_rate_df = pd.DataFrame({
    'Predicted Negative': [TN_rate*100, FP_rate*100],
    'Predicted Positive': [FN_rate*100, TP_rate*100]
}, index=['Actual Negative', 'Actual Positive'])

# Displaying confusion count
print("Confusion Count Matrix:")
print(confusion_count_df)

# Displaying confusion rate
print("\nConfusion Rate Matrix:")
print(confusion_rate_df)

# Displaying the Accuracy
print("\nAccuracy: ", ((TN_count+TP_count)/(TOTAL_count))*100, "%")


Confusion Count Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative               27505                1521
Actual Positive                2437               26556

Confusion Rate Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative           47.406884            4.200348
Actual Positive            2.621555           45.771213

Accuracy:  93.17809683034868 %


### Precision, Recall, and F1 Score

In [13]:
# Calculating Precision, Recall, and F1 Score
PRECISION = TP_count/(TP_count + FP_count)
RECALL = TP_count/(TP_count + FN_count)
F1_SCORE = 2 * (PRECISION * RECALL) / (PRECISION + RECALL)

# Displaying Precision, Recall, and F1 Score
print("Precision: ", PRECISION*100, "%")
print("Recall: ", RECALL*100, "%")
print("F1 Score: ", F1_SCORE*100, "%")

Precision:  94.5827545677957 %
Recall:  91.5945228158521 %
F1 Score:  93.06465743823374 %
