# Sentiment Analysis on Mental Health Statements  

## 1. Introduction  
This notebook focuses on **preprocessing and resampling** mental health-related statements to prepare them for sentiment classification. The dataset consists of statements categorized into `Anxiety`, `Depression`, `Suicidal`, `Stress`, and `Normal`. The workflow includes:  

- **Data Cleaning & Preprocessing**: Removing noise, tokenization, stemming, and feature extraction.  
- **Handling Imbalanced Data**: Applying various resampling techniques like SMOTE, NearMiss, and Random Oversampling/Undersampling.  
- **Feature Engineering**: Extracting numerical and textual features using TF-IDF and other statistical methods.  

The output of this notebook will be a balanced dataset, ready for sentiment classification modeling.


In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [47]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [48]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\nihar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# 2. Loading the Dataset

In [49]:
# Loading the dataset into a pandas DataFrame
data = pd.read_csv('../data/raw/cleaned_sentiment_data.csv')

In [50]:
data.head()

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety


<!-- # 3. Text - Data Preprocessing -->

## 3. Text Data Preprocessing

In [51]:
# Calculate the number of characters and sentences
data['num_of_characters'] = data['statement'].str.len()
data['num_of_sentences'] = data['statement'].apply(lambda x: len(nltk.sent_tokenize(x)))

# Generate descriptive statistics
description = data[['num_of_characters', 'num_of_sentences']].describe()

In [52]:

# Display the descriptive statistics of character and sentence counts
print(description)

       num_of_characters  num_of_sentences
count       51073.000000      51073.000000
mean          575.375051          6.249251
std           847.661079         10.762749
min             2.000000          1.000000
25%            79.000000          1.000000
50%           313.000000          3.000000
75%           745.000000          8.000000
max         32759.000000       1260.000000


### Cleaning Text Data

Removing URLs, punctuation, special characters, and converting text to lowercase for better processing.

In [53]:
import re

def clean_text(text):
    """
    Cleans the text by:
    - Removing URLs, markdown links, and user handles (@mentions)
    - Converting text to lowercase
    - Removing punctuation, numbers, and extra whitespace
    """
    if isinstance(text, str):  # Ensure input is a string
        text = text.lower()  # Convert to lowercase
        text = re.sub(r'http[s]?://\S+', '', text)  # Remove URLs
        text = re.sub(r'\[.*?\]\(.*?\)', '', text)  # Remove markdown links
        text = re.sub(r'@\w+', '', text)  # Remove @mentions
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
        text = re.sub(r'\d+', '', text)  # Remove numbers
        text = ' '.join(text.split())  # Remove extra spaces
        return text
    return ''


In [54]:
# Apply the cleaning function to the 'statement' column
data['cleaned_text'] = data['statement'].apply(clean_text)

In [55]:
data[['statement', 'cleaned_text']].head(10)


Unnamed: 0,statement,cleaned_text
0,oh my gosh,oh my gosh
1,"trouble sleeping, confused mind, restless hear...",trouble sleeping confused mind restless heart ...
2,"All wrong, back off dear, forward doubt. Stay ...",all wrong back off dear forward doubt stay in ...
3,I've shifted my focus to something else but I'...,ive shifted my focus to something else but im ...
4,"I'm restless and restless, it's been a month n...",im restless and restless its been a month now ...
5,"every break, you must be nervous, like somethi...",every break you must be nervous like something...
6,"I feel scared, anxious, what can I do? And may...",i feel scared anxious what can i do and may my...
7,Have you ever felt nervous but didn't know why?,have you ever felt nervous but didnt know why
8,"I haven't slept well for 2 days, it's like I'm...",i havent slept well for days its like im restl...
9,"I'm really worried, I want to cry.",im really worried i want to cry


### Tokenization

Splitting statements into individual words (tokens) for further processing.

In [56]:
# Tokenizing the cleaned text
data['tokens'] = data['statement'].apply(word_tokenize)


In [57]:
data.head()

Unnamed: 0,statement,status,num_of_characters,num_of_sentences,cleaned_text,tokens
0,oh my gosh,Anxiety,10,1,oh my gosh,"[oh, my, gosh]"
1,"trouble sleeping, confused mind, restless hear...",Anxiety,64,2,trouble sleeping confused mind restless heart ...,"[trouble, sleeping, ,, confused, mind, ,, rest..."
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety,78,2,all wrong back off dear forward doubt stay in ...,"[All, wrong, ,, back, off, dear, ,, forward, d..."
3,I've shifted my focus to something else but I'...,Anxiety,61,1,ive shifted my focus to something else but im ...,"[I, 've, shifted, my, focus, to, something, el..."
4,"I'm restless and restless, it's been a month n...",Anxiety,72,2,im restless and restless its been a month now ...,"[I, 'm, restless, and, restless, ,, it, 's, be..."


### Stemming

Stemming reduces words to their root form, but it may produce incorrect spellings (e.g., "crying" → "cri"). Despite this, **PorterStemmer** is useful because:  

- **Reduces vocabulary size** (e.g., "running," "runs," "ran" → "run")  
- **Improves model generalization** by grouping word variations  
- **Enhances efficiency** compared to lemmatization  
- **Balances accuracy and speed** for large-scale text processing  

Even with minor inaccuracies, PorterStemmer helps simplify text and improve NLP model performance.  



In [58]:
# Initialize the stemmer
stemmer = PorterStemmer()

# Function to stem tokens and convert them to strings
def stem_tokens(tokens):
    return ' '.join(stemmer.stem(str(token)) for token in tokens)

# Applying stemming to tokenized words
data['tokens_stemmed'] = data['tokens'].apply(stem_tokens)


In [59]:

data.head()

Unnamed: 0,statement,status,num_of_characters,num_of_sentences,cleaned_text,tokens,tokens_stemmed
0,oh my gosh,Anxiety,10,1,oh my gosh,"[oh, my, gosh]",oh my gosh
1,"trouble sleeping, confused mind, restless hear...",Anxiety,64,2,trouble sleeping confused mind restless heart ...,"[trouble, sleeping, ,, confused, mind, ,, rest...","troubl sleep , confus mind , restless heart . ..."
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety,78,2,all wrong back off dear forward doubt stay in ...,"[All, wrong, ,, back, off, dear, ,, forward, d...","all wrong , back off dear , forward doubt . st..."
3,I've shifted my focus to something else but I'...,Anxiety,61,1,ive shifted my focus to something else but im ...,"[I, 've, shifted, my, focus, to, something, el...",i 've shift my focu to someth els but i 'm sti...
4,"I'm restless and restless, it's been a month n...",Anxiety,72,2,im restless and restless its been a month now ...,"[I, 'm, restless, and, restless, ,, it, 's, be...","i 'm restless and restless , it 's been a mont..."


### Keep Stopwords for Better Sentiment Analysis  

Stopwords may seem insignificant, but in sentiment analysis, they provide crucial context. Words like "what," "why," and "if" shape meaning and emotion. Removing them can weaken insights, especially in mental health analysis. Instead of discarding all stopwords, consider their impact on context and sentiment.


### Stratified split & Label Encoding

Split Train-Test using stratified technique 

In [60]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder


In [61]:

X = data[['tokens_stemmed', 'num_of_characters', 'num_of_sentences']]  # features
y = data['status']
lbl_enc = LabelEncoder()
y = lbl_enc.fit_transform(y.values)

In [62]:
dict(zip(lbl_enc.classes_, lbl_enc.transform(lbl_enc.classes_)))

{'Anxiety': np.int64(0),
 'Bipolar': np.int64(1),
 'Depression': np.int64(2),
 'Normal': np.int64(3),
 'Personality disorder': np.int64(4),
 'Stress': np.int64(5),
 'Suicidal': np.int64(6)}

In [63]:
y = pd.Series(y)

In [64]:

# Create a stratified shuffle split with 1 iteration
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]


In [65]:
X_train.shape, X_test.shape, y_train.shape, y_train.shape

((40858, 3), (10215, 3), (40858,), (40858,))

In [91]:
X_train.head()

Unnamed: 0,tokens_stemmed,num_of_characters,num_of_sentences
9722,i hate myself and do not understand whi i shou...,604,8
48313,i 'm at the edg i have been have chronic tensi...,505,5
30428,i wish i wa free that night . i 'm kind of mad...,63,2
23853,i do not know what to do anymor . i know there...,1264,15
35946,i dont have long left thi past month ha been t...,905,1


In [92]:
y_train.head()

9722     2
48313    5
30428    3
23853    6
35946    6
dtype: int64

### Resampling Data

Handling class imbalance using various resampling techniques such as SMOTE, NearMiss, and Random Undersampling.

In [99]:
import numpy as np
import pandas as pd
from scipy.sparse import hstack, csr_matrix
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from sklearn.model_selection import train_test_split

In [100]:
# Vectorization - TF-IDF
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=10000)  # Reduced to save memory
X_train_tfidf = vectorizer.fit_transform(X_train['tokens_stemmed'])
X_test_tfidf = vectorizer.transform(X_test['tokens_stemmed'])

# Extract Numerical Features (Keep Sparse)
X_train_num = csr_matrix(X_train[['num_of_characters', 'num_of_sentences']].values)
X_test_num = csr_matrix(X_test[['num_of_characters', 'num_of_sentences']].values)

# Combine Features (TF-IDF + Numerical)
X_train_combined = hstack([X_train_tfidf, X_train_num])  # Sparse matrix
X_test_combined = hstack([X_test_tfidf, X_test_num])  # Sparse matrix

print(" Feature Extraction Completed! Shape:", X_train_combined.shape)

 Feature Extraction Completed! Shape: (40858, 10002)


In [101]:
print(f"Original Class Distribution: {Counter(y_train)}")

Original Class Distribution: Counter({3: 12831, 2: 12069, 6: 8513, 0: 2894, 1: 2001, 5: 1834, 4: 716})


In [106]:

# Define resampling techniques
sampling_methods = {
    "Random Over-Sampling": RandomOverSampler(random_state=101),
    "NearMiss": NearMiss(version=1),
    "SMOTE": SMOTE(random_state=101),
    "Tomek Links": TomekLinks(),
    "Random Undersampling": RandomUnderSampler(random_state=101),
    "SMOTE + Tomek Links": SMOTETomek(random_state=101)
}

In [107]:
resampled_datasets = {}
for method, sampler in sampling_methods.items():
    try:
        print(f"Applying {method}...")

        # Convert to dense array **only when required**
        if method in ["Random Over-Sampling"]:
            X_resampled, y_resampled = sampler.fit_resample(X_train_combined.toarray(), y_train)
        else:
            X_resampled, y_resampled = sampler.fit_resample(X_train_combined, y_train)  # Keep sparse format

        resampled_datasets[method] = (csr_matrix(X_resampled), y_resampled)  # Store as sparse
        print(f"{method} applied successfully - New Class Distribution: {Counter(y_resampled)}")

    except MemoryError as e:
        print(f"{method} failed due to memory limitations: {e}")

print("Resampling Process Completed!")

Applying Random Over-Sampling...
Random Over-Sampling applied successfully - New Class Distribution: Counter({2: 12831, 5: 12831, 3: 12831, 6: 12831, 0: 12831, 4: 12831, 1: 12831})
Applying NearMiss...
NearMiss applied successfully - New Class Distribution: Counter({0: 716, 1: 716, 2: 716, 3: 716, 4: 716, 5: 716, 6: 716})
Applying SMOTE...
SMOTE applied successfully - New Class Distribution: Counter({2: 12831, 5: 12831, 3: 12831, 6: 12831, 0: 12831, 4: 12831, 1: 12831})
Applying Tomek Links...
Tomek Links applied successfully - New Class Distribution: Counter({3: 11725, 2: 8601, 6: 5428, 0: 1716, 1: 1135, 5: 1134, 4: 716})
Applying Random Undersampling...
Random Undersampling applied successfully - New Class Distribution: Counter({0: 716, 1: 716, 2: 716, 3: 716, 4: 716, 5: 716, 6: 716})
Applying SMOTE + Tomek Links...
SMOTE + Tomek Links applied successfully - New Class Distribution: Counter({3: 12744, 5: 12643, 4: 12530, 1: 12460, 0: 12456, 6: 12060, 2: 11906})
Resampling Process Comp

### Exporting the re-sampled data

In [None]:
import pickle


# Path to save the file
save_path = "../data/processed/resampled_data.pkl"

# Convert sparse matrices to store efficiently
resampled_dict = {
    method: {"X": csr_matrix(X_resampled), "y": y_resampled}
    for method, (X_resampled, y_resampled) in resampled_datasets.items()
}

# Save the dictionary
with open(save_path, "wb") as f:
    pickle.dump(resampled_dict, f)

print(f"All resampled datasets saved successfully as {save_path}!")


All resampled datasets saved successfully as ../data/processed/resampled_data.pkl!


In [66]:
import pickle

# Define the save path
save_path = "../data/processed/test_data.pkl"

# Create a dictionary to store the objects
data_to_save = {
    "label_encoder": lbl_enc,
    "X_test": X_test,
    "y_test": y_test
}

# Save the dictionary as a pickle file
with open(save_path, "wb") as f:
    pickle.dump(data_to_save, f)

print(f"Test data saved successfully at: {save_path}")

Test data saved successfully at: ../data/processed/test_data.pkl
