Why?

Loads the dataset to explore the structure and contents.
Displays the first few rows to understand what data is available.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("mentalhealth.csv.csv")

# Display first few rows
df.head()


Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,Anxiety
1,1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,3,I've shifted my focus to something else but I'...,Anxiety
4,4,"I'm restless and restless, it's been a month n...",Anxiety


In [None]:
# Check dataset info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11151 entries, 0 to 11150
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  11151 non-null  int64 
 1   statement   11140 non-null  object
 2   status      11138 non-null  object
dtypes: int64(1), object(2)
memory usage: 261.5+ KB


Why?

Identifies how many missing values exist in each column.
Helps in deciding an appropriate cleaning strategy.
If missing values exist, we need to handle them properly instead of removing data.

In [None]:
# Check missing values
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)


Missing values per column:
 Unnamed: 0     0
statement     11
status        13
dtype: int64


Why?

If a mental health-related statement is missing, it is filled with "Unknown" rather than being removed.
This ensures that no rows are lost unnecessarily.


Why?

The status column represents the mental health classification.
The missing values are filled with the most frequently occurring category (mode()), ensuring consistency in classification.

In [None]:
# Fix for filling missing values
df["statement"] = df["statement"].fillna("Unknown")
df["status"] = df["status"].fillna(df["status"].mode()[0])


In [None]:
print(df.isnull().sum())  # Should print all zeros if cleaning was successful


Unnamed: 0    0
statement     0
status        0
dtype: int64


In [None]:
# prompt: now download the cleanead datset

from google.colab import files
import pandas as pd

# Assuming the cleaned dataset is saved as 'cleaned_mentalhealth.csv'
df = pd.read_csv("mentalhealth.csv.csv")

# Perform cleaning operations (same as in the previous code)
df["statement"] = df["statement"].fillna("Unknown")
df["status"] = df["status"].fillna(df["status"].mode()[0])

# Save the cleaned DataFrame to a new CSV file
df.to_csv('cleaned_mentalhealth.csv', index=False)

# Download the cleaned dataset
files.download('cleaned_mentalhealth.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Why?

Counts the number of statements for each mental health category.
Sorting in ascending order helps visualize the least and most frequent categories.

In [None]:
status_counts = df["status"].value_counts()
print(status_counts.sort_values(ascending=True))  # Ascending order


status
Sui              1
Anxiety        730
Depression     873
Suicidal       889
Normal        6252
Name: count, dtype: int64


Why?

Converts multi-class mental health categories into binary classification (normal vs. depression).
Any category other than "Normal" is considered "depression" for simplicity in model training.

In [None]:
df["status_binary"] = df["status"].apply(lambda x: "depression" if x != "Normal" else "normal")


Converts "normal" to 0 and "depression" to 1 for model training.
Machine learning algorithms require numerical data for processing.

In [None]:

import pandas as pd
from google.colab import files
from sklearn.preprocessing import LabelEncoder


# Label encode the binary classification
encoder = LabelEncoder()
df["status_numeric"] = encoder.fit_transform(df["status_binary"])

# Save the updated DataFrame to a new CSV file
df.to_csv('label_encoded_mentalhealth.csv', index=False)

# Download the updated dataset
files.download('label_encoded_mentalhealth.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Why?

80-20 Split Theory: 80% of data is used for training the model, and 20% is kept for testing.
Ensures that the model learns patterns from the training set and generalizes well to unseen data.
random_state=42 ensures reproducibility

In [None]:
from sklearn.model_selection import train_test_split

X = df["statement"]
y = df["status_numeric"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Why?

Converts text statements into numerical representations for machine learning models.
TF-IDF (Term Frequency-Inverse Document Frequency) helps capture the importance of words in each statement.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


Why?

Decision Trees are simple and interpretable models used for classification.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_tfidf, y_train)


Why?

A Neural Network is more powerful for complex patterns in text data.
Uses one hidden layer with 100 neurons and runs for 500 iterations to improve training.

In [None]:
from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500)
nn_model.fit(X_train_tfidf, y_train)


Why?

Accuracy measures the overall correctness.
Precision checks how many predicted depressions were actually correct.
Recall determines how well the model finds all actual depression cases.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Predictions
y_pred_dt = dt_model.predict(X_test_tfidf)
y_pred_nn = nn_model.predict(X_test_tfidf)

# Evaluation
def evaluate_model(y_test, y_pred, model_name):
    print(f"Performance of {model_name}:")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print(f"Precision: {precision_score(y_test, y_pred):.2f}")
    print(f"Recall: {recall_score(y_test, y_pred):.2f}\n")

evaluate_model(y_test, y_pred_dt, "Decision Tree")
evaluate_model(y_test, y_pred_nn, "Neural Network")


The ROC Curve visualizes how well the models distinguish between "normal" and "depression".
A good model has a high area under the curve (AUC).

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt)
fpr_nn, tpr_nn, _ = roc_curve(y_test, y_pred_nn)

plt.plot(fpr_dt, tpr_dt, label="Decision Tree")
plt.plot(fpr_nn, tpr_nn, label="Neural Network")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


Why?

Transforms new statements into numerical features and classifies them using both models.

In [None]:
test_statements = [
    "I can't stop worrying about everything",
    "I've been working hard, and seeing the results makes me feel incredibly happy and fulfilled",
    "Even the smallest things feel like too much right now",
    "I can’t stop smiling",
    "Today has been amazing!"
]

# Transform text
test_tfidf = vectorizer.transform(test_statements)

# Predict with both models
dt_predictions = dt_model.predict(test_tfidf)
nn_predictions = nn_model.predict(test_tfidf)

print("Decision Tree Predictions:", dt_predictions)
print("Neural Network Predictions:", nn_predictions)
