# Mental Health Status Prediction

This notebook aims to predict the mental health status of students based on various features from the provided dataset. We will explore the data, preprocess it, and then train and compare four different machine learning models: Support Vector Machine (SVM), Logistic Regression, Random Forest, and K-Nearest Neighbors (KNN). Finally, we will conclude with the best performing model.

## 1. Importing Libraries and Loading Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

df = pd.read_csv('mental_health_dataset.csv')

ModuleNotFoundError: No module named 'pandas'

## 2. Data Exploration and Preprocessing

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df = df.drop('Student_ID', axis=1)
X = df.drop('Mental_Health_Status', axis=1)
y = df['Mental_Health_Status']

## 3. Feature Engineering and Pipeline Creation

In [None]:
categorical_features = ['Gender', 'Mood_Description']
numerical_features = ['Age', 'GPA', 'Stress_Level', 'Anxiety_Score', 'Depression_Score', 'Sleep_Hours', 'Steps_Per_Day', 'Sentiment_Score']
text_features = 'Daily_Reflections'

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('text', TfidfVectorizer(stop_words='english', max_features=100), text_features)
    ],
    remainder='passthrough'
)

## 4. Model Training and Evaluation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
models = {
    'Support Vector Machine': SVC(kernel='linear', probability=True),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

results = {}
for model_name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    results[model_name] = {
        'accuracy': accuracy,
        'classification_report': report,
        'predictions': y_pred
    }
    print(f'--- {model_name} ---')
    print(f'Accuracy: {accuracy:.4f}')
    print('Classification Report:')
    print(pd.DataFrame(report).transpose())
    print()

## 5. Model Comparison

In [None]:
accuracies = {model_name: result['accuracy'] for model_name, result in results.items()}
accuracy_df = pd.DataFrame(list(accuracies.items()), columns=['Model', 'Accuracy']).sort_values('Accuracy', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy', y='Model', data=accuracy_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.xlim(0, 1.0)
for index, value in enumerate(accuracy_df['Accuracy']):
    plt.text(value, index, f'{value:.4f}')
plt.show()

## 6. Detailed Look at the Best Model: Support Vector Machine

In [None]:
best_model_name = 'Support Vector Machine'
best_model_predictions = results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_model_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y), yticklabels=np.unique(y))
plt.title(f'Confusion Matrix for {best_model_name}')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 7. Conclusion

Based on the evaluation, the **Support Vector Machine (SVM)** model achieved the highest accuracy of all the models tested. This indicates that SVM is the most suitable model for predicting mental health status from the given features in this dataset. The confusion matrix for the SVM model also shows a good performance in correctly classifying the different mental health statuses.