# Student mental health dataset

**Test if the type of course, the study year is a factor of stress or anxiety:**

In [1]:
import csv
import pandas as pd
filename=r'C:\Users\lora.maillard\OneDrive - De Vinci\Documents\ESILV\Informatique\S5\Data science\Final project\Student Mental health.csv'
# Ouvrir un fichier CSV en mode lecture
with open(filename, 'r') as fichier_csv:
    lecteur_csv = csv.reader(fichier_csv)

df = pd.read_csv(filename)
df.head()

Unnamed: 0,Timestamp,Choose your gender,Age,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No


**Data preprocessing**

First, we have to clean the data, which means turn categorial data into numerical data and normalize the data.

In [2]:

new_column_names = {
    'Timestamp': 'timestamp',
    'Choose your gender': 'gender',
    'Age': 'age',
    'What is your course?': 'course',
    'Your current year of Study': 'study_year',
    'What is your CGPA?': 'cgpa',
    'Marital status': 'marital_status',
    'Do you have Depression?': 'depression',
    'Do you have Anxiety?': 'anxiety',
    'Do you have Panic attack?': 'panic_attack',
    'Did you seek any specialist for a treatment?': 'treatment'
}
# Rename the columns
df.rename(columns=new_column_names, inplace=True)
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   timestamp       101 non-null    object 
 1   gender          101 non-null    object 
 2   age             100 non-null    float64
 3   course          101 non-null    object 
 4   study_year      101 non-null    object 
 5   cgpa            101 non-null    object 
 6   marital_status  101 non-null    object 
 7   depression      101 non-null    object 
 8   anxiety         101 non-null    object 
 9   panic_attack    101 non-null    object 
 10  treatment       101 non-null    object 
dtypes: float64(1), object(10)
memory usage: 8.8+ KB


Unnamed: 0,timestamp,gender,age,course,study_year,cgpa,marital_status,depression,anxiety,panic_attack,treatment
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No


In [3]:
#Binary Encoding for Yes/No Columns
cols = ['marital_status','depression','panic_attack','anxiety','treatment']

def check(x):
    if x == 'Yes':
        return 1
    else:
        return 0

for i in cols:
    df[i] = df[i].apply(lambda x:1 if x =='Yes' else 0)
    
#Convert 'study_year' to Integer   
df['study_year']=df['study_year'].apply(lambda x: int(x[-1:])) #Turn the string to int

#Handling 'cgpa' Values:
df['cgpa'].unique()
df['cgpa'].value_counts().sort_values()

def change_cgpa(x):
    if (x == '3.50 - 4.00' or x == '3.50 - 4.00 '):
        x = 5
        return x
    elif x=='3.00 - 3.49' :
        x = 4
        return x
    elif x == '2.50 - 2.99':
        x = 3
        return x
    elif x== '2.00 - 2.49':
        x = 2
        return x
    else:
        x=1
        return x

df['cgpa']=df['cgpa'].apply(lambda x:change_cgpa(x))

# Label Encoding for 'gender' Column
df['gender'] = df['gender'].apply(lambda x: 0 if x == 'Female' else 1)

#Label Encoding for 'course' Column:
from sklearn.preprocessing import  LabelEncoder
le = LabelEncoder()
df['course'] = le.fit_transform(df['course'])

#Drop 'timestamp' Column
df.drop('timestamp',axis=1,inplace=True)
# Delete the row if it contains a null value (There is one row in the column 'age' that is null
df.dropna(inplace=True)

df.head()

Unnamed: 0,gender,age,course,study_year,cgpa,marital_status,depression,anxiety,panic_attack,treatment
0,0,18.0,17,1,4,0,1,0,1,0
1,1,21.0,25,2,4,0,0,1,0,0
2,1,19.0,4,1,4,0,1,1,1,0
3,0,22.0,33,3,4,1,1,0,0,0
4,1,23.0,37,4,4,0,0,0,0,0


**Machine Learning Modeling**

We will train a model to predict the 'anxiety' status of students based on selected features.
Then we will assess the model's accuracy and its ability to correctly classify instances, providing insights into its strengths and weaknesses.

The goal is to understand the factors influencing students' anxiety based on the selected features and build a predictive model for future instances.
This process aims to gain insights into the relationships between features and mental health outcomes, ultimately contributing to the understanding and potential prediction of anxiety in students.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


features = ['gender', 'age', 'course', 'study_year', 'cgpa', 'marital_status']
X = pd.get_dummies(df[features], drop_first=True)
y = df['anxiety']  # target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")


Accuracy: 0.8
Confusion Matrix:
[[12  2]
 [ 2  4]]
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.86      0.86        14
           1       0.67      0.67      0.67         6

    accuracy                           0.80        20
   macro avg       0.76      0.76      0.76        20
weighted avg       0.80      0.80      0.80        20



Based on the result, the model correctly predicted the class for 80% of the samples in the test set.
Class 0 (no anxiety) has higher precision and recall compared to Class 1 (anxiety).

Now we do the same for 'depression'.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


features = ['gender', 'age', 'course', 'study_year', 'cgpa', 'marital_status']
X = pd.get_dummies(df[features], drop_first=True)
y = df['depression']  # target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")


Accuracy: 0.75
Confusion Matrix:
[[12  3]
 [ 2  3]]
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.80      0.83        15
           1       0.50      0.60      0.55         5

    accuracy                           0.75        20
   macro avg       0.68      0.70      0.69        20
weighted avg       0.77      0.75      0.76        20



**Feature importance**

Feature importance is a measure used in machine learning to assess the contribution of each feature (input variable) in a model to the prediction of the target variable. It quantifies the influence of different features on the model's output, helping to identify which features are more critical in making accurate predictions. In the context of tree-based models like Random Forest, feature importance is often calculated based on how frequently a feature is used to make decisions across the trees and how much it improves the model's performance. Higher feature importance values suggest that the corresponding features play a more significant role in the model's predictions.

In [6]:
from sklearn.ensemble import RandomForestRegressor

features = ['course', 'study_year', 'cgpa']
X = pd.get_dummies(df[features], drop_first=True)
y = df['anxiety'] # target variable

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Feature importance
feature_importance = pd.Series(model.feature_importances_, index=X.columns)
print("Feature Importance:\n", feature_importance)


Feature Importance:
 course        0.701243
study_year    0.166983
cgpa          0.131774
dtype: float64


The type of course is the most influential factor in predicting anxiety levels.
Study year, gender, and CGPA also contribute to the predictions but to varying degrees.

The feature importance results indicate that, according to the trained RandomForestRegressor model:
'course' has a significantly higher importance in predicting anxiety compared to 'study_year' and 'gcpa'.
This suggests that, based on the model's perspective, the type of course is a more influential factor in predicting stress compared to the others.

In [7]:
from sklearn.ensemble import RandomForestRegressor

features = ['course', 'study_year', 'cgpa']
X = pd.get_dummies(df[features], drop_first=True)
y = df['depression']

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Feature importance
feature_importance = pd.Series(model.feature_importances_, index=X.columns)
print("Feature Importance:\n", feature_importance)

Feature Importance:
 course        0.636581
study_year    0.178753
cgpa          0.184666
dtype: float64


We obtain the same result with the target variable being 'depression': The type of course is the most influential factor in predicting depression levels.
Study year and CGPA also contribute to the predictions, with CGPA being slightly more important than study year.