# Executive Summary

### Goals

- Find models that can predict whether a patient has a mental health problem by looking at their recent activities.

### Data Origin

- Data is taken and downloaded from a website called Kaggle

### Metrics

- To get the the accuracy of the models as high as possible preferably more than 98%

### Finding

- The most optimum models is random forest surpassing other models with the excellent accuracy of 98.5% with neglibile overfitting

- The best hyperparameter of the random forest model based on the availaible hyperparameter is max depth of 10 and n estimator of 100.

### Risk /limitation / Assumption

#### Risk
-It might has technical risk such as scalability since the data file is relatively large.
- It might have sampling bias.
#### Assumptions.
-The data is consistent.
-That the feature selected and provided is relevance target variable


# Statistical Analysis Summary

### Implementation

#### EDA
The data is a classification type data with its target variable being categorical. After uncovering the distribution of the target variable, It is discovered that that the data is a multi-class classification type data which consist of ‘Yes’ , ‘No’ and ‘Maybe’. There are a decent amount of missing value in the dataset and all of it seems to be data about whether the person is in a self-employement or not. Through EDA I have discoveed that instead of doing the normal classification , I need to do a multi-class classification ways of modeling which require a tweak and adjustment on the modeling which difer from usual. The EDA  is probably already fully completed if I didn’t make any mistake.

#### Modeling
I have already begun my modeling process. At first I started with logistic regression model thinking that this model is the most suitable for this project. However, after fitting the model and doing evaluation , the accuracy of the model is only a meagre 0.355 which is abysmal for something like my project. Thus I switch to a different model called Random Forest. After fiting and evaluating the model , the accuracy of Random Forest  is as high as 0.976 which is more preferable. It is noted that this is maybe not the most optimal perfomance as it has not done any hyperparameter tuning and bootstraping and doing this may increase the perfomance more in the future.

After that , I started implementing gridsearch on several model which are LogisticRegression,RidgeClassifier , KNN , DecisionTree , RandomForest in order to find the best model for my goals and the optimum hyperparameter.

### Evaluation

After a lengthy wait of the modelling proccess , this is the summary of the model perfomance

LogisticRegression

Accuracy Score : 0.36

RidgeClassifier

Accuracy Score: 0.41

KNN

Accuracy Score:0.33

DecisionTree

Accuracy Score: 0.98
    
RandomForest

Accuracy Score : 0.98

# Modeling and Coding

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [2]:
df = pd.read_csv('Mental Health Dataset.csv')

In [3]:
df.head() # Reading the initial data

Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
0,2014-08-27 11:29:31,Female,United States,Corporate,,No,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Not sure
1,2014-08-27 11:31:50,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,No
2,2014-08-27 11:32:39,Female,United States,Corporate,,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes
3,2014-08-27 11:37:59,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,Maybe,Yes
4,2014-08-27 11:43:36,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes


# EDA and Data Preparation

In [4]:
df.isnull().sum()# find missing value

Timestamp                     0
Gender                        0
Country                       0
Occupation                    0
self_employed              5202
family_history                0
treatment                     0
Days_Indoors                  0
Growing_Stress                0
Changes_Habits                0
Mental_Health_History         0
Mood_Swings                   0
Coping_Struggles              0
Work_Interest                 0
Social_Weakness               0
mental_health_interview       0
care_options                  0
dtype: int64

In [5]:
df = df.dropna()

In [6]:
df.isnull().sum()

Timestamp                  0
Gender                     0
Country                    0
Occupation                 0
self_employed              0
family_history             0
treatment                  0
Days_Indoors               0
Growing_Stress             0
Changes_Habits             0
Mental_Health_History      0
Mood_Swings                0
Coping_Struggles           0
Work_Interest              0
Social_Weakness            0
mental_health_interview    0
care_options               0
dtype: int64

In [7]:
df.head() # dummifiying the categorical variable

Unnamed: 0,Timestamp,Gender,Country,Occupation,self_employed,family_history,treatment,Days_Indoors,Growing_Stress,Changes_Habits,Mental_Health_History,Mood_Swings,Coping_Struggles,Work_Interest,Social_Weakness,mental_health_interview,care_options
3,2014-08-27 11:37:59,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,Maybe,Yes
4,2014-08-27 11:43:36,Female,United States,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Yes
5,2014-08-27 11:49:51,Female,Poland,Corporate,No,No,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,Maybe,Not sure
6,2014-08-27 11:51:34,Female,Australia,Corporate,No,Yes,Yes,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,Not sure
7,2014-08-27 11:52:41,Female,United States,Corporate,No,No,No,1-14 days,Yes,No,Yes,Medium,No,No,Yes,No,No


In [8]:
df = pd.get_dummies(df , columns = ['Gender', 'Country' , 'Occupation' , 'self_employed' , 'family_history' , 'treatment', 'Days_Indoors', 'Growing_Stress' , 'Changes_Habits', 'Mood_Swings' , 'Coping_Struggles', 'Work_Interest', 'Social_Weakness','mental_health_interview', 'care_options' ], dtype = int )

In [9]:
df.head()

Unnamed: 0,Timestamp,Mental_Health_History,Gender_Female,Gender_Male,Country_Australia,Country_Belgium,Country_Bosnia and Herzegovina,Country_Brazil,Country_Canada,Country_Colombia,...,Work_Interest_Yes,Social_Weakness_Maybe,Social_Weakness_No,Social_Weakness_Yes,mental_health_interview_Maybe,mental_health_interview_No,mental_health_interview_Yes,care_options_No,care_options_Not sure,care_options_Yes
3,2014-08-27 11:37:59,Yes,1,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,1
4,2014-08-27 11:43:36,Yes,1,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,1
5,2014-08-27 11:49:51,Yes,1,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,1,0
6,2014-08-27 11:51:34,Yes,1,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
7,2014-08-27 11:52:41,Yes,1,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0


In [10]:
df['Timestamp'] = pd.to_datetime(df['Timestamp']) #change date into numerical value so that it can be used in model.

df['Timestamp_'] = df['Timestamp'].apply(lambda x: x.timestamp())

In [11]:
df.drop('Timestamp', axis=1, inplace=True)

In [12]:
df['Mental_Health_History'].value_counts() # See the target variable

Mental_Health_History
No       102179
Maybe     93664
Yes       91319
Name: count, dtype: int64

### Transforming / Engineering Data

- Data transforming is done by observing the existence of missing value and looking at each of the column in the datasets on whether it is categorical or numerical.

- Then use dropna to drop all the missing value and dummify to change categorical value into binary.

- Change the data of timestamp from string into numerical for it to represent its value (time)

- Analyse the target variable and see what model or changes that need to be done to accomodate the target variable to be compatible

- Use label encoder to change the target variable data into something suitable to work properly in models

# Modeling

In [13]:
X = df.drop(columns ='Mental_Health_History') # labeling the x and y to be used in train test split
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['Mental_Health_History'])

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.20,
                                                    random_state = 42,
                                                    stratify = y)

### Selecting models and optimizing hyperparameter

-  Selecting the models and optimizing the hyperparameter is done by putting each model into a gridsearch to find the best perfomance model and the best set of hyperparameter among the listed models

- Computationally expensive models like svm is excluded because of long buffer time. Number of hyperparameter tested reduced is also for the same reason

In [15]:
models = { #initializing the model classifier
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'RidgeClassifier': RidgeClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42),

}

In [16]:
models_param = { # preparing the hyperparameter to be test
    'LogisticRegression': {
        'C': [0.1, 1, 10],
        'solver': ['liblinear', 'saga']
    },
    'RidgeClassifier': {
        'alpha': [0.1, 1, 10]
    },
    
    'KNN': {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance']
    },
    'DecisionTree': {
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 10, 20]
    },
    'RandomForest': {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 20]
} 
}

In [17]:
for name, model in models.items(): # use each of the model and its hyperparamter to be gridsearch
    gridsearch = GridSearchCV(model , models_param[name],cv = 5 , verbose = 1)
    gridsearch.fit(X_train, y_train)
    best_model = gridsearch.best_estimator_
    print(f"Best parameters for {name}: {gridsearch.best_params_}")
    print(f"Best cross-validation score for {name}: {gridsearch.best_score_:.4f}")
    y_pred = best_model.predict(X_test)
    print(f"Model: {name}")
    print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters for LogisticRegression: {'C': 0.1, 'solver': 'liblinear'}
Best cross-validation score for LogisticRegression: 0.3558
Model: LogisticRegression
              precision    recall  f1-score   support

           0       0.00      0.00      0.00     18733
           1       0.36      1.00      0.52     20436
           2       0.00      0.00      0.00     18264

    accuracy                           0.36     57433
   macro avg       0.12      0.33      0.17     57433
weighted avg       0.13      0.36      0.19     57433

Fitting 5 folds for each of 3 candidates, totalling 15 fits


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_

Best parameters for RidgeClassifier: {'alpha': 0.1}
Best cross-validation score for RidgeClassifier: 0.4176
Model: RidgeClassifier
              precision    recall  f1-score   support

           0       0.41      0.44      0.43     18733
           1       0.41      0.49      0.45     20436
           2       0.41      0.28      0.34     18264

    accuracy                           0.41     57433
   macro avg       0.41      0.41      0.40     57433
weighted avg       0.41      0.41      0.41     57433

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters for KNN: {'n_neighbors': 5, 'weights': 'uniform'}
Best cross-validation score for KNN: 0.3338
Model: KNN
              precision    recall  f1-score   support

           0       0.32      0.44      0.37     18733
           1       0.36      0.37      0.37     20436
           2       0.31      0.18      0.23     18264

    accuracy                           0.33     57433
   macro avg       0.33      0.33  

## Choosing Random Forest

In [16]:
RandFor = RandomForestClassifier(max_depth = 10, n_estimators =  100 ,random_state=42 ) # Choose the best model from gridsearch
RandFor.fit(X_train, y_train)

# Evaluation

## Random Forest

In [18]:
cross_val_score(RandFor, X, y, cv=5)# evaluating and finding all the relevance score of random forest

array([0.36773284, 0.42630543, 0.35713539, 0.37844756, 0.34137763])

In [20]:
y_pred = RandFor.predict(X_test)

In [21]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

In [22]:
accuracy

0.9849737955530793

In [23]:
print(report)

              precision    recall  f1-score   support

       Maybe       0.99      0.99      0.99     18733
          No       0.98      0.99      0.99     20436
         Yes       0.98      0.98      0.98     18264

    accuracy                           0.98     57433
   macro avg       0.98      0.98      0.98     57433
weighted avg       0.98      0.98      0.98     57433



In [24]:
train_score = RandFor.score(X_train, y_train)

test_score = RandFor.score(X_test, y_test)

In [25]:
train_score # train test score shows minimal overfitting.

0.9853610123232156

In [26]:
test_score

0.9849737955530793

### Minimizing false negative

- Positive means has mental health problem and negative  mean does not have mental health problem. False negative someone that is predicted to not have mental health problem while actually he has. This is worse than false positive and should be prioritize. Thus minimizing false negative is a must. This can be done by having a high value of recall. 

- The recall value for this model is 0.99 and 0.98 thus minimizing false negative is achieved