# **Machine learning for mental Health**

##### The project involves machine learning models that can assist in predicting mental health conditions, analyzing mental health-related data and creating tools that help in diagnosing or monitoring mental health conditions.


##### Our main objective is to predict whether a patient should be treated of his/her mental illness or not according to the values obtained in the dataset? 

##### Output Label is = seek_help 
##### Features = Age, Gender, self_employed, family_history, treatment, no_employees remote_work,tech_company, benefits, care_options, wellness_program seek_help, anonymity, leave, mental_health_consequence, phys_health_consequence coworkers, supervisor, mental_health_interview, phys_health_interview mental_vs_physical, obs_consequence

### **1. Imports**

In [81]:
import pandas as pd
import numpy as np

### **2. Data Preprocessing**

In [82]:
path = 'survey.csv'
data = pd.read_csv(path)

In [111]:
data.head()

(977, 25)

In [84]:
data.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


In [85]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [86]:
data.isna().sum()

Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64

In [87]:
data.drop(['comments', 'state', 'Timestamp'],  axis=1, inplace=True)

In [88]:
data.isna().sum()

Age                            0
Gender                         0
Country                        0
self_employed                 18
family_history                 0
treatment                      0
work_interfere               264
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
obs_consequence                0
dtype: int64

In [89]:
# Assuming your DataFrame is called 'Data'
missing_percentages = (data.isna().mean() * 100).round(2)

print(missing_percentages)

Age                           0.00
Gender                        0.00
Country                       0.00
self_employed                 1.43
family_history                0.00
treatment                     0.00
work_interfere               20.97
no_employees                  0.00
remote_work                   0.00
tech_company                  0.00
benefits                      0.00
care_options                  0.00
wellness_program              0.00
seek_help                     0.00
anonymity                     0.00
leave                         0.00
mental_health_consequence     0.00
phys_health_consequence       0.00
coworkers                     0.00
supervisor                    0.00
mental_health_interview       0.00
phys_health_interview         0.00
mental_vs_physical            0.00
obs_consequence               0.00
dtype: float64


#### **Cleaning the NaN values**

In [90]:
data = data.dropna()
print(data)

      Age Gender        Country self_employed family_history treatment  \
18     46   male  United States           Yes            Yes        No   
20     29   Male  United States            No            Yes       Yes   
21     31   male  United States           Yes             No        No   
22     46   Male  United States            No             No       Yes   
23     41   Male  United States            No             No       Yes   
...   ...    ...            ...           ...            ...       ...   
1252   29   male  United States            No            Yes       Yes   
1253   36   Male  United States            No            Yes        No   
1255   32   Male  United States            No            Yes       Yes   
1256   34   male  United States            No            Yes       Yes   
1258   25   Male  United States            No            Yes       Yes   

     work_interfere    no_employees remote_work tech_company  ...   anonymity  \
18        Sometimes           

In [91]:
data.isna().sum()

Age                          0
Gender                       0
Country                      0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

Now, our data does not contain any missing values !!

#### **Encoding data**

In [92]:
cols = data.columns
for i in range (len(cols)):
    print(cols[i])
    print(data[cols[i]].unique())

Age
[         46          29          31          41          33          35
          34          37          32          30          42          40
          27          24          18          38          26          22
          44          23          36          28          39          25
          45          21          19          43          56          54
         329          55 99999999999          57          58          48
          47          62          51          50          49          20
       -1726          53          61           8          11          -1
          72          60]
Gender
['male' 'Male' 'Female' 'female' 'M' 'm' 'Male-ish' 'Trans-female'
 'Cis Female' 'F' 'Cis Male' 'f' 'Mal' 'queer/she/they' 'non-binary'
 'woman' 'Make' 'Nah' 'All' 'Enby' 'fluid' 'Genderqueer' 'Female '
 'Androgyne' 'Agender' 'cis-female/femme' 'Guy (-ish) ^_^'
 'male leaning androgynous' 'Male ' 'Trans woman' 'msle' 'Neuter'
 'Female (trans)' 'queer' 'Female (cis)' 'Mail' 'ci

##### **a. Gender column**

In [93]:
# unique values of Gender
data['Gender'].unique()

array(['male', 'Male', 'Female', 'female', 'M', 'm', 'Male-ish',
       'Trans-female', 'Cis Female', 'F', 'Cis Male', 'f', 'Mal',
       'queer/she/they', 'non-binary', 'woman', 'Make', 'Nah', 'All',
       'Enby', 'fluid', 'Genderqueer', 'Female ', 'Androgyne', 'Agender',
       'cis-female/femme', 'Guy (-ish) ^_^', 'male leaning androgynous',
       'Male ', 'Trans woman', 'msle', 'Neuter', 'Female (trans)',
       'queer', 'Female (cis)', 'Mail', 'cis male', 'A little about you',
       'Malr', 'p', 'Woman', 'femail', 'Cis Man',
       'ostensibly male, unsure what that really means'], dtype=object)

In [94]:

male = ["male", "Male", "M", "m", "Male-ish", "maile", "Mal", "Male (CIS)","mail", "Guy (-ish) ^_^", "Mail",
        "male leaning androgynous", "Cis Male", "cis male", "Cis Man", "Make", "something kinda male?", "Man", "Malr", "Male "]
female = ["female", "Female", "Woman", "woman","F", "f","femail", "Cis Female", "Femake", "cis-female/femme",
          "Female (cis)", "Female "]
other = ["Trans-female", "queer/she/they", "non-binary", "Genderqueer", "Female (trans)", "Agender", "A little about you",
         "Nah", "All", "Enby", "fluid", "Androgyne", "Agender", "Trans woman", "msle", "Neuter", "queer", "p", "ostensibly male, unsure what that really means"]

In [95]:
for (row, col) in data.iterrows():

    if col.Gender in male:
        data['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)

    if col.Gender in female:
        data['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)

    if col.Gender in other:
        data['Gender'].replace(to_replace=col.Gender, value='other', inplace=True)


In [96]:
data['Gender'].unique()

array(['male', 'female', 'other'], dtype=object)

##### **b. Age column**

In [97]:
# Complete missing age with mean
data['Age'].fillna(data['Age'].median(), inplace = True)

# Fill values < 14 and > 120 with median()
age = pd.Series(data['Age'])
age[age < 14] = data['Age'].median()
data['Age'] = age
age = pd.Series(data['Age'])
age[age > 120] = data['Age'].median()
data['Age'] = age

#Ranges of Age
data['age_range'] = pd.cut(data['Age'], [0,20,30,65,120], labels=["0-20", "21-30", "31-65", "66-120"], include_lowest=True)
data["Age"].unique()

array([46, 29, 31, 41, 33, 35, 34, 37, 32, 30, 42, 40, 27, 24, 18, 38, 26,
       22, 44, 23, 36, 28, 39, 25, 45, 21, 19, 43, 56, 54, 55, 57, 58, 48,
       47, 62, 51, 50, 49, 20, 53, 61, 72, 60])

##### **c. Self employed column**

In [98]:
data['self_employed'].unique()

array(['Yes', 'No'], dtype=object)

In [99]:
# Binary encoding using Lambda function
data['self_employed'] = data['self_employed'].apply(lambda x: 1 if x == 'Yes' else 0)

##### **d. Family history column**

In [100]:
data['family_history'].unique()

array(['Yes', 'No'], dtype=object)

In [101]:
# Binary encoding using Lambda function
data['family_history'] = data['family_history'].apply(lambda x: 1 if x == 'Yes' else 0)

##### **e. Treatment column**

In [102]:
data['treatment'].unique()

array(['No', 'Yes'], dtype=object)

In [103]:
# Binary encoding using Lambda function
data['treatment'] = data['treatment'].apply(lambda x: 1 if x == 'Yes' else 0)

##### **f. Work interfere column**

In [110]:
data['work_interfere'].unique()

array([0, 1, 2, 3])

In [109]:
# Encode the categorical values
data['work_interfere'] = pd.factorize(data['work_interfere'])[0]
data['work_interfere'].unique()

array([0, 1, 2, 3])

In [114]:
#  Fill missing values with the most common category
data['work_interfere'] = data['work_interfere'].fillna(data['work_interfere'].mode()[0])
data["work_interfere"].isna().sum()

0

##### **e. Remote work column**

In [105]:
data['remote_work'].unique()

array(['Yes', 'No'], dtype=object)

In [106]:
# Binary encoding using Lambda function
data['remote_work'] = data['remote_work'].apply(lambda x: 1 if x == 'Yes' else 0)

### Explore and visualize data : exploratory data analysis (EDA)


### Feature Engineering

### Model Selection 

### Model Training and evaluation 

### Model tuning

### Conclusions