#### Machine learning for mental Health

###### The project involves machine learning models that can assist in predicting mental health conditions, analyzing mental health-related data and creating tools that help in diagnosing or monitoring mental health conditions.


###### Our main objective is to predict whether a patient should be treated of his/her mental illness or not according to the values obtained in the dataset? 

###### Output Label is = seek_help 
###### Features = Age, Gender, self_employed, family_history, treatment, no_employees remote_work,tech_company, benefits, care_options, wellness_program seek_help, anonymity, leave, mental_health_consequence, phys_health_consequence coworkers, supervisor, mental_health_interview, phys_health_interview mental_vs_physical, obs_consequence

#### Imports

In [97]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder



#### Data Preprocessing

In [98]:
path = 'survey.csv'
data = pd.read_csv(path)

In [99]:
data.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [100]:
data.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


In [101]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [102]:
data.isna().sum()

Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64

In [103]:
data.drop(['comments','work_interfere', 'state', 'Timestamp'],  axis=1, inplace=True)

In [104]:
data.isna().sum()

Age                           0
Gender                        0
Country                       0
self_employed                18
family_history                0
treatment                     0
no_employees                  0
remote_work                   0
tech_company                  0
benefits                      0
care_options                  0
wellness_program              0
seek_help                     0
anonymity                     0
leave                         0
mental_health_consequence     0
phys_health_consequence       0
coworkers                     0
supervisor                    0
mental_health_interview       0
phys_health_interview         0
mental_vs_physical            0
obs_consequence               0
dtype: int64

In [105]:
# Assuming your DataFrame is called 'Data'
missing_percentages = (data.isna().mean() * 100).round(2)

print(missing_percentages)

Age                          0.00
Gender                       0.00
Country                      0.00
self_employed                1.43
family_history               0.00
treatment                    0.00
no_employees                 0.00
remote_work                  0.00
tech_company                 0.00
benefits                     0.00
care_options                 0.00
wellness_program             0.00
seek_help                    0.00
anonymity                    0.00
leave                        0.00
mental_health_consequence    0.00
phys_health_consequence      0.00
coworkers                    0.00
supervisor                   0.00
mental_health_interview      0.00
phys_health_interview        0.00
mental_vs_physical           0.00
obs_consequence              0.00
dtype: float64


##### Cleaning the NaN values

In [106]:
df_clean = data.dropna()
print(df_clean)

      Age Gender         Country self_employed family_history treatment  \
18     46   male   United States           Yes            Yes        No   
19     36   Male          France           Yes            Yes        No   
20     29   Male   United States            No            Yes       Yes   
21     31   male   United States           Yes             No        No   
22     46   Male   United States            No             No       Yes   
...   ...    ...             ...           ...            ...       ...   
1254   26   male  United Kingdom            No             No       Yes   
1255   32   Male   United States            No            Yes       Yes   
1256   34   male   United States            No            Yes       Yes   
1257   46      f   United States            No             No        No   
1258   25   Male   United States            No            Yes       Yes   

        no_employees remote_work tech_company benefits  ...   anonymity  \
18               1-5    

In [107]:
df_clean.isna().sum()

Age                          0
Gender                       0
Country                      0
self_employed                0
family_history               0
treatment                    0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

Now, our data does not contain any missing values !!

##### Encoding data

In [108]:
# Iterate through each column in the DataFrame
for column in df_clean.columns:
    unique_values = df_clean[column].unique()
    print(f'Column "{column}" has {len(unique_values)} unique values: {unique_values}')

Column "Age" has 53 unique values: [         46          36          29          31          41          33
          35          34          37          32          30          42
          40          27          38          50          24          18
          28          26          22          44          23          19
          25          39          45          21         -29          43
          56          60          54         329          55 99999999999
          48          20          57          58          47          62
          51          65          49       -1726           5          53
          61           8          11          -1          72]
Column "Gender" has 49 unique values: ['male' 'Male' 'Female' 'female' 'M' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Ag

1 - Gender column

In [109]:
df_clean['Gender'] = np.where(df_clean['Gender'].isin(['Femake', 'Female (cis)','femail','cis-female/femme','Female (trans)','Trans woman' ,'woman','f', 'F', 'Female' ,'female'  , 'Trans-female','Cis Female', 'Woman' 'f' 'queer/she/they' ]), 'female', 
                              np.where(df_clean['Gender'].isin(['m', 'M', 'male' ,'Male', 'M', 'm' ,'Male-ish' ,'maile', 'Mal', 'Cis Male','Male (CIS)', 'Make', 'Guy (-ish) ^_^', 'male leaning androgynous', 'Man', 'msle', 'Mail', 'cis male', 'Malr','Cis Man']), 
                                                 'male', 'other'))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Gender'] = np.where(df_clean['Gender'].isin(['Femake', 'Female (cis)','femail','cis-female/femme','Female (trans)','Trans woman' ,'woman','f', 'F', 'Female' ,'female'  , 'Trans-female','Cis Female', 'Woman' 'f' 'queer/she/they' ]), 'female',


2 - Age column

In [110]:
unique_values_age = df_clean['Age'].unique()
print(unique_values_age)

[         46          36          29          31          41          33
          35          34          37          32          30          42
          40          27          38          50          24          18
          28          26          22          44          23          19
          25          39          45          21         -29          43
          56          60          54         329          55 99999999999
          48          20          57          58          47          62
          51          65          49       -1726           5          53
          61           8          11          -1          72]


In [116]:
# LabelEncoder from scikit-learn
le = LabelEncoder()

In [111]:
# Clean the Age column by removing invalid values
df = df_clean[(df_clean['Age'] >= 0) & (df_clean['Age'] <= 120)]

# Encode the Age column using label encoding
df['Age'] = le.fit_transform(df['Age'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age'] = le.fit_transform(df['Age'])


In [112]:
print(df['Age'].unique())

[31 21 14 16 26 18 20 19 22 17 15 27 25 12 23 35  9  3 13 11  7 29  8  4
 10 24 30  6 28 40 43 38 39 33  5 41 42 32 45 36 46 34  0 37 44  1  2 47]


Now, our age column is encoded and has no invalid values

3 - Self employed 

In [113]:
df['self_employed'].unique()

array(['Yes', 'No'], dtype=object)

In [114]:
df['self_employed'] = le.fit_transform(df['self_employed'])
df['self_employed'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['self_employed'] = le.fit_transform(df['self_employed'])


4 - family history

In [117]:
data['family_history'].unique()

array(['No', 'Yes'], dtype=object)

In [118]:
df['self_employed'] = le.fit_transform(df['self_employed'])
df['self_employed'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['self_employed'] = le.fit_transform(df['self_employed'])


array([1, 0])

### Explore and visualize data : exploratory data analysis (EDA)


### Feature Engineering

### Model Selection 

### Model Training and evaluation 

### Model tuning

### Conclusions