## Step-1: Business Problem Understanding
- Predict whether a person will get depressed or not.

## Step-2: Data Understanding

**Import Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

**Load the data**

In [4]:
data=pd.read_csv('test.csv')
pd.set_option('display.max_columns',None)
data.head(1)

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes


**Data Exploration**

In [7]:
print(f'No of Rows: {data.shape[0]}')
print(f'No of Columns: {data.shape[1]}')

No of Rows: 93800
No of Columns: 19


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93800 entries, 0 to 93799
Data columns (total 19 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     93800 non-null  int64  
 1   Name                                   93800 non-null  object 
 2   Gender                                 93800 non-null  object 
 3   Age                                    93800 non-null  float64
 4   City                                   93800 non-null  object 
 5   Working Professional or Student        93800 non-null  object 
 6   Profession                             69168 non-null  object 
 7   Academic Pressure                      18767 non-null  float64
 8   Work Pressure                          75022 non-null  float64
 9   CGPA                                   18766 non-null  float64
 10  Study Satisfaction                     18767 non-null  float64
 11  Jo

In [11]:
continous=['Age']
discrete_count=['Sleep Duration','Work/Study Hours']
ordinal=['Work Pressure','Job Satisfaction',
         'Financial Stress'
        ]
nominal=['Gender','City','Working Professional or Student','Profession',
         'Dietary Habits','Degree','Have you ever had suicidal thoughts ?',
         'Family History of Mental Illness'
        ]
other=['Name','id']

In [13]:
col_name = []
unique_count = []
for col in discrete_count+ordinal+nominal+other:
    count = data[col].nunique()
    unique_count.append(count)
    col_name.append(col)

unique_df = pd.DataFrame({'Unique_count':unique_count},index=col_name)
print('No of Unique values:')
unique_df.T

No of Unique values:


Unnamed: 0,Sleep Duration,Work/Study Hours,Work Pressure,Job Satisfaction,Financial Stress,Gender,City,Working Professional or Student,Profession,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Family History of Mental Illness,Name,id
Unique_count,31,13,5,5,5,2,68,2,64,22,87,2,2,374,93800


**Observations:**

**No of unique value comparision train vs Test
-                    Train data                     Test data
- Sleep duration         36                              31
- City                   98                              68
- Profession             64                              64
- Dietary Habits         23                              22
- Degree                 115                             87
- Name                   422                             374


**Feature Selection (Wrapper Method)**

**1. Based on no of unique values**

In [18]:
# Drop columns based on no of unique values
drop_cols = ['Name','City','id']
data.drop(columns=drop_cols,inplace=True)

**2. Based on missing values**

In [21]:
data.isnull().sum()/len(data)

Gender                                   0.000000
Age                                      0.000000
Working Professional or Student          0.000000
Profession                               0.262601
Academic Pressure                        0.799925
Work Pressure                            0.200192
CGPA                                     0.799936
Study Satisfaction                       0.799925
Job Satisfaction                         0.200149
Sleep Duration                           0.000000
Dietary Habits                           0.000053
Degree                                   0.000021
Have you ever had suicidal thoughts ?    0.000000
Work/Study Hours                         0.000000
Financial Stress                         0.000000
Family History of Mental Illness         0.000000
dtype: float64

**Observations:**
- In both train and test dataset 'academic pressure', 'cgpa','study satisfaction' have more than 50% missing values are there.

In [24]:
# drop columns that have more than 30% missing values
drop_cols = ['Academic Pressure','CGPA','Study Satisfaction']
data.drop(columns=drop_cols,inplace=True)

## *Step-3: Data Preprocessing*

**1. Data Cleaning**

**1.1 Wrong Data**

In [444]:
wrong_data_columns = ['Sleep Duration','Dietary Habits','Profession']

In [446]:
data['Sleep Duration'].replace({'Less than 5 hours':'Moderate','7-8 hours':'Good','More than 8 hours':'Excellent','5-6 hours':'Good','6-7 hours':'Good',
                                '8-9 hours':'Excellent','4-5 hours':'Moderate','2-3 hours':'Moderate','3-4 hours':'Moderate','9-5 hours':'Moderate',
                                '1-6 hours':'Good','4-6 hours':'Good','1-2 hours':'Moderate','10-6 hours':'Good','9-10 hours':'Excellent',
                                '9-11 hours':'Excellent','50-75 hours':'Good','6 hours':'Good','1-3 hours':'Moderate','8-89 hours':'Excellent',
                                '20-21 hours':'Moderate','60-65 hours':'Moderate','9-6 hours':'Good','than 5 hours':'Moderate','0':'Moderate',
                                '3-6 hours':'Good','9-5':'Moderate'
                               },inplace=True)

In [448]:
sleep_wrong_data = data[(data['Sleep Duration']!='Moderate') & (data['Sleep Duration']!='Good') & (data['Sleep Duration']!='Excellent')].index
sleep_wrong_data = sleep_wrong_data.tolist()
len(sleep_wrong_data)

5

In [450]:
data.iloc[sleep_wrong_data,[6]] = 'Good'

In [452]:
data['Sleep Duration'].value_counts()

Sleep Duration
Good         45916
Moderate     25685
Excellent    22199
Name: count, dtype: int64

In [454]:
data['Dietary Habits'].replace({'More Healthy':'Healthy','5 Unhealthy':'Unhealthy','5 Healthy':'Healthy','Less Healthy':'Unhealthy'
},inplace=True)

In [456]:
data['Dietary Habits'].value_counts()

Dietary Habits
Moderate       33018
Unhealthy      30788
Healthy        29969
No                 6
Educational        1
Naina              1
1.0                1
Raghav             1
Vivaan             1
Soham              1
MCA                1
Academic           1
Resistant          1
Mealy              1
Male               1
Prachi             1
Indoor             1
Kolkata            1
Name: count, dtype: int64

In [458]:
# Missing value also included
dietary_wrong_data = data[(data['Dietary Habits'] != 'Moderate') & (data['Dietary Habits'] != 'Unhealthy') & (data['Dietary Habits'] != 'Healthy')].index
dietary_wrong_data = dietary_wrong_data.tolist()
len(dietary_wrong_data)

25

In [460]:
# handling Profession unique values
data.iloc[dietary_wrong_data,[7]] = 'Moderate'

In [462]:
data['Profession'].replace({'Finanancial Analyst':'Financial Analyst','MD':'Student','B.Ed':'Student','LLM':'Student','M.Tech':'Student','BBA':'Student',
                            'B.Pharm':'Student','MCA':'Student','M.Ed':'Student','PhD':'Student','ME':'Student','M.Pharm':'Student','Surgeon':'Doctor'
},inplace=True)

In [464]:
pd.set_option('display.max_rows',None)
data['Profession'].value_counts()

Profession
Teacher                   16385
Content Writer             5187
Architect                  2982
Consultant                 2920
Pharmacist                 2656
HR Manager                 2601
Doctor                     2199
Business Analyst           2186
Chemist                    1967
Financial Analyst          1942
Entrepreneur               1935
Chef                       1844
Educational Consultant     1827
Data Scientist             1582
Lawyer                     1497
Researcher                 1496
Pilot                      1448
Customer Support           1422
Marketing Manager          1284
Judge                      1189
Travel Consultant          1188
Manager                    1155
Sales Executive            1139
Plumber                    1123
Electrician                1121
Software Engineer          1002
Digital Marketer            942
Civil Engineer              938
UX/UI Designer              915
Accountant                  853
Mechanical Engineer         8

In [466]:
profession_wrong_data = data[(data['Profession'] == 'Working Professional') | (data['Profession'] == 'Unhealthy') | (data['Profession'] == 'Surat') | (data['Profession'] == '3M') | (data['Profession'] == '24th') | (data['Profession'] == 'Manvi') | (data['Profession'] == 'Yogesh') | (data['Profession'] == 'Samar') | (data['Profession'] == 'Name') | (data['Profession'] == 'Simran') | (data['Profession'] == 'Analyst') | (data['Profession'] == 'Profession') | (data['Profession'] == 'No') | (data['Profession'] == 'Unveil') | (data['Profession'] == 'City Consultant')].index
profession_wrong_data = profession_wrong_data.tolist()
len(profession_wrong_data)

20

In [468]:
data.iloc[profession_wrong_data,[3]] = 'Teacher'

In [30]:
data['Degree'].replace({'LLB':'UG', 'B.Ed':'UG', 'B.Arch':'UG', 'BSc':'UG', 'BCA':'UG', 
                        'B.Com':'UG', 'MA':'PG', 'BA':'UG', 'BBA':'UG',
                        'Class 12':'Intermediate', 'MD':'PG', 'MBA':'PG', 'M.Ed':'PG',
                        'M.Pharm':'PG', 'BHM':'UG','LLM':'PG', 'PhD':'DT','M.Com':'PG', 
                        'BE':'UG', 'MBBS':'UG', 'B.Tech':'UG', 'ME':'PG', 'MCA':'PG', 
                        'B.Pharm':'UG', 'MHM':'PG', 'M.Tech':'PG', 'BTech':'UG',
                        'MSc':'PG', 'BArch':'UG','M.Arch':'PG', 'A.Ed':'PG', 
                        'Mechanical Engineer':'UG','B.H':'UG', 'B.Sc':'UG','B BCA':'UG',
                        'B.Press':'UG', 'BPharm':'UG','MPharm':'PG','B_Com':'UG',
                        'B._Pharm':'UG','B.M.Com':'UG','M.M.Ed':'PG', 'S.Pharm':'UG',
                        'E.Ed':'UG','B.BA':'UG','B B.Tech':'UG','M.B.Ed':'UG', 
                        'GCA':'UG', 'G.Ed':'UG','RCA':'UG',
                        'B.CA':'UG', 'PCA':'UG', 'J.Ed':'UG', 'BH':'UG', 'BEd':'UG',
                        'K.Ed':'PG', 'BHCA':'UG'
},inplace=True)

In [36]:
# Missing value also included
degree_wrong_data = data[(data['Degree'] != 'UG') & (data['Degree'] != 'PG') & (data['Degree'] != 'Intermediate') & (data['Degree'] != 'DT')].index
degree_wrong_data = degree_wrong_data.tolist()
len(degree_wrong_data)

37

In [44]:
data.iloc[degree_wrong_data,[8]] = 'UG'

In [46]:
pd.set_option('display.max_rows',None)
data['Degree'].value_counts()

Degree
UG              49155
PG              32760
Intermediate     9812
DT               2073
Name: count, dtype: int64

In [317]:
# data.drop(columns='Degree',inplace=True)

**1.2 Missing Values**

In [34]:
data.isnull().sum()
# in both train and test dataset have these 3 columns have missing values

Gender                                       0
Age                                          0
Working Professional or Student              0
Profession                               24632
Work Pressure                            18778
Job Satisfaction                         18774
Sleep Duration                               0
Dietary Habits                               5
Degree                                       2
Have you ever had suicidal thoughts ?        0
Work/Study Hours                             0
Financial Stress                             0
Family History of Mental Illness             0
dtype: int64

In [557]:
data["Profession"].fillna("Unemployed", inplace=True)
data['Work Pressure'].fillna(data['Work Pressure'].mode()[0],inplace=True)
data['Job Satisfaction'].fillna(data['Job Satisfaction'].mode()[0],inplace=True)

**Feature Scaling**

In [560]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data['Age']=sc.fit_transform(data[['Age']])

**Encoding**

In [562]:
from sklearn.preprocessing import LabelEncoder
en = LabelEncoder()
data['Gender'] = en.fit_transform(data['Gender'])
data['Working Professional or Student'] = en.fit_transform(data['Working Professional or Student'])
data['Have you ever had suicidal thoughts ?'] = en.fit_transform(data['Have you ever had suicidal thoughts ?'])
data['Family History of Mental Illness'] = en.fit_transform(data['Family History of Mental Illness'])
data['Profession'] = en.fit_transform(data['Profession'])

In [564]:
data["Sleep Duration"].replace({'Moderate':0,'Good':1,'Excellent':2},inplace=True)
data["Dietary Habits"].replace({'Unhealthy':0,'Moderate':1,'Healthy':2},inplace=True)

In [417]:
# data.drop(columns='Working Professional or Student',inplace=True)

In [566]:
X=data

In [568]:
import pickle
saved_model=pickle.load(open('mental_health_xgb5.pkl','rb'))
saved_model

In [570]:
ypred = saved_model.predict(X)

In [572]:
ypred=ypred.tolist()
ypred

[0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,


In [574]:
data1=pd.read_csv('test.csv')
id=data1['id'].tolist()

In [576]:
d={'id':id,'Depression':ypred}

In [578]:
final=pd.DataFrame(d)
final.head()

Unnamed: 0,id,Depression
0,140700,0
1,140701,0
2,140702,0
3,140703,1
4,140704,0


In [580]:
final['Depression'].value_counts()

Depression
0    79246
1    14554
Name: count, dtype: int64

In [582]:
final.to_csv('final_prediction_xgb5.csv',index=False)