## Data set Mental Health in TI

<p>"This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace."(From Kanggle)</p>
<p> link: <a href="https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey?resource=download&select=survey.csv">Data source Kanggle</a></p>
<p>Version 1.2 - 28/05/2022</p>

In [1]:
## imports
import pandas as pd
import numpy as np

### 1 Data Collection

In [2]:
df = pd.read_csv("survey.csv")

### 2 Data Understanding

In [3]:
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [4]:
#Tipos das colunas
df.dtypes

Timestamp                    object
Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
comments                     object
dtype: object

In [5]:
# Quantidades de linhas do dataset
len(df)

1259

In [6]:
#Soma nulos por Coluna
df.isnull().sum()

Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64

In [7]:
# Soma todos os NaN do dataset
df.isna().sum().sum()

1892

### 3 Data Preparation

#### Dealing with missing values
<p>Missing values in columns</p>
<ul>
<li>state - 515</li>
<li>self_employed - 18</li>
<li>work_interfere - 264</li>
<li>comments - 1095</li>
</ul>



In [8]:
# Put "Not informed" in NaN to state column
# Column: state - 515 NaN
df['state'].replace(np.nan, "Not informed", inplace=True)

In [9]:
# Column: work_interfere - 264 NaN
df['work_interfere']

0           Often
1          Rarely
2          Rarely
3           Often
4           Never
5       Sometimes
6       Sometimes
7           Never
8       Sometimes
9           Never
10      Sometimes
11          Never
12      Sometimes
13          Never
14          Never
15         Rarely
16      Sometimes
17      Sometimes
18      Sometimes
19            NaN
20      Sometimes
21          Never
22          Often
23          Never
24         Rarely
25      Sometimes
26            NaN
27         Rarely
28      Sometimes
29      Sometimes
          ...    
1229          NaN
1230    Sometimes
1231    Sometimes
1232    Sometimes
1233       Rarely
1234    Sometimes
1235        Often
1236        Often
1237    Sometimes
1238        Often
1239    Sometimes
1240    Sometimes
1241        Often
1242       Rarely
1243       Rarely
1244          NaN
1245        Often
1246        Never
1247        Often
1248    Sometimes
1249    Sometimes
1250        Often
1251        Often
1252    Sometimes
1253      

In [10]:
# Delete rows NaN work_interfere
df.groupby(['work_interfere']).count()
df.dropna(subset=["work_interfere"], inplace=True)

In [11]:
# Verifing the length of the dataset
len(df)

995

In [12]:
# Sum NaN Again
df.isnull().sum()

Timestamp                      0
Age                            0
Gender                         0
Country                        0
state                          0
self_employed                 18
family_history                 0
treatment                      0
work_interfere                 0
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
obs_consequence                0
comments                     852
dtype: int64

In [13]:
# self_employed
df.groupby(['self_employed']).count()

Unnamed: 0_level_0,Timestamp,Age,Gender,Country,state,family_history,treatment,work_interfere,no_employees,remote_work,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
self_employed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
No,852,852,852,852,852,852,852,852,852,852,...,852,852,852,852,852,852,852,852,852,118
Yes,125,125,125,125,125,125,125,125,125,125,...,125,125,125,125,125,125,125,125,125,22


In [14]:
# Delete  NaN self_employed - 18
df.dropna(subset=["self_employed"],axis=0, inplace=True)

In [15]:
len(df)

977

In [16]:
# Comments column
df['comments']

18                                                    NaN
20                                                    NaN
21                                                    NaN
22                                                    NaN
23                                                    NaN
24                    Relatively new job. Ask again later
25      Sometimes I think  about using drugs for my me...
27                                                    NaN
28                                                    NaN
29                                                    NaN
30                                                    NaN
31                                                    NaN
32                                                    NaN
33      I selected my current employer based on its po...
34                                                    NaN
35                                                    NaN
36                                                    NaN
39            

#### Dropping variables

In [17]:
# Drop column: comments
df.drop(["comments"], axis=1, inplace=True)

In [18]:
df.columns

Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence'],
      dtype='object')

In [19]:
# Sum of null values
df.isnull().sum()

Timestamp                    0
Age                          0
Gender                       0
Country                      0
state                        0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

In [20]:
# Export previous transformation
df.to_csv('survey_v1.csv')

In [22]:
# Read transformed data
df_t = pd.read_csv('survey_v1.csv')

In [23]:
len(df_t)

977