<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/basic_models/Classification1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classfication Model Walkthrough

## Problem Statement

**Question**: Can you predict whether a patient should be treated of his/her mental illness or not according to the values obtained in the dataset?

## Dataset
[Mental Health in Tech Survey](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey)

## Similar notebook in Kaggle

Most of the code is taken from [Machine Learning for Mental Health](https://www.kaggle.com/code/youpengcheng/machine-learning-for-mental-health). It would be easier to use Kaggle notebook.

# Getting data

The dataset is in Kaggle. If you are using Kaggle notebook, you can add the dataset directly to the notbook. On colab, we need to first get the data.

## Dataset

[Mental Health in Tech Survey](https://www.kaggle.com/code/youpengcheng/machine-learning-for-mental-health)

## Data Columns

- Timestamp
- Age
- Gender
- Country
- state: If you live in the United States, which state or territory do you live in?
- self_employed: Are you self-employed?
- family_history: Do you have a family history of mental illness?
- treatment: Have you sought treatment for a mental health condition?
- work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
- no_employees: How many employees does your company or organization have?
- remote_work: Do you work remotely (outside of an office) at least 50% of the time?
- tech_company: Is your employer primarily a tech company/organization?
- benefits: Does your employer provide mental health benefits?
- care_options: Do you know the options for mental health care your employer provides?
- wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
- seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
- anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
- leave: How easy is it for you to take medical leave for a mental health condition?
- mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
- phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
- coworkers: Would you be willing to discuss a mental health issue with your coworkers?
- supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
- mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?
- phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?
- mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?
- obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
- comments: Any additional notes or comments



In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/calmrocks/master-machine-learning-engineer/main/basic_models/data/survey.csv"
df = pd.read_csv(url)

In [5]:
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1259 non-null   int64 
 1   Gender                     1259 non-null   object
 2   Country                    1259 non-null   object
 3   self_employed              1241 non-null   object
 4   family_history             1259 non-null   object
 5   treatment                  1259 non-null   object
 6   work_interfere             995 non-null    object
 7   no_employees               1259 non-null   object
 8   remote_work                1259 non-null   object
 9   tech_company               1259 non-null   object
 10  benefits                   1259 non-null   object
 11  care_options               1259 non-null   object
 12  wellness_program           1259 non-null   object
 13  seek_help                  1259 non-null   object
 14  anonymit

In [9]:
df.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


In [10]:
print(df.shape)

(1259, 24)


In [11]:
print(df.columns.tolist())

['Age', 'Gender', 'Country', 'self_employed', 'family_history', 'treatment', 'work_interfere', 'no_employees', 'remote_work', 'tech_company', 'benefits', 'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave', 'mental_health_consequence', 'phys_health_consequence', 'coworkers', 'supervisor', 'mental_health_interview', 'phys_health_interview', 'mental_vs_physical', 'obs_consequence']


In [12]:
print(df.isnull().sum())

Age                            0
Gender                         0
Country                        0
self_employed                 18
family_history                 0
treatment                      0
work_interfere               264
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
obs_consequence                0
dtype: int64


In [13]:
print(df.dtypes)

Age                           int64
Gender                       object
Country                      object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
dtype: object


In [15]:
print(df['self_employed'].unique())
print(df['benefits'].unique())

[nan 'Yes' 'No']
['Yes' "Don't know" 'No']


In [19]:
print(df['work_interfere'].value_counts())
print(df['Gender'].value_counts())

work_interfere
Sometimes    465
Never        213
Rarely       173
Often        144
Name: count, dtype: int64
Gender
Male                                              615
male                                              206
Female                                            121
M                                                 116
female                                             62
F                                                  38
m                                                  34
f                                                  15
Make                                                4
Male                                                3
Woman                                               3
Cis Male                                            2
Man                                                 2
Female (trans)                                      2
Female                                              2
Trans woman                                         1
msle                

## Data cleanup

In [7]:
df = df.drop(['comments'], axis= 1)
df = df.drop(['state'], axis= 1)
df = df.drop(['Timestamp'], axis= 1)
df.head()

Unnamed: 0,Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,Female,United States,,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,M,United States,,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,Male,Canada,,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,Male,United Kingdom,,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,Male,United States,,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


### Deal with null values

- self_employed: drop
- fill with the most frequent value

In [21]:
import pandas as pd
import numpy as np

# Make a copy of the original DataFrame to avoid the warning
df = df.copy()

# Print initial counts
print("Initial counts:")
print("\nwork_interfere counts:")
print(df['work_interfere'].value_counts())
print("\nwork_interfere null count:", df['work_interfere'].isnull().sum())

print("\nself_employed counts:")
print(df['self_employed'].value_counts())
print("\nself_employed null count:", df['self_employed'].isnull().sum())

# 1. Handle self_employed: Drop rows with NaN values
df = df.dropna(subset=['self_employed'])

# 2. For work_interfere, using .loc to avoid the warning
mode_value = df['work_interfere'].mode()[0]
df.loc[df['work_interfere'].isnull(), 'work_interfere'] = mode_value

# Alternative method using fillna with inplace=True
# df['work_interfere'].fillna(df['work_interfere'].mode()[0], inplace=True)

# Print results after cleaning
print("\nAfter cleaning:")
print("\nwork_interfere counts:")
print(df['work_interfere'].value_counts())
print("\nwork_interfere null count:", df['work_interfere'].isnull().sum())

print("\nself_employed counts:")
print(df['self_employed'].value_counts())
print("\nself_employed null count:", df['self_employed'].isnull().sum())

# Print total number of rows
print("\nTotal number of rows in dataset:", len(df))

Initial counts:

work_interfere counts:
work_interfere
Sometimes    722
Never        207
Rarely       170
Often        142
Name: count, dtype: int64

work_interfere null count: 0

self_employed counts:
self_employed
No     1095
Yes     146
Name: count, dtype: int64

self_employed null count: 0

After cleaning:

work_interfere counts:
work_interfere
Sometimes    722
Never        207
Rarely       170
Often        142
Name: count, dtype: int64

work_interfere null count: 0

self_employed counts:
self_employed
No     1095
Yes     146
Name: count, dtype: int64

self_employed null count: 0

Total number of rows in dataset: 1241
