Для курсового проекта я выбрала тему - "Ментальное здоровье"
Для этого взяла дата-сет "Mental Health in Tech Survey"
https://www.kaggle.com/osmi/mental-health-in-tech-survey

Я хочу обучить модель, которая по ответам на вопросы будет прогнозировать обращение человека за профессиональной психологической помощью.

Content

This dataset contains the following data:

Timestamp
Age
Gender
Country
state: If you live in the United States, which state or territory do you live in?

self_employed: Are you self-employed?

family_history: Do you have a family history of mental illness?

treatment: Have you sought treatment for a mental health condition?

work_interfere: If you have a mental health condition, do you feel that it interferes with your work?

no_employees: How many employees does your company or organization have?

remote_work: Do you work remotely (outside of an office) at least 50% of the time?

tech_company: Is your employer primarily a tech company/organization?

benefits: Does your employer provide mental health benefits?

care_options: Do you know the options for mental health care your employer provides?

wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?

seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?

anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

leave: How easy is it for you to take medical leave for a mental health condition?

mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?

physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?

coworkers: Would you be willing to discuss a mental health issue with your coworkers?

supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?

mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?

physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?

mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?

obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

comments: Any additional notes or comments

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#from sklearn.feature_extraction.text import TfidfVectorizer
import itertools

import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv("survey.csv")
df.head(3)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


В качестве целевой переменной мы возьмем treatment -  
"Have you sought treatment for a mental health condition?". Именно она говорит о том, обращался ли человек за помощью относительно своего ментального здоровья

In [3]:
df['treatment'].value_counts()

Yes    637
No     622
Name: treatment, dtype: int64

In [4]:
df.loc[df['treatment'] == 'Yes', 'treatment']=1
df.loc[df['treatment'] == 'No', 'treatment']=0
df["treatment"] = df["treatment"].astype('int32')
df['treatment'].value_counts()

1    637
0    622
Name: treatment, dtype: int64

Посмотрим на распределение классов  - видим, что не так уж и много данных у нас) И при этом нет дисбаланса классов.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   int32 
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [6]:
df['self_employed'].value_counts()

No     1095
Yes     146
Name: self_employed, dtype: int64

In [7]:
df.loc[df['self_employed'] == 'Yes', 'self_employed']=1
df.loc[df['self_employed'] == 'No', 'self_employed']=0

df['self_employed'].value_counts()

0    1095
1     146
Name: self_employed, dtype: int64

In [8]:
df['family_history'].value_counts()

No     767
Yes    492
Name: family_history, dtype: int64

In [9]:
df.loc[df['family_history'] == 'Yes', 'family_history']=1
df.loc[df['family_history'] == 'No', 'family_history']=0
df["family_history"] = df["family_history"].astype('int32')
df['family_history'].value_counts()

0    767
1    492
Name: family_history, dtype: int64

In [10]:
df['work_interfere'].value_counts()

Sometimes    465
Never        213
Rarely       173
Often        144
Name: work_interfere, dtype: int64

In [11]:
df['no_employees'].value_counts()

6-25              290
26-100            289
More than 1000    282
100-500           176
1-5               162
500-1000           60
Name: no_employees, dtype: int64

In [12]:
df['remote_work'].value_counts()

No     883
Yes    376
Name: remote_work, dtype: int64

In [13]:
df.loc[df['remote_work'] == 'Yes', 'remote_work']=1
df.loc[df['remote_work'] == 'No', 'remote_work']=0
df["remote_work"] = df["remote_work"].astype('int32')
df['remote_work'].value_counts()

0    883
1    376
Name: remote_work, dtype: int64

In [14]:
df['tech_company'].value_counts()

Yes    1031
No      228
Name: tech_company, dtype: int64

In [15]:
df.loc[df['tech_company'] == 'Yes', 'tech_company']=1
df.loc[df['tech_company'] == 'No', 'tech_company']=0
df["tech_company"] = df["tech_company"].astype('int32')
df['tech_company'].value_counts()

1    1031
0     228
Name: tech_company, dtype: int64

In [16]:
df['benefits'].value_counts()

Yes           477
Don't know    408
No            374
Name: benefits, dtype: int64

In [17]:
df['care_options'].value_counts()

No          501
Yes         444
Not sure    314
Name: care_options, dtype: int64

In [47]:
df.loc[df['care_options'] == 'Yes', 'care_options']=1
df.loc[df['care_options'] == 'No', 'care_options']=0
df.loc[df['care_options'] == 'Not sure', 'care_options']=1
df["care_options"] = df["care_options"].astype('int32')
df['care_options'].value_counts()

1    758
0    501
Name: care_options, dtype: int64

In [18]:
df['wellness_program'].value_counts()

No            842
Yes           229
Don't know    188
Name: wellness_program, dtype: int64

In [44]:
df.loc[df['wellness_program'] == 'Yes', 'wellness_program']=1
df.loc[df['wellness_program'] == 'No', 'wellness_program']=0
df.loc[df['wellness_program'] ==   "Don't know", 'wellness_program']=0
df["wellness_program"] = df["wellness_program"].astype('int32')
df['wellness_program'].value_counts()

0    1030
1     229
Name: wellness_program, dtype: int64

In [19]:
df['seek_help'].value_counts()

No            646
Don't know    363
Yes           250
Name: seek_help, dtype: int64

In [45]:
df.loc[df['seek_help'] == 'Yes', 'seek_help']=1
df.loc[df['seek_help'] == 'No', 'seek_help']=0
df.loc[df['seek_help'] ==   "Don't know", 'seek_help']=0
df["seek_help"] = df["seek_help"].astype('int32')
df['seek_help'].value_counts()

0    1009
1     250
Name: seek_help, dtype: int64

In [20]:
df['anonymity'].value_counts()

Don't know    819
Yes           375
No             65
Name: anonymity, dtype: int64

In [21]:
df['leave'].value_counts()

Don't know            563
Somewhat easy         266
Very easy             206
Somewhat difficult    126
Very difficult         98
Name: leave, dtype: int64

In [22]:
df['mental_health_consequence'].value_counts()

No       490
Maybe    477
Yes      292
Name: mental_health_consequence, dtype: int64

In [23]:
df.loc[df['mental_health_consequence'] == 'Yes', 'mental_health_consequence']=1
df.loc[df['mental_health_consequence'] == 'No', 'mental_health_consequence']=0
df.loc[df['mental_health_consequence'] == 'Maybe', 'mental_health_consequence']=1
df["mental_health_consequence"] = df["mental_health_consequence"].astype('int32')
df['mental_health_consequence'].value_counts()

1    769
0    490
Name: mental_health_consequence, dtype: int64

In [24]:
df['phys_health_consequence'].value_counts()

No       925
Maybe    273
Yes       61
Name: phys_health_consequence, dtype: int64

In [25]:
df.loc[df['phys_health_consequence'] == 'Yes', 'phys_health_consequence']=1
df.loc[df['phys_health_consequence'] == 'No', 'phys_health_consequence']=0
df.loc[df['phys_health_consequence'] == 'Maybe', 'phys_health_consequence']=1
df["phys_health_consequence"] = df["phys_health_consequence"].astype('int32')
df['phys_health_consequence'].value_counts()

0    925
1    334
Name: phys_health_consequence, dtype: int64

In [26]:
df['coworkers'].value_counts()

Some of them    774
No              260
Yes             225
Name: coworkers, dtype: int64

In [27]:
df.loc[df['coworkers'] == 'Yes', 'coworkers']=1
df.loc[df['coworkers'] == 'No', 'coworkers']=0
df.loc[df['coworkers'] == 'Some of them', 'coworkers']=1
df["coworkers"] = df["coworkers"].astype('int32')
df['coworkers'].value_counts()

1    999
0    260
Name: coworkers, dtype: int64

In [28]:
df['supervisor'].value_counts()

Yes             516
No              393
Some of them    350
Name: supervisor, dtype: int64

In [29]:
df.loc[df['supervisor'] == 'Yes', 'supervisor']=1
df.loc[df['supervisor'] == 'No', 'supervisor']=0
df.loc[df['supervisor'] == 'Some of them', 'supervisor']=1
df["supervisor"] = df["supervisor"].astype('int32')
df['supervisor'].value_counts()

1    866
0    393
Name: supervisor, dtype: int64

In [30]:
df['mental_health_interview'].value_counts() 

No       1008
Maybe     207
Yes        44
Name: mental_health_interview, dtype: int64

In [31]:
df.loc[df['mental_health_interview'] == 'Yes', 'mental_health_interview']=1
df.loc[df['mental_health_interview'] == 'No', 'mental_health_interview']=0
df.loc[df['mental_health_interview'] == 'Maybe', 'mental_health_interview']=1
df["mental_health_interview"] = df["mental_health_interview"].astype('int32')
df['mental_health_interview'].value_counts()

0    1008
1     251
Name: mental_health_interview, dtype: int64

In [32]:
df['phys_health_interview'].value_counts() 

Maybe    557
No       500
Yes      202
Name: phys_health_interview, dtype: int64

In [33]:
df.loc[df['phys_health_interview'] == 'Yes', 'phys_health_interview']=1
df.loc[df['phys_health_interview'] == 'No', 'phys_health_interview']=0
df.loc[df['phys_health_interview'] == 'Maybe', 'phys_health_interview']=1
df["phys_health_interview"] = df["phys_health_interview"].astype('int32')
df['phys_health_interview'].value_counts()

1    759
0    500
Name: phys_health_interview, dtype: int64

In [34]:
df['mental_vs_physical'].value_counts() 

Don't know    576
Yes           343
No            340
Name: mental_vs_physical, dtype: int64

In [35]:
df.loc[df['mental_vs_physical'] == 'Yes', 'mental_vs_physical']=1
df.loc[df['mental_vs_physical'] == 'No', 'mental_vs_physical']=0
df.loc[df['mental_vs_physical'] == "Don't know", 'mental_vs_physical']=0
df["mental_vs_physical"] = df["mental_vs_physical"].astype('int32')
df['mental_vs_physical'].value_counts()

0    916
1    343
Name: mental_vs_physical, dtype: int64

In [36]:
df['obs_consequence'].value_counts() 

No     1075
Yes     184
Name: obs_consequence, dtype: int64

In [37]:
df.loc[df['obs_consequence'] == 'Yes', 'obs_consequence']=1
df.loc[df['obs_consequence'] == 'No', 'obs_consequence']=0
df["obs_consequence"] = df["obs_consequence"].astype('int32')
df['obs_consequence'].value_counts()

0    1075
1     184
Name: obs_consequence, dtype: int64

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1259 non-null   int64 
 1   Gender                     1259 non-null   object
 2   Country                    1259 non-null   object
 3   self_employed              1259 non-null   object
 4   family_history             1259 non-null   int32 
 5   treatment                  1259 non-null   int32 
 6   work_interfere             1259 non-null   object
 7   no_employees               1259 non-null   object
 8   remote_work                1259 non-null   int32 
 9   tech_company               1259 non-null   int32 
 10  benefits                   1259 non-null   object
 11  care_options               1259 non-null   int32 
 12  wellness_program           1259 non-null   int32 
 13  seek_help                  1259 non-null   int32 
 14  anonymit

In [39]:
df['comments'].value_counts() 
df= df.drop(columns=['comments','state','Timestamp'], axis=1)

In [41]:
mode = df['self_employed'].mode
df["self_employed"] = df["self_employed"].fillna(mode)

In [42]:
sum(df['work_interfere'].isna())
mode1 = df['work_interfere'].mode
df["work_interfere"] = df["work_interfere"].fillna(mode1)