## Datasets Source
This dataset was from OSMH/OSMI Mental Health in Tech Survey 2014:
https://osmihelp.org/research

## Dataset Information
This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace. This training dataset contains 1007 records, including 510 persons who sought treatment for a mental health condition and 497 persons without seeking treatment. To study this dataset, it may help to assist companies in making supportive environments for those impacted by mental health disorders. The "treatment" field is a class label used to divide into groups (sought treatment or not).

## Attribute Information:
This dataset contains the following data:

1. Age
2. Gender
3. self_employed: Are you self-employed?
4. family_history: Do you have a family history of mental illness?
5. work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
6. no_employees: How many employees does your company or organization have?
remote_work: Do you work remotely (outside of an office) at least 50% of the time?
7. tech_company: Is your employer primarily a tech company/organization?
8. benefits: Does your employer provide mental health benefits?
9. care_options: Do you know the options for mental health care your employer provides?
10. wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
11. seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
12. anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
13. leave: How easy is it for you to take medical leave for a mental health condition?
14. mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
15. physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
16. coworkers: Would you be willing to discuss a mental health issue with your coworkers?
17. supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
18. mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?
19. physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
20. mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
21. obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
22. treatment: Have you sought treatment for a mental health condition?


### Download the training set

沒有index，性別，年齡，one hot encoding

In [None]:
# Download from Google Drive
!gdown --id 1HZnYBOe8Z04UzK6T0BXeTH5oaU_ABjIz

Downloading...
From: https://drive.google.com/uc?id=1HZnYBOe8Z04UzK6T0BXeTH5oaU_ABjIz
To: /content/project2.zip
  0% 0.00/19.2k [00:00<?, ?B/s]100% 19.2k/19.2k [00:00<00:00, 35.1MB/s]


In [None]:
!unzip project2.zip
# if seeing the message: "replace project1_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:"
# you may enter "A"

Archive:  project2.zip
  inflating: project2_test.csv       
  inflating: project2_train.csv      


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('project2_train.csv')
df.columns

Index(['Age', 'Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment
0,37,Female,,No,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,Yes
1,44,M,,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,No
2,32,Male,,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,No
3,31,Male,,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,Yes
4,31,Male,,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,No


In [None]:
df.columns

Index(['Age', 'Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment'],
      dtype='object')

In [None]:
for col in df.columns:
  print('Unique values in {} :'.format(col), len(df[col].unique()))

Unique values in Age : 52
Unique values in Gender : 44
Unique values in self_employed : 3
Unique values in family_history : 2
Unique values in work_interfere : 5
Unique values in no_employees : 6
Unique values in remote_work : 2
Unique values in tech_company : 2
Unique values in benefits : 3
Unique values in care_options : 3
Unique values in wellness_program : 3
Unique values in seek_help : 3
Unique values in anonymity : 3
Unique values in leave : 5
Unique values in mental_health_consequence : 3
Unique values in phys_health_consequence : 3
Unique values in coworkers : 3
Unique values in supervisor : 3
Unique values in mental_health_interview : 3
Unique values in phys_health_interview : 3
Unique values in mental_vs_physical : 3
Unique values in obs_consequence : 2
Unique values in treatment : 2


In [None]:
df.info() #非1007有空值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1007 non-null   int64 
 1   Gender                     1007 non-null   object
 2   self_employed              994 non-null    object
 3   family_history             1007 non-null   object
 4   work_interfere             800 non-null    object
 5   no_employees               1007 non-null   object
 6   remote_work                1007 non-null   object
 7   tech_company               1007 non-null   object
 8   benefits                   1007 non-null   object
 9   care_options               1007 non-null   object
 10  wellness_program           1007 non-null   object
 11  seek_help                  1007 non-null   object
 12  anonymity                  1007 non-null   object
 13  leave                      1007 non-null   object
 14  mental_h

In [None]:
df.isnull().sum().sort_values(ascending=False)  #有null要處理

work_interfere               207
self_employed                 13
treatment                      0
wellness_program               0
Gender                         0
family_history                 0
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
seek_help                      0
obs_consequence                0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
Age                            0
dtype: int64

In [None]:
df['work_interfere'].unique()

array(['Often', 'Rarely', 'Never', 'Sometimes', nan], dtype=object)

In [None]:
df.self_employed.unique()

array([nan, 'Yes', 'No'], dtype=object)

In [None]:
df.Gender.unique()

array(['Female', 'M', 'Male', 'female', 'male', 'm', 'maile',
       'Trans-female', 'Cis Female', 'F', 'something kinda male?',
       'Cis Male', 'Woman', 'f', 'Mal', 'Male (CIS)', 'queer/she/they',
       'non-binary', 'woman', 'Make', 'Nah', 'All', 'Enby', 'fluid',
       'Genderqueer', 'Androgyne', 'cis-female/femme', 'Guy (-ish) ^_^',
       'male leaning androgynous', 'Male ', 'Trans woman', 'Man', 'msle',
       'Neuter', 'queer', 'Female (cis)', 'Mail', 'cis male',
       'A little about you', 'Malr', 'p', 'femail', 'Cis Man',
       'ostensibly male unsure what that really means'], dtype=object)

In [None]:
other  = ['A little about you', 'p', 'Nah', 'Enby', 'Trans-female','something kinda male?','queer/she/they','non-binary','All','fluid', 'Genderqueer','Androgyne', 'Agender','Guy (-ish) ^_^', 'male leaning androgynous','Trans woman','Neuter', 'Female (trans)','queer','ostensibly male unsure what that really means','trans']
male   = ['male', 'Male','M', 'm', 'Male-ish', 'maile','Cis Male','Mal', 'Male (CIS)','Make','Male ', 'Man', 'msle','cis male', 'Cis Man','Malr','Mail']
female = ['Female', 'female','Cis Female', 'F','f','Femake', 'woman','Female ','cis-female/femme','Female (cis)','femail','Woman','female']

In [None]:
df.Age.min(), df.Age.max()  #小於18歲改成18 大於72改成72之類的 設最小最大值

(-1726, 99999999999)

In [None]:
df.treatment = df.treatment.astype('category')
df.treatment = df.treatment.cat.codes
df.treatment.value_counts()

1    510
0    497
Name: treatment, dtype: int64

### The stage is yours

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import randint

# prep
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler

# models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Validation libraries
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score

#Neural Network
from sklearn.neural_network import MLPClassifier

#Bagging
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

#Naive bayes
from sklearn.naive_bayes import GaussianNB 



In [None]:
#dealing with missing data
#df1 = df.drop(['work_interfere','self_employed'], axis=1)
#df1.columns

In [None]:
#Cleaning NaN
defaultInt = 0
defaultString = 'NaN'
defaultFloat = 0.0

intFeatures = ['Age']
stringFeatures = ['Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment']
                 
floatFeatures = []

# Clean the NaN's
for feature in df:
    if feature in intFeatures:
        df[feature] = df[feature].fillna(defaultInt)
    elif feature in stringFeatures:
        df[feature] = df[feature].fillna(defaultString)
    elif feature in floatFeatures:
        df[feature] = df[feature].fillna(defaultFloat)
    else:
        print('Error: Feature %s not recognized.' % feature)
df.head(5)

#clean 'Gender'
#Slower case all columm's elements
gender = df['Gender'].str.lower()
#print(gender)

#Select unique elements
gender = df['Gender'].unique()

#Made gender groups
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", 
             "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", 
             "trans woman", "neuter", "female (trans)", "queer", "ostensibly male unsure what that really means"]           
female_str = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail"]

for (row, col) in df.iterrows():

    if str.lower(col.Gender) in male_str:
        df['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)

    if str.lower(col.Gender) in female_str:
        df['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)

    if str.lower(col.Gender) in trans_str:
        df['Gender'].replace(to_replace=col.Gender, value='trans', inplace=True)

#Get rid of bullshit
stk_list = ['A little about you', 'p']
df = df[~df['Gender'].isin(stk_list)]

print(df['Gender'].unique())

#complete missing age with mean
df['Age'].fillna(df['Age'].median(), inplace = True)

# Fill with media() values < 18 and > 120
s = pd.Series(df['Age'])
s[s<18] = df['Age'].median()
df['Age'] = s
s = pd.Series(df['Age'])
s[s>120] = df['Age'].median()
df['Age'] = s

#Ranges of Age
df['age_range'] = pd.cut(df['Age'], [0,20,30,65,100], labels=["0-20", "21-30", "31-65", "66-100"], include_lowest=True)



['female' 'male' 'trans']


In [None]:
df2 = df.drop('Age', axis=1)
df_age= df.iloc[:,:1]

In [None]:
df_age

Unnamed: 0,Age
0,37
1,44
2,32
3,31
4,31
...,...
1002,36
1003,26
1004,32
1005,34


In [None]:
#Encoding data
labelDict = {}
for feature in df2:
    le = preprocessing.LabelEncoder()
    le.fit(df2[feature])
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    df2[feature] = le.transform(df2[feature])
    # Get labels
    labelKey = 'label_' + feature
    labelValue = [*le_name_mapping]
    labelDict[labelKey] =labelValue
    
for key, value in labelDict.items():     
    print(key, value)

df2.head()

label_Gender ['female', 'male', 'trans']
label_self_employed ['NaN', 'No', 'Yes']
label_family_history ['No', 'Yes']
label_work_interfere ['NaN', 'Never', 'Often', 'Rarely', 'Sometimes']
label_no_employees ['1-5', '100-500', '26-100', '500-1000', '6-25', 'More than 1000']
label_remote_work ['No', 'Yes']
label_tech_company ['No', 'Yes']
label_benefits ["Don't know", 'No', 'Yes']
label_care_options ['No', 'Not sure', 'Yes']
label_wellness_program ["Don't know", 'No', 'Yes']
label_seek_help ["Don't know", 'No', 'Yes']
label_anonymity ["Don't know", 'No', 'Yes']
label_leave ["Don't know", 'Somewhat difficult', 'Somewhat easy', 'Very difficult', 'Very easy']
label_mental_health_consequence ['Maybe', 'No', 'Yes']
label_phys_health_consequence ['Maybe', 'No', 'Yes']
label_coworkers ['No', 'Some of them', 'Yes']
label_supervisor ['No', 'Some of them', 'Yes']
label_mental_health_interview ['Maybe', 'No', 'Yes']
label_phys_health_interview ['Maybe', 'No', 'Yes']
label_mental_vs_physical ["Don't 

Unnamed: 0,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment,age_range
0,0,0,0,2,4,0,1,2,1,1,2,2,2,1,1,1,2,1,0,2,0,1,2
1,1,0,0,3,5,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,2
2,1,0,0,3,4,0,1,1,0,1,1,0,1,1,1,2,2,2,2,1,0,0,2
3,1,0,1,2,2,0,1,1,2,1,1,1,1,2,2,1,0,0,0,1,1,1,2
4,1,0,0,1,1,1,1,2,0,0,0,0,0,1,1,1,2,2,2,0,0,0,2


In [None]:
df1=df_age.merge(df2, how='inner', left_index=True, right_index=True)

df1.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment,age_range
0,37,0,0,0,2,4,0,1,2,1,1,2,2,2,1,1,1,2,1,0,2,0,1,2
1,44,1,0,0,3,5,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,2
2,32,1,0,0,3,4,0,1,1,0,1,1,0,1,1,1,2,2,2,2,1,0,0,2
3,31,1,0,1,2,2,0,1,1,2,1,1,1,1,2,2,1,0,0,0,1,1,1,2
4,31,1,0,0,1,1,1,1,2,0,0,0,0,0,1,1,1,2,2,2,0,0,0,2


In [None]:
#特徵縮放年齡
scaler = MinMaxScaler().fit(df1[['Age']])
df1['Age'] = scaler.transform(df1[['Age']])
df1.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,treatment,age_range
0,0.351852,0,0,0,2,4,0,1,2,1,1,2,2,2,1,1,1,2,1,0,2,0,1,2
1,0.481481,1,0,0,3,5,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,2
2,0.259259,1,0,0,3,4,0,1,1,0,1,1,0,1,1,1,2,2,2,2,1,0,0,2
3,0.240741,1,0,1,2,2,0,1,1,2,1,1,1,1,2,2,1,0,0,0,1,1,1,2
4,0.240741,1,0,0,1,1,1,1,2,0,0,0,0,0,1,1,1,2,2,2,0,0,0,2


In [None]:
from sklearn.model_selection import train_test_split
# Train and Test set
feature_cols = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'anonymity', 'leave', 'work_interfere']
X = df1.drop('treatment', axis=1)
y = df1.treatment

# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 408570291) # random_state改成你自己的學號，純數字

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
ypred = random_forest.predict(X_test)

print(metrics.classification_report(ypred, y_test))

              precision    recall  f1-score   support

           0       0.74      0.90      0.81        79
           1       0.92      0.80      0.85       122

    accuracy                           0.84       201
   macro avg       0.83      0.85      0.83       201
weighted avg       0.85      0.84      0.84       201



In [None]:
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
    


In [None]:
from sklearn.svm import LinearSVC

# 模型构建与拟合
lsvm  = LinearSVC()
lsvm .fit(X_train, y_train)

# 模型预测
y_pred = lsvm .predict(X_test)

# 分类正确率
print("分类正确率：",round(lsvm .score(X_test, y_test),4))

分类正确率： 0.7811


In [None]:
from sklearn.model_selection import GridSearchCV
C_grid = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1, 2, 3, 4, 5, 10, 20, 30, 40]
param_grid = {'C':C_grid}
grid = GridSearchCV(LinearSVC(), param_grid, cv=10, scoring='f1')
grid.fit(X_train, y_train)

print('Best paras',grid.best_params_)

param_grid = {'C': [0.0001, 0.01, 0.1, 0.3, 0.5, 1, 10, 100], 'gamma':[100, 10, 1, 0.1, 0.01, 0.001]} 
from sklearn.svm import SVC
grid_search = GridSearchCV(SVC(), param_grid, cv=10)

grid_search.fit(X_train, y_train)

grid.best_params_



Best paras {'C': 0.3}


{'C': 0.3}

In [None]:
import numpy as np
import pandas as pd
from sklearn import svm, preprocessing, metrics

# 建立 SVC 模型
svc = svm.SVC()
svc_fit = svc.fit(X_train, y_train)

# 預測
test_y_predicted = svc.predict(X_test)

# 績效
accuracy = metrics.accuracy_score(y_test, test_y_predicted)
print(accuracy)

0.8059701492537313


### Make prediction and submission file

In [None]:
x_test = pd.read_csv('project2_test.csv')

In [None]:
#要對測試資料作處理(性別、年齡、補缺損值)

#dealing with missing data

#x_test = x_test.drop(['work_interfere','self_employed'], axis=1)

#Cleaning NaN
defaultInt = 0
defaultString = 'NaN'
defaultFloat = 0.0

intFeatures = ['Age']
stringFeatures = ['Gender', 'self_employed', 'family_history', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'treatment']
floatFeatures = []

# Clean the NaN's
for feature in x_test:
    if feature in intFeatures:
        x_test[feature] = x_test[feature].fillna(defaultInt)
    elif feature in stringFeatures:
        x_test[feature] = x_test[feature].fillna(defaultString)
    elif feature in floatFeatures:
        x_test[feature] = x_test[feature].fillna(defaultFloat)
    else:
        print('Error: Feature %s not recognized.' % feature)

#clean 'Gender'
#Slower case all columm's elements
gender = x_test['Gender'].str.lower()
#print(gender)

#Select unique elements
gender = x_test['Gender'].unique()

#Made gender groups
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male unsure what that really means"]           
female_str = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail"]

for (row, col) in x_test.iterrows():

    if str.lower(col.Gender) in male_str:
        x_test['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)

    if str.lower(col.Gender) in female_str:
        x_test['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)

    if str.lower(col.Gender) in trans_str:
        x_test['Gender'].replace(to_replace=col.Gender, value='trans', inplace=True)

#Get rid of bullshit
stk_list = ['A little about you', 'p']
x_test = x_test[~x_test['Gender'].isin(stk_list)]

print(x_test['Gender'].unique())

#complete missing age with mean
x_test['Age'].fillna(x_test['Age'].median(), inplace = True)

# Fill with media() values < 18 and > 120
s = pd.Series(x_test['Age'])
s[s<18] = x_test['Age'].median()
x_test['Age'] = s
s = pd.Series(x_test['Age'])
s[s>120] = x_test['Age'].median()
x_test['Age'] = s

#Ranges of Age
x_test['age_range'] = pd.cut(x_test['Age'], [0,20,30,65,100], labels=["0-20", "21-30", "31-65", "66-100"], include_lowest=True)


['male' 'female' 'trans']


In [None]:
x_test2 = x_test.drop('Age', axis=1)
x_test_age= x_test.iloc[:,:1]

In [None]:
#Encoding data
labelDict = {}
for feature in x_test2:
    le = preprocessing.LabelEncoder()
    le.fit(x_test2[feature])
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    x_test2[feature] = le.transform(x_test2[feature])
    # Get labels
    labelKey = 'label_' + feature
    labelValue = [*le_name_mapping]
    labelDict[labelKey] =labelValue
    
for key, value in labelDict.items():     
    print(key, value)

x_test2

label_Gender ['female', 'male', 'trans']
label_self_employed ['NaN', 'No', 'Yes']
label_family_history ['No', 'Yes']
label_work_interfere ['NaN', 'Never', 'Often', 'Rarely', 'Sometimes']
label_no_employees ['1-5', '100-500', '26-100', '500-1000', '6-25', 'More than 1000']
label_remote_work ['No', 'Yes']
label_tech_company ['No', 'Yes']
label_benefits ["Don't know", 'No', 'Yes']
label_care_options ['No', 'Not sure', 'Yes']
label_wellness_program ["Don't know", 'No', 'Yes']
label_seek_help ["Don't know", 'No', 'Yes']
label_anonymity ["Don't know", 'No', 'Yes']
label_leave ["Don't know", 'Somewhat difficult', 'Somewhat easy', 'Very difficult', 'Very easy']
label_mental_health_consequence ['Maybe', 'No', 'Yes']
label_phys_health_consequence ['Maybe', 'No', 'Yes']
label_coworkers ['No', 'Some of them', 'Yes']
label_supervisor ['No', 'Some of them', 'Yes']
label_mental_health_interview ['Maybe', 'No', 'Yes']
label_phys_health_interview ['Maybe', 'No', 'Yes']
label_mental_vs_physical ["Don't 

Unnamed: 0,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,age_range
0,1,1,1,3,4,1,1,0,0,1,1,0,4,1,1,2,2,0,2,2,0,1
1,1,1,0,0,2,0,0,0,1,1,0,0,0,1,1,2,2,1,1,1,0,2
2,0,1,0,4,1,0,1,1,0,1,1,0,3,2,2,1,0,1,1,1,1,2
3,1,1,0,3,5,0,1,2,0,1,1,0,4,2,0,1,0,1,1,0,0,1
4,1,1,0,0,5,1,1,0,1,0,0,0,0,1,1,1,2,1,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,1,1,1,2,2,0,1,1,0,1,1,0,2,0,0,2,2,1,1,1,0,1
248,1,1,1,2,5,0,1,2,0,1,1,0,0,0,0,1,1,1,1,0,1,2
249,0,1,0,0,4,0,1,1,0,1,1,0,1,0,1,1,1,1,1,0,0,2
250,1,1,0,1,5,1,1,0,1,1,1,0,0,2,1,2,1,1,1,1,1,2


In [None]:
x_test=x_test_age.merge(x_test2, how='inner', left_index=True, right_index=True)

x_test.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,age_range
0,24,1,1,1,3,4,1,1,0,0,1,1,0,4,1,1,2,2,0,2,2,0,1
1,35,1,1,0,0,2,0,0,0,1,1,0,0,0,1,1,2,2,1,1,1,0,2
2,32,0,1,0,4,1,0,1,1,0,1,1,0,3,2,2,1,0,1,1,1,1,2
3,29,1,1,0,3,5,0,1,2,0,1,1,0,4,2,0,1,0,1,1,0,0,1
4,39,1,1,0,0,5,1,1,0,1,0,0,0,0,1,1,1,2,1,1,0,0,2


In [None]:
#特徵縮放年齡
x_test['Age'] = scaler.transform(x_test[['Age']])
x_test.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,age_range
0,0.111111,1,1,1,3,4,1,1,0,0,1,1,0,4,1,1,2,2,0,2,2,0,1
1,0.314815,1,1,0,0,2,0,0,0,1,1,0,0,0,1,1,2,2,1,1,1,0,2
2,0.259259,0,1,0,4,1,0,1,1,0,1,1,0,3,2,2,1,0,1,1,1,1,2
3,0.203704,1,1,0,3,5,0,1,2,0,1,1,0,4,2,0,1,0,1,1,0,0,1
4,0.388889,1,1,0,0,5,1,1,0,1,0,0,0,0,1,1,1,2,1,1,0,0,2


In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = forest.predict(x_test)

df_submit.to_csv('submission_forest.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = lsvm.predict(x_test)

df_submit.to_csv('submission_lsvm.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = grid.predict(x_test)

df_submit.to_csv('submission_grid.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = grid_search.predict(x_test)

df_submit.to_csv('submission_grid_search.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = random_forest.predict(x_test)

df_submit.to_csv('submission_random_forest.csv', index=None)

In [None]:
df_submit = pd.DataFrame([], columns=['Id', 'Treatment'])
df_submit['Id'] = [f'{i:03d}' for i in range(len(x_test))]
df_submit['Treatment'] = model.predict(x_test)

In [None]:
df_submit.to_csv('submission.csv', index=None)