Team Project_02 파일에서 모델마다 특성 중요도를 확인하지만 Gender, BMI 자료들을 나누어서 측정하다보니 <br>
원본파일의 특성 중요도는 어떻게 나올지 궁금하여 자료를 나누지 않고 특성 중요도를 확인해봄.

<hr>

##### CSV 자료 가져오기

In [1]:
import pandas as pd

sleep = pd.read_csv("Sleep_health_and_lifestyle_dataset.csv")

#### 성별(Gender) 숫자로 변환하기

In [2]:
sleep['Gender'] = pd.get_dummies(sleep['Gender'], drop_first=True)

# Gender --> 1: 남자,  0: 여자

#### 사용하지 않을 Person ID, Occupation(직업) 열 제거

In [3]:
sleep.drop(['Person ID','Occupation'], axis=1, inplace=True)       

#### Blood Pressure 혈압에서 수축기 자료만 가져오기

In [4]:
for i in range(len(sleep['Blood Pressure'])):
    sleep.loc[i, 'Blood Pressure'] = sleep['Blood Pressure'][i].split('/')[0]

In [5]:
sleep['Blood Pressure'] = sleep['Blood Pressure'].astype('int')

##### BMI Category 수치를 숫자로 변경하기 

In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(sleep['BMI Category'])

sleep['BMI Category'] = encoder.transform(sleep['BMI Category'])

print(encoder.classes_)    
# Normal : 0,  Normal Weight : 1,  Obese : 2 ,  Overweight :3

['Normal' 'Normal Weight' 'Obese' 'Overweight']


##### Sleep Disorder 값 숫자로 변경하기 (레이블 인코딩)

In [7]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(sleep['Sleep Disorder'])

sleep['Sleep Disorder'] = encoder.transform(sleep['Sleep Disorder'])

print(encoder.classes_)               
# Insomnia : 0,  None : 1,  Sleep Apnea : 2

['Insomnia' 'None' 'Sleep Apnea']


In [8]:
sleep.head(3)

Unnamed: 0,Gender,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,27,6.1,6,42,6,3,126,77,4200,1
1,1,28,6.2,6,60,8,0,125,75,10000,1
2,1,28,6.2,6,60,8,0,125,75,10000,1


<hr>

#### 훈련 데이터와 테스트 데이터 나누기 (비율 7 : 3)

In [9]:
from sklearn.model_selection import train_test_split

X = sleep[['Gender','Age', 'Sleep Duration', 'Quality of Sleep', 'Physical Activity Level', 
           'Stress Level','BMI Category','Blood Pressure', 'Heart Rate', 'Daily Steps']]
y = sleep['Sleep Disorder'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)                

## 특성 중요도 내림차순으로 확인

In [None]:
# feature_importance['importance'].sort_values(ascending=False)

#### 결정트리에서의 특성 중요도

In [10]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': dt.feature_importances_
})

feature_importance

Unnamed: 0,feature,importance
0,Gender,0.017454
1,Age,0.070177
2,Sleep Duration,0.029547
3,Quality of Sleep,0.002773
4,Physical Activity Level,0.267109
5,Stress Level,0.003422
6,BMI Category,0.485891
7,Blood Pressure,0.083184
8,Heart Rate,0.033624
9,Daily Steps,0.006818


- 중요도 3위 : BMI Category > Physical Activity Level > Blood Pressure

###### 랜덤 포레스트에서의 특성 중요도 확인

In [11]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

feature_importance = pd.DataFrame({
    'category': X.columns,
    'importance': rf.feature_importances_
})

feature_importance

Unnamed: 0,category,importance
0,Gender,0.011781
1,Age,0.109395
2,Sleep Duration,0.138135
3,Quality of Sleep,0.026671
4,Physical Activity Level,0.073708
5,Stress Level,0.027092
6,BMI Category,0.262011
7,Blood Pressure,0.217304
8,Heart Rate,0.05694
9,Daily Steps,0.076963


- 중요도 3위 : BMI Category > Blood Pressure > Sleep Duration

###### 엑스트라 트리에서의 특성 중요도 확인

In [12]:
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(random_state=42)
et.fit(X, y)

feature_importance = pd.DataFrame({
    'category': X.columns,
    'importance': et.feature_importances_
})

feature_importance

Unnamed: 0,category,importance
0,Gender,0.038934
1,Age,0.111126
2,Sleep Duration,0.09415
3,Quality of Sleep,0.054857
4,Physical Activity Level,0.090121
5,Stress Level,0.0483
6,BMI Category,0.279766
7,Blood Pressure,0.176161
8,Heart Rate,0.043819
9,Daily Steps,0.062765


- 중요도 3위 : BMI Category > Blood Pressure > Age

###### 그레이언트 부스팅에서의 특성 중요도 확인

In [13]:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X, y)

feature_importance = pd.DataFrame({
    'category': X.columns,
    'importance': gb.feature_importances_
})

feature_importance

Unnamed: 0,category,importance
0,Gender,0.004096
1,Age,0.027909
2,Sleep Duration,0.018366
3,Quality of Sleep,0.012814
4,Physical Activity Level,0.105718
5,Stress Level,0.002147
6,BMI Category,0.323861
7,Blood Pressure,0.437363
8,Heart Rate,0.040495
9,Daily Steps,0.027231


- 중요도 3위 : Blood Pressure > BMI Category > Physical Activity Level

<hr>

결정트리, 랜덤 포레스트, 엑스트라 트리, 그레이언트 부스팅 의 특성중요도 확인 결과 <br>
모두 동일한 결과를 보여주진 않았으나, 전반적으로 BMI Category, Blood Pressure, Physical Activity Level <br>
세 요인이 중복적으로 출력된 것을 확인할 수 있다.