# 데이터 표현과 특성 공학
* 특성 공학
    * 특정 애플리케이션에 가장 적합한 데이터 표현을 찾는 것.
    * 올바른 데이터 표현은 지도 학습 모델에서 적절한 매개변수를 선택하는 것보다 성능에 더 큰 영향늘 미친다.
## 범주형 변수
### 원-핫-인코딩(one-hot-encoding)
* 원-아웃-오브-엔 인코딩(one-out-of-N encoding)
* 가변수(dummy variable)

In [1]:
import os

import pandas as pd
import mglearn

data = pd.read_csv(
    os.path.join(mglearn.datasets.DATA_PATH, 'adult.data'),
    header=None,
    index_col=False,
    names=[
        'age', 'workclass', 'fnlwgt', 'education', 'education-num',
        'marital-status', 'occupation', 'relationship', 'race', 'gender',
        'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
        'income'
    ]
)
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]
data.head()

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


#### 문자열로 된 범주형 데이터 확인하기

In [2]:
print(data['gender'].value_counts())

 Male      21790
 Female    10771
Name: gender, dtype: int64


In [3]:
print('original feature:\n{}'.format(list(data.columns)))
print()
data_dummies = pd.get_dummies(data)
print('get_dummies feature:\n{}'.format(data_dummies.columns))

original feature:
['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']

get_dummies feature:
Index(['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov',
       'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th',
       'education_ 11th', 'education_ 12th', 'education_ 1st-4th',
       'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th',
       'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors',
       'education_ Doctorate', 'education_ HS-grad', 'education_ Masters',
       'education_ Preschool', 'education_ Prof-school',
       'education_ Some-college', 'gender_ Female', 'gender_ Male',
       'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces',
       'occupation_ Craft-repair', 'occupation_ Exec-managerial',
       'occupatio

In [4]:
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
X = features.values
y = data_dummies['income_ >50K'].values
print('X.shape: {} y.shape: {}'.format(X.shape, y.shape))

X.shape: (32561, 44) y.shape: (32561,)


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
print('test score: {:.3f}'.format(logreg.score(X_test, y_test)))

test score: 0.809


* 훈련 데이터와 테스트 데이터 포인트를 모두 포함하는 DataFrame을 사용해 get_dummies 함수를 호출하든지, 또는 각각 get_dummies를 호출한 후에 훈련 세트와 테스트 세트의 열 이름을 비교해서 같은 속성인지를 확인해야 한다.

### 숫자로 표현된 범주형 특성
* 범주형 특성은 종종 숫자로 인코딩된다.

In [6]:
demo_df = pd.DataFrame({'number': [0, 1, 2, 1], 'category': ['socks', 'fox', 'socks', 'box']})
demo_df

Unnamed: 0,number,category
0,0,socks
1,1,fox
2,2,socks
3,1,box


In [7]:
pd.get_dummies(demo_df)

Unnamed: 0,number,category_box,category_fox,category_socks
0,0,0,0,1
1,1,0,1,0
2,2,0,0,1
3,1,1,0,0


* 문자열 특성만 인코딩되며 숫자 특성은 바뀌지 않는다.

In [8]:
demo_df['number'] = demo_df['number'].astype(str)
pd.get_dummies(demo_df, columns=['number', 'category'])

Unnamed: 0,number_0,number_1,number_2,category_box,category_fox,category_socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0


* 숫자를 문자열로 바꾸면 columns 매개변수 지정하지 않아도 가변수 특성 만들어진다.
* 숫자형 특성이더라도 columns 매개변수에 지정하면 가변수가 만들어진다.

## OneHotEncoder와 ColumnTransformer: scikit-learn으로 범주형 변수 다루기

In [9]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
print(ohe.fit_transform(demo_df))

[[1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0.]]


In [12]:
print(ohe.get_feature_names())

['x0_0' 'x0_1' 'x0_2' 'x1_box' 'x1_fox' 'x1_socks']


In [13]:
data.head()

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer(
    [('scaling', StandardScaler(), ['age', 'hours-per-week']),
    ('onehot', OneHotEncoder(sparse=False), ['workclass', 'education', 'gender', 'occupation'])]
)

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data_features = data.drop('income', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
    data_features, data.income, random_state=0
)

ct.fit(X_train)
X_train_trans = ct.transform(X_train)
print(X_train_trans.shape)

(24420, 44)


In [17]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_trans, y_train)

X_test_trans = ct.transform(X_test)
print('test score: {:.3f}'.format(logreg.score(X_test_trans, y_test)))

test score: 0.809


* 하나의 변환기로 모든 전처리 단계를 캡슐화하면 장점이 있다.

In [18]:
ct.named_transformers_.onehot

OneHotEncoder(sparse=False)

## make_column_transformer로 간편하게 ColumnTransformer 만들기

In [20]:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
    (StandardScaler(), ['age', 'hours-per-week']),
    (OneHotEncoder(sparse=False), ['workclass', 'education', 'gender', 'occupation'])
)

* ColumnTransformer의 한 가지 단점은 0.20 버전에서 아직 변환된 출력 열에 대응하는 입력 열을 찾지 못한다.