# 데이터 표현
- 특성(feature)은 범주형 특성(categorical feature)와 연속형특성(continuous feature)이 있다.
- 특성 공학(feature engineering)이란 특정 애플리케이션에 가장 적합한 데이터 표현을 찾는 것
- 데이터 표현에 따라 모델의 성능이 달라짐

1. 범주형 특성


- 범주형 특성은 연속적이지 않은 특성.(이산적)
- 보통 문자로 표시되나, 숫자로 표현된 것도 범주형 특성일 수 있음.
- 범주형 특성은 one-hot-encoding 방식으로 많이 표현됨.
- one-hot-encoding = one-out-of-N encoding = dummy variable
- 위 방식은 범주마다 하나의 특성으로 간주하고 그 결과값을 0 or 1로 표현하는 것.
- 0과 1로 표현된 변수는 모든 모델에 적용 가능하므로 좋다.
- pandas의 get_dummies 함수를 통해 데이터를 쉽게 위와 같은 방법으로 인코딩 할 수 있다.


- 범주형 특성의 경우 오타가 있을 수 있고 같은 내용을 다르게 표현할 수 있으므로 데이터 가공 전 이에 대한 확인이 필수
- 이를 확인하기에 좋은 것이 Series의 value_counts 매서드임. (dataframe의 각 열은 Series)


In [1]:
import pandas as pd
df = pd.DataFrame({'숫자특성':[0,1,2,1], '범주형 특성':['a','b','c','b']})
display(df)
display(pd.get_dummies(df))
display(pd.get_dummies(df,columns=['숫자특성', '범주형 특성']))
df['숫자특성'] = df['숫자특성'].astype(str)
display(pd.get_dummies(df))

In [7]:
import os
import mglearn
data = pd.read_csv(os.path.join(mglearn.datasets.DATA_PATH, "adult.data"), header=None, index_col=False)

In [11]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [12]:
data[9].value_counts()

 Male      21790
 Female    10771
Name: 9, dtype: int64

In [13]:
data_dummies = pd.get_dummies(data)

In [14]:
data_dummies

Unnamed: 0,0,2,4,10,11,12,1_ ?,1_ Federal-gov,1_ Local-gov,1_ Never-worked,...,13_ Scotland,13_ South,13_ Taiwan,13_ Thailand,13_ Trinadad&Tobago,13_ United-States,13_ Vietnam,13_ Yugoslavia,14_ <=50K,14_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,37,284582,14,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
6,49,160187,5,0,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
7,52,209642,9,0,0,45,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
8,31,45781,14,14084,0,50,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
9,42,159449,13,5178,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


In [17]:
features = data_dummies.iloc[:,0:-2]

In [22]:
X = features.values
y = data_dummies['14_ >50K'].values

In [23]:
X.shape

(32561, 108)

In [24]:
features.shape

(32561, 108)

In [25]:
y.shape

(32561,)

In [26]:
data_dummies['14_ >50K'].shape

(32561,)

- 숫자로 표현된 범주형 특성은 연속형으로 간주하여 get_dummies에서 가변수를 만들어 주지 않는다.
- 따라서 get_dummies에서 columns로 지정을 해주거나, type을 str로 변경해주어야한다.

In [27]:
import pandas as pd
df = pd.DataFrame({'숫자특성':[0,1,2,1], '범주형 특성':['a','b','c','b']})
display(df)

Unnamed: 0,숫자특성,범주형 특성
0,0,a
1,1,b
2,2,c
3,1,b


In [28]:
display(pd.get_dummies(df))

Unnamed: 0,숫자특성,범주형 특성_a,범주형 특성_b,범주형 특성_c
0,0,1,0,0
1,1,0,1,0
2,2,0,0,1
3,1,0,1,0


In [29]:
display(pd.get_dummies(df,columns=['숫자특성', '범주형 특성']))

Unnamed: 0,숫자특성_0,숫자특성_1,숫자특성_2,범주형 특성_a,범주형 특성_b,범주형 특성_c
0,1,0,0,1,0,0
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,0,1,0


In [30]:
df['숫자특성'] = df['숫자특성'].astype(str)
display(pd.get_dummies(df))

Unnamed: 0,숫자특성_0,숫자특성_1,숫자특성_2,범주형 특성_a,범주형 특성_b,범주형 특성_c
0,1,0,0,1,0,0
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,0,1,0


- sklearn을 이용한 범주형 데이터 인코딩
    - OneHotEncoder

In [31]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
#sparse를 False로 하면 OneHotEncoder가 희소 행렬이 아니라 Numpy 배열을 반환

In [32]:
ohe2 = OneHotEncoder()

In [34]:
print(ohe.fit_transform(df))

[[1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0.]]


In [37]:
print(ohe2.fit_transform(df))

  (0, 0)	1.0
  (0, 3)	1.0
  (1, 1)	1.0
  (1, 4)	1.0
  (2, 2)	1.0
  (2, 5)	1.0
  (3, 1)	1.0
  (3, 4)	1.0


In [39]:
#OneHotEncoder의 출력은 Dataframe이 아니라 array. 따라서 열의 이름이 없다. 열의 이름을 나타내기 위해서는 get_feature_names이용
print(ohe.get_feature_names())

['x0_0' 'x0_1' 'x0_2' 'x1_a' 'x1_b' 'x1_c']


- OneHotEncoder는 모든 feature가 범주형일 때 적용 가능.
- 연속형과 범주형이 섞여있을 때는 ColumnTransformer 이용
- ColumnTransformer는 열마다 다른 변환 적용 가능

In [40]:
data = pd.read_csv(os.path.join(mglearn.datasets.DATA_PATH, "adult.data"), header=None, index_col=False,
                  names=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship',
                        'race','gender','capital-gain','capital-loss','hours-per-week','native-country','income'])

In [43]:
data = data[['age','hours-per-week','workclass','education','gender','occupation','income']]

In [44]:
#age와 hours-per-week는 연속형 데이터
#나머지는 범주형 데이터->OneHotEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("scaling",StandardScaler(), ['age','hours-per-week']),
                       ("onehot",OneHotEncoder(sparse=False),['workclass','education','gender','occupation'])])

In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data_features = data.drop("income",axis=1)

In [46]:
X_train, X_test, y_train, y_test = train_test_split(data_features, data.income, random_state=0)

In [47]:
ct.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('scaling',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 ['age', 'hours-per-week']),
                                ('onehot',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 ['workclass', 'education', 'gender',
                                  'occupation'])],
                  verbose=False)

In [49]:
X_train_trans = ct.transform(X_train)
print(X_train_trans.shape)

(24420, 44)


In [50]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_trans,y_train)
X_test_trans = ct.transform(X_test)
print("test score:{:.2f}".format(logreg.score(X_test_trans,y_test)))



test score:0.81
