## 더미변수
###### 명목형 변수에 대한 표준화 처리 방법
###### patsy 패키지의 설치가 필요하다

In [15]:
from pandas import read_excel, DataFrame, get_dummies
from patsy import dmatrix
import numpy as np

In [16]:
df = read_excel("https://data.hossam.kr/C02/dum.xlsx")
df

Unnamed: 0,성별,비만도
0,남자,정상
1,여자,경도
2,여자,정상
3,남자,고도
4,남자,정상
5,남자,경도
6,남자,고도
7,여자,고도
8,여자,경도
9,남자,고도


명목형 변수 라벨링
###### 컬럼이름을 지정하는 문자열에 항상 +0을 추가해야한다(그래야 항목들이 다나옴)
###### +0을 추가한 데이터는 회귀분석에 사용할 경우 서로 연관성을 보이기떄문에 사용하지 않는다.
###### +0으로 항목데이터를 확인 후
###### 다시 +을 지우고 원래 데이터로 회귀분석을 실시해야한다.

In [17]:
dmatrix('성별 + 0', df)

DesignMatrix with shape (20, 2)
  성별[남자]  성별[여자]
       1       0
       0       1
       0       1
       1       0
       1       0
       1       0
       1       0
       0       1
       0       1
       1       0
       0       1
       0       1
       0       1
       1       0
       0       1
       0       1
       1       0
       0       1
       0       1
       0       1
  Terms:
    '성별' (columns 0:2)

In [18]:
dmatrix('비만도 + 0', df)

DesignMatrix with shape (20, 3)
  비만도[경도]  비만도[고도]  비만도[정상]
        0        0        1
        1        0        0
        0        0        1
        0        1        0
        0        0        1
        1        0        0
        0        1        0
        0        1        0
        1        0        0
        0        1        0
        0        0        1
        0        0        1
        0        0        1
        0        0        1
        0        1        0
        1        0        0
        0        1        0
        1        0        0
        0        0        1
        1        0        0
  Terms:
    '비만도' (columns 0:3)

In [19]:
dm = dmatrix('성별:비만도 + 0', df)
dm

DesignMatrix with shape (20, 6)
  Columns:
    ['성별[남자]:비만도[경도]',
     '성별[여자]:비만도[경도]',
     '성별[남자]:비만도[고도]',
     '성별[여자]:비만도[고도]',
     '성별[남자]:비만도[정상]',
     '성별[여자]:비만도[정상]']
  Terms:
    '성별:비만도' (columns 0:6)
  (to view full data, use np.asarray(this_obj))

In [20]:
dm.design_info.column_names

['성별[남자]:비만도[경도]',
 '성별[여자]:비만도[경도]',
 '성별[남자]:비만도[고도]',
 '성별[여자]:비만도[고도]',
 '성별[남자]:비만도[정상]',
 '성별[여자]:비만도[정상]']

In [21]:
dmarray = np.asarray(dm)
dmarray

array([[0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0.]])

In [22]:
dummy_df = DataFrame(dmarray, columns=dm.design_info.column_names)
dummy_df

Unnamed: 0,성별[남자]:비만도[경도],성별[여자]:비만도[경도],성별[남자]:비만도[고도],성별[여자]:비만도[고도],성별[남자]:비만도[정상],성별[여자]:비만도[정상]
0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,0.0,0.0,0.0


원본 데이터가 라벨링 되어 있는 경우

In [23]:
df2 = df.copy()
df2.head()

Unnamed: 0,성별,비만도
0,남자,정상
1,여자,경도
2,여자,정상
3,남자,고도
4,남자,정상


In [24]:
# 컬럼을 구분하지 않고 모든 값을 변경
df2.replace("남자", 0, inplace=True)
df2.replace("여자", 1, inplace=True)

# 성별 컬럼에서만 변경
df2.replace({"성별": "정상"}, 0, inplace=True)
df2.replace({"성별": "경도"}, 1, inplace=True)
df2.replace({"성별": "고도"}, 2, inplace=True)
df2.head()

Unnamed: 0,성별,비만도
0,0,정상
1,1,경도
2,1,정상
3,0,고도
4,0,정상


In [25]:
df2.dtypes

성별      int64
비만도    object
dtype: object

##### 라벨링 된 데이터의 더미 변수화
###### 표현식에 범주형(Category)임을 의미하는 C를 표기

In [26]:
dm = dmatrix('C(성별):C(비만도) + 0', df2)
dummy_df = DataFrame(np.asarray(dm), columns=dm.design_info.column_names)
dummy_df

Unnamed: 0,C(성별)[0]:C(비만도)[경도],C(성별)[1]:C(비만도)[경도],C(성별)[0]:C(비만도)[고도],C(성별)[1]:C(비만도)[고도],C(성별)[0]:C(비만도)[정상],C(성별)[1]:C(비만도)[정상]
0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,0.0,0.0,0.0


#### #05. Pandas의 함수 이용하기
##### 모든 필드를 더미 변수로 변환(N개 생성)
###### 원본 데이터프레임과 더미변수로 변환할 컬럼 이름을 파라미터로 전달한다.
###### 원본데이터 프레임에서 기존의 컬럼은 제거되고 더미변수로 변경된 컬럼들이 추가된다.

In [27]:
dummy_df = get_dummies(df, columns=['성별', '비만도'])
dummy_df.head()

Unnamed: 0,성별_남자,성별_여자,비만도_경도,비만도_고도,비만도_정상
0,True,False,False,False,True
1,False,True,True,False,False
2,False,True,False,False,True
3,True,False,False,True,False
4,True,False,False,False,True


##### N-1개의 더미변수 생성
###### drop_first 파라미터를 True로 설정한다.

In [28]:
dummy_df = get_dummies(df, columns=['성별', '비만도'], drop_first=True)
dummy_df.head()

Unnamed: 0,성별_여자,비만도_고도,비만도_정상
0,False,False,True
1,True,False,False
2,True,False,True
3,False,True,False
4,False,False,True


변환 데이터 타입 지정

In [29]:
dummy_df = get_dummies(df, columns=['성별', '비만도'], drop_first=True, dtype='int')
dummy_df.head()

Unnamed: 0,성별_여자,비만도_고도,비만도_정상
0,0,0,1
1,1,0,0
2,1,0,1
3,0,1,0
4,0,0,1
