<a id=0></a>
# 7.Categorical Features
カテゴリカル特徴量（変数）の取り扱い

---
### [1.LabelEncoder()](#1)
### [2.get_dummies()](#2)
### [3.OneHotEncoder()](#3)
### [4.pd.get_dummies()とOneHotEncoder()の違い](#4)
### [5.Seriesのstr属性を使う](#5)

---

データセットとしてsample1_without_index.csvを使用する

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('/Users/zinkoko/projects/ml/udemy/udemy_course/notebooks/sample1_without_index.csv')
df.head()

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Score,Difference,Color,Shape
0,1997-07-05,2291,25,2.94665,5.305868,45.8933,52.762659,0.276266,green,triangle
1,1997-07-06,506,16,1.915208,0.679004,50.611735,31.453719,-1.854628,blue,
2,1997-07-07,9629,32,7.869855,6.563335,43.830416,56.239011,0.623901,blue,square
3,1997-07-08,6161,67,6.375209,5.756029,41.358007,61.453113,1.145311,green,square
4,,8570,55,0.390629,3.578136,55.739709,,1.03719,red,square


In [3]:
df = df[['Color', 'Shape']]

In [4]:
df.head()

Unnamed: 0,Color,Shape
0,green,triangle
1,blue,
2,blue,square
3,green,square
4,red,square


In [5]:
df.isnull().sum()

Color    4
Shape    5
dtype: int64

In [6]:
df[df['Color'].isnull()].index

Index([19, 37, 40, 73], dtype='int64')

---
<a id=1></a>
[Topへ](#0)

---
## 1. LabelEncoder()  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html  
※ ラベルを数値(0, 1, 2, ...)で置換する

In [7]:
from sklearn.preprocessing import LabelEncoder

In [8]:
encoder = LabelEncoder()

In [9]:
encoder.fit(df['Color'])

In [11]:
encoder.classes_

array(['blue', 'green', 'red', nan], dtype=object)

In [10]:
encoder.transform(df['Color'])

array([1, 0, 0, 1, 2, 1, 0, 2, 2, 2, 1, 0, 1, 1, 2, 0, 0, 0, 0, 3, 1, 0,
       1, 0, 1, 1, 1, 2, 1, 1, 0, 0, 0, 2, 0, 1, 1, 3, 0, 0, 3, 2, 1, 2,
       0, 0, 2, 1, 0, 0, 0, 1, 2, 1, 2, 2, 2, 2, 1, 0, 2, 2, 1, 1, 2, 1,
       2, 1, 1, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 0, 2, 0,
       1, 2, 2, 2, 0, 1, 2, 0, 2, 0, 0, 2])

In [12]:
df_ce = df.copy()
df_ce['Color_encoded'] = encoder.transform(df['Color'])
df_ce.head()

Unnamed: 0,Color,Shape,Color_encoded
0,green,triangle,1
1,blue,,0
2,blue,square,0
3,green,square,1
4,red,square,2


In [13]:
df_ce.columns

Index(['Color', 'Shape', 'Color_encoded'], dtype='object')

In [14]:
cols = ['Color', 'Color_encoded', 'Shape']

In [16]:
df_ce = df_ce[cols]
df_ce.loc[36:41]

Unnamed: 0,Color,Color_encoded,Shape
36,green,1,square
37,,3,circle
38,blue,0,square
39,blue,0,square
40,,3,triangle
41,red,2,


---
<a id=2></a>
[Topへ](#0)

---
## 2. get_dummies()  
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html  
※　カテゴリー変数をダミー変数化（0 or 1）する

* ダミー変数化を実行
* drop_first=Trueとは
* np.nanはどうなるのか
---

ダミー変数化を実行

In [19]:
pd.get_dummies(df['Shape'], dtype=int).head()

Unnamed: 0,circle,square,triangle
0,0,0,1
1,0,0,0
2,0,1,0
3,0,1,0
4,0,1,0


drop_first=Trueとは  

In [24]:
df_cd = pd.get_dummies(df, columns=['Shape'], dtype=int, drop_first=True)
df_cd.head()

Unnamed: 0,Color,Shape_square,Shape_triangle
0,green,0,1
1,blue,0,0
2,blue,1,0
3,green,1,0
4,red,1,0


In [22]:
pd.get_dummies(df, columns=['Color', 'Shape'], dtype=int, drop_first=True)

Unnamed: 0,Color_green,Color_red,Shape_square,Shape_triangle
0,1,0,0,1
1,0,0,0,0
2,0,0,1,0
3,1,0,1,0
4,0,1,1,0
...,...,...,...,...
95,0,0,0,0
96,0,1,0,1
97,0,0,1,0
98,0,0,0,0


np.nanはどうなるのか

In [25]:
df_cd.isnull().sum()

Color             4
Shape_square      0
Shape_triangle    0
dtype: int64

In [26]:
df.isnull().sum()

Color    4
Shape    5
dtype: int64

In [32]:
df_cd = pd.get_dummies(df, columns=['Color', 'Shape'], drop_first=True, dummy_na=True, dtype=int)
df_cd.head()

Unnamed: 0,Color_green,Color_red,Color_nan,Shape_square,Shape_triangle,Shape_nan
0,1,0,0,0,1,0
1,0,0,0,0,0,1
2,0,0,0,1,0,0
3,1,0,0,1,0,0
4,0,1,0,1,0,0


In [33]:
df_cd.loc[36:41]

Unnamed: 0,Color_green,Color_red,Color_nan,Shape_square,Shape_triangle,Shape_nan
36,1,0,0,1,0,0
37,0,0,1,0,0,0
38,0,0,0,1,0,0
39,0,0,0,1,0,0
40,0,0,1,0,1,0
41,0,1,0,0,0,1


---
<a id=3></a>
[Topへ](#0)

---
## 3. OneHotEncoder()  
※　One-hot : ひとつが1で他は0  
※　pd.get_dummies()にはない機能を使ってダミー変数化を行う

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

デフォルトのKeyword Argument : drop=None, handle_unknown='error'

* OneHotEncoder()を使ってみる
* 複数の特徴量を変換
---

OneHotEncoder()を使ってみる

In [34]:
from sklearn.preprocessing import OneHotEncoder

In [35]:
encoder = OneHotEncoder()

In [37]:
encoder.fit(df[['Color']])

In [41]:
encoder.transform(df[['Color']]).toarray()[:5]

array([[0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]])

複数の特徴量を変換

In [42]:
encoder = OneHotEncoder()

In [43]:
encoder.fit(df)

In [44]:
encoder.categories_

[array(['blue', 'green', 'red', nan], dtype=object),
 array(['circle', 'square', 'triangle', nan], dtype=object)]

In [46]:
encoder.transform(df).toarray()[:5]

array([[0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.]])

In [47]:
encoder.inverse_transform([[0, 1, 0, 0, 0, 1, 0, 0]])

array([['green', 'square']], dtype=object)

---
<a id=4></a>
[Topへ](#0)

---
## 4. pd.get_dummies()とOneHotEncoder()の違い

* get_dummies()ではトレインセットとテストセットに差が生じる
* OneHotEncoder(handle_unknown='error', drop='first')の場合
* OneHotEncoder(handle_unknown='ignore')の場合
---

In [57]:
np.random.seed(1)
s = pd.Series(np.random.choice([0, 1], len(df)), name='target')
s[:5]

0    1
1    1
2    0
3    0
4    1
Name: target, dtype: int64

In [58]:
df_new = pd.concat([df, s], axis=1)
df_new.head()

Unnamed: 0,Color,Shape,target
0,green,triangle,1
1,blue,,1
2,blue,square,0
3,green,square,0
4,red,square,1


In [59]:
y = df_new.pop('target')
X = df_new

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, stratify=y, random_state=17)

In [62]:
X_train

Unnamed: 0,Color,Shape
94,red,square
3,green,square
25,green,triangle
42,green,square
69,red,triangle
...,...,...
34,blue,square
5,green,triangle
99,red,
61,red,triangle


In [63]:
X_test

Unnamed: 0,Color,Shape
16,blue,circle
29,green,triangle
80,blue,triangle
44,blue,triangle
48,blue,triangle


get_dummies()ではトレインセットとテストセットに差が生じる

In [67]:
pd.get_dummies(X_train, dtype=int, drop_first=True, dummy_na=True)

Unnamed: 0,Color_green,Color_red,Color_nan,Shape_square,Shape_triangle,Shape_nan
94,0,1,0,1,0,0
3,1,0,0,1,0,0
25,1,0,0,0,1,0
42,1,0,0,1,0,0
69,0,1,0,0,1,0
...,...,...,...,...,...,...
34,0,0,0,1,0,0
5,1,0,0,0,1,0
99,0,1,0,0,0,1
61,0,1,0,0,1,0


In [68]:
pd.get_dummies(X_test, dtype=int, drop_first=True, dummy_na=True)

Unnamed: 0,Color_green,Color_nan,Shape_triangle,Shape_nan
16,0,0,0,0
29,1,0,1,0
80,0,0,1,0
44,0,0,1,0
48,0,0,1,0


In [69]:
encoder = OneHotEncoder(drop='first')

In [72]:
encoder.fit_transform(X_train).toarray()[:5]

array([[0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0.]])

In [73]:
encoder.transform(X_test).toarray()

array([[0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

OneHotEncoder(handle_unknown='error', drop='first')の場合

In [74]:
X_test_new = X_test.copy()
X_test_new.loc[6, 'Color'] = 'purple'
X_test_new

Unnamed: 0,Color,Shape
16,blue,circle
29,green,triangle
80,blue,triangle
44,blue,triangle
48,blue,triangle
6,purple,


In [75]:
encoder_error = OneHotEncoder(handle_unknown='error', drop='first')

In [76]:
encoder_error.fit_transform(X_train).toarray()[:5]

array([[0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0.]])

In [77]:
encoder_error.transform(X_test).toarray()

array([[0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.]])

In [79]:
# encoder_error.transform(X_test_new).toarray()

OneHotEncoder(handle_unknown='ignore')の場合

In [80]:
encoder_ignore = OneHotEncoder(handle_unknown='ignore')

In [81]:
encoder_ignore.fit_transform(X_train).toarray()[:5]

array([[0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 1., 0.]])

In [82]:
encoder_ignore.transform(X_test).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.]])

In [83]:
encoder_ignore.transform(X_test_new).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

#### 状況に応じて使い分ける（例）
* 分類される値が少ない、レコード量が多い  
    ＝＞　testデータに欠ける値はない　＝＞　get_dummies, OneHotEncoder(drop='first')
* 分類される値が少ない、レコード量が少ない  
    ＝＞　testデータに欠ける値があるかもしれない　＝＞　OneHotEncoder(handle_unknown='error', drop='first')
* 分類される値が多い、レコード量が少ない  
    ＝＞　testデータにtrainデータにない値が確実に入る　＝＞ OneHotEncoder, handle_unknown='ignore'

---
<a id=5></a>
[Topへ](#0)

---
## 5.Seriesのstr属性を使う

* Series.strとは
* メソッドを確認
* 利用頻度の高い置換、抽出、分離
---

Series.strとは

In [84]:
df = pd.DataFrame()
df['ID'] = ['A-123', 'B-456', 'A-789', 'B-123']
df['Color'] = ['py/white black', 'red green blue', 'py/yellow', 'purple white']
df

Unnamed: 0,ID,Color
0,A-123,py/white black
1,B-456,red green blue
2,A-789,py/yellow
3,B-123,purple white


メソッドを確認


In [88]:
df['ID'].str[0]

0    A
1    B
2    A
3    B
Name: ID, dtype: object

In [90]:
df['ID'].str[:]

0    A-123
1    B-456
2    A-789
3    B-123
Name: ID, dtype: object

In [89]:
df['ID'].str.lower()

0    a-123
1    b-456
2    a-789
3    b-123
Name: ID, dtype: object

In [91]:
df['ID'].str.startswith('A')

0     True
1    False
2     True
3    False
Name: ID, dtype: bool

In [92]:
df['ID'].str.startswith('B')

0    False
1     True
2    False
3     True
Name: ID, dtype: bool

In [93]:
df['Color'].str.contains('white')

0     True
1    False
2    False
3     True
Name: Color, dtype: bool

In [94]:
df['Color'].str.contains('ye|pu')

0    False
1    False
2     True
3     True
Name: Color, dtype: bool

利用頻度の高い置換、抽出、分離

In [96]:
df['Color'].str.replace('black', 'gold')

0     py/white gold
1    red green blue
2         py/yellow
3      purple white
Name: Color, dtype: object

In [97]:
df['ID']

0    A-123
1    B-456
2    A-789
3    B-123
Name: ID, dtype: object

In [98]:
df['ID'].str.split('-')

0    [A, 123]
1    [B, 456]
2    [A, 789]
3    [B, 123]
Name: ID, dtype: object

In [99]:
df['ID'].str.split('-', expand=True)

Unnamed: 0,0,1
0,A,123
1,B,456
2,A,789
3,B,123


In [100]:
df[['ID_a', 'ID_n']] = df['ID'].str.split('-', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n
0,A-123,py/white black,A,123
1,B-456,red green blue,B,456
2,A-789,py/yellow,A,789
3,B-123,purple white,B,123


In [101]:
df[['Color_1', 'Color_2', 'Color_3']] = df['Color'].str.split(' ', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3
0,A-123,py/white black,A,123,py/white,black,
1,B-456,red green blue,B,456,red,green,blue
2,A-789,py/yellow,A,789,py/yellow,,
3,B-123,purple white,B,123,purple,white,


In [103]:
df['py'] = df['Color_1'].str.extract('(py/)', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,py/white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,py/yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


In [104]:
df['Color_1'] = df['Color_1'].str.replace('py/', '')
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


---
[Topへ](#0)

---
## 以上
    
---