# [Module 2.1] 고급 피쳐 엔지니어링 (타겟 인코딩)

타겟 인코딩은 카테고리 변수를 레이블 값을 이용하여 숫자 값으로 변경을 하는 방법 입니다. 이를 통해서 카테고리가 레이블 값과의 상관 관계를 만들게 하는 인코딩 입니다.
예를 들어서
```
-----------
Cate Target
-----------
book   1
book   3
food   3
위와 같다고 하면, 카테고리당 평균(target) 값 혹은 다른 수식을 이용합니다. 
아래는 book에 대해서 (1+3)/2 = 2 이어서 2를 타겟 인코딩 값으로 사용을 합니다.
-----------
Cate-TE Target
-----------
2        1
2        3
3        3


```
이 노트북에서는 아래의 타겟 인코딩의 방법 (Smoothing, Cross Validation) 중에서 Smoothing을 이용하여 인코딩을 합니다.

- Target encoding, the right way
    - https://www.kaggle.com/c/ieee-fraud-detection/discussion/108311
    - Python target encoding for categorical features
        - https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
        - Data
            - https://www.kaggle.com/c/porto-seguro-safe-driver-prediction



이 노트북은 아래와 같은 피쳐 엔지니어링을 통하여 새로운 피쳐를 생성 합니다.
- 날짜관련 피쳐 생성(월, 일, 요일)
- 기존의 피쳐들을 결합하여 새로운 피쳐 생성 (피쳐1 + 피쳐2 = 뉴피쳐)
- Product 볼륨에 대한 피쳐를 생성 합니다 (가로,새로,높이 = 볼륨)
- Train, Test 데이터 세트를 시간 순으로 8:2 분리
- 나머지 카테고리 컬럼에 대해서 모두 타겟 인코딩을 합니다.
    - 먼저 Train 데이터에 대해서 타겟 인코딩을 합니다.
    - 이후에 Test 데이터를 제공하여, 훈련 데이터의 카테고리 값을 바로 복사하고, 만약에 훈련 데이터에 해당 카테고리가 없으면, 훈련 카테고리의 전체 평균값을 사용합니다
- 최종 사용할 컬럼 지정
    - XGBoost, CatBoost, AutoGluon
- 로컬에 데이터 저장
    - 최종 레이블 인코딩 된 데이터 세트 저장 (XGBoost, CatBoost 용)
    - 레이블 인코딩 안한 데이터 세트 저장 (AutoGluon 용)

In [1]:
import pandas as pd
pd.options.display.max_rows=5
import numpy as np

In [2]:
%store -r full_data_file_name

### 데이터 로딩 및 셔플링

In [3]:
df = pd.read_csv(full_data_file_name)
df = df.sample(frac=1.0, random_state=1000)
df

Unnamed: 0,classes,order_approved_at,customer_id,customer_zip_code_prefix,customer_city,customer_state,price,freight_value,product_id,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name_english,seller_zip_code_prefix,seller_city,seller_state
37413,3,2018-07-14 13:04:07,4ee61c3905a5c398d44b089108961bb3,28950,armacao dos buzios,RJ,105.00,27.04,2f13d1dc8b4e1d9d8027be50339546a9,2650.0,30.0,30.0,30.0,furniture_decor,3204,sao paulo,SP
54762,3,2017-03-25 10:25:16,959292edcade77d6b60dc8f49f01cd71,37880,cabo verde,MG,23.99,14.52,b000447e24e31a4d7e628ca4d0622131,250.0,19.0,4.0,11.0,telephony,3504,sao paulo,SP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18782,3,2018-07-16 18:20:34,609b34a18fd0d61719d0b3190ec05231,21240,rio de janeiro,RJ,200.00,19.50,92d5ae8e42c4599266f210bf0469ae9b,439.0,18.0,11.0,13.0,watches_gifts,14050,ribeirao preto,SP
3776,1,2018-05-25 07:06:30,dc275c5e585b7c3a3cddef527e34fc19,26540,nilopolis,RJ,18.90,7.55,15de022edf1005363381e66bed514528,100.0,16.0,3.0,23.0,furniture_decor,20270,rio de janeiro,RJ


In [4]:
df.columns

Index(['classes', 'order_approved_at', 'customer_id',
       'customer_zip_code_prefix', 'customer_city', 'customer_state', 'price',
       'freight_value', 'product_id', 'product_weight_g', 'product_length_cm',
       'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'seller_zip_code_prefix',
       'seller_city', 'seller_state'],
      dtype='object')

## 날짜 피쳐 생성: Month, Day, WeeoOfDay(요일)

In [5]:
def create_date_feature(raw_df):
    df = raw_df.copy()
    df['order_date'] = pd.to_datetime(df['order_approved_at'])    
    df['order_weekday'] = df['order_date'].dt.weekday
    df['order_day'] = df['order_date'].dt.day    
    df['order_month'] = df['order_date'].dt.month        
    return df

f_df = create_date_feature(df)
f_df

Unnamed: 0,classes,order_approved_at,customer_id,customer_zip_code_prefix,customer_city,customer_state,price,freight_value,product_id,product_weight_g,...,product_height_cm,product_width_cm,product_category_name_english,seller_zip_code_prefix,seller_city,seller_state,order_date,order_weekday,order_day,order_month
37413,3,2018-07-14 13:04:07,4ee61c3905a5c398d44b089108961bb3,28950,armacao dos buzios,RJ,105.00,27.04,2f13d1dc8b4e1d9d8027be50339546a9,2650.0,...,30.0,30.0,furniture_decor,3204,sao paulo,SP,2018-07-14 13:04:07,5,14,7
54762,3,2017-03-25 10:25:16,959292edcade77d6b60dc8f49f01cd71,37880,cabo verde,MG,23.99,14.52,b000447e24e31a4d7e628ca4d0622131,250.0,...,4.0,11.0,telephony,3504,sao paulo,SP,2017-03-25 10:25:16,5,25,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18782,3,2018-07-16 18:20:34,609b34a18fd0d61719d0b3190ec05231,21240,rio de janeiro,RJ,200.00,19.50,92d5ae8e42c4599266f210bf0469ae9b,439.0,...,11.0,13.0,watches_gifts,14050,ribeirao preto,SP,2018-07-16 18:20:34,0,16,7
3776,1,2018-05-25 07:06:30,dc275c5e585b7c3a3cddef527e34fc19,26540,nilopolis,RJ,18.90,7.55,15de022edf1005363381e66bed514528,100.0,...,3.0,23.0,furniture_decor,20270,rio de janeiro,RJ,2018-05-25 07:06:30,4,25,5


## 기존 피쳐 결합하여 새로운 피쳐 생성 (컬럼1 + 컬럼2 = 뉴피쳐)

In [6]:
def change_var_type(f_df):
    df = f_df.copy()
    df['customer_zip_code_prefix'] = df['customer_zip_code_prefix'].astype(str)
    df['seller_zip_code_prefix'] = df['seller_zip_code_prefix'].astype(str)    
    return df

def comnbine_columns(f_df,src_col1, src_col2,new_col):
    df = f_df.copy()
    df[new_col] = df[str(src_col1)] + '_' + df[str(src_col2)]
    print("df shape: ", df.shape)
    return df



f_df = change_var_type(f_df)

### custoemr_state + seller_state

In [7]:
f_df = comnbine_columns(f_df,src_col1='customer_state', src_col2='seller_state',new_col='customer_seller_state')

df shape:  (67176, 22)


### custoemr_city + seller_city

In [8]:
f_df = comnbine_columns(f_df,src_col1='customer_city', src_col2='seller_city',new_col='customer_seller_city')

df shape:  (67176, 23)


### custoemr_zip + seller_zip

In [9]:
f_df = comnbine_columns(f_df,src_col1='customer_zip_code_prefix', 
                        src_col2='seller_zip_code_prefix',new_col='customer_seller_zip_code_prefix')

df shape:  (67176, 24)


In [10]:
f_df

Unnamed: 0,classes,order_approved_at,customer_id,customer_zip_code_prefix,customer_city,customer_state,price,freight_value,product_id,product_weight_g,...,seller_zip_code_prefix,seller_city,seller_state,order_date,order_weekday,order_day,order_month,customer_seller_state,customer_seller_city,customer_seller_zip_code_prefix
37413,3,2018-07-14 13:04:07,4ee61c3905a5c398d44b089108961bb3,28950,armacao dos buzios,RJ,105.00,27.04,2f13d1dc8b4e1d9d8027be50339546a9,2650.0,...,3204,sao paulo,SP,2018-07-14 13:04:07,5,14,7,RJ_SP,armacao dos buzios_sao paulo,28950_3204
54762,3,2017-03-25 10:25:16,959292edcade77d6b60dc8f49f01cd71,37880,cabo verde,MG,23.99,14.52,b000447e24e31a4d7e628ca4d0622131,250.0,...,3504,sao paulo,SP,2017-03-25 10:25:16,5,25,3,MG_SP,cabo verde_sao paulo,37880_3504
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18782,3,2018-07-16 18:20:34,609b34a18fd0d61719d0b3190ec05231,21240,rio de janeiro,RJ,200.00,19.50,92d5ae8e42c4599266f210bf0469ae9b,439.0,...,14050,ribeirao preto,SP,2018-07-16 18:20:34,0,16,7,RJ_SP,rio de janeiro_ribeirao preto,21240_14050
3776,1,2018-05-25 07:06:30,dc275c5e585b7c3a3cddef527e34fc19,26540,nilopolis,RJ,18.90,7.55,15de022edf1005363381e66bed514528,100.0,...,20270,rio de janeiro,RJ,2018-05-25 07:06:30,4,25,5,RJ_RJ,nilopolis_rio de janeiro,26540_20270


## product volume 컬럼 생성 (가로 * 세로 * 높이 의 계산값)

In [11]:
def add_product_volume(raw_df):
    df = raw_df.copy()
    df['product_volume'] = df.product_length_cm * df.product_width_cm * df.product_height_cm
    return df

f_df = add_product_volume(f_df)

In [12]:
f_df.columns

Index(['classes', 'order_approved_at', 'customer_id',
       'customer_zip_code_prefix', 'customer_city', 'customer_state', 'price',
       'freight_value', 'product_id', 'product_weight_g', 'product_length_cm',
       'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'seller_zip_code_prefix',
       'seller_city', 'seller_state', 'order_date', 'order_weekday',
       'order_day', 'order_month', 'customer_seller_state',
       'customer_seller_city', 'customer_seller_zip_code_prefix',
       'product_volume'],
      dtype='object')

## Train, Test 데이터 셋 분리
- 시간 순으로 훈련, 테스트를 8:2 로 분리 합니다.

In [13]:

def split_data_2(raw_df, sort_col='order_approved_at',val_ratio=0.3):
    '''
    train, test 데이터 분리
    '''
    df = raw_df.copy()
    val_ratio = 1 - val_ratio # 1 - 0.3  = 0.7

    
    df = df.sort_values(by= sort_col) # 시간 순으로 정렬
    # One-Hot-Encoding
    data1,data2, = np.split(df, 
                     [int(val_ratio * len(df))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%
    
    print(f"data1, data2 shape: {data1.shape},{data2.shape}")
    
    return data1, data2

train_df, test_df = split_data_2(f_df, val_ratio=0.2)




data1, data2 shape: (53740, 25),(13436, 25)


## Target Encoding 관련 피쳐 생성


In [14]:
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
#    display(averages)
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # display(smoothing)
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # display(averages)    
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)

## Target Encoding 실행
모든 카테고리 컬럼에 대해서 타겟 인코딩을 실행 합니다.

In [15]:
def add_new_te(raw_train, raw_test):
    train_df = raw_train.copy()
    test_df = raw_test.copy()
    
    def add_new_fe_col(train_df, test_df, cat):
        trn, sub = target_encode(train_df[cat], 
                                 test_df[cat], 
                                 target=train_df.classes, 
                                 min_samples_leaf=100,
                                 smoothing=10,
                                 noise_level=0.01)
        te_col_name = 'te_' + cat + '_mean_smoothed'
        train_df[te_col_name] = trn
        test_df[te_col_name] = sub

        return train_df, test_df

    cat_cols = ['product_id','product_category_name_english',
                'seller_state','seller_city','seller_zip_code_prefix',
                'customer_seller_city','customer_seller_state','customer_seller_zip_code_prefix',
                'customer_city','customer_state','customer_zip_code_prefix'
               ]


    for cat in cat_cols:
        train_df, test_df = add_new_fe_col(train_df, test_df, cat)
    
    return train_df, test_df

train_te_df, test_te_df = add_new_te(train_df, test_df)    
    
    

In [16]:
display(train_te_df.head(2))
display(test_te_df.head(2))

Unnamed: 0,classes,order_approved_at,customer_id,customer_zip_code_prefix,customer_city,customer_state,price,freight_value,product_id,product_weight_g,...,te_product_category_name_english_mean_smoothed,te_seller_state_mean_smoothed,te_seller_city_mean_smoothed,te_seller_zip_code_prefix_mean_smoothed,te_customer_seller_city_mean_smoothed,te_customer_seller_state_mean_smoothed,te_customer_seller_zip_code_prefix_mean_smoothed,te_customer_city_mean_smoothed,te_customer_state_mean_smoothed,te_customer_zip_code_prefix_mean_smoothed
2605,3,2016-10-04 10:19:23,7812fcebfc5e8065d31e1bb5f0017dae,12030,taubate,SP,29.99,10.96,e2a1d45a73dc7f5a7f9236b043431b89,9000.0,...,2.623823,2.112744,1.834408,1.725886,2.146831,1.631038,2.173256,2.291789,1.784604,2.216951
41063,2,2016-10-04 13:46:31,aadd27185177fc7ac9b364898ac09343,78075,cuiaba,MT,23.9,26.82,43bb8825dd6838251606e5e4130cfff4,1500.0,...,2.220169,2.077027,2.222871,2.176866,2.151111,3.250278,2.193177,2.204278,3.168087,2.177346


Unnamed: 0,classes,order_approved_at,customer_id,customer_zip_code_prefix,customer_city,customer_state,price,freight_value,product_id,product_weight_g,...,te_product_category_name_english_mean_smoothed,te_seller_state_mean_smoothed,te_seller_city_mean_smoothed,te_seller_zip_code_prefix_mean_smoothed,te_customer_seller_city_mean_smoothed,te_customer_seller_state_mean_smoothed,te_customer_seller_zip_code_prefix_mean_smoothed,te_customer_city_mean_smoothed,te_customer_state_mean_smoothed,te_customer_zip_code_prefix_mean_smoothed
4927,3,2018-06-19 03:36:38,956346db615f7adb9ea991c5e4648c55,89219,joinville,SC,105.0,23.89,a62e25e09e05e6faf31d90c6ec1aa3d1,1000.0,...,2.23284,2.19512,2.120438,2.711891,2.197809,2.173182,2.136937,2.609409,2.622631,2.183538
6617,0,2018-06-19 03:36:39,a3777228aa1a73a3bf9a3127609cf68a,9950,diadema,SP,99.97,15.8,ed1e39b938c6cc6867d8f5fd408ce319,650.0,...,2.044144,2.379377,1.984575,2.149048,2.154301,2.228585,2.177672,1.633646,1.781283,2.16351


In [17]:
print(train_te_df.shape)
print(test_te_df.shape)

(53740, 36)
(13436, 36)


In [18]:
train2_lb = train_te_df
test2_lb = test_te_df

In [19]:
train2_lb.columns

Index(['classes', 'order_approved_at', 'customer_id',
       'customer_zip_code_prefix', 'customer_city', 'customer_state', 'price',
       'freight_value', 'product_id', 'product_weight_g', 'product_length_cm',
       'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'seller_zip_code_prefix',
       'seller_city', 'seller_state', 'order_date', 'order_weekday',
       'order_day', 'order_month', 'customer_seller_state',
       'customer_seller_city', 'customer_seller_zip_code_prefix',
       'product_volume', 'te_product_id_mean_smoothed',
       'te_product_category_name_english_mean_smoothed',
       'te_seller_state_mean_smoothed', 'te_seller_city_mean_smoothed',
       'te_seller_zip_code_prefix_mean_smoothed',
       'te_customer_seller_city_mean_smoothed',
       'te_customer_seller_state_mean_smoothed',
       'te_customer_seller_zip_code_prefix_mean_smoothed',
       'te_customer_city_mean_smoothed', 'te_customer_state_mean_smoothed',
       'te_c

## 최종 사용할 컬럼 지정
### XGBoost, CatBoost 알고리즘 용

In [20]:
def filter_df(raw_df, cols):
    df = raw_df.copy()
    df = df[cols]
    return df


cols = ['classes',
        'price', 'freight_value',
        'product_weight_g', 
        'product_volume',    
        'order_weekday',
        'order_day', 'order_month',        
        'te_product_id_mean_smoothed',
        'te_product_category_name_english_mean_smoothed',        
        'te_seller_state_mean_smoothed', 'te_seller_city_mean_smoothed',
        'te_seller_zip_code_prefix_mean_smoothed',
        'te_customer_seller_city_mean_smoothed',
        'te_customer_seller_state_mean_smoothed',
        'te_customer_seller_zip_code_prefix_mean_smoothed',
        'te_customer_city_mean_smoothed', 'te_customer_state_mean_smoothed',
       'te_customer_zip_code_prefix_mean_smoothed'
       ]



encode_te_train = filter_df(train2_lb, cols)
encode_te_test = filter_df(test2_lb, cols)



### AutoGluon 용

In [21]:
te_auto_train = filter_df(train2_lb, cols)
te_auto_test = filter_df(test2_lb, cols)

## 로컬에 데이터 저장

In [22]:
import os

def save_local(train_data, test_data, preproc_folder):
    train_df = pd.concat([train_data['classes'], train_data.drop(['classes'], axis=1)], axis=1)
    train_file_name = os.path.join(preproc_folder, 'train.csv')
    train_df.to_csv(train_file_name, index=False)
    print(f'{train_file_name} is saved')

    test_df = pd.concat([test_data['classes'], test_data.drop(['classes'], axis=1)], axis=1)
    test_file_name = os.path.join(preproc_folder, 'test.csv')
    test_df.to_csv(test_file_name, index=False)
    print(f'{test_file_name} is saved')        
    
    return train_file_name, test_file_name




## XGBoost, CatBoost에 사용할 데이터 저장

In [23]:
preproc_folder = 'preproc_data/fe/te_xgboost'
os.makedirs(preproc_folder, exist_ok=True)    
te_pre_train_file, te_pre_test_file = save_local(encode_te_train, encode_te_test, preproc_folder)




preproc_data/fe/te_xgboost/train.csv is saved
preproc_data/fe/te_xgboost/test.csv is saved


## AutoGluon에 사용할 데이터 저장

In [24]:
preproc_folder = 'preproc_data/fe/te_auto'
os.makedirs(preproc_folder, exist_ok=True)    
te_auto_train_file,te_auto_test_file  = save_local(te_auto_train, te_auto_test, preproc_folder)



preproc_data/fe/te_auto/train.csv is saved
preproc_data/fe/te_auto/test.csv is saved


In [25]:
%store te_pre_train_file
%store te_pre_test_file

%store te_auto_train_file
%store te_auto_test_file

Stored 'te_pre_train_file' (str)
Stored 'te_pre_test_file' (str)
Stored 'te_auto_train_file' (str)
Stored 'te_auto_test_file' (str)
