# Categorical Feature Encoding Challenge Step 4

---
### Analysis summary and modeling strategy
#### 1) Analysis summary
1. No missing values
2. No features to remove
3. Encode binary features : Change values to 0 and 1
    + **'bin_3'** and **'bin_4'**
4. Encode nominal features : One-hot encoding because the data quantity is not that much
    + **'nom_0' ~ 'nom_9'**
5. Encode ordinal features : Encode as the order of unique values
    + **'ord_0' ~ 'ord_5'**
6. Encode cyclical features : One-hot encoding to prevent recognition as large or small values
    + **'day'** and **'month'**

#### 2) Modeling strategy
- Baseline model : Logistic Regression
    + Feature engineering : One-hot encoding of all features
- Performance improvement : Additional feature engineering and hyperparameter optimization
    + Feature engineering : <u>Custom encoding for categorical features</u> and <u>feature scaling</u>
    + Hyperparameter optimization : GridSearch
    + Additional tip : Use validation data for training
--- 

## 4. Performance improvement
- Process
    1. Import data
    2. Feature engineering
    3. Make evaluation metrics calculation function
    4. Hyperparameter optimization (Train model)
        + **Generate model**
        + **Generate GridSearch object**
        + **Train GridSearch**
    5. Validate performance
        + If performance is not good, go back to **Feature Engineering** or **Hyperparameter Optimization**
    6. Submit

### 4.1. Modeling focused on features
- Key point
    + **Custom encoding** for categorical features
        + Binary features : Manual encoding
        + Ordinal features : Manual encoding and ordinal encoding
        + Nominal features : One-hot encoding
        + Date features : One-hot encoding
    + **Feature scaling** for ordinal features
    + Hyperparameter optimization

#### 1) Import data

In [1]:
# Import data
import pandas as pd
data_path = '/kaggle/input/cat-in-the-dat/'
train = pd.read_csv(data_path + 'train.csv', index_col='id')
test = pd.read_csv(data_path + 'test.csv', index_col='id')
submission = pd.read_csv(data_path + 'sample_submission.csv', index_col='id')

#### 2-1) Feature engineering : Custom encoding for categorical features

In [2]:
# Merge train data and test data
all_data = pd.concat([train, test])
# Remove target values for separately modeling feature and target values
all_data = all_data.drop('target', axis=1)

##### (1) Binary feature
- 'bin_3', 'bin_4' : Manual encoding

In [3]:
all_data['bin_3'] = all_data['bin_3'].map({'F':0, 'T':1})
all_data['bin_4'] = all_data['bin_4'].map({'N':0, 'Y':1})

##### (2) Ordinal feature
- 'ord_1', 'ord_2' : Manual encoding
- 'ord_3', 'ord_4', 'ord_5' : Encoding as the alphabetical order

In [4]:
# Manual encoding
ord1dict = {'Novice':0, 'Contributor':1,
            'Expert':2, 'Master':3, 'Grandmaster':4}
ord2dict = {'Freezing':0, 'Cold':1, 'Warm':2,
            'Hot':3, 'Boiling Hot':4, 'Lava Hot':5}
all_data['ord_1'] = all_data['ord_1'].map(ord1dict)
all_data['ord_2'] = all_data['ord_2'].map(ord2dict)

In [5]:
# Encoding as the alphabetical order using OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
    # List to encode
ord_345 = ['ord_3', 'ord_4', 'ord_5']
    # Generate encoder object
ord_encoder = OrdinalEncoder()
    # Apply ordinal encoding
all_data[ord_345] = ord_encoder.fit_transform(all_data[ord_345])
    # Print encoding order by feature
for feature, categories in zip(ord_345, ord_encoder.categories_):
    print(feature)
    print(categories)

ord_3
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o']
ord_4
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R'
 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z']
ord_5
['AP' 'Ai' 'Aj' 'BA' 'BE' 'Bb' 'Bd' 'Bn' 'CL' 'CM' 'CU' 'CZ' 'Cl' 'DH'
 'DN' 'Dc' 'Dx' 'Ed' 'Eg' 'Er' 'FI' 'Fd' 'Fo' 'GD' 'GJ' 'Gb' 'Gx' 'Hj'
 'IK' 'Id' 'JX' 'Jc' 'Jf' 'Jt' 'KR' 'KZ' 'Kf' 'Kq' 'LE' 'MC' 'MO' 'MV'
 'Mf' 'Ml' 'Mx' 'NV' 'Nf' 'Nk' 'OR' 'Ob' 'Os' 'PA' 'PQ' 'PZ' 'Ps' 'QM'
 'Qb' 'Qh' 'Qo' 'RG' 'RL' 'RP' 'Rm' 'Ry' 'SB' 'Sc' 'TR' 'TZ' 'To' 'UO'
 'Uk' 'Uu' 'Vf' 'Vx' 'WE' 'Wc' 'Wv' 'XI' 'Xh' 'Xi' 'YC' 'Yb' 'Ye' 'ZR'
 'ZS' 'Zc' 'Zq' 'aF' 'aM' 'aO' 'aP' 'ac' 'av' 'bF' 'bJ' 'be' 'cA' 'cG'
 'cW' 'ck' 'cp' 'dB' 'dE' 'dN' 'dO' 'dP' 'dQ' 'dZ' 'dh' 'eG' 'eQ' 'eb'
 'eg' 'ek' 'ex' 'fO' 'fh' 'gJ' 'gM' 'hL' 'hT' 'hh' 'hp' 'iT' 'ih' 'jS'
 'jV' 'je' 'jp' 'kC' 'kE' 'kK' 'kL' 'kU' 'kW' 'ke' 'kr' 'kw' 'lF' 'lL'
 'll' 'lx' 'mb' 'mc' 'mm' 'nX' 'nh' 'oC' 'oG' 'oH' 'oK' 'od' 'on' 'pa'
 'ps' 'qA' 'qJ' 'qK' 'qP' 'qX' '

In [6]:
# Before ordinal encoding
train[ord_345].head()

Unnamed: 0_level_0,ord_3,ord_4,ord_5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,h,D,kr
1,a,A,bF
2,h,R,Jc
3,i,D,kW
4,a,R,qP


In [7]:
# After ordinal encoding
all_data[ord_345].head()

Unnamed: 0_level_0,ord_3,ord_4,ord_5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7.0,3.0,136.0
1,0.0,0.0,93.0
2,7.0,17.0,31.0
3,8.0,3.0,134.0
4,0.0,17.0,158.0


##### (3) Nominal feature
- 'nom_0' ~ 'nom_9' : One-hot encoding
- Note
    + When applying one-hot encoding with OneHotEncoder, sparse matrix is returned as CSR(Compressed Sparse Row) format
    + CSR format uses less memory and makes calculation speed faster 

In [8]:
# One-hot encoding
from sklearn.preprocessing import OneHotEncoder
    # List to encode
nom_features = ['nom_' + str(i) for i in range(10)]
    # Generate encoder object
onehot_encoder = OneHotEncoder()
    # Apply one-hot encoding
encoded_nom_matrix = onehot_encoder.fit_transform(all_data[nom_features])
encoded_nom_matrix

<500000x16276 sparse matrix of type '<class 'numpy.float64'>'
	with 5000000 stored elements in Compressed Sparse Row format>

In [9]:
# Remove nominal features from all_data
all_data = all_data.drop(nom_features, axis=1)

##### (4) Date feature
- 'day', 'month' : One-hot encoding

In [10]:
# One-hot encoding
    # List to encode
date_features = ['day', 'month']
    # Apply one-hot encoding
encoded_date_matrix = onehot_encoder.fit_transform(all_data[date_features])
encoded_date_matrix

<500000x19 sparse matrix of type '<class 'numpy.float64'>'
	with 1000000 stored elements in Compressed Sparse Row format>

#### 2-2) Feature engineering : Feature scaling
- When numerical features have different valid value ranges, training won't be done well
- Feature scaling : The operation of adjusting the range of values of features to match each other
    + Binary, nominal, date features are encoded 0 and 1
    + The value range of the ordinal feature must also be scaled to be between 0 and 1

In [11]:
# Feature scaling
from sklearn.preprocessing import MinMaxScaler
    # List to scale
ord_features = ['ord_' + str(i) for i in range(6)]
    # Just to compare before and after
before_scaling = all_data[ord_features]
    # Min-max normalization
all_data[ord_features] = MinMaxScaler().fit_transform(all_data[ord_features])

In [12]:
# Before scaling
before_scaling.head()

Unnamed: 0_level_0,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2,4,1,7.0,3.0,136.0
1,1,4,3,0.0,0.0,93.0
2,1,2,5,7.0,17.0,31.0
3,1,4,4,8.0,3.0,134.0
4,1,4,0,0.0,17.0,158.0


In [13]:
# After scaling
all_data[ord_features].head()

Unnamed: 0_level_0,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.5,1.0,0.2,0.5,0.12,0.712042
1,0.0,1.0,0.6,0.0,0.0,0.486911
2,0.0,0.5,1.0,0.5,0.68,0.162304
3,0.0,1.0,0.8,0.571429,0.12,0.701571
4,0.0,1.0,0.0,0.0,0.68,0.827225


In [14]:
# Merge encoded and scaled features with csr format
from scipy import sparse

all_data_sprs = sparse.hstack([sparse.csr_matrix(all_data), # Return all_data to CSR format
                              encoded_nom_matrix,
                              encoded_date_matrix],
                              format='csr')

# Split data into train data and test date
num_train = len(train)
X_train = all_data_sprs[:num_train]
X_test = all_data_sprs[num_train:]
y = train['target']

# Split train data into train and validation data
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y,
                                                     test_size=0.1,
                                                     stratify=y,
                                                     random_state=10)

#### 3) Make evaluation index calculation function
- Use scikit learn library
    + sklearn.metrics.roc_auc_score

#### 4) Hyperparameter optimization (Train model)
- GridSearch finds **optimal hyperparameter values** by changing hyperparameter values and evaluating model performance through cross-validation


- Process
    1. Generate model
    2. Generate GridSearch object
        + Target model
        + List of hyperparameter values (Dictionary type)
        + Evaluation function for cross-validation
    3. Train model (GridSearch)
    

- Note
    + C : A parameter that controls the intensity of regulation and the smaller value, the stronger regulation

In [15]:
%%time

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Generate model
logistic_model = LogisticRegression()
# List of hyperparameter values (Dict type)
lr_params = {'C':[0.1, 0.125, 0.2], 'max_iter':[800, 900, 1000],
            'solver':['liblinear'], 'random_state':[42]}
# Generate GridSearch object
gridsearch_logistic_model = GridSearchCV(estimator=logistic_model,
                                        param_grid=lr_params,
                                        scoring='roc_auc',
                                        cv=5)
# Train model and GridSearch
gridsearch_logistic_model.fit(X_train, y_train)

print(f'best hyperparameter :', gridsearch_logistic_model.best_params_)

best hyperparameter : {'C': 0.125, 'max_iter': 800, 'random_state': 42, 'solver': 'liblinear'}
CPU times: user 19min 46s, sys: 17min 49s, total: 37min 36s
Wall time: 9min 40s


#### 5) Validate performance

In [16]:
# Predict the probability of target value 1 with validation data
y_valid_preds = gridsearch_logistic_model.predict_proba(X_valid)[:, 1]

# Validate model
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_valid, y_valid_preds)
print(f'ROC AUC of validation data : {roc_auc:.4f}')

ROC AUC of validation data : 0.8045


#### 6) Submit

In [17]:
# Predict with test data
y_preds = gridsearch_logistic_model.best_estimator_.predict_proba(X_test)[:, 1]

# Save submission file
submission['target'] = y_preds
submission.to_csv('submission.csv')
submission

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
300000,0.345187
300001,0.694867
300002,0.112086
300003,0.472137
300004,0.856656
...,...
499995,0.285640
499996,0.146745
499997,0.333108
499998,0.573309


### 4.2. Train all data including validation data
- Training with even a little more data is beneficial for performance improvement
- Key point for performance improvement
    1. Do feature engineering (Encoding/Feature scaling)
    2. Do modeling using different kinds of models and hyperparameter optimization
    3. Select one model whose performance of validation data is the best
    4. Train the selected model again with full training data including validation data
    5. Submit

#### 1) Import data

In [18]:
# Import data
import pandas as pd
data_path = '/kaggle/input/cat-in-the-dat/'
train = pd.read_csv(data_path + 'train.csv', index_col='id')
test = pd.read_csv(data_path + 'test.csv', index_col='id')
submission = pd.read_csv(data_path + 'sample_submission.csv', index_col='id')

#### 2-1) Feature engineering : Custom encoding for categorical features

In [19]:
# Merge train data and test data
all_data = pd.concat([train, test])
# Remove target values for separately modeling feature and target values
all_data = all_data.drop('target', axis=1)

In [20]:
# Binary feature
all_data['bin_3'] = all_data['bin_3'].map({'F':0, 'T':1})
all_data['bin_4'] = all_data['bin_4'].map({'N':0, 'Y':1})

# Ordinal feature
ord1dict = {'Novice':0, 'Contributor':1,
            'Expert':2, 'Master':3, 'Grandmaster':4}
ord2dict = {'Freezing':0, 'Cold':1, 'Warm':2,
            'Hot':3, 'Boiling Hot':4, 'Lava Hot':5}
all_data['ord_1'] = all_data['ord_1'].map(ord1dict)
all_data['ord_2'] = all_data['ord_2'].map(ord2dict)

from sklearn.preprocessing import OrdinalEncoder
    # List to encode
ord_345 = ['ord_3', 'ord_4', 'ord_5']
    # Generate encoder object
ord_encoder = OrdinalEncoder()
    # Apply ordinal encoding
all_data[ord_345] = ord_encoder.fit_transform(all_data[ord_345])
    # Print encoding order by feature
for feature, categories in zip(ord_345, ord_encoder.categories_):
    print(feature)
    print(categories)
    
# Nominal feature
from sklearn.preprocessing import OneHotEncoder
    # List to encode
nom_features = ['nom_' + str(i) for i in range(10)]
    # Generate encoder object
onehot_encoder = OneHotEncoder()
    # Apply one-hot encoding
encoded_nom_matrix = onehot_encoder.fit_transform(all_data[nom_features])
encoded_nom_matrix
    # Remove nominal features from all_data
all_data = all_data.drop(nom_features, axis=1)

# Date feature
    # List to encode
date_features = ['day', 'month']
    # Apply one-hot encoding
encoded_date_matrix = onehot_encoder.fit_transform(all_data[date_features])
encoded_date_matrix

ord_3
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o']
ord_4
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R'
 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z']
ord_5
['AP' 'Ai' 'Aj' 'BA' 'BE' 'Bb' 'Bd' 'Bn' 'CL' 'CM' 'CU' 'CZ' 'Cl' 'DH'
 'DN' 'Dc' 'Dx' 'Ed' 'Eg' 'Er' 'FI' 'Fd' 'Fo' 'GD' 'GJ' 'Gb' 'Gx' 'Hj'
 'IK' 'Id' 'JX' 'Jc' 'Jf' 'Jt' 'KR' 'KZ' 'Kf' 'Kq' 'LE' 'MC' 'MO' 'MV'
 'Mf' 'Ml' 'Mx' 'NV' 'Nf' 'Nk' 'OR' 'Ob' 'Os' 'PA' 'PQ' 'PZ' 'Ps' 'QM'
 'Qb' 'Qh' 'Qo' 'RG' 'RL' 'RP' 'Rm' 'Ry' 'SB' 'Sc' 'TR' 'TZ' 'To' 'UO'
 'Uk' 'Uu' 'Vf' 'Vx' 'WE' 'Wc' 'Wv' 'XI' 'Xh' 'Xi' 'YC' 'Yb' 'Ye' 'ZR'
 'ZS' 'Zc' 'Zq' 'aF' 'aM' 'aO' 'aP' 'ac' 'av' 'bF' 'bJ' 'be' 'cA' 'cG'
 'cW' 'ck' 'cp' 'dB' 'dE' 'dN' 'dO' 'dP' 'dQ' 'dZ' 'dh' 'eG' 'eQ' 'eb'
 'eg' 'ek' 'ex' 'fO' 'fh' 'gJ' 'gM' 'hL' 'hT' 'hh' 'hp' 'iT' 'ih' 'jS'
 'jV' 'je' 'jp' 'kC' 'kE' 'kK' 'kL' 'kU' 'kW' 'ke' 'kr' 'kw' 'lF' 'lL'
 'll' 'lx' 'mb' 'mc' 'mm' 'nX' 'nh' 'oC' 'oG' 'oH' 'oK' 'od' 'on' 'pa'
 'ps' 'qA' 'qJ' 'qK' 'qP' 'qX' '

<500000x19 sparse matrix of type '<class 'numpy.float64'>'
	with 1000000 stored elements in Compressed Sparse Row format>

#### 2-2) Feature engineering : Feature scaling

In [21]:
# Feature scaling
from sklearn.preprocessing import MinMaxScaler
    # List to scale
ord_features = ['ord_' + str(i) for i in range(6)]
    # Just to compare before and after
before_scaling = all_data[ord_features]
    # Min-max normalization
all_data[ord_features] = MinMaxScaler().fit_transform(all_data[ord_features])

In [22]:
# Merge encoded and scaled features with csr format
from scipy import sparse

all_data_sprs = sparse.hstack([sparse.csr_matrix(all_data), # Return all_data to CSR format
                              encoded_nom_matrix,
                              encoded_date_matrix],
                              format='csr')

# Split data into train data and test date
num_train = len(train)
X_train = all_data_sprs[:num_train]
X_test = all_data_sprs[num_train:]
y = train['target']

#### 3) Make evaluation index calculation function

#### 4) Hyperparameter optimization (Train model)

In [23]:
%%time

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Generate model
logistic_model = LogisticRegression()
# List of hyperparameter values (Dict type)
lr_params = {'C':[0.1, 0.125, 0.2], 'max_iter':[800, 900, 1000],
            'solver':['liblinear'], 'random_state':[42]}
# Generate GridSearch object
gridsearch_logistic_model = GridSearchCV(estimator=logistic_model,
                                        param_grid=lr_params,
                                        scoring='roc_auc',
                                        cv=5)
# Train model and GridSearch
gridsearch_logistic_model.fit(X_train, y)

print(f'best hyperparameter :', gridsearch_logistic_model.best_params_)

best hyperparameter : {'C': 0.125, 'max_iter': 800, 'random_state': 42, 'solver': 'liblinear'}
CPU times: user 21min 28s, sys: 19min 21s, total: 40min 49s
Wall time: 10min 30s


#### 5) Validate performance

#### 6) Submit

In [24]:
# Predict with test data
y_preds = gridsearch_logistic_model.best_estimator_.predict_proba(X_test)[:, 1]

# Save submission file
submission['target'] = y_preds
submission.to_csv('submission.csv')
submission

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
300000,0.339968
300001,0.695769
300002,0.121029
300003,0.436690
300004,0.868538
...,...
499995,0.291462
499996,0.129374
499997,0.323234
499998,0.597045


References
===
- [EDA reference](https://www.kaggle.com/kabure/eda-feat-engineering-encode-conquer)
- [Modeling reference](https://www.kaggle.com/dkomyagin/cat-in-the-dat-0-80285-private-lb-solution)
- 머신러닝.딥러닝 문제해결 전략(신백균)