# Categorical Feature Encoding Challenge Step 3

---
### Analysis summary and modeling strategy
#### 1) Analysis summary
1. No missing values
2. No features to remove
3. Encode binary features : Change values to 0 and 1
    + **'bin_3'** and **'bin_4'**
4. Encode nominal features : One-hot encoding because the data quantity is not that much
    + **'nom_0' ~ 'nom_9'**
5. Encode ordinal features : Encode as the order of unique values
    + **'ord_0' ~ 'ord_5'**
6. Encode cyclical features : One-hot encoding to prevent recognition as large or small values
    + **'day'** and **'month'**

#### 2) Modeling strategy
- Baseline model : Logistic Regression
    + Feature engineering : One-hot encoding of all features
- Performance improvement : Additional feature engineering and hyperparameter optimization
    + Feature engineering : <u>Custom encoding for categorical features</u> and <u>feature scaling</u>
    + Hyperparameter optimization : GridSearch
    + Additional tip : Use validation data for training
--- 

## 3. Baseline model
- Train
    + Finding the optimal regression coefficient when given the independent variables(features) and target values
- Predict
    + Estimating a target value when new independent variables are given to the trained model  (which has the optimal regression coefficients)  

### 3.1.  Import data

In [1]:
# Import data
import pandas as pd

data_path = '../../Datasets/categorical_feature_encoding/'

train = pd.read_csv(data_path + 'train.csv', index_col='id')
test = pd.read_csv(data_path + 'test.csv', index_col='id')
submission = pd.read_csv(data_path + 'sample_submission.csv', index_col='id')

### 3.2. Feature engineering
- When training ML model, features data type should be **int** or **float** because the ML model do not recognize text data
- Encoding means changing the form of expression of data

In [2]:
# Merge train data and test data
all_data = pd.concat([train, test])

# Remove target values for separately modeling feature and target values
all_data = all_data.drop('target', axis=1)

all_data

Unnamed: 0_level_0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,...,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2
1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,...,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8
2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,...,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2
3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,...,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1
4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,...,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0,0,0,F,N,Green,Square,Lion,Canada,Theremin,...,9e4b23160,acc31291f,1,Novice,Lava Hot,j,A,Gb,1,3
499996,1,0,0,F,Y,Green,Trapezoid,Lion,China,Piano,...,cfbd87ed0,eae3446d0,1,Contributor,Lava Hot,f,S,Ed,2,2
499997,0,1,1,T,Y,Green,Trapezoid,Lion,Canada,Oboe,...,1108bcd6c,33dd3cf4b,1,Novice,Boiling Hot,g,V,TR,3,1
499998,1,0,0,T,Y,Blue,Star,Hamster,Costa Rica,Bassoon,...,606ac930b,d4cf587dd,2,Grandmaster,Boiling Hot,g,X,Ye,2,1


In [3]:
# One-hot encoding of all features
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder() # Generate one-hot encoder
all_data_encoded = encoder.fit_transform(all_data) # Apply one-hot encoder

In [4]:
# Split data into train data and test data
num_train = len(train)
X_train = all_data_encoded[:num_train] # rows (0 ~ num_train-1)
X_test = all_data_encoded[num_train:]  # rows (num_train ~ end index)

# Target values
y = train['target']

In [5]:
# Split train data into train and validation data
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y,
                                                     test_size=0.1,
    # Make train and validation data have the same target distribution
                                                     stratify=y, 
                                                     random_state=10)

### 3.3. Make evaluation index calculation function
- Use scikit learn library
    + sklearn.metrics.roc_auc_score

### 3.4. Train model
- Note
    + max_iter : The number of iterations of updating the regression coefficient while training the model 

In [6]:
# Import library
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(max_iter=1000, random_state=42) # Generate model
logistic_model.fit(X_train, y_train) # Train model

LogisticRegression(max_iter=1000, random_state=42)

### 3.5. Validate performance

In [7]:
# Predict probabilities of target value with validation data
logistic_model.predict_proba(X_valid)

array([[0.23291409, 0.76708591],
       [0.91410264, 0.08589736],
       [0.83000693, 0.16999307],
       ...,
       [0.24886846, 0.75113154],
       [0.49433266, 0.50566734],
       [0.95657777, 0.04342223]])

In [8]:
# Predict target value with validation data
logistic_model.predict(X_valid)

array([1, 0, 0, ..., 1, 1, 0])

In [9]:
# Predict the probability of target value 1 with validation data
y_valid_preds = logistic_model.predict_proba(X_valid)[:, 1]
y_valid_preds

array([0.76708591, 0.08589736, 0.16999307, ..., 0.75113154, 0.50566734,
       0.04342223])

In [10]:
# Validate model
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_valid, y_valid_preds)
print(f'ROC AUC of validation data : {roc_auc:.4f}')

ROC AUC of validation data : 0.7965


### 3.6. Submit

In [11]:
# Predict with test data
y_preds = logistic_model.predict_proba(X_test)[:, 1]

# Save submission file
submission['target'] = y_preds
submission.to_csv('submission.csv')
submission

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
300000,0.308302
300001,0.699205
300002,0.067976
300003,0.444452
300004,0.893166
...,...
499995,0.311533
499996,0.142492
499997,0.406103
499998,0.491811


References
===
- [EDA reference](https://www.kaggle.com/kabure/eda-feat-engineering-encode-conquer)
- [Modeling reference](https://www.kaggle.com/dkomyagin/cat-in-the-dat-0-80285-private-lb-solution)
- 머신러닝.딥러닝 문제해결 전략(신백균)