# Engineer Numerical Features

## 0. Custom Transformers for Numerical Features
**Feature engineering for numerical features** consists of:
- sklearn-Based Transformers
    - `StandardScaler_DF`
    - `RobusScaler_DF`    
    - `MinMaxScaler_DF`
    - `MaxAbsScaler_DF`
    - `Normalizer_DF`
    - `PowerTransformer_DF`
    - `Binarizer_DT`
    - `QuantileTransformer_DF`     
    - `KBinsDiscretizer_DF`
    
- numpy-Based Transformers
    - `Log1pTransformer`
    - `SqrtTransformer`
    - `ReciprocalTransformer`

- Utility Transformers
    - `FeatureUnion_DF`
        - Concatenate all returns of Custom Transformers in dataframe (df).
    - `UniversalTransformer`
        - Transform a given df with a general function a user provides.
    - `PassTransformer`
        - Pass a given df to next without any transformation.
    - `FeatureSelector`
        - Select given features from a df.
    - `FeatureSelector_NUM`
        - Select NUMERICAL features from a df.
    - `FeatureSelector_CAT`
        - Select CATEGORICAL features from a df.
  
- **Note**: 
    - Data will have ***'pandas dataframe'*** format before/after transformation.
    - ***sklearn/numpy-based transfomers*** have the same functionalities in dataframe as their alternatives in sklearn/numpy.

- References: 
    - sklearn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
    - sklear Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
    - pandas-pipelines-custom-transformers: https://github.com/jem1031/pandas-pipelines-custom-transformers
    - In-house Code: 'data_procesing_for_modeling.py' 
    - Featuretools: https://docs.featuretools.com/#
    - Feature Engine: https://pypi.org/project/feature-engine/
    - Category Encoders: http://contrib.scikit-learn.org/categorical-encoding/    

## 1. Pre-Processed TRAIN/TEST Datasets

In [11]:
### 0. Import Required Packages
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import datetime as dt
%matplotlib inline

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

### Import Customer Transformers
import PreProcessing_Custom_Transformers_v2 as PP
import FeatureEngineering_Custom_Transformers as FE
import FeatureCreation_Custom_Transformers as FC

# Remove DataConversionWarning
import warnings
warnings.simplefilter('ignore')

In [12]:
# Use Pre-Processed Data as TRAIN and TEST
df_train = pd.read_pickle('data_new/Vol_df_train_FiosONT_PP_5months.pkl')
df_test  = pd.read_pickle('data_new/Vol_df_test_FiosONT_PP_5months.pkl')

# TRAIN
train_X = df_train.drop('status', axis=1).copy()
train_y = df_train['status']

# TEST
test_X  = df_test.drop('status', axis=1).copy()
test_y  = df_test['status']

# Sample Size
print('*'*50 + '\nTRAIN vs TEST Datasets\n' + '*'*50)
# print('Competitive Area: ', df_train.competitive_area.unique())
print('The Shape of TRAIN Data: ' + str(df_train.shape))
print('The Shape of TEST Data:  ' + str(df_test.shape))

## Churn Rate by Sample Type
print('\n' + '*'*50 + '\nOverall Churn Rate\n' + '*'*50)
print('TRAIN: ', df_train.status.value_counts(normalize=True)[1].round(4))
print('TEST:  ', df_test.status.value_counts(normalize=True)[1].round(4), '\n')

**************************************************
TRAIN vs TEST Datasets
**************************************************
The Shape of TRAIN Data: (102451, 777)
The Shape of TEST Data:  (100858, 777)

**************************************************
Overall Churn Rate
**************************************************
TRAIN:  0.0367
TEST:   0.0405 



## 2. Feature Engineering

### Create/Use a Meta Custom Transfomer for Feature Engineering

In [13]:
%%time

# (1) Make a Pipeline in Parallel/Sequence and Instantiate 
# Custom Transformers in Parallel for NUMERICAL Features
Pipe_FU          =  FE.FeatureUnion_DF([
                    ('Original', FE.PassTransformer(prefix='Original')),
                    ('Standard', FE.StandardScaler_DF(prefix='Standard')),
                    ('Robust', FE.RobustScaler_DF(prefix='Robust', quantile_range=(5.0, 95.0))),
                    ('Quantile', FE.QuantileTransformer_DF(prefix='Quantile', n_quantiles=100, random_state=0)),
                    ('Binary', FE.Binarizer_DF(prefix='Binary', threshold=0)),
                    ('MinMax', FE.MinMaxScaler_DF(prefix='MinMax', feature_range=(0, 1))),
                    ('MaxAbs', FE.MaxAbsScaler_DF(prefix='MaxAbs')),
                    ('Norm', FE.Normalizer_DF(prefix='Norm', norm='l1')),
                    ('KBin', FE.KBinsDiscretizer_DF(prefix='KBin', n_bins=10, encode='ordinal')),
                    ('Log1p', FE.Log1pTransformer(prefix='Log1p')),
                    ('Sqrt', FE.SqrtTransformer(prefix='Sqrt')),
                    ('Reciprocal', FE.ReciprocalTransformer(prefix='Reciprocal'))
                    ])

# Custom Transformers in Sequence for NUMERICAL Features
NUM_Pipe          = Pipeline([
                    ('Selector', FE.FeatureSelector_NUM()),
                    ('FU_Pipe', Pipe_FU)
                    ])


# (2) fit()
NUM_Pipe.fit(train_X, train_y)


# (3) transform()
train_X_FE = NUM_Pipe.transform(train_X)
test_X_FE  = NUM_Pipe.transform(test_X)

# Feature Dimension
print('\n' + '*'*50 + '\nBefore vs After Feature Engineering (FE)\n' + '*'*50)
print('TRAIN: Before FE:' + str(train_X.shape))
print('TRAIN: After FE: ' + str(train_X_FE.shape))
print('TEST:  After FE: ' + str(test_X_FE.shape))


**************************************************
Before vs After Feature Engineering (FE)
**************************************************
TRAIN: Before FE:(102451, 776)
TRAIN: After FE: (102451, 4992)
TEST:  After FE: (100858, 4992)
CPU times: user 1min 57s, sys: 33.6 s, total: 2min 31s
Wall time: 49.3 s


### Correlation Summary: TRAIN vs TEST

In [14]:
p_list          = [.01, .05, .1, .2, .3, .4, .6, .7, .8, .9, .95, .99]
flag_NUM        = train_X.select_dtypes(exclude=[object, 'category']).columns.tolist()
corr_train      = train_X[flag_NUM].apply(lambda x: x.corr(train_y)).to_frame().describe(percentiles=p_list)
corr_train_all  = train_X_FE.apply(lambda x: x.corr(train_y)).to_frame().describe(percentiles=p_list)

corr_test       = test_X[flag_NUM].apply(lambda x: x.corr(test_y)).to_frame().describe(percentiles=p_list)
corr_test_all   = test_X_FE.apply(lambda x: x.corr(test_y)).to_frame().describe(percentiles=p_list)

corr_all         = pd.concat([corr_train, corr_train_all, corr_test, corr_test_all], axis=1)
corr_all.columns = ['TRAIN_Original', 'TRAIN_All', 'TEST_Original', 'TEST_All']
print('\n' + '*'*50 + '\nCorrelation Summary: TRAIN vs TEST\n' + '*'*50)
corr_all


**************************************************
Correlation Summary: TRAIN vs TEST
**************************************************


Unnamed: 0,TRAIN_Original,TRAIN_All,TEST_Original,TEST_All
count,416.0,4759.0,415.0,4750.0
mean,0.000401,0.000137,0.002034,0.001822
std,0.008391,0.008241,0.009544,0.009769
min,-0.050927,-0.05302,-0.047839,-0.052411
1%,-0.02125,-0.021181,-0.017396,-0.018422
5%,-0.0119,-0.012161,-0.008281,-0.008881
10%,-0.008043,-0.008145,-0.005923,-0.006743
20%,-0.00458,-0.004556,-0.003767,-0.003983
30%,-0.002516,-0.002719,-0.00254,-0.002685
40%,-0.001268,-0.001387,-0.001394,-0.001374


### Saving Transformed Numerical Features

In [15]:
df_train_FE_NUM = train_y.to_frame().\
                 merge(train_X_FE, how='inner', left_index=True, right_index=True)

df_test_FE_NUM  = test_y.to_frame().\
                  merge(test_X_FE, how='inner', left_index=True, right_index=True)

df_train_FE_NUM.to_pickle('data_new/Vol_df_train_FiosONT_FE_NUM_5months.pkl')
df_test_FE_NUM.to_pickle('data_new/Vol_df_test_FiosONT_FE_NUM_5months.pkl')