# Engineer Categorical Features

## 0. Custom Transformers for Categorical Features
**Feature Creation** consists of the following custom transformers:
- `FeatureMaker`
    - Create new features that can be used as grouping variables
- `FeatureInteractionTransformer`
    - Account for interaction between any pair of given features.
    - Use newly created interaction features in further analyses
- `FeatureAggregator`
    - Aggregate both Numerical and Categorical features by a grouping variable.
    - Use aggregated features as new features. 
- `RareCategoryEncoder`
    - Re-group rare categories into either 'all_other' or most common category.
    - Create more representative/manageable number of categories.
- `UniversalCategoryEncoder`
    - Encode CATEGORICAL features with selected encoding methods.
    - Eocoding methods:
        - `ohe`: Generate 0/1 binary variable for every label of CATEGORICAL features.
        - `pct`: Replace category with its corresponding %.
        - `count`: Replace category with its corresponding count.
        - `ordinal`: Replace category with its order of average value of target y.
        - `y_mean`: Replace category with its corresponding average value of target y.
        - `y_log_ratio`: Replace category with its corresponding log(p(Churner)/p(Non-Churner)).
        - `y_ratio`: Replace category with its corresponding (p(Churner)/p(Non-Churner)).

- **Note**: 
    - Data will have ***'pandas dataframe'*** format before/after transformation.

- References: 
    - sklearn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
    - sklear Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
    - Featuretools: https://docs.featuretools.com/#
    - Feature Engine: https://pypi.org/project/feature-engine/
    - Category Encoders: http://contrib.scikit-learn.org/categorical-encoding/    

## 1. Pre-Processed TRAIN/TEST Datasets

In [6]:
### Import Base Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

### Import Customer Transformers
import PreProcessing_Custom_Transformers_v2 as PP
import FeatureEngineering_Custom_Transformers as FE
import FeatureCreation_Custom_Transformers as FC

# Remove DataConversionWarning
import warnings
warnings.simplefilter('ignore')
# from sklearn.exceptions import DataConversionWarning
# warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#### Note: `FeatureMaker` is used at pre-processing.

In [7]:
# Use Pre-Processed Data as TRAIN and TEST
df_train = pd.read_pickle('data_new/Vol_df_train_FiosONT_PP_5months.pkl')
df_test  = pd.read_pickle('data_new/Vol_df_test_FiosONT_PP_5months.pkl')

# TRAIN
train_X = df_train.drop('status', axis=1).copy()
train_y = df_train['status']

# TEST
test_X  = df_test.drop('status', axis=1).copy()
test_y  = df_test['status']

# Sample Size
print('*'*50 + '\nTRAIN vs TEST Datasets\n' + '*'*50)
# print('Competitive Area: ', df_train.competitive_area.unique())
print('The Shape of TRAIN Data: ' + str(df_train.shape))
print('The Shape of TEST Data:  ' + str(df_test.shape))

## Churn Rate by Sample Type
print('\n' + '*'*50 + '\nOverall Churn Rate\n' + '*'*50)
print('TRAIN: ', df_train.status.value_counts(normalize=True)[1].round(4))
print('TEST:  ', df_test.status.value_counts(normalize=True)[1].round(4), '\n')

**************************************************
TRAIN vs TEST Datasets
**************************************************
The Shape of TRAIN Data: (102451, 777)
The Shape of TEST Data:  (100858, 777)

**************************************************
Overall Churn Rate
**************************************************
TRAIN:  0.0367
TEST:   0.0405 



## 2. Feature Engineering

### Create/Use a Meta Custom Transfomer for Feature Engineering

In [8]:
%%time

# (1) Make a Pipeline in Parallel/Sequence and Instantiate 
# List of Features Used as Parameters
fe_1st           = ['grp_tenure_3m', 'grp_payment_method', \
                    'grp_payment_25dollar', 'grp_payment_change_10dollar', 'grp_payment_change_5pct', \
                    'grp_payment_income', 'grp_call_csc', 'grp_call_bill', \
                    'grp_call_csr', 'grp_call_tsr']
fe_2nd           = fe_1st + ['income_demos', 'ethnic', 'age_demos', 'archetype']
fe_group         = ['census', 'cleansed_city', 'cleansed_zipcode']

# Custom Transformers in Parallel for CATEGORICAL Features
Pipe_FU          =  FE.FeatureUnion_DF([
                    ('OHE', FC.UniversalCategoryEncoder(encoding_method='ohe')),
                    ('PCT', FC.UniversalCategoryEncoder(encoding_method='pct', prefix='PCT')),
                    ('COUNT', FC.UniversalCategoryEncoder(encoding_method='count', prefix='COUNT')),
                    ('ORDINAL', FC.UniversalCategoryEncoder(encoding_method='ordinal', prefix='ORDINAL')),
                    ('Y_MEAN', FC.UniversalCategoryEncoder(encoding_method='y_mean', prefix='Y_MEAN')),
                    ('Y_LOG_RATIO', FC.UniversalCategoryEncoder(encoding_method='y_log_ratio', prefix='Y_LOG_RATIO')),
                    ('Y_RATIO', FC.UniversalCategoryEncoder(encoding_method='y_ratio', prefix='Y_RATIO')),
                    ('Aggregation', FC.FeatureAggregator(features_grouping=fe_group, correlation_threshold=0.01))
                    ])

# Custom Transformers in Sequence for CATEGORICAL Features
CAT_Pipe          = Pipeline([
                    ('Interaction', FC.FeatureInteractionTransformer(features_1st=fe_1st, features_2nd=fe_2nd)),
                    ('RareCategory', FC.RareCategoryEncoder(category_min_pct=0.05, category_max_count=30)),
                    ('FU_Pipe', Pipe_FU)
                    ])

# (2) fit()
CAT_Pipe.fit(train_X, train_y)


# (3) transform()
train_X_FE = CAT_Pipe.transform(train_X)
test_X_FE  = CAT_Pipe.transform(test_X)

# Feature Dimension
print('\n' + '*'*50 + '\nBefore vs After Feature Engineering (FE)\n' + '*'*50)
print('TRAIN: Before FE:' + str(train_X.shape))
print('TRAIN: After FE: ' + str(train_X_FE.shape))
print('TEST:  After FE: ' + str(test_X_FE.shape))

'ordinal' encoding requires target y.
'y_mean' encoding requires target y.
'y_log_ratio' encoding requires target y.
'y_ratio' encoding requires target y.
'FeatureAggregator' requires target y.

**************************************************
Before vs After Feature Engineering (FE)
**************************************************
TRAIN: Before FE:(102451, 776)
TRAIN: After FE: (102451, 6891)
TEST:  After FE: (100858, 6891)
CPU times: user 1h 15min 19s, sys: 20min 29s, total: 1h 35min 49s
Wall time: 1h 8min 23s


### Correlation Summary: TRAIN vs TEST

In [9]:
p_list          = [.01, .05, .1, .2, .3, .4, .6, .7, .8, .9, .95, .99]
corr_train_all  = train_X_FE.apply(lambda x: x.corr(train_y)).to_frame().describe(percentiles=p_list)
corr_test_all   = test_X_FE.apply(lambda x: x.corr(test_y)).to_frame().describe(percentiles=p_list)

corr_all         = pd.concat([corr_train_all, corr_test_all], axis=1)
corr_all.columns = ['TRAIN_All', 'TEST_All']
print('\n' + '*'*70 + '\nCorrelation Summary: TRAIN vs TEST\n' + '*'*70)
corr_all


**********************************************************************
Correlation Summary: TRAIN vs TEST
**********************************************************************


Unnamed: 0,TRAIN_All,TEST_All
count,6813.0,6743.0
mean,0.00497,0.003665
std,0.014884,0.013955
min,-0.048018,-0.054007
1%,-0.027277,-0.027083
5%,-0.012891,-0.013984
10%,-0.009446,-0.008889
20%,-0.004493,-0.00455
30%,-0.001772,-0.002533
40%,0.000505,-0.000472


### Saving Transformed Categorical Features

In [10]:
df_train_FE_CAT = train_y.to_frame().\
                  merge(train_X_FE, how='inner', left_index=True, right_index=True)

df_test_FE_CAT  = test_y.to_frame().\
                  merge(test_X_FE, how='inner', left_index=True, right_index=True)

df_train_FE_CAT.to_pickle('data_new/Vol_df_train_FiosONT_FE_CAT_5months.pkl')
df_test_FE_CAT.to_pickle('data_new/Vol_df_test_FiosONT_FE_CAT_5months.pkl')