# Pre-Process TRAIN/TEST Datasets

## 0. Custom Transformers for Pre-Processing
**Custom Transformers** will conduct the following pre-processing:
- Missing values and imputation
    - `Remove_MissingFeatures`
        - Identify missing percentages of features.
        - Remove features with missing % >= threshold missing %.
        - **Note**: Be applied prior to **any missing value imputation**.
- Zero/near-zero variance features
    - `Remove_ConstantFeatures`
        - Identify features with a single unique value.
        - Remove those constant features.
- Duplicate/highly correlated features
    - `Remove_CorrelatedFeatures`
        - Compute pairwise correlation between features.
        - Remove features with abs(correlation) >= threshold correlation.
        - **Note**: Relevant to **numerical features only**.
    - `Remove_DuplicateFeatures`
        - Identify features with duplicate columns.
        - Remove features with duplicate columns.
- Data Type Conversion
    - `Use_DefaultDataType`
        - Identify features having data types inconsistent with default data types.
        - Convert the data types into the default data types if inconsistent.
        - **Note**: No feature removed!
- Default Imputation
    - `Use_DefaultImputer`
        - Use default imputation values.
        - **Note**: No feature removed!            

## 1. TRAIN/TEST Datasets

In [1]:
### 0. Import Required Packages
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

### Import Customer Transformers
import PreProcessing_Custom_Transformers_v2 as PP
import FeatureEngineering_Custom_Transformers as FE
import FeatureCreation_Custom_Transformers as FC

# Use the Updated Attribute/Imputation Dictionaries!!!
%run 'data_new/attribute_dictionary.py'
%run 'data_new/imputation_dictionary.py'

# Remove DataConversionWarning
import warnings
warnings.simplefilter('ignore')

In [2]:
# Competitive Area = 'Fios ONT'
df_train = pd.read_pickle('data_new/Vol_df_train_Fios ONT Competitive Area_5months.pkl')
df_test  = pd.read_pickle('data_new/Vol_df_test_Fios ONT Competitive Area_5months.pkl')

# Use 'products' to choose eligible customers.
df_train = df_train[(df_train['products'].isin(['2: Video/OOL','3: Video/OOL/OV']))]
df_test  = df_test[(df_test['products'].isin(['2: Video/OOL','3: Video/OOL/OV']))]

# Use 'chc_id' as index, and sort by index.
df_train.set_index('chc_id', inplace=True)
df_test.set_index('chc_id', inplace=True)

df_train = df_train.sort_index()
df_test  = df_test.sort_index()

# TRAIN
train_X = df_train.drop('status', axis=1).copy()
train_y = df_train['status']

# TEST
test_X  = df_test.drop('status', axis=1).copy()
test_y  = df_test['status']

# Sample Size
print('*'*50 + '\nTRAIN vs TEST Datasets\n' + '*'*50)
print('Competitive Area: ', df_train.competitive_area.unique())
print('The Shape of TRAIN Data: ' + str(df_train.shape))
print('The Shape of TEST Data:  ' + str(df_test.shape))

## Churn Rate by Sample Type
print('\n' + '*'*50 + '\nOverall Churn Rate\n' + '*'*50)
print('TRAIN: ', df_train.status.value_counts(normalize=True)[1].round(4))
print('TEST:  ', df_test.status.value_counts(normalize=True)[1].round(4), '\n')

# print(train_X.index)
# print(train_y.index)
# print(test_X.index)
# print(test_y.index)

**************************************************
TRAIN vs TEST Datasets
**************************************************
Competitive Area:  ['Fios ONT Competitive Area']
The Shape of TRAIN Data: (102451, 1023)
The Shape of TEST Data:  (100858, 1023)

**************************************************
Overall Churn Rate
**************************************************
TRAIN:  0.0367
TEST:   0.0405 



## 2. Pre-Processing

### Pre-Processing Data

In [3]:
%%time

# (1) Make a Pipeline and Instantiate
Pipe_PP = Pipeline([
                    ('DataType', PP.Use_DefaultDataType(default_dtypes=attribute_dict)),
                    ('Missing', PP.Remove_MissingFeatures(missing_threshold=0.99)), 
                    ('Constant1', PP.Remove_ConstantFeatures(unique_threshold=1, missing_threshold=0.00)), 
                    ('Correlated1', PP.Remove_CorrelatedFeatures(correlation_threshold=0.99)), 
                    ('Duplicate', PP.Remove_DuplicateFeatures()),
                    ('Imputer', PP.Use_DefaultImputer(default_imputers=attribute_imputer_dict, default_dtypes=attribute_dict)),
                    ('Constant2', PP.Remove_ConstantFeatures(unique_threshold=1, missing_threshold=0.00)), 
                    ('Correlated2', PP.Remove_CorrelatedFeatures(correlation_threshold=0.90))
                  ])

# 'Constant2' is added to handle (1) unique value = 0 and (2) default imputation value = 0.
# 'Correlated2' is added to further remove correlated features after impuation.


# (2) fit()
Pipe_PP.fit(train_X, train_y)


# (3) transform()
train_X_PP = Pipe_PP.transform(train_X)
test_X_PP  = Pipe_PP.transform(test_X)

# Feature Dimension
print('\n' + '*'*50 + '\nBefore vs After Transformation\n' + '*'*50)
print('TRAIN: Before Transformation:' + str(train_X.shape))
print('TRAIN: After Transformation: ' + str(train_X_PP.shape))
print('TEST:  After Transformation: ' + str(test_X_PP.shape))

# CPU times: user 46min 8s, sys: 8min 53s, total: 55min 2s

**************************************************
Pre-Processing: Use_DefaultDataType
**************************************************
- It will convert data types into default ones.

**************************************************
Pre-Processing: Remove_MissingFeatures
**************************************************
- It will remove features with a high missing pct.

**************************************************
Pre-Processing: Remove_ConstantFeatures
**************************************************
- It will remove features with 1 unique value(s).

**************************************************
Pre-Processing: Remove_CorrelatedFeatures
**************************************************
- It will work on Numerical Features Only, doing nothing on Categorical Features.
- It may take 10+ minutes. Be patient!

**************************************************
Pre-Processing: Remove_DuplicateFeatures
**************************************************
- It may take 10+ 

### Creating New Features

In [4]:
%%time

# (1) Make a Pipeline and Instantiate
Pipe_NF = Pipeline([
                    ('Imputer', PP.Use_DefaultImputer(default_imputers=attribute_imputer_dict, default_dtypes=attribute_dict)),
                    ('NewFeatures', FC.FeatureMaker())
                  ])

# 'Imputer' is added to handle missing values
# 'NewFeature' is added to create new features


# (2) fit()
Pipe_NF.fit(train_X, train_y)


# (3) transform()
train_X_NF = Pipe_NF.transform(train_X)
test_X_NF  = Pipe_NF.transform(test_X)

# Feature Dimension
print('\n' + '*'*50 + '\nBefore vs After Transformation\n' + '*'*50)
print('TRAIN: Before Transformation:' + str(train_X.shape))
print('TRAIN: After Transformation: ' + str(train_X_NF.shape))
print('TEST:  After Transformation: ' + str(test_X_NF.shape))
print('\n' + '*'*50 + '\nNewly Created Features\n' + '*'*50 + '\n', 
      Pipe_NF.named_steps['NewFeatures'].features_new_)

# CPU times: user 17min 27s, sys: 4min 27s, total: 21min 55

**************************************************
Pre-Processing: Use_DefaultImputere
**************************************************
- It will append default imputation values to missings.


**************************************************
Before vs After Transformation
**************************************************
TRAIN: Before Transformation:(102451, 1022)
TRAIN: After Transformation: (102451, 16)
TEST:  After Transformation: (100858, 16)

**************************************************
Newly Created Features
**************************************************
 ['grp_tenure_3m', 'grp_tenure_1m', 'grp_tenure_6m', 'grp_payment_method', 'grp_payment_25dollar', 'grp_payment_10dollar', 'grp_payment_change_5dollar', 'grp_payment_change_10dollar', 'grp_payment_change_2pct', 'grp_payment_change_5pct', 'ratio_payment_income', 'grp_payment_income', 'grp_call_csc', 'grp_call_bill', 'grp_call_csr', 'grp_call_tsr']
CPU times: user 16min 36s, sys: 4min 59s, total: 21min 36s
Wall time

In [5]:
# train_X_NF.groupby('grp_tenure_3m').count()
# train_X_NF.grp_tenure_3m.value_counts()

### Combining Processed and New Features

In [6]:
# Create Datasets that Consist of Pre-processed and New Features.
df_train_NF_PP = train_y.to_frame().\
                 merge(train_X_NF, how='inner', left_index=True, right_index=True).\
                 merge(train_X_PP, how='inner', left_index=True, right_index=True)

df_test_NF_PP  = test_y.to_frame().\
                 merge(test_X_NF, how='inner', left_index=True, right_index=True).\
                 merge(test_X_PP, how='inner', left_index=True, right_index=True)

# Save Data for Feature Engineering
# Pre-processed data with new features
df_train_NF_PP.to_pickle('data_new/Vol_df_train_FiosONT_PP_5months.pkl')
df_test_NF_PP.to_pickle('data_new/Vol_df_test_FiosONT_PP_5months.pkl')

print(df_train_NF_PP.shape)
print(df_test_NF_PP.shape)

(102451, 777)
(100858, 777)
