## Custom Transformers for Pre-Processing
**Custom Transformers** will conduct the following pre-processing:
- [Missing values and imputation](#Missing-values-and-imputation)
    - `Remove_MissingFeatures`
        - Identify missing percentages of features.
        - Remove features with missing % >= threshold missing %.
        - **Note**: Be applied prior to **any missing value imputation**.
- [Zero/near-zero variance features](#Zero/near-zero-variance-features)
    - `Remove_ConstantFeatures`
        - Identify features with a single unique value.
        - Remove those constant features.
- [Duplicate/highly correlated features](#Duplicate/highly-correlated-features)
    - `Remove_CorrelatedFeatures`
        - Compute pairwise correlation between features.
        - Remove features with abs(correlation) >= threshold correlation.
        - **Note**: Relevant to **numerical features only**.
    - `Remove_DuplicateFeatures`
        - Identify features with duplicate columns.
        - Remove features with duplicate columns.
- [Data Type Conversion](#Data-Type-Conversion)
    - `Use_DefaultDataType`
        - Identify features having data types inconsistent with default data types.
        - Convert the data types into the default data types if inconsistent.
        - **Note**: No feature removed!
- [Default Imputation](#Default-Imputation)
    - `Use_DefaultImputer`
        - Use default imputation values.
        - **Note**: No feature removed!        
- Extreme values/outliers (**TBD if necessary**)
    - How to define extreme values
    - How to replace extreme values
- Non-informative features (**TBD if necessary**)
    - Identify non-informative features
    - Decide which one be dropped        

## Custom Transformers: Parameters, Methods and Attributes
### Parameters
#### Common Parameters
All Custom Transformers require a pandas dataframe (df) that consists of all features:
- **X**: a df with all possible features
    - e.g.: Remove_DuplicateFeatures().fit(X)

#### Additional Parameters
Custom Transformers may require additional parameters with respect to their purpose:
- **Transformer specific parameter(s)** such as 'correlation_threshold'
    - default:
        - Remove_MissingFeatures(**missing_threshold=0.99**)
        - Remove_ConstantFeatures(**unique_threshold=1**)
        - Remove_CorrelatedFeatures(**correlation_threshold=0.90**)
    - e.g.: Remove_CorrelatedFeatures(correlation_threshold=0.99).fit(X)        
- **y**: a pandas series that represent a churn status: 
    - default = None
    - e.g.: Remove_CorrelatedFeatures(correlation_threshold=0.99).fit(X, y)

### Methods
#### Common Methods
All Custom Transformers have the same methods as **any other sklearn transformers**:
- **fit**: CustomTransformer().fit()
- **transform**: CustomTransformer().transform()
- **fit_transform**: CustomTransformer().fit_transform()

#### Additional Methods
Some Custom Transformers have additional methods as below:
- **plot**: CustomTransformer().plot() 

### Attributes
#### Common Attributes
All Custom Transformers have the below attributes:
- **summary_dropped_**: CustomTransformer().fit().summary_dropped_
    - a df that includes dropped features with ***simple*** summary statistics 
- **summary_dropped_NUM_**: CustomTransformer().fit().summary_dropped_NUM_
    - a df that includes dropped **Numerical** features with ***full*** summary statistics 
- **summary_dropped_CAT_**: CustomTransformer().fit().summary_dropped_CAT_
    - a df that includes dropped **Categorical** features with ***full*** summary statistics 
- **features_dropped_**: CustomTransformer().fit().features_dropped_
    - a list of dropped features
- **features_kept_**: CustomTransformer().fit().features_kept_
    - a list of kept features

#### Additional Attributes
Some Custom Transformers have additional attributes as below:
- **features_irrelevant (Class Attribute)**: CustomTransformer().features_irrelevant
    - a list of irrelevant features that are pre-excluded before pre-processing
- `Use_DefaultImputer` has the following attributes:
    - **summary_imputation_**: Use_DefaultImputer.fit().summary_imputation_
    - **summary_imputation_NUM_**: Use_DefaultImputer.fit().summary_imputation_NUM_
    - **summary_imputation_CAT_**: Use_DefaultImputer.fit().summary_imputation_CAT_
- `Use_DefaultDataType` has the following attributes:
    - **summary_inconsistent_dtypes_**: Use_DefaultDataType.fit().summary_inconsistent_dtypes_
    - **summary_inconsistent_NUM_**: Use_DefaultDataType.fit().summary_inconsistent_NUM_
    - **summary_inconsistent_CAT_**: Use_DefaultDataType.fit().summary_inconsistent_CAT_
    - **features_inconsistent_dtypes_**: Use_DefaultDataType.fit().features_inconsistent_dtypes_
    - **features_inconsistent_NUM_**: Use_DefaultDataType.fit().features_inconsistent_NUM_
    - **features_inconsistent_CAT_**: Use_DefaultDataType.fit().features_inconsistent_CAT_

## Things You Need to Do
### Data: Your Own TRAIN/TEST
- Create Your Own TRAIN/TEST datasets from a master churn data file.

### Dictionary: Default Data Type and Imputation Values
- Import default data type and imputation value dictionaries.

### Pre-Process Sample Data

In [1]:
### 0. Import Required Packages
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

# Import Custom Transformers Here!!!
import PreProcessing_Custom_Transformers_v2 as PP         ### Import PreProcessing Customer Transformers

# Use the Updated Attribute/Imputation Dictionaries!!!
%run 'data_new/attribute_dictionary.py'
%run 'data_new/imputation_dictionary.py'

In [2]:
df_train = pd.read_pickle('data_new/df_train_Fios ONT Competitive Area.pkl')
df_test  = pd.read_pickle('data_new/df_test_Fios ONT Competitive Area.pkl')

# Use '' as index
df_train.set_index('chc_id', inplace=True)
df_test.set_index('chc_id', inplace=True)

# TRAIN
train_X = df_train.drop('status', axis=1).copy()
train_y = df_train['status']

# TEST
test_X  = df_test.drop('status', axis=1).copy()
test_y  = df_test['status']

# Sample Size
print('*'*50 + '\nTRAIN vs TEST Datasets\n' + '*'*50)
print('Competitive Area: ', df_train.competitive_area.unique())
print('The Shape of TRAIN Data: ' + str(df_train.shape))
print('The Shape of TEST Data:  ' + str(df_test.shape))

## Churn Rate by Sample Type
print('\n' + '*'*50 + '\nOverall Churn Rate\n' + '*'*50)
print('TRAIN: ', df_train.status.value_counts(normalize=True)[1].round(4))
print('TEST:  ', df_test.status.value_counts(normalize=True)[1].round(4), '\n')

**************************************************
TRAIN vs TEST Datasets
**************************************************
Competitive Area:  ['Fios ONT Competitive Area']
The Shape of TRAIN Data: (126928, 1017)
The Shape of TEST Data:  (123820, 1017)

**************************************************
Overall Churn Rate
**************************************************
TRAIN:  0.0199
TEST:   0.0258 



In [3]:
%%time
from sklearn.pipeline import Pipeline

# (1) Make a Pipeline and Instantiate
PP_Pipe = Pipeline([
                    ('DataType', PP.Use_DefaultDataType(default_dtypes=attribute_dict)),
                    ('Missing', PP.Remove_MissingFeatures(missing_threshold=0.99)), 
                    ('Constant1', PP.Remove_ConstantFeatures(unique_threshold=1, missing_threshold=0.00)), 
                    ('Correlated1', PP.Remove_CorrelatedFeatures(correlation_threshold=0.99)), 
                    ('Duplicate', PP.Remove_DuplicateFeatures()),
                    ('Imputer', PP.Use_DefaultImputer(default_imputers=attribute_imputer_dict, default_dtypes=attribute_dict)),
                    ('Constant2', PP.Remove_ConstantFeatures(unique_threshold=1, missing_threshold=0.00)), 
                    ('Correlated2', PP.Remove_CorrelatedFeatures(correlation_threshold=0.90))
                  ])

# 'Constant2' is added to handle (1) unique value = 0 and (2) default imputation value = 0.
# 'Correlated2' is added to further remove correlated features after impuation.


# (2) fit()
# default: y=None
PP_Pipe.fit(train_X, train_y)


# (3) transform()
train_X_Preprocessed = PP_Pipe.transform(train_X)
test_X_Preprocessed  = PP_Pipe.transform(test_X)

# Feature Dimension
print('\n' + '*'*50 + '\nBefore vs After Transformation\n' + '*'*50)
print('TRAIN: Before Transformation:' + str(train_X.shape))
print('TRAIN: After Transformation: ' + str(train_X_Preprocessed.shape))
print('TEST:  After Transformation: ' + str(test_X_Preprocessed.shape))

**************************************************
Pre-Processing: Use_DefaultDataType
**************************************************
- It will convert data types into default ones.

**************************************************
Pre-Processing: Remove_MissingFeatures
**************************************************
- It will remove features with a high missing pct.

**************************************************
Pre-Processing: Remove_ConstantFeatures
**************************************************
- It will remove features with 1 unique value(s).

**************************************************
Pre-Processing: Remove_CorrelatedFeatures
**************************************************
- It will work on Numerical Features Only, doing nothing on Categorical Features.

- It may take 10+ minutes. Be patient!

**************************************************
Pre-Processing: Remove_DuplicateFeatures
**************************************************
- It may take 10+

#### Print Common Attributes, `CustomTransformer().fit().features_dropped_`

In [4]:
print('\n' + '*'*50 + '\nFeatures Dropped Due to High Missing Pct\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Missing'].features_dropped_)

print('\n' + '*'*50 + '\nFeatures Dropped Due to Constant Value before Imputation\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Constant1'].features_dropped_)

print('\n' + '*'*50 + '\nFeatures Dropped Due to Multicollinearity before Imputation\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Correlated1'].features_dropped_)

print('\n' + '*'*50 + '\nFeatures Dropped Due to Duplicate Columns before Imputation\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Duplicate'].features_dropped_)

print('\n' + '*'*50 + '\nFeatures Dropped Due to Constant Value after Imputation\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Constant2'].features_dropped_)

print('\n' + '*'*50 + '\nFeatures Dropped Due to Multicollinearity after Imputation\n' + '*'*50 + '\n', 
      PP_Pipe.named_steps['Correlated2'].features_dropped_)


**************************************************
Features Dropped Due to High Missing Pct
**************************************************
 ['ddp_recurring_m1', 'espanol_save_offer_lift_amount', 'espanol_save_offer_months_remaining', 'gf_mig', 'hbo_svod', 'mover', 'ooldown_m1', 'ooldown_m2', 'ooldown_m3', 'other_offer_lift_amount', 'other_offer_months_remaining', 'outage_m1', 'outage_ool_m1', 'outage_ov_m1', 'outage_vid_m1', 'portout_m1', 'portout_m2', 'portout_m3', 'portout_m4', 'premium_offer_lift_amount', 'premium_offer_months_remaining', 'tcs_m1', 'uversezip', 'viddown_m1', 'viddown_m2', 'viddown_m3']

**************************************************
Features Dropped Due to Constant Value before Imputation
**************************************************
 ['commprod_ind', 'curr_addl_did_blocks', 'curr_sip_sessions', 'curr_tf_lines', 'fiosont', 'music_choice_m1', 'music_choice_m2', 'ool4b_flag', 'ov4b_flag', 'range_extend_m1', 'range_extend_m2', 'rewindbuf_chrg_m1', 'rewind

#### Print/Retrieve Common Attributes, `CustomTransformer().fit().summary_dropped_` if necessary