# Compare Classifiers
*Paulo G. Martinez* Fri. Apr. 3, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.
P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow 1.0
"Failing Fast." We begin with the most difficult but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration

For each binary-classification target
- find "natural" class_weights
- for sample_size in [small, medium, ... large, full]:
  - for weight_balance in [natural_weights, slight_upsample, ..., aggressive_upsample]:
    - for model in [rf, cnb, boost, etc]:
      - pre-process features
        - impute nulls
        - perhaps scale
        - fit, train, test model
        - score model
        - score feature importance
        - while model scores unsatisfactorily and features refinement is possible:
          - refine feature selection
          - refit, retrain, retest, rescore
          - (if model fails to score satisfactorily move on to next model)
        - **if model scores satisfactorily on sample**
          - save model, its score and config to "training_sample_weight_success"
          - while unsampled training data is available:
            - out_of_sample test and score model on increasingly larger samples
            - **if model scores satisfactorily when "generalized" to data out of its training sample:**
              - save model, its score and config to "training_sample_weight_generalsample_success"
              - (if model failed to generalize well move on to next model)
      - (when all models have been evaluated on this class weight, upsample and repeat.)
    - (when all weight resampling's have been attempted increase the samle size and repeat.)



**prep environment**

In [1]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd

# import builtins
from datetime import datetime
import re

# visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# notebook variables
notebook_verbose = True

**load data**
Use titanic data set for its mix of numeric and categoricals

In [2]:
# load data
data_df = pd.read_csv('data/semi_processed_all.csv')
data_df.head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False


In [3]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Name        1309 non-null   object 
 3   Sex         1309 non-null   object 
 4   Age         1046 non-null   float64
 5   SibSp       1309 non-null   int64  
 6   Parch       1309 non-null   int64  
 7   Ticket      352 non-null    object 
 8   Fare        1308 non-null   float64
 9   Cabin       295 non-null    object 
 10  EMBARKED    1307 non-null   object 
 11  EMBARKED_S  1309 non-null   bool   
 12  EMBARKED_C  1309 non-null   bool   
 13  EMBARKED_Q  1309 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(5)
memory usage: 116.5+ KB


# Workflow 1.0

## Declare Targets

In [4]:
target_columns = [col for col in data_df.columns if col == col.upper()]
target_columns

['EMBARKED', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED_Q']

In [5]:
binary_target_columnns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']
binary_target_columnns

['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']

In [52]:
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED']

### get targets' natural class-weights

In [88]:
def get_target_class_weights(targets = [], data_frame = None, verbose = True):
    '''
    Get the class weights of a set of target columns in a data frame.
    Print the information and return it as a dictionary of classes and weights.
    Expects datetime and pandas to be available.
    
    Parameters
    ----------
    targets : list of target-column names in a dataframelike object.
        Default []
    
    data_frame: a pandas.DataFrame like object with the .shape attribute
        Default None
    
    verbose: boolean. Whether or not the function prints out feedback.
        Default True
    '''
    if verbose:
        feedback = 'Getting class weights'
        print(feedback+'\n'+'-'*len(feedback))
        print(datetime.now(), "Available Data Records:", f"{data_frame.shape[0]:.2E} =", f"{data_frame.shape[0]:,}\n")
    # validate non falsy inputs
    for var in {'targets': targets, 'data_frame': data_frame}:
        # if empty or null
        if not var:
            raise TypeError(f'''Expected non empty input for {var} but received "falsy" type {type(var)}''')
    # validate targets in dats_famre
    for col in targets:
        if col not in data_frame.columns:
            raise NameError(f'''target column {col} not found in received data_frame.''')
    
    # initialize a storage 
    target_class_weights = {}

    # for each target
    for target_col in targets:
        if verbose:
            print(datetime.now(), 'Working on target:', target_col)

        # get natural class weights
        target_class_weights[target_col] = {
            clss:sum(data_frame[target_col]==clss)/data_frame.shape[0]
            for clss in data_frame[target_col].unique()
        }
        if verbose:
            print(datetime.now(), "Got target's natural class weights:")
            print(target_class_weights[target_col], '\n')
        
    
    return target_class_weights

In [90]:
# spot check
get_target_class_weights(target_columns, data_df)

Getting class weights
---------------------
2020-04-04 21:46:24.550360 Available Data Records: 1.31E+03 = 1,309

2020-04-04 21:46:24.550584 Working on target: EMBARKED_Q
2020-04-04 21:46:24.552480 Got target's natural class weights:
{False: 0.906035141329259, True: 0.09396485867074103} 

2020-04-04 21:46:24.552793 Working on target: EMBARKED_S
2020-04-04 21:46:24.554022 Got target's natural class weights:
{True: 0.6982429335370511, False: 0.3017570664629488} 

2020-04-04 21:46:24.556088 Working on target: EMBARKED_C
2020-04-04 21:46:24.560986 Got target's natural class weights:
{False: 0.7937356760886173, True: 0.20626432391138275} 

2020-04-04 21:46:24.561328 Working on target: EMBARKED
2020-04-04 21:46:24.564328 Got target's natural class weights:
{'S': 0.6982429335370511, 'C': 0.20626432391138275, 'Q': 0.09396485867074103, nan: 0.0} 



{'EMBARKED_Q': {False: 0.906035141329259, True: 0.09396485867074103},
 'EMBARKED_S': {True: 0.6982429335370511, False: 0.3017570664629488},
 'EMBARKED_C': {False: 0.7937356760886173, True: 0.20626432391138275},
 'EMBARKED': {'S': 0.6982429335370511,
  'C': 0.20626432391138275,
  'Q': 0.09396485867074103,
  nan: 0.0}}

## Workflow 1.0

In [None]:
def incrementally_compare_classifiers(
    data_frame = None, target_columns = [],
    class_weights = ['natural'],
    classifiers = [
        #SVC(kernel="linear", C=0.025),
        #SVC(gamma=2, C=1),
        #GaussianProcessClassifier(1.0 * RBF(1.0)),
        #DecisionTreeClassifier(max_depth=5),
        RandomForestClassifier(
            #max_depth=5, n_estimators=10, max_features=1
        ),
        ComplementNB(),
        #MLPClassifier(alpha=1, max_iter=1000),
        AdaBoostClassifier(),
        KNeighborsClassifier(3),
        #GaussianNB(),
        QuadraticDiscriminantAnalysis()
    ],
    verbose = True
):
    '''
    For each sample, for each weight balance, will pre-process data, test_train_split the data, 
    fit, predict, score the model and save the model, its configuration, its evaluation metrics
    to the destination directory for later reference.
    
    Requirements: Assumes sklearn modules are available, as well as pandas
    
    Parameters
    -----------
    data_frame: a pandas pd.DataFrame with "tidy" data. Accepts nullable numerical, datetime, 
        and string/categorical columns.
    
    target_columns: list of target columns containing labeled classes for each row in a data_frame
    
    class_weights: list of dictionaries of class weights to use during model training and scoring 
        on sample data
        Default ['natural'], will get the natural weights of each class in each target column.
        Ex: ['natural', {True:0.20, False:0.80}, {True:0.50, False:0.50}]
        
    classifiers: List of sklearn classifiers to compare
        Decault [
            RandomForestClassifier(),
            ComplementNB(),
            AdaBoostClassifier(),
            KNeighborsClassifier(3),
            QuadraticDiscriminantAnalysis()
        ]
    
    verbose: boolean, whether or not the function prints feedback
        Default True
    
    # Context
    ---------
    ## This is an experiment in "elegant brute force."
    - Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant 
        numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi 
        class column, but two or more binary class target columns)
      - Attempt to find an efficient way of comparing the success of various classification models 
          - each with various model configurations

    P1. One of the premises of this experiment is that it would take "too much" time to assess the 
        viability, let alone solve the problem using a "traditional" domain-knowledge based approach.
    P2. Another premise is that compute and memory resources are so limited that a brute force 
        iteration through models would also take "too long."

    ## Workflow 1.0 "Failing Fast." We begin with the most difficult but easiest to compute 
        configurations in hopes of fiding success before having to "bloat" all the way up to full 
        brute-force iteration

    For each binary-classification target
    - find "natural" class_weights
    - for sample_size in [small, medium, ... large, full]:
      - for weight_balance in [natural_weights, slight_upsample, ..., aggressive_upsample]:
        - for model in [rf, cnb, boost, etc]:
          - pre-process features
            - impute nulls
            - perhaps scale
            - fit, train, test model
            - score model
            - score feature importance
            - while model scores unsatisfactorily and features refinement is possible:
              - refine feature selection
              - refit, retrain, retest, rescore
              - (if model fails to score satisfactorily move on to next model)
            - **if model scores satisfactorily on sample**
              - save model, its score and config to "training_sample_weight_success"
              - while unsampled training data is available:
                - out_of_sample test and score model on increasingly larger samples
                - **if model scores satisfactorily when "generalized" to data out of its training sample:**
                  - save model, its score and config to "training_sample_weight_generalsample_success"
                  - (if model failed to generalize well move on to next model)
          - (when all models have been evaluated on this class weight, upsample and repeat.)
        - (when all weight resampling's have been attempted increase the samle size and repeat.)
    '''

## Developing "master" script
Which uses the helper functions defined above

**input variables/parameters**

In [95]:
target_columns

['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED']

In [97]:
data_df.head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False


In [120]:
data_df.to_dict(orient = 'index')

{0: {'Survived': 0.0,
  'Pclass': 3,
  'Name': 'braund',
  'Sex': 'male',
  'Age': 22.0,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 'a',
  'Fare': 7.25,
  'Cabin': nan,
  'EMBARKED': 'S',
  'EMBARKED_S': True,
  'EMBARKED_C': False,
  'EMBARKED_Q': False},
 1: {'Survived': 1.0,
  'Pclass': 1,
  'Name': 'cumings',
  'Sex': 'female',
  'Age': 38.0,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 'pc',
  'Fare': 71.2833,
  'Cabin': 'C85',
  'EMBARKED': 'C',
  'EMBARKED_S': False,
  'EMBARKED_C': True,
  'EMBARKED_Q': False},
 2: {'Survived': 1.0,
  'Pclass': 3,
  'Name': 'heikkinen',
  'Sex': 'female',
  'Age': 26.0,
  'SibSp': 0,
  'Parch': 0,
  'Ticket': 'stono',
  'Fare': 7.925,
  'Cabin': nan,
  'EMBARKED': 'S',
  'EMBARKED_S': True,
  'EMBARKED_C': False,
  'EMBARKED_Q': False},
 3: {'Survived': 1.0,
  'Pclass': 1,
  'Name': 'futrelle',
  'Sex': 'female',
  'Age': 35.0,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': nan,
  'Fare': 53.1,
  'Cabin': 'C123',
  'EMBARKED': 'S',
  'EMBARKED_S': True,
  'EM

In [119]:
# declare target columns and dataframe
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED']
data = data_df

# declare sample sizes as percentages of total data size
sample_sizes = [.10, .20, .40, .80, 1.00]
# feedback on sample sizes
if verbose:
    print(datetime.now(), 'Got following sample sizes:')
    for pct in sample_sizes:
        if pct <=0 or pct > 1:
            raise ValueError(f"Expected percentage between (0, 1] but received {pct}")
        print(f' {pct*data_frame.shape[0]:,} is {100*pct}% of {data_frame.shape[0]}')

# declare weight configurations to use as float percentages
use_natural_weights = True
use_balanced_weights = True
custom_weight_configs = [
    {True:.20, False:.80},
    {True:.30, False:.70}
]
# Validation for each class_weight configuration
for weighting in custom_weight_configs:
    # spot check that they add up to 1 (for 100%)
    if not np.round(pd.Series(weighting).sum(), 2) == 1:
        raise ValueError(f'Invalid or incomplete class weight values. Expected sum to about 1 but got {pd.Series(weighting)}')

# if testing/debuging on a single target
one_run = False

2020-04-04 22:19:22.139780 Got following sample sizes:
 130.9 is 10.0% of 1309
 261.8 is 20.0% of 1309
 523.6 is 40.0% of 1309
 1,047.2 is 80.0% of 1309
 1,309.0 is 100.0% of 1309


In [113]:
# for each target column in a single data frame
for target in target_columns:
    if verbose:
        print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
        print('===================================================')
    
    # initialize weights to use
    class_weight_configs = []
    # get natural class weights
    if use_natural_weights:
        # make this the first item in the list
        class_weight_configs = [get_target_class_weights([target], data_frame)[target]] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        # make this the last item in the list
        class_weight_configs += [{clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]}]
    
    # FOR EACH DATA SAMPLE
    for pct in sample_sizes:
        
    

    # if testing/debuging on a single target
    if one_run:
        break

2020-04-04 22:06:33.427921 WORKING ON TARGET COLUMN: EMBARKED_Q
Getting class weights
---------------------
2020-04-04 22:06:33.428204 Available Data Records: 1.31E+03 = 1,309

2020-04-04 22:06:33.428328 Working on target: EMBARKED_Q
2020-04-04 22:06:33.430406 Got target's natural class weights:
{False: 0.906035141329259, True: 0.09396485867074103} 

2020-04-04 22:06:33.433526 WORKING ON TARGET COLUMN: EMBARKED_S
Getting class weights
---------------------
2020-04-04 22:06:33.434307 Available Data Records: 1.31E+03 = 1,309

2020-04-04 22:06:33.434427 Working on target: EMBARKED_S
2020-04-04 22:06:33.435812 Got target's natural class weights:
{True: 0.6982429335370511, False: 0.3017570664629488} 

2020-04-04 22:06:33.437457 WORKING ON TARGET COLUMN: EMBARKED_C
Getting class weights
---------------------
2020-04-04 22:06:33.437853 Available Data Records: 1.31E+03 = 1,309

2020-04-04 22:06:33.437972 Working on target: EMBARKED_C
2020-04-04 22:06:33.439140 Got target's natural class weight