# Compare Classifiers
*Paulo G. Martinez* Fri. Apr. 3, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.
P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow_1.0
"Failing Fast." We begin with the most difficult but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration

For each binary-classification target
- find "natural" class_weights
- for sample_size in [small, medium, ... large, full]:
  - for weight_balance in [natural_weights, slight_upsample, ..., aggressive_upsample]:
    - for model in [rf, cnb, boost, etc]:
      - pre-process features
        - impute nulls
        - perhaps scale
        - fit, train, test model
        - score model
        - score feature importance
        - while model scores unsatisfactorily and features refinement is possible:
          - refine feature selection
          - refit, retrain, retest, rescore
          - (if model fails to score satisfactorily move on to next model)
        - **if model scores satisfactorily on sample**
          - save model, its score and config to "training_sample_weight_success"
          - while unsampled training data is available:
            - out_of_sample test and score model on increasingly larger samples
            - **if model scores satisfactorily when "generalized" to data out of its training sample:**
              - save model, its score and config to "training_sample_weight_generalsample_success"
              - (if model failed to generalize well move on to next model)
      - (when all models have been evaluated on this class weight, upsample and repeat.)
    - (when all weight resampling's have been attempted increase the samle size and repeat.)



**prep environment**

In [28]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd

# import builtins
from datetime import datetime
import re

# visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# notebook variables
notebook_verbose = True

**load data**
Use titanic data set for its mix of numeric and categoricals

In [4]:
# load data
data_df = pd.read_csv('data/semi_processed_all.csv')
data_df.head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False


In [5]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Name        1309 non-null   object 
 3   Sex         1309 non-null   object 
 4   Age         1046 non-null   float64
 5   SibSp       1309 non-null   int64  
 6   Parch       1309 non-null   int64  
 7   Ticket      352 non-null    object 
 8   Fare        1308 non-null   float64
 9   Cabin       295 non-null    object 
 10  EMBARKED    1307 non-null   object 
 11  EMBARKED_S  1309 non-null   bool   
 12  EMBARKED_C  1309 non-null   bool   
 13  EMBARKED_Q  1309 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(5)
memory usage: 116.5+ KB


## Declare Targets

In [6]:
target_columns = [col for col in data_df.columns if col == col.upper()]
target_columns

['EMBARKED', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED_Q']

In [22]:
binary_target_columnns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']
binary_target_columnns

['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']

## Workflow 1.0

In [32]:
# initialize a storage dict
target_class_weights = {}
verbose = notebook_verbose

# for each binary target
for binary_target in binary_target_columnns:
    
    if verbose:
        print(datetime.now(), 'Working on target:', binary_target)
    
    # get natural class weights
    target_class_weights[binary_target] = {
        clss:sum(data_df[binary_target]==clss)/data_df.shape[0]
        for clss in data_df[binary_target].unique()
    }
    
    if verbose:
        print(datetime.now(), "Got target's natural class weights:")
        print(target_class_weights[binary_target], '\n')
    
    # define sample sizes
    for sample_size in [.01, .05, .10, .20, .40, .80, 1.00]:
        if verbose:
            print(datetime.now(), '''Using sample size:''', sample_size)

2020-04-03 20:43:39.456308 Working on target: EMBARKED_Q
2020-04-03 20:43:39.458253 Got target's natural class weights:
{False: 0.906035141329259, True: 0.09396485867074103} 

2020-04-03 20:43:39.460380 Using sample size: 0.01
2020-04-03 20:43:39.460458 Using sample size: 0.05
2020-04-03 20:43:39.460527 Using sample size: 0.1
2020-04-03 20:43:39.460595 Using sample size: 0.2
2020-04-03 20:43:39.460661 Using sample size: 0.4
2020-04-03 20:43:39.460727 Using sample size: 0.8
2020-04-03 20:43:39.460791 Using sample size: 1.0
2020-04-03 20:43:39.460882 Working on target: EMBARKED_S
2020-04-03 20:43:39.463270 Got target's natural class weights:
{True: 0.6982429335370511, False: 0.3017570664629488} 

2020-04-03 20:43:39.464238 Using sample size: 0.01
2020-04-03 20:43:39.465615 Using sample size: 0.05
2020-04-03 20:43:39.465687 Using sample size: 0.1
2020-04-03 20:43:39.465749 Using sample size: 0.2
2020-04-03 20:43:39.465811 Using sample size: 0.4
2020-04-03 20:43:39.466234 Using sample size

In [24]:
target_class_weights

{'EMBARKED_Q': {False: 0.906035141329259, True: 0.09396485867074103},
 'EMBARKED_S': {True: 0.6982429335370511, False: 0.3017570664629488},
 'EMBARKED_C': {False: 0.7937356760886173, True: 0.20626432391138275}}