# Compare Classifiers
*Paulo G. Martinez* Sat. Apr. 4, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.
P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow 1.0
"Failing Fast." We begin with the most difficult but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration

For each binary-classification target
- find "natural" class_weights
- for sample_size in [small, medium, ... large, full]:
  - for weight_balance in [natural_weights, slight_upsample, ..., aggressive_upsample]:
    - for model in [rf, cnb, boost, etc]:
      - pre-process features
        - impute nulls
        - perhaps scale
        - fit, train, test model
        - score model
        - score feature importance
        - while model scores unsatisfactorily and features refinement is possible:
          - refine feature selection
          - refit, retrain, retest, rescore
          - (if model fails to score satisfactorily move on to next model)
        - **if model scores satisfactorily on sample**
          - save model, its score and config to "training_sample_weight_success"
          - while unsampled training data is available:
            - out_of_sample test and score model on increasingly larger samples
            - **if model scores satisfactorily when "generalized" to data out of its training sample:**
              - save model, its score and config to "training_sample_weight_generalsample_success"
              - (if model failed to generalize well move on to next model)
      - (when all models have been evaluated on this class weight, upsample and repeat.)
    - (when all weight resampling's have been attempted increase the samle size and repeat.)



## Reveal data structures for clarity during dev

In [1]:
# this is the data we'll be developing on, it's a slightly modified version of the titanic data set
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False
2,1.0,3,heikkinen,female,26.0,0,0,stono,7.925,,S,True,False,False
3,1.0,1,futrelle,female,35.0,1,0,,53.1,C123,S,True,False,False
4,0.0,3,allen,male,35.0,0,0,,8.05,,S,True,False,False


In [2]:
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Name        1309 non-null   object 
 3   Sex         1309 non-null   object 
 4   Age         1046 non-null   float64
 5   SibSp       1309 non-null   int64  
 6   Parch       1309 non-null   int64  
 7   Ticket      352 non-null    object 
 8   Fare        1308 non-null   float64
 9   Cabin       295 non-null    object 
 10  EMBARKED    1307 non-null   object 
 11  EMBARKED_S  1309 non-null   bool   
 12  EMBARKED_C  1309 non-null   bool   
 13  EMBARKED_Q  1309 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(5)
memory usage: 116.5+ KB


# Developing Functions and Script
# --------------------------------------

**Declare Variables**

In [3]:
# declare some global variables
using_TSNE = False

**prep environment**

In [4]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd

# import builtins
from datetime import datetime

# visualizations
import matplotlib.pyplot as plt
if using_TSNE:
    import seaborn as sns

## Define helper functions

### getting targets' natural class-weights

In [5]:
def get_target_class_count_weight_and_recordkeys(targets = [], data = None, verbose = True):
    '''
    Get the class weights of a set of target columns in a dataframe 
    (with unique indices)or its equivalent index-oriented dictionary. 
    Print the information and return it as a dictionary of classes and 
    weights in the following format:
    {
        'total_count': tc,
        target_col : {
            class_a : {
                'natural_cout':nc,
                'natural_weight': nw,
                'records': {i, i+1, ..., i+nc}
            }
        }
    }
    
    Expects datetime and pandas to be available.
    
    Parameters
    ----------
    targets : list of target-column names in a dataframelike object.
        Default []
    
    data: a pandas.DataFrame or its .to_dict(orient = 'index') equivalent. 
        Default None
    
    verbose: boolean. Whether or not the function prints out feedback.
        Default True
    '''
    if verbose:
        feedback = 'Getting class weights'
        print(feedback+'\n'+'-'*len(feedback))
        
    # validate non falsy inputs
    for var in {'targets': targets, 'data': data}:
        # if empty or null
        if not var:
            raise TypeError(f'''Expected non empty input for {var} but received "falsy" type {type(var)}''')
    
    # if received dataframe cast it to dict and continue
    if isinstance(data, pd.DataFrame):
        # ignore the columns we won't use
        data = data[targets].to_dict(orient = 'index')
    
    # workflow for data dictionary
    if isinstance(data, dict):
        # initialize storage dict
        target_class_weights = {}
        
        # get length of data
        data_length = len(data)
        if verbose:
            print(datetime.now(), "Available Data Records:", f"{data_length:.2E} = {data_length:,}\n")
        target_class_weights['total_count'] = data_length
        
        for target_col in targets:
            if verbose:
                print(datetime.now(), 'weighing classes in target:', target_col)
            # initialize a dict for each target column 
            target_class_weights[target_col] = {}
            
            # iterate once through the records to get each classes keys
            for record in data:
                # get the class label at that record
                class_label = data[record][target_col]
                
                # if first time seeing this class, initialize its own dict
                if class_label not in target_class_weights[target_col]:
                    # initialize its set of records (to avoid the need for iteration searches downstream)
                    target_class_weights[target_col][class_label] = {'records':set()}
                
                # update this class' set of records
                target_class_weights[target_col][class_label]['records'].add(record)
            
            # now that we have each class's set of records, save its count and weight for convenience
            for class_label in target_class_weights[target_col]:
                target_class_weights[target_col][class_label]['natural_count'] = len(target_class_weights[target_col][class_label]['records'])
                target_class_weights[target_col][class_label]['natural_weight'] = target_class_weights[target_col][class_label]['natural_count']/target_class_weights['total_count']
                if verbose:
                    print(f"class:{class_label}")
                    for attribute in ['natural_count', 'natural_weight']:
                        print(f"- {attribute}: {target_class_weights[target_col][class_label][attribute]}")
            if verbose:
                print('')
    # if neither data frame nor dict
    else:
        raise TypeError(f'Expected data to be type dict or pd.DataFrame but instead got type: {type(data)}')
    
    return target_class_weights

## getting data sample of given weight

In [None]:
def get_data_sampl`

## Declare "Script" Variables (i.e. input parameters)

**load data**

Use titanic data set for its mix of numeric and categoricals

In [6]:
# load data as a dictionary, because they are more efficient when large and of konwn structure than dataframes are
data = pd.read_csv('data/semi_processed_all.csv').to_dict(orient = 'index')

**declare target columns**

In [7]:
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C', 'EMBARKED']

In [29]:
verbose = True

# declare sample sizes as percentages of total data size
sample_size_percents = [.10, .20, .40, .80, 1.00]
# validate sample pcts
for pct in sample_sizes:
    if pct <=0 or pct > 1:
        raise ValueError(f"Expected percentage between (0, 1] but received {pct}")

# declare weight configurations to use
use_natural_weights = True
use_balanced_weights = True
class_weight_configs = [
    {True:.20, False:.80},
    {True:.30, False:.70}
]
# Validation for each class_weight configuration
for weighting in class_weight_configs:
    # spot check that they add up to 1 (for 100%)
    if not np.round(pd.Series(weighting).sum(), 2) == 1:
        raise ValueError(f'Invalid or incomplete class weight values. Expected sum to about 1 but got {pd.Series(weighting)}')

# if testing/debuging on a single target
one_run = True


## Script development

In [30]:
# for each target column in a single data frame
for target in target_columns:
    if verbose:
        print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
        print('===================================================')
    
    # get target classes, counts, weights, ad recordkeys
    class_counts_weights_keys = get_target_class_count_weight_and_recordkeys(targets=[target], data = data)[target]
    
    if use_natural_weights:
        # make this the first item in the list
        class_weight_configs = [{class_label:class_counts_weights_keys[class_label]['natural_weight'] for class_label in class_counts_weights_keys}] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        # make this the last item in the list
        class_weight_configs += [{clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]}]
    
    # GET DATA SAMPLE
    for sample_pct in sample_size_percents:
        for use_weights in class_weight_configs:
            print('get', sample_pct, 'percent sample with weights:', use_weights)
    

    # if testing/debuging on a single target
    if one_run:
        break

2020-04-05 01:07:19.845068 WORKING ON TARGET COLUMN: EMBARKED_Q
Getting class weights
---------------------
2020-04-05 01:07:19.845326 Available Data Records: 1.31E+03 = 1,309

2020-04-05 01:07:19.845400 weighing classes in target: EMBARKED_Q
class:False
- natural_count: 1186
- natural_weight: 0.906035141329259
class:True
- natural_count: 123
- natural_weight: 0.09396485867074103

get 0.1 percent sample with weights: {False: 0.906035141329259, True: 0.09396485867074103}
get 0.1 percent sample with weights: {True: 0.2, False: 0.8}
get 0.1 percent sample with weights: {True: 0.3, False: 0.7}
get 0.1 percent sample with weights: {False: 0.5, True: 0.5}
get 0.2 percent sample with weights: {False: 0.906035141329259, True: 0.09396485867074103}
get 0.2 percent sample with weights: {True: 0.2, False: 0.8}
get 0.2 percent sample with weights: {True: 0.3, False: 0.7}
get 0.2 percent sample with weights: {False: 0.5, True: 0.5}
get 0.4 percent sample with weights: {False: 0.906035141329259, True

In [25]:
class_weight_configs

[{False: 0.906035141329259, True: 0.09396485867074103},
 {True: 0.2, False: 0.8},
 {True: 0.3, False: 0.7},
 {False: 0.5, True: 0.5}]

In [15]:
class_dict = get_target_class_count_weight_and_recordkeys(
    targets = ['EMBARKED_Q'], data=data, verbose = False
)['EMBARKED_Q']

{class_label:class_dict[class_label]['natural_weight'] for class_label in class_dict}

{False: 0.906035141329259, True: 0.09396485867074103}

In [13]:
get_target_class_count_weight_and_recordkeys(
    targets = ['EMBARKED_Q'], data=data, verbose = False
)['EMBARKED_Q']

{False: {'records': {0,
   1,
   2,
   3,
   4,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   17,
   18,
   19,
   20,
   21,
   23,
   24,
   25,
   26,
   27,
   29,
   30,
   31,
   33,
   34,
   35,
   36,
   37,
   38,
   39,
   40,
   41,
   42,
   43,
   45,
   48,
   49,
   50,
   51,
   52,
   53,
   54,
   55,
   56,
   57,
   58,
   59,
   60,
   61,
   62,
   63,
   64,
   65,
   66,
   67,
   68,
   69,
   70,
   71,
   72,
   73,
   74,
   75,
   76,
   77,
   78,
   79,
   80,
   81,
   83,
   84,
   85,
   86,
   87,
   88,
   89,
   90,
   91,
   92,
   93,
   94,
   95,
   96,
   97,
   98,
   99,
   100,
   101,
   102,
   103,
   104,
   105,
   106,
   107,
   108,
   110,
   111,
   112,
   113,
   114,
   115,
   117,
   118,
   119,
   120,
   121,
   122,
   123,
   124,
   125,
   127,
   128,
   129,
   130,
   131,
   132,
   133,
   134,
   135,
   136,
   137,
   138,
   139,
   140,
   141,
   142,
   144,
   145,
   146,
   147,
  