# Compare Classifiers
*Paulo G. Martinez* Sat. Apr. 4, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.

P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow 1.0
"Failing Fast." We begin with the most challenging but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration

For each binary-classification target
- find "natural" class_weights
- for sample_size in [small, medium, ... large, full]:
  - for weight_balance in [natural_weights, slight_upsample, ..., aggressive_upsample]:
    - for model in [rf, cnb, boost, etc]:
      - pre-process features
        - impute nulls
        - perhaps scale
        - fit, train, test model
        - score model
        - score feature importance
        - while model scores unsatisfactorily and features refinement is possible:
          - refine feature selection
          - refit, retrain, retest, rescore
          - (if model fails to score satisfactorily move on to next model)
        - **if model scores satisfactorily on sample**
          - save model, its score and config to "training_sample_weight_success"
             - out_of_sample test and score model on increasingly larger samples
            - **if model scores satisfactorily when "generalized" to data out of its training sample:**
              - save model, its score and config to "training_sample_weight_generalsample_success"
              - (if model failed to generalize well move on to next model)
      - (when all models have been evaluated on this class weight, upsample and repeat.)
    - (when all weight resampling's have been attempted increase the sample size and repeat.)



## Reveal data structures for clarity during dev

In [None]:
# this is the data we'll be developing on, it's a slightly modified version of the titanic data set
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').head()

In [None]:
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').info()

# Developing Functions and Script
# --------------------------------------

**Declare Variables**

In [1]:
# declare some global variables
using_TSNE = False

**prep environment**

In [2]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd
# sci-kit learn modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
#from sklearn.svm import SVC
#from sklearn.gaussian_process import GaussianProcessClassifier
#from sklearn.gaussian_process.kernels import RBF
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
#from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import ComplementNB
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# import builtins
from datetime import datetime

# visualizations
import matplotlib.pyplot as plt
if using_TSNE:
    import seaborn as sns

## Define helper functions

### getting targets' natural class-weights

In [3]:
def get_target_class_count_weight_and_recordkeys(targets = [], data = None, verbose = True):
    '''
    Get the class weights of a set of target columns in a dataframe 
    (with unique indices)or its equivalent index-oriented dictionary. 
    Print the information and return it as a dictionary of classes and 
    weights in the following format:
    {
        'total_count': tc,
        target_col : {
            class_a : {
                'natural_count':nc,
                'natural_weight': nw,
                'records': {i0, i1, ..., 1nc}
            },
            class_b : {
                'natural_count':nc,
                'natural_weight': nw,
                'records': {i0, i1, ..., 1nc}
            }
        }
    }
    
    Expects datetime and pandas to be available.
    
    Parameters
    ----------
    targets : list of target-column names in a dataframelike object.
        Default []
    
    data: a pandas.DataFrame or its .to_dict(orient = 'index') equivalent. 
        Default None
    
    verbose: boolean. Whether or not the function prints out feedback.
        Default True
    '''
    if verbose:
        feedback = 'Getting class weights'
        print(feedback+'\n'+'-'*len(feedback))
        
    # validate non falsy inputs
    for var in {'targets': targets, 'data': data}:
        # if empty or null
        if not var:
            raise TypeError(f'''Expected non empty input for {var} but received "falsy" type {type(var)}''')
    
    # if received dataframe cast it to dict and continue
    if isinstance(data, pd.DataFrame):
        # ignore the columns we won't use
        data = data[targets].to_dict(orient = 'index')
    
    # workflow for data dictionary
    if isinstance(data, dict):
        # initialize storage dict
        target_class_weights = {}
        
        # get length of data
        data_length = len(data)
        if verbose:
            print(datetime.now(), "Available Data Records:", f"{data_length:.2E} = {data_length:,}\n")
        target_class_weights['total_count'] = data_length
        
        for target_col in targets:
            if verbose:
                print(datetime.now(), 'weighing classes in target:', target_col)
            # initialize a dict for each target column 
            target_class_weights[target_col] = {}
            
            # iterate once through the records to get each classes keys
            for record in data:
                # get the class label at that record
                class_label = data[record][target_col]
                
                # if first time seeing this class, initialize its own dict
                if class_label not in target_class_weights[target_col]:
                    # initialize its set of records (to avoid the need for iteration searches downstream)
                    target_class_weights[target_col][class_label] = {'records':set()}
                
                # update this class' set of records
                target_class_weights[target_col][class_label]['records'].add(record)
            
            # now that we have each class's set of records, save its count and weight for convenience
            for class_label in target_class_weights[target_col]:
                target_class_weights[target_col][class_label]['natural_count'] = len(target_class_weights[target_col][class_label]['records'])
                target_class_weights[target_col][class_label]['natural_weight'] = target_class_weights[target_col][class_label]['natural_count']/target_class_weights['total_count']
                if verbose:
                    print(f"class:{class_label}")
                    for attribute in ['natural_count', 'natural_weight']:
                        print(f"- {attribute}: {target_class_weights[target_col][class_label][attribute]}")
            if verbose:
                print('')
    # if neither data frame nor dict
    else:
        raise TypeError(f'Expected data to be type dict or pd.DataFrame but instead got type: {type(data)}')
    
    return target_class_weights

## getting data sample of given weight

In [16]:
def get_class_weighted_data_samples(
    class_counts_weights_keys = {}, sample_pct = .10, sample_weights = {}, 
    verbose = False, return_dict = False
):
    '''
    Takes a dictionary denoting a set of classes, and the data keys that correspond to them in a data set.
    Returns a list of random samples to meet the size and weighting specifications.
    - Classes will be upsampled or downsampled to meet the requested parameters.
    - When upsampling, all unique records will be added once, then additional records will be added at random
        with replacement
        
    Requires numpy.random.choice
    
    Parameters
    ----------
    class_counts_weight_keys: dictionary with the following features where the 'records' contain the 
        keys or indices of records in a data dictionary or dataframe.
        Default {} empty dict (will fail).
        {
            class_a : {
                'natural_count':nc,
                'natural_weight': nw,
                'records': {i0, i1, ..., 1nc}
            },
            class_b : {
                'natural_count':nc,
                'natural_weight': nw,
                'records': {i0, i1, ..., 1nc}
            }
        }
    
    sample_pct: float between (0,1] denoting the percetage of the original data to be sampled
        Default .10
        
    sample_weights: dictionary of classes and their desired weights in the data sample.
        Ex: {class_a:.50, class_b:.50}
    
    verbose: boolean, whether to print feedback or not
        Default False
        
    return_dict: boolean, whether to return a second object. Optionally Also Returns a similar 
        dictionary of the same classes denoting their sampled count, sampled weight and a list
        of their sampled keys (returns a list because upsampling migh require duplicate keys). 
        Default False
        
    '''
    # validate inputs
    for variable in [class_counts_weights_keys, sample_weights]:
        if not isinstance(variable, dict):
            raise TypeError(f"Expected type dict but received type{type(variable)}.")
    
    # initialize list of data keys to return
    use_records = []
    # initialize dictionary to return
    sample_class_counts_weights_keys = {
        class_label:{
            'sample_count': None,
            'sample_weight':None,
            'sample_records':[]
        } 
        for class_label in class_counts_weights_keys
    }
    
    # get total count of available data
    total_count = 0
    for class_label in class_counts_weights_keys:
        total_count += class_counts_weights_keys[class_label]['natural_count']
        if class_label not in sample_weights:
            raise NameError(f"class: {class_label} not in sample_weights {sample_weights.keys()} If you don't want it in the sample assign its weight to 0")
    
    # determine the requested size of the sample
    sample_size = int(total_count*sample_pct)
    if verbose:
        print('Requested sample size:', sample_size)
    
    # for each class
    for class_label in class_counts_weights_keys:
        
        # determine and save the number of class samples requested
        requested_class_count = int(sample_size*sample_weights[class_label])
        if verbose:
            print('requested_class_count:', class_label, requested_class_count)
        
        # determine if upsampling will be required
        if requested_class_count > class_counts_weights_keys[class_label]['natural_count']:
            if verbose:
                print(f"class: {class_label} is too small by {requested_class_count - class_counts_weights_keys[class_label]['natural_count']}\n Upsampling")
            # add all the unique records available
            use_records += list(class_counts_weights_keys[class_label]['records'])
            # add all the unique records to the class' sample records
            sample_class_counts_weights_keys[class_label]['sample_records'] += list(class_counts_weights_keys[class_label]['records'])
            replacement = True
            # define how many replacement samples still need to be added
            outstanding = requested_class_count - class_counts_weights_keys[class_label]['natural_count']
        
        # determine if downsampling will be required
        if requested_class_count <= class_counts_weights_keys[class_label]['natural_count']:
            if verbose:
                print(f"class: {class_label} is too large by {class_counts_weights_keys[class_label]['natural_count'] - requested_class_count}\n Downsampling")
            replacement = False
            outstanding = requested_class_count
        
        # get outstanding samples to be added
        class_samples = np.random.choice(list(class_counts_weights_keys[class_label]['records']), size = outstanding, replace = replacement)
        class_samples = list(class_samples)
        # add the class samples to the list of records to be used.
        use_records += class_samples
        # add and save the class samples to be used
        sample_class_counts_weights_keys[class_label]['sample_records'] += class_samples
        
        # spot check to make sure it worked as expected
        # get the new count of class samples in use_records
        returned_class_count = sum([record in class_counts_weights_keys[class_label]['records'] for record in use_records])
        returned_class_weight = returned_class_count/sample_size
        assert returned_class_count == requested_class_count
        # save class sample counts and weights for output
        sample_class_counts_weights_keys[class_label]['sample_count'] = returned_class_count
        sample_class_counts_weights_keys[class_label]['sample_weight'] = returned_class_weight
        if verbose:
            print(f"class:{class_label}\n resampled to weight:{np.round(returned_class_weight, 2)}, count:{returned_class_count}\n")
    
    if verbose:
            print(datetime.now(), 'Done resampling')
            
    if not return_dict:
        return use_records
    else:
        return use_records, sample_class_counts_weights_keys


## Declare "Script" Variables (i.e. input parameters)

**load data**

Use titanic data set for its mix of numeric and categoricals

In [12]:
# load data as a dictionary, because they are more efficient when large and of konwn structure than dataframes are
data = pd.read_csv('data/semi_processed_all.csv').to_dict(orient = 'index')

**declare target columns**

In [13]:
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']

In [14]:
verbose = True

# declare sample sizes as percentages of total data size
sample_size_percents = [.10, .20, .40, .80, 1.00]
# validate sample pcts
for pct in sample_size_percents:
    if (pct <= 0) or (pct > 1):
        raise ValueError(f"Expected percentage between (0, 1] but received {pct}")

# declare weight configurations to use
use_natural_weights = True
use_balanced_weights = True
class_weight_configs = [{True:.20, False:.80},{True:.30, False:.70}]
# Validation for each class_weight configuration
for weighting in class_weight_configs:
    # spot check that they add up to 1 (for 100%)
    if not np.round(pd.Series(weighting).sum(), 2) == 1:
        raise ValueError(f'Invalid or incomplete class weight values. Expected sum to about 1 but got {pd.Series(weighting)}')

# declare classifiers to use
classifiers = {
    "Complement_Naive_Bayes" : ComplementNB(),
    "Random_Forest_Classifier": RandomForestClassifier(),
    "Nearest_Neighbors" : KNeighborsClassifier(),
    "Add_a_Boost": AdaBoostClassifier(),
    "Neural_Network" : MLPClassifier()
}
# declare which models need their features scaled
needs_scaling = {"Nearest_Neighbors", "Neural_Network"}

        
# if testing/debuging on a single target
one_run = True

## Script development

In [17]:
# for each target column in a single data frame
for target in target_columns:
    if verbose:
        print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
        print('===================================================')
    
    # get target classes, counts, weights, and recordkeys
    class_counts_weights_keys = get_target_class_count_weight_and_recordkeys(targets=[target], data = data)[target]
    
    if use_natural_weights:
        # make this the first item in the list
        class_weight_configs = [{
            class_label:class_counts_weights_keys[class_label]['natural_weight'] 
            for class_label in class_counts_weights_keys
        }] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        # make this the last item in the list
        class_weight_configs += [{
            clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]
        }]
    
    # GET WEIGHTED DATA SAMPLE
    for sample_pct in sample_size_percents:
        for sample_weights in class_weight_configs:
            if verbose:
                print('getting', sample_pct, 'percent sample with weights:', sample_weights)
                print('------------------------------------')
            # save (list of record keys, and dict of class sample counts and weights and corresponding record keys)
            use_samples, sample_class_counts_weights_keys = get_class_weighted_data_samples(
                class_counts_weights_keys = class_counts_weights_keys, 
                sample_pct = sample_pct, 
                sample_weights = sample_weights,
                verbose = True,
                return_dict=True
            )
            print('')
        
        
        
        
        
        # FOR EACH CLASSIFIER
        #for model in classifiers:
            

    # if testing/debugging on a single target
    if one_run:
        break

2020-04-05 18:13:33.657040 WORKING ON TARGET COLUMN: EMBARKED_Q
Getting class weights
---------------------
2020-04-05 18:13:33.657320 Available Data Records: 1.31E+03 = 1,309

2020-04-05 18:13:33.657395 weighing classes in target: EMBARKED_Q
class:False
- natural_count: 1186
- natural_weight: 0.906035141329259
class:True
- natural_count: 123
- natural_weight: 0.09396485867074103

get 0.1 percent sample with weights: {False: 0.906035141329259, True: 0.09396485867074103}
------------------------------------
Requested sample size: 130
requested_class_count: False 117
class: False is too large by 1069
 Downsampling
class:False
 resampled to weight:0.9, count:117

requested_class_count: True 12
class: True is too large by 111
 Downsampling
class:True
 resampled to weight:0.09, count:12

2020-04-05 18:13:33.659886 Done resampling

get 0.1 percent sample with weights: {False: 0.906035141329259, True: 0.09396485867074103}
------------------------------------
Requested sample size: 130
request