# Compare Classifiers
*Paulo G. Martinez* Sun. Apr. 5, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.

P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow 1.1
"Failing Fast Promote Fast" We begin with the most challenging but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration.

### We define the "rote" order of brute force iteration of tasks to be tested
**small samples**
- natural to balanced weights
['small-sample-natural-weigt-model1', 'small-sample-natural-weight-model2', ..., 'small-sample-natural-weight-modeln'] + 
['small-sample-mild-reweight-model1', 'small-sample-mild-reweight-model2', ..., 'small-sample-mild-reweight-modeln'] + 
...
['small-sample-aggressive-reweight-model1', 'small-sample-aggressive-reweight-model2', ..., 'small-sample-aggressive-reweight-modeln']

**medium samples**
- ibid, mutatis mutandi

**large samples**
- ibid, mutatis mutandi

**full data**
- ibid, mutatis mutandi

### we iterate through the un-tested tasks
- We begin at the smallest untested task
    - but at the most difficult untested version of the task (class imbalance)
        - we may manipulate balance of training data, but not of test data!
        - we iterate through models to survey performance
            - each model does a small number of feature selection optimizations
        - **at this point we have a "bearish" approach, wanting to preview models before committing compute resources to them**
            - recording which configurations we have already tried
            - **if a model succeeds at performing satisfactorily, we switch to an inreasingly bullish approach assuming this model will generalize well**
                - we start a new recursion of this same pattern but on only that model's configurations
                    - but for every success we double the number of intermediate tasks to skip
                        - ex. if tasks were [smallest, smaller, small, medium, large, larger, largest] and model succedes at smallest
                            - we skip smaller and go strainng to small
                                - if model succeeds a second time, we skip to larger
                - recording which connfigurations we have already tried
                - if model fails 
                    - we exit this recursion and return to the original pattern
                    
**Note: This would only be more efficient if we hypothesize that a model will generalize well to unseen data.**

## Reveal data structures for clarity during dev

In [1]:
# this is the data we'll be developing on, it's a slightly modified version of the titanic data set
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False
2,1.0,3,heikkinen,female,26.0,0,0,stono,7.925,,S,True,False,False
3,1.0,1,futrelle,female,35.0,1,0,,53.1,C123,S,True,False,False
4,0.0,3,allen,male,35.0,0,0,,8.05,,S,True,False,False


In [2]:
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Name        1309 non-null   object 
 3   Sex         1309 non-null   object 
 4   Age         1046 non-null   float64
 5   SibSp       1309 non-null   int64  
 6   Parch       1309 non-null   int64  
 7   Ticket      352 non-null    object 
 8   Fare        1308 non-null   float64
 9   Cabin       295 non-null    object 
 10  EMBARKED    1307 non-null   object 
 11  EMBARKED_S  1309 non-null   bool   
 12  EMBARKED_C  1309 non-null   bool   
 13  EMBARKED_Q  1309 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(5)
memory usage: 116.5+ KB


# Developing Functions and Script
# --------------------------------------

**Declare Variables**

In [3]:
# declare some global variables
using_TSNE = False

**prep environment**

In [4]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd
# sci-kit learn modules
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
#from sklearn.svm import SVC
#from sklearn.gaussian_process import GaussianProcessClassifier
#from sklearn.gaussian_process.kernels import RBF
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
#from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import ComplementNB
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# import builtins
from datetime import datetime
import json

# visualizations
import matplotlib.pyplot as plt
if using_TSNE:
    import seaborn as sns

## Define helper functions

### getting targets' natural class-weights

In [5]:
def get_target_class_count_weight_and_recordkeys(targets = [], data = None, verbose = True):
    '''
    Get the class weights of a set of target columns in a dataframe 
    (with unique indices)or its equivalent index-oriented dictionary. 
    Print the information and return it as a dictionary of classes and 
    weights in the following format:
    {
        'total_count': tc,
        target_col : {
            class_a : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1nc}
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1nc}
            }
        }
    }
    
    Expects datetime and pandas to be available.
    
    Parameters
    ----------
    targets : list of target-column names in a dataframelike object.
        Default []
    
    data: a pandas.DataFrame or its .to_dict(orient = 'index') equivalent. 
        If the data is large, you can reduce memory load by simply passing 
        the target columns. No other information is required.
        Default None
    
    verbose: boolean. Whether or not the function prints out feedback.
        Default True
    '''
    if verbose:
        feedback = 'Getting class weights'
        print(feedback+'\n'+'-'*len(feedback))
        
    # validate non falsy inputs
    for var in {'targets': targets, 'data': data}:
        # if empty or null
        if not var:
            raise TypeError(f'''Expected non empty input for {var} but received "falsy" type {type(var)}''')
    
    # if received dataframe cast it to dict and continue
    if isinstance(data, pd.DataFrame):
        # ignore the columns we won't use
        data = data[targets].to_dict(orient = 'index')
    
    # workflow for data dictionary
    if isinstance(data, dict):
        # initialize storage dict
        target_class_weights = {}
        
        # get length of data
        data_length = len(data)
        if verbose:
            print(datetime.now(), "Available Data Records:", f"{data_length:.2E} = {data_length:,}\n")
        target_class_weights['total_count'] = data_length
        
        for target_col in targets:
            if verbose:
                print(datetime.now(), 'weighing classes in target:', target_col)
            # initialize a dict for each target column 
            target_class_weights[target_col] = {}
            
            # iterate once through the records to get each classes keys
            for record in data:
                # get the class label at that record
                class_label = data[record][target_col]
                
                # if first time seeing this class, initialize its own dict
                if class_label not in target_class_weights[target_col]:
                    # initialize its set of records (to avoid the need for iteration searches downstream)
                    target_class_weights[target_col][class_label] = {'records':set()}
                
                # update this class' set of records
                target_class_weights[target_col][class_label]['records'].add(record)
            
            # now that we have each class's set of records, save its count and weight for convenience
            for class_label in target_class_weights[target_col]:
                target_class_weights[target_col][class_label]['count'] = len(target_class_weights[target_col][class_label]['records'])
                target_class_weights[target_col][class_label]['weight'] = target_class_weights[target_col][class_label]['count']/target_class_weights['total_count']
                if verbose:
                    print(f"class:{class_label}")
                    for attribute in ['count', 'weight']:
                        print(f"- {attribute}: {target_class_weights[target_col][class_label][attribute]}")
            if verbose:
                print('')
    # if neither data frame nor dict
    else:
        raise TypeError(f'Expected data to be type dict or pd.DataFrame but instead got type: {type(data)}')
    
    return target_class_weights

## getting data sample of given weight

In [6]:
def get_class_weighted_data_samples(
    class_counts_weights_keys = {}, sample_pct = .10, sample_weights = {}, 
    verbose = False, return_dict = False
):
    '''
    Takes a dictionary denoting a set of classes, and the data keys that correspond to them in a data set.
    Returns a list of random samples to meet the size and weighting specifications.
    - Classes will be upsampled or downsampled to meet the requested parameters.
    - When upsampling, all unique records will be added once, then additional records will be added at random
        with replacement
        
    Requires numpy.random.choice
    
    Parameters
    ----------
    class_counts_weight_keys: dictionary with the following features where the 'records' contain the 
        keys or indices of records in a data dictionary or dataframe.
        Default {} empty dict (will fail).
        {
            class_a : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1c}
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1c}
            }
        }
    
    sample_pct: float between (0,1] denoting the percetage of the original data to be sampled
        Default .10
        
    sample_weights: dictionary of classes and their desired weights in the data sample.
        Ex: {class_a:.50, class_b:.50}
    
    verbose: boolean, whether to print feedback or not
        Default False
        
    return_dict: boolean, whether to return a second object. Optionally Also Returns a similar 
        dictionary of the same classes denoting their sampled count, sampled weight and a list
        of their sampled keys (returns a list because upsampling might require duplicate keys). 
        Default False
        Ex:
        {
            class_a : {
                'count': c,
                'weight': w,
                'records': [i0, i1, ..., 1c]
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': [i0, i1, ..., 1c]
            }
        }
        
    '''
    # validate inputs
    for variable in [class_counts_weights_keys, sample_weights]:
        if not isinstance(variable, dict):
            raise TypeError(f"Expected type dict but received type{type(variable)}.")
    
    # initialize list of data keys to return
    use_records = []
    # initialize dictionary to return
    sample_class_counts_weights_keys = {
        class_label:{
            'count': None,
            'weight':None,
            'records':[]
        } 
        for class_label in class_counts_weights_keys
    }
    
    # get total count of available data
    total_count = 0
    for class_label in class_counts_weights_keys:
        total_count += class_counts_weights_keys[class_label]['count']
        if class_label not in sample_weights:
            raise NameError(f"class: {class_label} not in sample_weights {sample_weights.keys()} If you don't want it in the sample assign its weight to 0")
    
    # determine the requested size of the sample
    sample_size = int(total_count*sample_pct)
    if verbose:
        print('Requested sample size:', sample_size)
    
    # for each class
    for class_label in class_counts_weights_keys:
        
        # determine and save the number of class samples requested
        requested_class_count = int(sample_size*sample_weights[class_label])
        if verbose:
            print('requested_class_count:', class_label, requested_class_count)
        
        # determine if upsampling will be required
        if requested_class_count > class_counts_weights_keys[class_label]['count']:
            if verbose:
                print(f"class: {class_label} is too small by {requested_class_count - class_counts_weights_keys[class_label]['count']}\n Upsampling")
            # add all the unique records available
            use_records += list(class_counts_weights_keys[class_label]['records'])
            # add all the unique records to the class' sample records
            sample_class_counts_weights_keys[class_label]['records'] += list(class_counts_weights_keys[class_label]['records'])
            replacement = True
            # define how many replacement samples still need to be added
            outstanding = requested_class_count - class_counts_weights_keys[class_label]['count']
        
        # determine if downsampling will be required
        if requested_class_count <= class_counts_weights_keys[class_label]['count']:
            if verbose:
                print(f"class: {class_label} is too large by {class_counts_weights_keys[class_label]['count'] - requested_class_count}\n Downsampling")
            replacement = False
            outstanding = requested_class_count
        
        # get outstanding samples to be added
        class_samples = np.random.choice(list(class_counts_weights_keys[class_label]['records']), size = outstanding, replace = replacement)
        class_samples = list(class_samples)
        # add the class samples to the list of records to be used.
        use_records += class_samples
        # add and save the class samples to be used
        sample_class_counts_weights_keys[class_label]['records'] += class_samples
        
        # spot check to make sure it worked as expected
        # get the new count of class samples in use_records
        returned_class_count = sum([record in class_counts_weights_keys[class_label]['records'] for record in use_records])
        returned_class_weight = returned_class_count/sample_size
        assert returned_class_count == requested_class_count
        # save class sample counts and weights for output
        sample_class_counts_weights_keys[class_label]['count'] = returned_class_count
        sample_class_counts_weights_keys[class_label]['weight'] = returned_class_weight
        if verbose:
            print(f"class:{class_label}\n resampled to weight:{np.round(returned_class_weight, 2)}, count:{returned_class_count}\n")
    
    if verbose:
            print(datetime.now(), 'Done resampling')
            
    if not return_dict:
        return use_records
    else:
        return use_records, sample_class_counts_weights_keys


## Declare "Script" Variables (i.e. input parameters)

In [7]:
# declare path to data
path_to_data = 'data/semi_processed_all.csv'
# declare target columns
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']
# declare columns to drop, if any. Requied to be at least an empty list
drop_columns = ['EMBARKED']
# declare date columns
date_cols = []
# declare numeric cols
numeric_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
# declare categorical cols
categorical_cols = ['Name','Sex','Ticket', 'Cabin']

verbose = False

# declare sample sizes as percentages of total data size
sample_size_percents = [.10, .20, .40, .80, 1.00]
# validate sample pcts
for pct in sample_size_percents:
    if (pct <= 0) or (pct > 1):
        raise ValueError(f"Expected percentage between (0, 1] but received {pct}")

# declare weight configurations to use
use_natural_weights = True
use_balanced_weights = True
class_weight_configs = [{True:.20, False:.80},{True:.30, False:.70}]
# Validation for each class_weight configuration
for weighting in class_weight_configs:
    # spot check that they add up to 1 (for 100%)
    if not np.round(pd.Series(weighting).sum(), 2) == 1:
        raise ValueError(f'Invalid or incomplete class weight values. Expected sum to about 1 but got {pd.Series(weighting)}')

# declare classifiers to use
classifiers = {
    "Complement_Naive_Bayes" : ComplementNB(),
    "Random_Forest_Classifier": RandomForestClassifier(),
    "Nearest_Neighbors" : KNeighborsClassifier(),
    "Add_a_Boost": AdaBoostClassifier(),
    "Neural_Network" : MLPClassifier()
}
# declare which models need their features scaled
needs_scaling = {"Nearest_Neighbors", "Neural_Network"}
# declare which models need non-negative features
needs_non_negative = {'Complement_Naive_Bayes'}

        
# if testing/debuging on a single target
one_run = True

## Script development

In [8]:
if verbose:
    feedback = f"{datetime.now()} GETTING TARGET-DATA" + '\n'
    feedback = feedback + '='*len(feedback)
    print(feedback)
# get target data
if path_to_data.split('.')[-1].lower() == 'csv':
    # if data frame read only target columns then cast to dict
    target_data = pd.read_csv(path_to_data, usecols=target_columns).to_dict(orient = 'index')
if path_to_data.split('.')[-1].lower() == 'json':
    # if json read whole thing
    with open(path_to_data, 'r') as json_file:
        target_data = json.loads(json_file.read())
        # then reduce to target features
        target_data = {record:{target:target_data[record][target] for target in target_columns} for record in target_data}
        
# get targets' class counts, weights, and record keys
if verbose:
    print(datetime.now(), 'getting targets_class_counts_weights_recordKeys...')
targets_class_counts_weights_recordKeys = get_target_class_count_weight_and_recordkeys(
    targets = target_columns, 
    data = target_data,
    verbose = True
)

# START WORKING ON TARGET COLUMNS
if verbose:
    print(datetime.now(), 'STARTING WORK FOR TARGET COLUMNS:')
    print('==============================================================')
    print(target_columns)
# for each target column in a single data frame
for target in target_columns:
    print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
    print('-------------------------------------------------')
    
    # get all available data record keys
    all_data_record_keys = set(target_data.keys())
    
    # get target's natural classes, counts, weights, and recordkeys
    natural_class_counts_weights_keys = targets_class_counts_weights_recordKeys[target]
    
    # finalize list of class_weight_configs to use
    if use_natural_weights:
        natural_weights = {
            class_label:np.round(natural_class_counts_weights_keys[class_label]['weight'], 2) 
            for class_label in natural_class_counts_weights_keys
        }
        # make this the first item in the list
        if natural_weights not in class_weight_configs:
            class_weight_configs = [natural_weights] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        balanced_weights = {
            clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]
        }
        # make this the last item in the list
        if balanced_weights not in class_weight_configs:
            class_weight_configs += [balanced_weights]
        
    # define brute force order of test iteration (via embeded dict comprehensions.) Keep this snippet inside the loop because it has to "pack" tasks based on the natural and balanced weights which are only derived inside the loop
    brute_force_order = [
        {'pct':pct, 'training_weights': weighting, 'model': model, 'tested': False}
        # pcts are ordered in ascending compute load (ascending size of sample)
        for pct in sample_size_percents 
        # weightings are ordered in descending order of presumed difficulty (natural class imbalance first artifical balance last)
        for weighting in class_weight_configs 
        # technically models are in an unordered dict, but the dict was declared in order of prefered testing
        for model in classifiers
    ]
    if verbose:
        print(datetime.now(), 'Defined "brute force order" (order of incremental compute load iteration)')
        #print('Here are the first 30')
        #for i in range(30):
        #    print(brute_force_order[i])
        #print('\n')
    
    # EXECUTE TASKS (each task tests a model on specified data sample size and training-set class-weights)
    if verbose:
        feedback = "STARTING WORK ON TASKS"
        feedback = feedback + '\n' + '='*len(f"{datetime.now()} {feedback}")
        print(datetime.now(), feedback)
    # --------------
    # initialize task-state trackers
    prior_task_pct = 0.0
    prior_task_training_weights = {}
    prior_task_model = None

    # for each task in brute_force_order
    for task in brute_force_order:
        
        # skip completed tasks
        if task['tested']:
            print('SKIPPING COMPLETED TASK >>>>')
            continue
        # give some unsolicited feedback
        feedback = f"{datetime.now()} Working on task:"
        feedback = feedback+'\n'+'-'*len(feedback)
        print(feedback)
        print(task, '\n')
        
        # check if we need to resample the data (can re-use same training and testing records for all models at this task size and weight)
        if (task['pct']!=prior_task_pct) or (task['training_weights']!=prior_task_training_weights):
            if verbose:
                feedback = str(datetime.now()) + ' Resampling data for task'
                feedback += '\n' + '-'*len(feedback)
                print(feedback)
            
            # DEFINE TEST DATA-SET: "reserve" data required for the testing-set as 30% of the sample, 
            # using natural weights. (If we used manipulated weights for testing we wouldn't be testing 
            # the model's performance in real-world circumstances, where class imbalance is presumed to be 
            # sever, rather we would be testing the model's performance on an easier task
            # where task imbalance is less severe.)
            if verbose:
                print(datetime.now(), 'defining test set...')
            test_records = get_class_weighted_data_samples(
                # resample from all the records available
                class_counts_weights_keys = natural_class_counts_weights_keys,
                # 30% of the sample (equivalent to 30% of task['pct'] of total records available)
                sample_pct = .30*task['pct'],
                sample_weights = natural_weights,
                verbose = verbose,
                return_dict = False
            )
            # spot-check: a 30% subset of data with the natural balance should never require upsampling
            if len(test_records) != len(set(test_records)):
                raise ValueError(
                    f'Uh oh did not obtain correct number of unique records for test-set expected {len(test_records)} unique records but got {len(set(test_records))}'
                )
            else:
                test_records = set(test_records)
            if verbose:
                print(datetime.now(), 'done defining test-set!\n')
            
            # determine remaining data records available after "securing" test set for this task
            remaining_records_available_for_training = {record for record in all_data_record_keys if record not in test_records}
            # get remaining_record_class_counts_weights_keys
            remaining_records_class_counts_weights_keys = get_target_class_count_weight_and_recordkeys(
                targets = [target],
                # pass only the target feature of each record to avoid having to copy or pass entire remaining data set
                data = {record:{target: target_data[record][target]} for record in remaining_records_available_for_training},
                verbose = verbose
            )[target]
            
            # DEFINE TRAINING-SET: 70% of the sample
            # (To combat overfitting, it's important to do this after the test set is "secured;" 
            # to ensure any upsampling required for completion of the training set does not 
            # include overlap or "data leakage" from the test set)
            if verbose:
                print(datetime.now(), 'defining training set...')
            training_records = get_class_weighted_data_samples(
                # resample from the remaining records available after securing test-set
                class_counts_weights_keys=remaining_records_class_counts_weights_keys,
                # 70% of the sample (equivalent to 70% of task['pct'] of remaining records available)
                sample_pct = .70*task['pct'],
                sample_weights = task['training_weights'],
                verbose = verbose,
                return_dict = False
            )
            if verbose:
                print(datetime.now(), 'done defining training-set!', '\n')
            
            # update drop_columns by adding the target columns we aren't currently working on
            task_drop_columns = drop_columns + [col for col in target_columns if col != target]
            
            # READ IN TRAINING-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TRAINING-DATA FOR TASK')
                print('------------------------------------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                training_data = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    training_records
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                    # reset the index cause it might not be unique after upsampling
                ).reset_index(drop = True)
                
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    training_data = json.loads(json_file.read())
                    # filter down to only target records (add an enumerating prefix because there may be duplicate keys due to upsampling)
                    training_data = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        f"{ri}_{record}":{
                            column:training_data[record][column] for column in training_data[record] if column not in task_drop_columns
                        }
                        for ri, record in enumerate(training_records)
                    }
                    # cast dict to dataframe for pre-processing
                    training_data = pd.DataFrame(training_data).T.reset_index(drop = True)
            
            # separate training target from features
            y_train = training_data[target]
            training_data = training_data.drop(columns = [target])
            if verbose:
                print('done reading training-data!\n')
            
            # READ IN TEST-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TEST-DATA FOR TASK')
                print('------------------------------------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                testing_data = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    list(test_records)
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                )
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    testing_data = json.loads(json_file.read())
                    # filter down to only target records
                    testing_data = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        record:{
                            column:testing_data[record][column] for column in testing_data[record] if column not in task_drop_columns
                        }
                        for record in test_records
                    }
                    # cast dict to dataframe for pre-processing
                    testing_data = pd.DataFrame(testing_data).T
            
            # separate testing target from features
            y_test = testing_data[target]
            testing_data = testing_data.drop(columns = [target])
            if verbose:
                print('done reading test-data!\n')
            
            # PRE-PROCESS DATA SETS
            # ---------------------
            if verbose:
                print(datetime.now(), 'pre-processing data sets...'.upper())
                print('----------------------------------------------------')
            for data_set in [training_data, testing_data]:
                
                # process numeric cols
                if verbose:
                    print('\n processing numeric columns...')
                    print('- - - - - - - - - - - - - - - - -')
                for col in numeric_cols:
                    # scrub the numeric columns for commas
                    dt = data_set.dtypes[col]
                    if dt == np.dtype('object'):
                        data_set[col] = data_set[col].map(
                            lambda s: float(re.sub(',', '', s))
                            if s!= None else np.nan
                        )
                    # ensure they are numeric
                    data_set[col] = data_set[col].astype('float64')
                    if verbose:
                        print(datetime.now(), col, 'casted to float64')
                
                # process date columns to days ago
                today = datetime.today().date()
                if verbose:
                    print(datetime.now(), '\n coercing date cols to int64 days ago...')
                    print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ')
                for col in date_cols:
                    if np.dtype(data_set[col]) != np.dtype('datetime64[ns]'):
                        try:
                            data_set[col] = (today - pd.to_datetime(data_set[col], errors='coerce').dt.date).dt.days
                            if verbose:
                                print(datetime.now(), 'casted', col, 'to', np.dtype(data_set[col]))
                        except TypeError:
                            # if they are all null, then drop the column
                            if not pd.to_datetime(data_set[col], errors='coerce').notna().sum():
                                data_set.drop(columns = col, inplace = True)
                                print(datetime.now(), 'dropped empty col:', col)
                
                # IMPUTE MISSING VALUES
                if verbose:
                    print(datetime.now(), 'IMPUTING MISSING VALUES')
                    print('--------------------------------------------')
                # declare sentinel categorical to impute with
                sentinel_cat = '0'
                dry_run = False

                for col in data_set.columns:
                    # get the data type
                    dt = np.dtype(data_set[col])
                    if verbose:
                        print(datetime.now(), col, ':', dt)

                    # handle categoricals
                    if dt == np.dtype('object'):
                        if verbose:
                            print('-found categorical feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                data_set[col] = data_set[col].fillna(sentinel_cat)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls')

                    # for numerical values impute them with the median
                    elif dt == np.dtype('float64') or dt == np.dtype('int64'):
                        if verbose:
                            print('-found numerical feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('- Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                md = np.median(data_set[col].dropna())
                                data_set[col] = data_set[col].fillna(md)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls')
                    # for datetime values impute them with the "manufactured" median
                    elif dt == np.dtype('datetime64[ns]'):
                        if verbose:
                            print('-found', dt ,'feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('- Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                # manufacture the middle value between the max and the min by adding half that distance to the min
                                md = data_set[col].min() + (data_set[col].max() - data_set[col].min())/2
                                data_set[col] = data_set[col].fillna(md)
                                # if there weren't enough values to compute a median and its still null fill it with today's date
                                if data_set[col].isna().sum():
                                    now = datetime.now()
                                    data_set[col] = data_set[col].fillna(now)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls')
                if verbose:
                    print(datetime.now(), 'done imputing!', '\n')
            
        # determine if any non-positive columns need to be split for both data sets, (triggerable by any one data set)
        # this transformation will be "model-specific" and ephemeral to the task's general data but we define it now before "blowing up" the number of columns to check by getting dummies
        if task['model'] in needs_non_negative:
            if verbose:
                print(datetime.now(), 'Flagging non-negative columns to split for model:', task['model'])
                print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ')
            # initialize column tracker
            split_cols = {}
            for data_set in [training_data, testing_data]:
                for col in data_set:
                    # skip columns already flagged for splitting
                    if col in split_cols:
                        continue

                    # skip categoricals
                    if np.dtype(data_set[col]) != np.dtype('object'):
                        # check if all the values are greater than 0
                        if data_set[col].map(lambda v: v<0).any():
                            if verbose:
                                print(col, 'is non-positive, will split it')
                            # track the col for splitting
                            split_cols.add(col)
                            # move on to the next column
                            continue
                            
         # check if we need to re-dummy the data (can re-use same training and testing records for all models at this task size and weight)
        if (task['pct']!=prior_task_pct) or (task['training_weights']!=prior_task_training_weights):       
            # GET DUMMIES
            if verbose:
                feedback = f"{datetime.now()} GETTING DUMMIES..."
                feedback = feedback +'\n'+'-'*len(feedback)
                print(feedback)
            for data_set in [training_data, testing_data]:
                # get dummies for trainning-set
                training_data = pd.get_dummies(training_data)
                # get dummies for test-set
                testing_data = pd.get_dummies(testing_data)
                # drop features in test-set not seen in training-set
                testing_data = testing_data.drop(columns = [col for col in testing_data.columns if col not in training_data.columns])
                # add training-features missing in test-set as empty zero columns
                for col in [col for col in training_data.columns if col not in testing_data.columns]:
                    testing_data[col] = 0
            if verbose:
                print(datetime.now(), 'done getting dummies!\n')
            
            # DONE WITH GENERAL PREPROCESSING OF DATA AT SAME TASK PCT AND TRAINGING WEIGHT
            if verbose:
                print(datetime.now(), 'done with general pre-processing of data sets!')
                print('-------------------------------------------------------------------')
            
        # MODEL-SPECIFIC EPHEMERAL PRE-PROCESSING
        # ---------------------------------------
        if verbose:
            print(datetime.now(), 'BEGINNING MODEL-SPECIFIC DATA PRE-PROCESSING')
            print('--------------------------------------------------------------------')

        # initialize customizable copies of training and test data sets that won't perpetuate their changes to source
        X_train = training_data.copy()
        X_test = testing_data.copy()

        # execute splitting of non-positive columns, if necessary
        if task['model'] in needs_non_negative:
            # now split the columns in each data-set
            for col in split_cols:
                for data_set in [X_train, X_test]:
                    data_set[f'{col}_positive'] = data_set[col].map(lambda v: v if v >= 0 else 0)
                    data_set[f'{col}_negative'] = data_set[col].map(lambda v: abs(v) if v<= 0 else 0)
                    data_set.drop(columns = col, inplace = True)
            if verbose:
                print('For both data-sets, split non-positive columns:', split_cols, '\n')

        # SCALE DATA, IF DESIRED
        if task['model'] in needs_scaling:
            if verbose:
                print(datetime.now(), 'SCALING DATA...')
            scaler = StandardScaler()
            X_train = scaler.fit_transform(X_train)
            X_test = scaler.transform(X_test)
            if verbose:
                print(datetime.now(), 'done scaling!\n')

        # FIT MODEL TO TRAININNG-DATA
        # instantiate model
        clf = classifiers[task['model']]
        if verbose:
            print(datetime.now(), 'FITTING MODEL TO TRAINING-DATA...')
            print('---------------------------------------------------------')
        clf.fit(X_train, y_train)
        if verbose:
            print(datetime.now(), 'done fitting model!')

        # TEST PREDICTIONS
        if verbose:
            print(datetime.now(), 'PREDICTING TEST-DATA...')
            print('---------------------------------------------------------')
        y_pred = clf.predict(X_test)
        if verbose:
            print(datetime.now(), 'done predicting!\n')

        # report accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f"CLASSIFIER:\n {clf}")
        print('----------------------------')
        print(f'ACCURACY: {acc}')
        print('----------------------------')
        cm = confusion_matrix(y_test, y_pred)
        print("Confusion Matrix ")
        print(cm, '\n'*3)


        # store previous task's states
        prior_task_pct = task['pct']
        prior_task_training_weights = task['training_weights']
        prior_task_model = task['model']
        # note task completion
        task['tested'] = True

        if verbose:
            print(datetime.now(), '>>>>>>>>>>>>DONE WITH TASK!!!\n')
                
    # if testing/debugging on a single target
    if one_run:
        break

Getting class weights
---------------------
2020-04-06 10:19:20.191313 Available Data Records: 1.31E+03 = 1,309

2020-04-06 10:19:20.191363 weighing classes in target: EMBARKED_Q
class:False
- count: 1186
- weight: 0.906035141329259
class:True
- count: 123
- weight: 0.09396485867074103

2020-04-06 10:19:20.191778 weighing classes in target: EMBARKED_S
class:True
- count: 914
- weight: 0.6982429335370511
class:False
- count: 395
- weight: 0.3017570664629488

2020-04-06 10:19:20.192524 weighing classes in target: EMBARKED_C
class:False
- count: 1039
- weight: 0.7937356760886173
class:True
- count: 270
- weight: 0.20626432391138275

2020-04-06 10:19:20.194966 WORKING ON TARGET COLUMN: EMBARKED_Q
-------------------------------------------------
2020-04-06 10:19:20.195416 Working on task:
-------------------------------------------
{'pct': 0.1, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

CLASSIFIER:
 ComplementNB(alpha=1.0, class_pri

CLASSIFIER:
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
----------------------------
ACCURACY: 0.8947368421052632
----------------------------
Confusion Matrix 
[[34  1]
 [ 3  0]] 



2020-04-06 10:19:21.880617 Working on task:
-------------------------------------------
{'pct': 0.1, 'training_weights': {False: 0.5, True: 0.5}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

CLASSIFIER:
 ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)
----------------------------
ACCURACY: 0.605263157

CLASSIFIER:
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
----------------------------
ACCURACY: 0.8441558441558441
----------------------------
Confusion Matrix 
[[65  5]
 [ 7  0]] 



2020-04-06 10:19:23.401495 Working on task:
-------------------------------------------
{'pct': 0.2, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

CLASSIFIER:
 ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)
----------------------------
ACCURACY: 0.519480519

 [  5   9]] 



2020-04-06 10:19:25.331957 Working on task:
-------------------------------------------
{'pct': 0.4, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Neural_Network', 'tested': False} 

CLASSIFIER:
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
----------------------------
ACCURACY: 0.7948717948717948
----------------------------
Confusion Matrix 
[[124  18]
 [ 14   0]] 



2020-04-06 10:19:25.668810 Working on task:
-------------------------------------------
{'pct': 0.4, 'training_weights'

CLASSIFIER:
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
----------------------------
ACCURACY: 0.1858974358974359
----------------------------
Confusion Matrix 
[[ 15 127]
 [  0  14]] 



2020-04-06 10:19:28.412213 Working on task:
-------------------------------------------
{'pct': 0.8, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

CLASSIFIER:
 ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)
----------------------------
ACCURACY: 0.638

CLASSIFIER:
 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)
----------------------------
ACCURACY: 0.33226837060702874
----------------------------
Confusion Matrix 
[[ 77 208]
 [  1  27]] 



2020-04-06 10:19:33.161918 Working on task:
-------------------------------------------
{'pct': 0.8, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Neural_Network', 'tested': False} 

CLASSIFIER:
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_s

CLASSIFIER:
 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
----------------------------
ACCURACY: 0.21994884910485935
----------------------------
Confusion Matrix 
[[ 54 302]
 [  3  32]] 



2020-04-06 10:19:38.052272 Working on task:
-------------------------------------------
{'pct': 1.0, 'training_weights': {True: 0.2, False: 0.8}, 'model': 'Add_a_Boost', 'tested': False} 

CLASSIFIER:
 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)
----------------------------
ACCURACY: 0.5882352941176471
----------------------------
Confusion Matrix 
[[196 160]
 [  1  34]] 



2020-04-06 10:19:38.466562 Working on task:
-------------------------------------------
{'pct': 1.0, 'training_weights': {True: 0.2, False: 0.8}, 'model': 'Neural_Network', 'tested': False} 

CLASSI

In [9]:
1 + 1

2

            # MODEL-SPECIFIC EPHEMERAL PRE-PROCESSING
            if verbose:
                print(datetime.now(), 'BEGINNING MODEL-SPECIFIC DATA PRE-PROCESSING')
                print('--------------------------------------------------------------------')
            
            # initialize customizable copies of training and test data sets that won't perpetuate their changes to source
            X_train = training_data.copy()
            X_test = testing_data.copy()
            
            # execute splitting of non-positive columns, if necessary
            if task['model'] in needs_non_negative:
                # now split the columns in each data-set
                for col in split_cols:
                    for data_set in [X_train, X_test]:
                        data_set[f'{col}_positive'] = data_set[col].map(lambda v: v if v >= 0 else 0)
                        data_set[f'{col}_negative'] = data_set[col].map(lambda v: abs(v) if v<= 0 else 0)
                        data_set.drop(columns = col, inplace = True)
                if verbose:
                    print('For both data-sets, split non-positive columns:', split_cols, '\n')
                                    
            # SCALE DATA, IF DESIRED
            if task['model'] in needs_scaling:
                if verbose:
                    print(datetime.now(), 'SCALING DATA...')
                scaler = StandardScaler()
                X_train = scaler.fit_transform(X_train)
                X_test = scaler.transform(X_test)
                if verbose:
                    print(datetime.now(), 'done scaling!\n')
                    
            # FIT MODEL TO TRAININNG-DATA
            # instantiate model
            clf = classifiers[task['model']]
            print(type(clf))
            if verbose:
                print(datetime.now(), 'FITTING MODEL TO TRAINING-DATA...')
                print('---------------------------------------------------------')
            clf.fit(X_train, y_train)
            if verbose:
                print(datetime.now(), 'done fitting model!')
            
            # TEST PREDICTIONS
            if verbose:
                print(datetime.now(), 'PREDICTING TEST-DATA...')
                print('---------------------------------------------------------')
            y_pred = clf.predict(X_test)
            if verbose:
                print(datetime.now(), 'done predicting!')

            # report accuracy
            acc = accuracy_score(y_test, y_pred)
            print(f"\nCLASSIFIER:\n {clf}")
            print('----------------------------')
            print(f'ACCURACY: {acc}')
            print('----------------------------')
            cm = confusion_matrix(y_test, y_pred)
            print("Confusion Matrix ")
            print(cm)
            
            
            # store previous task's states
            prior_task_pct = task['pct']
            prior_task_training_weights = task['training_weights']
            prior_task_model = task['model']
            # note task completion
            task['tested'] = True
            
            if verbose:
                print(datetime.now(), '>>>>>>>>>>>>DONE WITH TASK!!!\n')
                
    # if testing/debugging on a single target
    if one_run:
        break

if verbose:
    feedback = f"{datetime.now()} GETTING TARGET-DATA" + '\n'
    feedback = feedback + '='*len(feedback)
    print(feedback)
# get target data
if path_to_data.split('.')[-1].lower() == 'csv':
    # if data frame read only target columns then cast to dict
    target_data = pd.read_csv(path_to_data, usecols=target_columns).to_dict(orient = 'index')
if path_to_data.split('.')[-1].lower() == 'json':
    # if json read whole thing
    with open(path_to_data, 'r') as json_file:
        target_data = json.loads(json_file.read())
        # then reduce to target features
        target_data = {record:{target:target_data[record][target] for target in target_columns} for record in target_data}
        
# get targets' class counts, weights, and record keys
if verbose:
    print(datetime.now(), 'getting targets_class_counts_weights_recordKeys...')
targets_class_counts_weights_recordKeys = get_target_class_count_weight_and_recordkeys(
    targets = target_columns, 
    data = target_data,
    verbose = True
)

# START WORKING ON TARGET COLUMNS
if verbose:
    print(datetime.now(), 'STARTING WORK FOR TARGET COLUMNS:')
    print('==============================================================')
    print(target_columns)
# for each target column in a single data frame
for target in target_columns:
    print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
    print('-------------------------------------------------')
    
    # get all available data record keys
    all_data_record_keys = set(target_data.keys())
    
    # get target's natural classes, counts, weights, and recordkeys
    natural_class_counts_weights_keys = targets_class_counts_weights_recordKeys[target]
    
    # finalize list of class_weight_configs to use
    if use_natural_weights:
        natural_weights = {
            class_label:np.round(natural_class_counts_weights_keys[class_label]['weight'], 2) 
            for class_label in natural_class_counts_weights_keys
        }
        # make this the first item in the list
        if natural_weights not in class_weight_configs:
            class_weight_configs = [natural_weights] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        balanced_weights = {
            clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]
        }
        # make this the last item in the list
        if balanced_weights not in class_weight_configs:
            class_weight_configs += [balanced_weights]
        
    # define brute force order of test iteration (via embeded dict comprehensions.) Keep this snippet inside the loop because it has to "pack" tasks based on the natural and balanced weights which are only derived inside the loop
    brute_force_order = [
        {'pct':pct, 'training_weights': weighting, 'model': model, 'tested': False}
        # pcts are ordered in ascending compute load (ascending size of sample)
        for pct in sample_size_percents 
        # weightings are ordered in descending order of presumed difficulty (natural class imbalance first artifical balance last)
        for weighting in class_weight_configs 
        # technically models are in an unordered dict, but the dict was declared in order of prefered testing
        for model in classifiers
    ]
    if verbose:
        print(datetime.now(), 'Defined "brute force order" (order of incremental compute load iteration)')
        #print('Here are the first 30')
        #for i in range(30):
        #    print(brute_force_order[i])
        #print('\n')
    
    # EXECUTE TASKS (each task tests a model on specified data sample size and training-set class-weights)
    if verbose:
        feedback = "STARTING WORK ON TASKS"
        feedback = feedback + '\n' + '='*len(f"{datetime.now()} {feedback}")
        print(datetime.now(), feedback)
    # --------------
    # initialize task-state trackers
    prior_task_pct = 0.0
    prior_task_training_weights = {}
    prior_task_model = None

    # for each task in brute_force_order
    for task in brute_force_order:
        # skip completed tasks
        if task['tested']:
            print('SKIPPING COMPLETED TASK >>>>')
            continue
        # give some unsolicited feedback
        feedback = f"{datetime.now()} Working on task:"
        feedback = feedback+'\n'+'-'*len(feedback)
        print(feedback)
        print(task, '\n')
        # check if we need to resample the data (can re-use same training and testing records for all models at this task size and weight)
        if (task['pct']!=prior_task_pct) or (task['training_weights']!=prior_task_training_weights):
            if verbose:
                feedback = str(datetime.now()) + ' Resampling data for task'
                feedback += '\n' + '-'*len(feedback)
                print(feedback)
            
            # DEFINE TEST DATA-SET: "reserve" data required for the testing-set as 30% of the sample, 
            # using natural weights. (If we used manipulated weights for testing we wouldn't be testing 
            # the model's performance in real-world circumstances, where class imbalance is presumed to be 
            # sever, rather we would be testing the model's performance on an easier task
            # where task imbalance is less severe.)
            if verbose:
                print(datetime.now(), 'defining test set...')
            test_records = get_class_weighted_data_samples(
                # resample from all the records available
                class_counts_weights_keys = natural_class_counts_weights_keys,
                # 30% of the sample (equivalent to 30% of task['pct'] of total records available)
                sample_pct = .30*task['pct'],
                sample_weights = natural_weights,
                verbose = verbose,
                return_dict = False
            )
            # spot-check: a 30% subset of data with the natural balance should never require upsampling
            if len(test_records) != len(set(test_records)):
                raise ValueError(
                    f'Uh oh did not obtain correct number of unique records for test-set expected {len(test_records)} unique records but got {len(set(test_records))}'
                )
            else:
                test_records = set(test_records)
            if verbose:
                print(datetime.now(), 'done defining test-set!\n')
            
            # determine remaining data records available after "securing" test set for this task
            remaining_records_available_for_training = {record for record in all_data_record_keys if record not in test_records}
            # get remaining_record_class_counts_weights_keys
            remaining_records_class_counts_weights_keys = get_target_class_count_weight_and_recordkeys(
                targets = [target],
                # pass only the target feature of each record to avoid having to copy or pass entire remaining data set
                data = {record:{target: target_data[record][target]} for record in remaining_records_available_for_training},
                verbose = verbose
            )[target]
            
            # DEFINE TRAINING-SET: 70% of the sample
            # (To combat overfitting, it's important to do this after the test set is "secured;" 
            # to ensure any upsampling required for completion of the training set does not 
            # include overlap or "data leakage" from the test set)
            if verbose:
                print(datetime.now(), 'defining training set...')
            training_records = get_class_weighted_data_samples(
                # resample from the remaining records available after securing test-set
                class_counts_weights_keys=remaining_records_class_counts_weights_keys,
                # 70% of the sample (equivalent to 70% of task['pct'] of remaining records available)
                sample_pct = .70*task['pct'],
                sample_weights = task['training_weights'],
                verbose = verbose,
                return_dict = False
            )
            if verbose:
                print(datetime.now(), 'done defining training-set!', '\n')
            
            # update drop_columns by adding the target columns we aren't currently working on
            task_drop_columns = drop_columns + [col for col in target_columns if col != target]
            
            # READ IN TRAINING-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TRAINING-DATA FOR TASK')
                print('------------------------------------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                X_train = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    training_records
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                    # reset the index cause it might not be unique after upsampling
                ).reset_index(drop = True)
                
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    X_train = json.loads(json_file.read())
                    # filter down to only target records (add an enumerating prefix because there may be duplicate keys due to upsampling)
                    X_train = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        f"{ri}_{record}":{
                            column:X_train[record][column] for column in X_train[record] if column not in task_drop_columns
                        }
                        for ri, record in enumerate(training_records)
                    }
                    # cast dict to dataframe for pre-processing
                    X_train = pd.DataFrame(X_train).T.reset_index(drop = True)
            
            # separate training target from features
            y_train = X_train[target]
            X_train = X_train.drop(columns = [target])
            if verbose:
                print('done reading training-data!\n')
            
            # READ IN TEST-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TEST-DATA FOR TASK')
                print('------------------------------------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                X_test = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    list(test_records)
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                )
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    X_test = json.loads(json_file.read())
                    # filter down to only target records
                    X_test = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        record:{
                            column:X_test[record][column] for column in X_test[record] if column not in task_drop_columns
                        }
                        for record in test_records
                    }
                    # cast dict to dataframe for pre-processing
                    X_test = pd.DataFrame(X_test).T
            
            # separate testing target from features
            y_test = X_test[target]
            X_test = X_test.drop(columns = [target])
            if verbose:
                print('done reading test-data!\n')
            
            # PRE-PROCESS DATA SETS
            # ---------------------
            if verbose:
                print(datetime.now(), 'pre-processing data sets...'.upper())
                print('----------------------------------------------------')
            for data_set in [X_train, X_test]:
                
                # process numeric cols
                if verbose:
                    print('\n processing numeric columns...')
                    print('- - - - - - - - - - - - - - - - -')
                for col in numeric_cols:
                    # scrub the numeric columns for commas
                    dt = data_set.dtypes[col]
                    if dt == np.dtype('object'):
                        data_set[col] = data_set[col].map(
                            lambda s: float(re.sub(',', '', s))
                            if s!= None else np.nan
                        )
                    # ensure they are numeric
                    data_set[col] = data_set[col].astype('float64')
                    if verbose:
                        print(datetime.now(), col, 'casted to float64')
                
                # process date columns to days ago
                today = datetime.today().date()
                if verbose:
                    print(datetime.now(), '\n coercing date cols to int64 days ago...')
                    print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ')
                for col in date_cols:
                    if np.dtype(data_set[col]) != np.dtype('datetime64[ns]'):
                        try:
                            data_set[col] = (today - pd.to_datetime(data_set[col], errors='coerce').dt.date).dt.days
                            if verbose:
                                print(datetime.now(), 'casted', col, 'to', np.dtype(data_set[col]))
                        except TypeError:
                            # if they are all null, then drop the column
                            if not pd.to_datetime(data_set[col], errors='coerce').notna().sum():
                                data_set.drop(columns = col, inplace = True)
                                print(datetime.now(), 'dropped empty col:', col)
                
                # split non-positive features into two absolute value columns, if model requires it
                if task['model'] in needs_non_negative:
                    if verbose:
                        print(datetime.now(), 'Splitting non-negative columns for model:', task['model'])
                        print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ')
                    for col in data_set:
                        # skip categoricals
                        if np.dtype(data_set[col]) != np.dtype('object'):
                            # check if all the values are greater than 0
                            if data_set[col].map(lambda v: v<0).any():
                                if verbose:
                                    print(col, 'is non-positive, splitting it')
                                data_set[f'{col}_positive'] = data_set[col].map(lambda v: v if v >= 0 else 0)
                                data_set[f'{col}_negative'] = data_set[col].map(lambda v: abs(v) if v<= 0 else 0)
                                data_set.drop(columns = col, inplace = True)
                                if verbose:
                                    print('Split it!')
                
                # IMPUTE MISSING VALUES
                if verbose:
                    print(datetime.now(), 'IMPUTING MISSING VALUES')
                    print('--------------------------------------------')
                # declare sentinel categorical to impute with
                sentinel_cat = '0'
                dry_run = False

                for col in data_set.columns:
                    # get the data type
                    dt = np.dtype(data_set[col])
                    if verbose:
                        print(datetime.now(), col, ':', dt)

                    # handle categoricals
                    if dt == np.dtype('object'):
                        if verbose:
                            print('-found categorical feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                data_set[col] = data_set[col].fillna(sentinel_cat)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls')

                    # for numerical values impute them with the median
                    elif dt == np.dtype('float64') or dt == np.dtype('int64'):
                        if verbose:
                            print('-found numerical feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('- Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                md = np.median(data_set[col].dropna())
                                data_set[col] = data_set[col].fillna(md)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls')
                    # for datetime values impute them with the "manufactured" median
                    elif dt == np.dtype('datetime64[ns]'):
                        if verbose:
                            print('-found', dt ,'feature')
                        # if the nulls sum (n_sm) is > 0
                        n_sm = sum(data_set[col].isnull())
                        if n_sm > 0:
                            if verbose:
                                print('- Column: ', col, 'has', n_sm, 'nulls')
                            if not dry_run:
                                # manufacture the middle value between the max and the min by adding half that distance to the min
                                md = data_set[col].min() + (data_set[col].max() - data_set[col].min())/2
                                data_set[col] = data_set[col].fillna(md)
                                # if there weren't enough values to compute a median and its still null fill it with today's date
                                if data_set[col].isna().sum():
                                    now = datetime.now()
                                    data_set[col] = data_set[col].fillna(now)
                                if verbose:
                                    print('- Imputed')
                                    print('- ', col, 'now has', sum(data_set[col].isnull()), 'nulls') 
                    if verbose:
                        print('\n')
            if verbose:
                print(datetime.now(), 'done pre-processing data sets!')
                print('--------------------------------------------------------')
            
            # GET DUMMIES
            if verbose:
                feedback = f"{datetime.now()} GETTING DUMMIES..."
                feedback = feedback +'\n'+'-'*len(feedback)
                print(feedback)
            # get dummies for trainning-set
            X_train = pd.get_dummies(X_train)
            # get dummies for test-set
            X_test = pd.get_dummies(X_test)
            # drop features in test-set not seen in training-set
            X_test = X_test.drop(columns = [col for col in X_test.columns if col not in X_train.columns])
            # add training-features missing in test-set as empty zero columns
            for col in [col for col in X_train.columns if col not in X_test.columns]:
                X_test[col] = 0
            if verbose:
                print(datetime.now(), 'done getting dummies!\n')
            
            # SCALE DATA, IF DESIRED
            if task['model'] in needs_scaling:
                if verbose:
                    print(datetime.now(), 'SCALING DATA...')
                scaler = StandardScaler()
                X_train = scaler.fit_transform(X_train)
                X_test = scaler.transform(X_test)
                if verbose:
                    print(datetime.now(), 'done scaling!\n')
                    
            # FIT MODEL TO TRAININNG-DATA
            # instantiate model
            clf = classifiers[task['model']]
            print(type(clf))
            if verbose:
                print(datetime.now(), 'FITTING MODEL TO TRAINING-DATA...')
                print('---------------------------------------------------------')
            clf.fit(X_train, y_train)
            if verbose:
                print(datetime.now(), 'done fitting model!')
            
            # TEST PREDICTIONS
            if verbose:
                print(datetime.now(), 'PREDICTING TEST-DATA...')
                print('---------------------------------------------------------')
            y_pred = clf.predict(X_test)
            if verbose:
                print(datetime.now(), 'done predicting!')

            # report accuracy
            acc = accuracy_score(y_test, y_pred)
            print(f"\nCLASSIFIER:\n {clf}")
            print('----------------------------')
            print(f'ACCURACY: {acc}')
            print('----------------------------')
            cm = confusion_matrix(y_test, y_pred)
            print("Confusion Matrix ")
            print(cm)
            
            
            # store previous task's states
            prior_task_pct = task['pct']
            prior_task_training_weights = task['training_weights']
            prior_task_model = task['model']
            # note task completion
            task['tested'] = True
            
            if verbose:
                print(datetime.now(), '>>>>>>>>>>>>DONE WITH TASK!!!\n')
                
    # if testing/debugging on a single target
    if one_run:
        break