# Compare Classifiers
*Paulo G. Martinez* Sun. Apr. 5, 2020

---

# classifier-kitchen-sink
Attempt efficient classifier comparison and promotion

## This is an experiment in "elegant brute force."
- Given a "large" but "tidy" data set with a "wide" set of potentially sparse and or redundant numeric, categorical, and datetime features and multiple "imbalaced" targets (not a multi class column, but two or more binary class target columns)
  - Attempt to find an efficient way of comparing the success of various classification models 
  -   each with various model configurations

P1. One of the premises of this experiment is that it would take "too much" time to assess the viability, let alone solve the problem using a "traditional" domain-knowledge based approach.

P2. Another premise is that compute and memory resources are so limited that a brute force iteration through models would also take "too long."

## Workflow 1.1
"Failing Fast Promote Fast" We begin with the most challenging but easiest to compute configurations in hopes of fiding success before having to "bloat" all the way up to full brute-force iteration.

### We define the "rote" order of brute force iteration of tasks to be tested
**small samples**
- natural to balanced weights
['small-sample-natural-weigt-model1', 'small-sample-natural-weight-model2', ..., 'small-sample-natural-weight-modeln'] + 
['small-sample-mild-reweight-model1', 'small-sample-mild-reweight-model2', ..., 'small-sample-mild-reweight-modeln'] + 
...
['small-sample-aggressive-reweight-model1', 'small-sample-aggressive-reweight-model2', ..., 'small-sample-aggressive-reweight-modeln']

**medium samples**
- ibid, mutatis mutandi

**large samples**
- ibid, mutatis mutandi

**full data**
- ibid, mutatis mutandi

### we iterate through the un-tested tasks
- We begin at the smallest untested task
    - but at the most difficult untested version of the task (class imbalance)
        - we may manipulate balance of training data, but not of test data!
        - we iterate through models to survey performance
            - each model does a small number of feature selection optimizations
        - **at this point we have a "bearish" approach, wanting to preview models before committing compute resources to them**
            - recording which configurations we have already tried
            - **if a model succeeds at performing satisfactorily, we switch to an inreasingly bullish approach assuming this model will generalize well**
                - we start a new recursion of this same pattern but on only that model's configurations
                    - but for every success we double the number of intermediate tasks to skip
                        - ex. if tasks were [smallest, smaller, small, medium, large, larger, largest] and model succedes at smallest
                            - we skip smaller and go strainng to small
                                - if model succeeds a second time, we skip to larger
                - recording which connfigurations we have already tried
                - if model fails 
                    - we exit this recursion and return to the original pattern
                    
**Note: This would only be more efficient if we hypothesize that a model will generalize well to unseen data.**

## Reveal data structures for clarity during dev

In [1]:
# this is the data we'll be developing on, it's a slightly modified version of the titanic data set
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EMBARKED,EMBARKED_S,EMBARKED_C,EMBARKED_Q
0,0.0,3,braund,male,22.0,1,0,a,7.25,,S,True,False,False
1,1.0,1,cumings,female,38.0,1,0,pc,71.2833,C85,C,False,True,False
2,1.0,3,heikkinen,female,26.0,0,0,stono,7.925,,S,True,False,False
3,1.0,1,futrelle,female,35.0,1,0,,53.1,C123,S,True,False,False
4,0.0,3,allen,male,35.0,0,0,,8.05,,S,True,False,False


In [2]:
if 'pd' not in vars():
    import pandas as pd
pd.read_csv('data/semi_processed_all.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Name        1309 non-null   object 
 3   Sex         1309 non-null   object 
 4   Age         1046 non-null   float64
 5   SibSp       1309 non-null   int64  
 6   Parch       1309 non-null   int64  
 7   Ticket      352 non-null    object 
 8   Fare        1308 non-null   float64
 9   Cabin       295 non-null    object 
 10  EMBARKED    1307 non-null   object 
 11  EMBARKED_S  1309 non-null   bool   
 12  EMBARKED_C  1309 non-null   bool   
 13  EMBARKED_Q  1309 non-null   bool   
dtypes: bool(3), float64(3), int64(3), object(5)
memory usage: 116.5+ KB


# Developing Functions and Script
# --------------------------------------

**Declare Variables**

In [3]:
# declare some global variables
using_TSNE = False

**prep environment**

In [4]:
# import open source software packages
# numerical manipulation and analysis
import numpy as np
# data frame manipulation and analysis
import pandas as pd
# sci-kit learn modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
#from sklearn.svm import SVC
#from sklearn.gaussian_process import GaussianProcessClassifier
#from sklearn.gaussian_process.kernels import RBF
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
#from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import ComplementNB
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# import builtins
from datetime import datetime
import json

# visualizations
import matplotlib.pyplot as plt
if using_TSNE:
    import seaborn as sns

## Define helper functions

### getting targets' natural class-weights

In [5]:
def get_target_class_count_weight_and_recordkeys(targets = [], data = None, verbose = True):
    '''
    Get the class weights of a set of target columns in a dataframe 
    (with unique indices)or its equivalent index-oriented dictionary. 
    Print the information and return it as a dictionary of classes and 
    weights in the following format:
    {
        'total_count': tc,
        target_col : {
            class_a : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1nc}
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1nc}
            }
        }
    }
    
    Expects datetime and pandas to be available.
    
    Parameters
    ----------
    targets : list of target-column names in a dataframelike object.
        Default []
    
    data: a pandas.DataFrame or its .to_dict(orient = 'index') equivalent. 
        If the data is large, you can reduce memory load by simply passing 
        the target columns. No other information is required.
        Default None
    
    verbose: boolean. Whether or not the function prints out feedback.
        Default True
    '''
    if verbose:
        feedback = 'Getting class weights'
        print(feedback+'\n'+'-'*len(feedback))
        
    # validate non falsy inputs
    for var in {'targets': targets, 'data': data}:
        # if empty or null
        if not var:
            raise TypeError(f'''Expected non empty input for {var} but received "falsy" type {type(var)}''')
    
    # if received dataframe cast it to dict and continue
    if isinstance(data, pd.DataFrame):
        # ignore the columns we won't use
        data = data[targets].to_dict(orient = 'index')
    
    # workflow for data dictionary
    if isinstance(data, dict):
        # initialize storage dict
        target_class_weights = {}
        
        # get length of data
        data_length = len(data)
        if verbose:
            print(datetime.now(), "Available Data Records:", f"{data_length:.2E} = {data_length:,}\n")
        target_class_weights['total_count'] = data_length
        
        for target_col in targets:
            if verbose:
                print(datetime.now(), 'weighing classes in target:', target_col)
            # initialize a dict for each target column 
            target_class_weights[target_col] = {}
            
            # iterate once through the records to get each classes keys
            for record in data:
                # get the class label at that record
                class_label = data[record][target_col]
                
                # if first time seeing this class, initialize its own dict
                if class_label not in target_class_weights[target_col]:
                    # initialize its set of records (to avoid the need for iteration searches downstream)
                    target_class_weights[target_col][class_label] = {'records':set()}
                
                # update this class' set of records
                target_class_weights[target_col][class_label]['records'].add(record)
            
            # now that we have each class's set of records, save its count and weight for convenience
            for class_label in target_class_weights[target_col]:
                target_class_weights[target_col][class_label]['count'] = len(target_class_weights[target_col][class_label]['records'])
                target_class_weights[target_col][class_label]['weight'] = target_class_weights[target_col][class_label]['count']/target_class_weights['total_count']
                if verbose:
                    print(f"class:{class_label}")
                    for attribute in ['count', 'weight']:
                        print(f"- {attribute}: {target_class_weights[target_col][class_label][attribute]}")
            if verbose:
                print('')
    # if neither data frame nor dict
    else:
        raise TypeError(f'Expected data to be type dict or pd.DataFrame but instead got type: {type(data)}')
    
    return target_class_weights

## getting data sample of given weight

In [6]:
def get_class_weighted_data_samples(
    class_counts_weights_keys = {}, sample_pct = .10, sample_weights = {}, 
    verbose = False, return_dict = False
):
    '''
    Takes a dictionary denoting a set of classes, and the data keys that correspond to them in a data set.
    Returns a list of random samples to meet the size and weighting specifications.
    - Classes will be upsampled or downsampled to meet the requested parameters.
    - When upsampling, all unique records will be added once, then additional records will be added at random
        with replacement
        
    Requires numpy.random.choice
    
    Parameters
    ----------
    class_counts_weight_keys: dictionary with the following features where the 'records' contain the 
        keys or indices of records in a data dictionary or dataframe.
        Default {} empty dict (will fail).
        {
            class_a : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1c}
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': {i0, i1, ..., 1c}
            }
        }
    
    sample_pct: float between (0,1] denoting the percetage of the original data to be sampled
        Default .10
        
    sample_weights: dictionary of classes and their desired weights in the data sample.
        Ex: {class_a:.50, class_b:.50}
    
    verbose: boolean, whether to print feedback or not
        Default False
        
    return_dict: boolean, whether to return a second object. Optionally Also Returns a similar 
        dictionary of the same classes denoting their sampled count, sampled weight and a list
        of their sampled keys (returns a list because upsampling might require duplicate keys). 
        Default False
        Ex:
        {
            class_a : {
                'count': c,
                'weight': w,
                'records': [i0, i1, ..., 1c]
            },
            class_b : {
                'count': c,
                'weight': w,
                'records': [i0, i1, ..., 1c]
            }
        }
        
    '''
    # validate inputs
    for variable in [class_counts_weights_keys, sample_weights]:
        if not isinstance(variable, dict):
            raise TypeError(f"Expected type dict but received type{type(variable)}.")
    
    # initialize list of data keys to return
    use_records = []
    # initialize dictionary to return
    sample_class_counts_weights_keys = {
        class_label:{
            'count': None,
            'weight':None,
            'records':[]
        } 
        for class_label in class_counts_weights_keys
    }
    
    # get total count of available data
    total_count = 0
    for class_label in class_counts_weights_keys:
        total_count += class_counts_weights_keys[class_label]['count']
        if class_label not in sample_weights:
            raise NameError(f"class: {class_label} not in sample_weights {sample_weights.keys()} If you don't want it in the sample assign its weight to 0")
    
    # determine the requested size of the sample
    sample_size = int(total_count*sample_pct)
    if verbose:
        print('Requested sample size:', sample_size)
    
    # for each class
    for class_label in class_counts_weights_keys:
        
        # determine and save the number of class samples requested
        requested_class_count = int(sample_size*sample_weights[class_label])
        if verbose:
            print('requested_class_count:', class_label, requested_class_count)
        
        # determine if upsampling will be required
        if requested_class_count > class_counts_weights_keys[class_label]['count']:
            if verbose:
                print(f"class: {class_label} is too small by {requested_class_count - class_counts_weights_keys[class_label]['count']}\n Upsampling")
            # add all the unique records available
            use_records += list(class_counts_weights_keys[class_label]['records'])
            # add all the unique records to the class' sample records
            sample_class_counts_weights_keys[class_label]['records'] += list(class_counts_weights_keys[class_label]['records'])
            replacement = True
            # define how many replacement samples still need to be added
            outstanding = requested_class_count - class_counts_weights_keys[class_label]['count']
        
        # determine if downsampling will be required
        if requested_class_count <= class_counts_weights_keys[class_label]['count']:
            if verbose:
                print(f"class: {class_label} is too large by {class_counts_weights_keys[class_label]['count'] - requested_class_count}\n Downsampling")
            replacement = False
            outstanding = requested_class_count
        
        # get outstanding samples to be added
        class_samples = np.random.choice(list(class_counts_weights_keys[class_label]['records']), size = outstanding, replace = replacement)
        class_samples = list(class_samples)
        # add the class samples to the list of records to be used.
        use_records += class_samples
        # add and save the class samples to be used
        sample_class_counts_weights_keys[class_label]['records'] += class_samples
        
        # spot check to make sure it worked as expected
        # get the new count of class samples in use_records
        returned_class_count = sum([record in class_counts_weights_keys[class_label]['records'] for record in use_records])
        returned_class_weight = returned_class_count/sample_size
        assert returned_class_count == requested_class_count
        # save class sample counts and weights for output
        sample_class_counts_weights_keys[class_label]['count'] = returned_class_count
        sample_class_counts_weights_keys[class_label]['weight'] = returned_class_weight
        if verbose:
            print(f"class:{class_label}\n resampled to weight:{np.round(returned_class_weight, 2)}, count:{returned_class_count}\n")
    
    if verbose:
            print(datetime.now(), 'Done resampling')
            
    if not return_dict:
        return use_records
    else:
        return use_records, sample_class_counts_weights_keys


## Declare "Script" Variables (i.e. input parameters)

In [37]:
# declare path to data
path_to_data = 'data/semi_processed_all.csv'
# declare target columns
target_columns = ['EMBARKED_Q', 'EMBARKED_S', 'EMBARKED_C']
# declare columns to drop, if any. Requied to be at least an empty list
drop_columns = ['EMBARKED']

In [38]:
verbose = True

# declare sample sizes as percentages of total data size
sample_size_percents = [.10, .20, .40, .80, 1.00]
# validate sample pcts
for pct in sample_size_percents:
    if (pct <= 0) or (pct > 1):
        raise ValueError(f"Expected percentage between (0, 1] but received {pct}")

# declare weight configurations to use
use_natural_weights = True
use_balanced_weights = True
class_weight_configs = [{True:.20, False:.80},{True:.30, False:.70}]
# Validation for each class_weight configuration
for weighting in class_weight_configs:
    # spot check that they add up to 1 (for 100%)
    if not np.round(pd.Series(weighting).sum(), 2) == 1:
        raise ValueError(f'Invalid or incomplete class weight values. Expected sum to about 1 but got {pd.Series(weighting)}')

# declare classifiers to use
classifiers = {
    "Complement_Naive_Bayes" : ComplementNB(),
    "Random_Forest_Classifier": RandomForestClassifier(),
    "Nearest_Neighbors" : KNeighborsClassifier(),
    "Add_a_Boost": AdaBoostClassifier(),
    "Neural_Network" : MLPClassifier()
}
# declare which models need their features scaled
needs_scaling = {"Nearest_Neighbors", "Neural_Network"}

        
# if testing/debuging on a single target
one_run = True

## Script development

In [64]:
# get target data
if path_to_data.split('.')[-1].lower() == 'csv':
    # if data frame read only target columns then cast to dict
    target_data = pd.read_csv(path_to_data, usecols=target_columns).to_dict(orient = 'index')
if path_to_data.split('.')[-1].lower() == 'json':
    # if json read whole thing
    with open(path_to_data, 'r') as json_file:
        target_data = json.loads(json_file.read())
        # then reduce to target features
        target_data = {record:{target:target_data[record][target] for target in target_columns} for record in target_data}

# get targets' class counts, weights, and record keys
targets_class_counts_weights_recordKeys = get_target_class_count_weight_and_recordkeys(
    targets = target_columns, 
    data = target_data,
    verbose = True
)

# for each target column in a single data frame
for target in target_columns:
    if verbose:
        print(datetime.now(), 'WORKING ON TARGET COLUMN:', target)
        print('===================================================')
    
    # get all available data record keys
    all_data_record_keys = set(target_data.keys())
    
    # get target's natural classes, counts, weights, and recordkeys
    natural_class_counts_weights_keys = targets_class_counts_weights_recordKeys[target]
    
    # finalize list of class_weight_configs to use
    if use_natural_weights:
        natural_weights = {
            class_label:np.round(natural_class_counts_weights_keys[class_label]['weight'], 2) 
            for class_label in natural_class_counts_weights_keys
        }
        # make this the first item in the list
        class_weight_configs = [natural_weights] + class_weight_configs
    # get balanced class weights
    if use_balanced_weights:
        # make this the last item in the list
        class_weight_configs += [{
            clss:np.round(1/len(class_weight_configs[0]),2) for clss in class_weight_configs[0]
        }]
        
    # define brute force order of test iteration (via embeded dict comprehensions)
    brute_force_order = [
        {'pct':pct, 'training_weights': weighting, 'model': model, 'tested': False}
        # pcts are ordered in ascending compute load (ascending size of sample)
        for pct in sample_size_percents 
        # weightings are ordered in descending order of presumed difficulty (natural class imbalance first artifical balance last)
        for weighting in class_weight_configs 
        # technically models are in an unordered dict, but the dict was declared in order of prefered testing
        for model in classifiers
    ]
    if verbose:
        print(datetime.now(), 'Defined "brute force order" (order of incremental compute load iteration)')
        print('Here are the first 30')
        for i in range(30):
            print(brute_force_order[i])
        print('\n')
    
    # EXECUTE TASKS (each task tests a model on specified data sample size and training-set class-weights)
    # --------------
    # initialize task-state trackers
    prior_task_pct = 0.0
    prior_task_training_weights = {}
    prior_task_model = None

    # for each task in brute_force_order
    for task in brute_force_order:
        # skip completed tasks
        if task['tested']:
            continue
        if verbose:
            print(datetime.now(), 'Working on task:')
            print('------------------------------------------')
            print(task, '\n')
        # check if we need to resample the data (can re-use same training and testing data for all models at this task size and weight)
        if (task['pct']!=prior_task_pct) or (task['training_weights']!=prior_task_training_weights):
            if verbose:
                print(datetime.now(), 'Resampling data for task\n')
            
            # DEFINE TEST DATA-SET: "reserve" data required for the testing-set as 30% of the sample, 
            # using natural weights. (If we used manipulated weights for testing we wouldn't be testing 
            # the model's performance in real-world circumstances, where class imbalance is presumed to be 
            # sever, rather we would be testing the model's performance on an easier task
            # where task imbalance is less severe.)
            if verbose:
                print(datetime.now(), 'defining test set...')
            test_records = get_class_weighted_data_samples(
                # resample from all the records available
                class_counts_weights_keys = natural_class_counts_weights_keys,
                # 30% of the sample (equivalent to 30% of task['pct'] of total records available)
                sample_pct = .30*task['pct'],
                sample_weights = natural_weights,
                verbose = verbose,
                return_dict = False
            )
            # spot-check: a 30% subset of data with the natural balance should never require upsampling
            if len(test_records) != len(set(test_records)):
                raise ValueError(
                    f'Uh oh did not obtain correct number of unique records for test-set expected {len(test_records)} unique records but got {len(set(test_records))}'
                )
            else:
                test_records = set(test_records)
            if verbose:
                print(datetime.now(), 'done defining test-set!\n')
            
            # determine remaining data records available after "securing" test set for this task
            remaining_records_available_for_training = {record for record in all_data_record_keys if record not in test_records}
            # get remaining_record_class_counts_weights_keys
            remaining_records_class_counts_weights_keys = get_target_class_count_weight_and_recordkeys(
                targets = [target],
                # pass only the target feature of each record to avoid having to copy or pass entire remaining data set
                data = {record:{target: target_data[record][target]} for record in remaining_records_available_for_training},
                verbose = verbose
            )[target]
            
            # DEFINE TRAINING-SET: 70% of the sample
            # (To combat overfitting, it's important to do this after the test set is "secured;" 
            # to ensure any upsampling required for completion of the training set does not 
            # include overlap or "data leakage" from the test set)
            if verbose:
                print(datetime.now(), 'defining training set...')
            training_records = get_class_weighted_data_samples(
                # resample from the remaining records available after securing test-set
                class_counts_weights_keys=remaining_records_class_counts_weights_keys,
                # 70% of the sample (equivalent to 70% of task['pct'] of remaining records available)
                sample_pct = .70*task['pct'],
                sample_weights = task['training_weights'],
                verbose = verbose,
                return_dict = False
            )
            if verbose:
                print(datetime.now(), 'done defining training-set!', '\n')
            
            # update drop_columns by adding the target columns we aren't currently working on
            task_drop_columns = drop_columns + [col for col in target_columns if col != target]
            
            # READ IN TRAINING-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TRAINING-DATA FOR TASK')
                print('-------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                training_data = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    training_records
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                    # reset the index cause it might not be unique after upsampling
                ).reset_index(drop = True)
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    training_data = json.loads(json_file.read())
                    # filter down to only target records (add an enumerating prefix because there may be duplicate keys due to upsampling)
                    training_data = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        f"{ri}_{record}":{
                            column:training_data[record][column] for column in training_data[record] if column not in task_drop_columns
                        }
                        for ri, record in enumerate(training_records)
                    }
                    # cast dict to dataframe for pre-processing
                    training_data = pd.DataFrame(training_data).T.reset_index(drop = True)
            if verbose:
                print('done reading training-data!\n')
            
            # READ IN TEST-DATA FOR TASK
            if verbose:
                print(datetime.now(), 'READING TEST-DATA FOR TASK')
                print('-------------------------------------')
            # handle csv data
            if path_to_data.split('.')[-1].lower() == 'csv':
                # if data frame read and only keep training records (drop unused target columns)
                test_data = pd.read_csv(path_to_data).loc[
                    # filter to training records
                    list(test_records)
                ].drop(
                    # filter to non-target columns (and drop specified columns if any)
                    columns = task_drop_columns
                )
            # hadle json data
            if path_to_data.split('.')[-1].lower() == 'json':
                # if json read whole thing then reduce to target features
                with open(path_to_data, 'r') as json_file:
                    test_data = json.loads(json_file.read())
                    # filter down to only target records
                    test_data = {
                        # keep each column and its record value, but ignore (drop specified columns if any)
                        record:{
                            column:test_data[record][column] for column in test_data[record] if column not in task_drop_columns
                        }
                        for record in test_records
                    }
                    # cast dict to dataframe for pre-processing
                    training_data = pd.DataFrame(training_data).T
            if verbose:
                print('done reading test-data!\n')
            
            # PRE-PROCESS DATA SETS
            # ---------------------
            

    # if testing/debugging on a single target
    if one_run:
        break

Getting class weights
---------------------
2020-04-06 05:53:57.565973 Available Data Records: 1.31E+03 = 1,309

2020-04-06 05:53:57.566046 weighing classes in target: EMBARKED_Q
class:False
- count: 1186
- weight: 0.906035141329259
class:True
- count: 123
- weight: 0.09396485867074103

2020-04-06 05:53:57.568166 weighing classes in target: EMBARKED_S
class:True
- count: 914
- weight: 0.6982429335370511
class:False
- count: 395
- weight: 0.3017570664629488

2020-04-06 05:53:57.569337 weighing classes in target: EMBARKED_C
class:False
- count: 1039
- weight: 0.7937356760886173
class:True
- count: 270
- weight: 0.20626432391138275

2020-04-06 05:53:57.571507 WORKING ON TARGET COLUMN: EMBARKED_Q
2020-04-06 05:53:57.572069 Defined "brute force order" (order of incremental compute load iteration)
Here are the first 30
{'pct': 0.1, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Complement_Naive_Bayes', 'tested': False}
{'pct': 0.1, 'training_weights': {False: 0.91, True: 0.09}, 'mo

Getting class weights
---------------------
2020-04-06 05:53:58.000227 Available Data Records: 1.27E+03 = 1,271

2020-04-06 05:53:58.000309 weighing classes in target: EMBARKED_Q
class:False
- count: 1151
- weight: 0.90558615263572
class:True
- count: 120
- weight: 0.0944138473642801

2020-04-06 05:53:58.001825 defining training set...
Requested sample size: 88
requested_class_count: False 80
class: False is too large by 1071
 Downsampling
class:False
 resampled to weight:0.91, count:80

requested_class_count: True 7
class: True is too large by 113
 Downsampling
class:True
 resampled to weight:0.08, count:7

2020-04-06 05:53:58.003039 Done resampling
2020-04-06 05:53:58.003114 done defining training-set! 

2020-04-06 05:53:58.003169 READING TRAINING-DATA FOR TASK
-------------------------------------
done reading training-data!

2020-04-06 05:53:58.018016 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:58.031348 Working on task


2020-04-06 05:53:58.204891 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:58.213413 Working on task:
------------------------------------------
{'pct': 0.1, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Random_Forest_Classifier', 'tested': False} 

2020-04-06 05:53:58.213514 Resampling data for task

2020-04-06 05:53:58.213548 defining test set...
Requested sample size: 39
requested_class_count: False 35
class: False is too large by 1151
 Downsampling
class:False
 resampled to weight:0.9, count:35

requested_class_count: True 3
class: True is too large by 120
 Downsampling
class:True
 resampled to weight:0.08, count:3

2020-04-06 05:53:58.214448 Done resampling
2020-04-06 05:53:58.214506 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:58.215475 Available Data Records: 1.27E+03 = 1,271

2020-04-06 05:53:58.215515 weighing classes in target: EMBARKED_Q
class:False
- count: 1151
- w

done reading test-data!

2020-04-06 05:53:58.408327 Working on task:
------------------------------------------
{'pct': 0.1, 'training_weights': {False: 0.5, True: 0.5}, 'model': 'Nearest_Neighbors', 'tested': False} 

2020-04-06 05:53:58.408447 Resampling data for task

2020-04-06 05:53:58.408489 defining test set...
Requested sample size: 39
requested_class_count: False 35
class: False is too large by 1151
 Downsampling
class:False
 resampled to weight:0.9, count:35

requested_class_count: True 3
class: True is too large by 120
 Downsampling
class:True
 resampled to weight:0.08, count:3

2020-04-06 05:53:58.409274 Done resampling
2020-04-06 05:53:58.409340 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:58.411159 Available Data Records: 1.27E+03 = 1,271

2020-04-06 05:53:58.415099 weighing classes in target: EMBARKED_Q
class:False
- count: 1151
- weight: 0.90558615263572
class:True
- count: 120
- weight: 0.0944138473642801

2020-04-06 05:53:58.41

done reading test-data!

2020-04-06 05:53:58.830211 Working on task:
------------------------------------------
{'pct': 0.2, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Random_Forest_Classifier', 'tested': False} 

2020-04-06 05:53:58.830298 Resampling data for task

2020-04-06 05:53:58.830325 defining test set...
Requested sample size: 78
requested_class_count: False 70
class: False is too large by 1116
 Downsampling
class:False
 resampled to weight:0.9, count:70

requested_class_count: True 7
class: True is too large by 116
 Downsampling
class:True
 resampled to weight:0.09, count:7

2020-04-06 05:53:58.830834 Done resampling
2020-04-06 05:53:58.830884 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:58.831535 Available Data Records: 1.23E+03 = 1,232

2020-04-06 05:53:58.831577 weighing classes in target: EMBARKED_Q
class:False
- count: 1116
- weight: 0.9058441558441559
class:True
- count: 116
- weight: 0.09415584415584416

2020-04-06

---------------------
2020-04-06 05:53:59.037262 Available Data Records: 1.23E+03 = 1,232

2020-04-06 05:53:59.037307 weighing classes in target: EMBARKED_Q
class:False
- count: 1116
- weight: 0.9058441558441559
class:True
- count: 116
- weight: 0.09415584415584416

2020-04-06 05:53:59.038073 defining training set...
Requested sample size: 172
requested_class_count: False 137
class: False is too large by 979
 Downsampling
class:False
 resampled to weight:0.8, count:137

requested_class_count: True 34
class: True is too large by 82
 Downsampling
class:True
 resampled to weight:0.2, count:34

2020-04-06 05:53:59.038661 Done resampling
2020-04-06 05:53:59.038698 done defining training-set! 

2020-04-06 05:53:59.038749 READING TRAINING-DATA FOR TASK
-------------------------------------
done reading training-data!

2020-04-06 05:53:59.055303 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:59.073132 Working on task:
----------------


2020-04-06 05:53:59.239707 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:59.247544 Working on task:
------------------------------------------
{'pct': 0.2, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Neural_Network', 'tested': False} 

2020-04-06 05:53:59.247639 Resampling data for task

2020-04-06 05:53:59.247674 defining test set...
Requested sample size: 78
requested_class_count: False 70
class: False is too large by 1116
 Downsampling
class:False
 resampled to weight:0.9, count:70

requested_class_count: True 7
class: True is too large by 116
 Downsampling
class:True
 resampled to weight:0.09, count:7

2020-04-06 05:53:59.248245 Done resampling
2020-04-06 05:53:59.248485 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:59.249115 Available Data Records: 1.23E+03 = 1,232

2020-04-06 05:53:59.249172 weighing classes in target: EMBARKED_Q
class:False
- count: 1116
- weight: 0.9

done reading training-data!

2020-04-06 05:53:59.446558 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:59.455396 Working on task:
------------------------------------------
{'pct': 0.4, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

2020-04-06 05:53:59.455498 Resampling data for task

2020-04-06 05:53:59.455532 defining test set...
Requested sample size: 157
requested_class_count: False 142
class: False is too large by 1044
 Downsampling
class:False
 resampled to weight:0.9, count:142

requested_class_count: True 14
class: True is too large by 109
 Downsampling
class:True
 resampled to weight:0.09, count:14

2020-04-06 05:53:59.456462 Done resampling
2020-04-06 05:53:59.456532 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:59.460554 Available Data Records: 1.15E+03 = 1,153

2020-04-06 05:53:59.460665 weighing classes in target: EMBARKE

done reading test-data!

2020-04-06 05:53:59.658299 Working on task:
------------------------------------------
{'pct': 0.4, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Add_a_Boost', 'tested': False} 

2020-04-06 05:53:59.658406 Resampling data for task

2020-04-06 05:53:59.658442 defining test set...
Requested sample size: 157
requested_class_count: False 142
class: False is too large by 1044
 Downsampling
class:False
 resampled to weight:0.9, count:142

requested_class_count: True 14
class: True is too large by 109
 Downsampling
class:True
 resampled to weight:0.09, count:14

2020-04-06 05:53:59.659648 Done resampling
2020-04-06 05:53:59.659730 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:59.660837 Available Data Records: 1.15E+03 = 1,153

2020-04-06 05:53:59.660905 weighing classes in target: EMBARKED_Q
class:False
- count: 1044
- weight: 0.9054640069384216
class:True
- count: 109
- weight: 0.09453599306157849

2020-04-06 05:53:5

done reading training-data!

2020-04-06 05:53:59.870407 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:53:59.884942 Working on task:
------------------------------------------
{'pct': 0.4, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

2020-04-06 05:53:59.885064 Resampling data for task

2020-04-06 05:53:59.885105 defining test set...
Requested sample size: 157
requested_class_count: False 142
class: False is too large by 1044
 Downsampling
class:False
 resampled to weight:0.9, count:142

requested_class_count: True 14
class: True is too large by 109
 Downsampling
class:True
 resampled to weight:0.09, count:14

2020-04-06 05:53:59.886066 Done resampling
2020-04-06 05:53:59.886155 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:53:59.887018 Available Data Records: 1.15E+03 = 1,153

2020-04-06 05:53:59.887087 weighing classes in target: EMBARKED_


2020-04-06 05:54:00.073912 Working on task:
------------------------------------------
{'pct': 0.4, 'training_weights': {False: 0.5, True: 0.5}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

2020-04-06 05:54:00.074011 Resampling data for task

2020-04-06 05:54:00.074053 defining test set...
Requested sample size: 157
requested_class_count: False 142
class: False is too large by 1044
 Downsampling
class:False
 resampled to weight:0.9, count:142

requested_class_count: True 14
class: True is too large by 109
 Downsampling
class:True
 resampled to weight:0.09, count:14

2020-04-06 05:54:00.075837 Done resampling
2020-04-06 05:54:00.075976 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:00.077411 Available Data Records: 1.15E+03 = 1,153

2020-04-06 05:54:00.077487 weighing classes in target: EMBARKED_Q
class:False
- count: 1044
- weight: 0.9054640069384216
class:True
- count: 109
- weight: 0.09453599306157849

2020-04-06 05:54:00.079188 defini

---------------------
2020-04-06 05:54:00.280085 Available Data Records: 1.15E+03 = 1,153

2020-04-06 05:54:00.280146 weighing classes in target: EMBARKED_Q
class:False
- count: 1044
- weight: 0.9054640069384216
class:True
- count: 109
- weight: 0.09453599306157849

2020-04-06 05:54:00.281059 defining training set...
Requested sample size: 322
requested_class_count: False 161
class: False is too large by 883
 Downsampling
class:False
 resampled to weight:0.5, count:161

requested_class_count: True 161
class: True is too small by 52
 Upsampling
class:True
 resampled to weight:0.5, count:161

2020-04-06 05:54:00.282351 Done resampling
2020-04-06 05:54:00.282394 done defining training-set! 

2020-04-06 05:54:00.282427 READING TRAINING-DATA FOR TASK
-------------------------------------
done reading training-data!

2020-04-06 05:54:00.296212 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:00.309882 Working on task:
----------------

done reading training-data!

2020-04-06 05:54:00.485165 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:00.500331 Working on task:
------------------------------------------
{'pct': 0.8, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Add_a_Boost', 'tested': False} 

2020-04-06 05:54:00.500445 Resampling data for task

2020-04-06 05:54:00.500483 defining test set...
Requested sample size: 314
requested_class_count: False 285
class: False is too large by 901
 Downsampling
class:False
 resampled to weight:0.91, count:285

requested_class_count: True 28
class: True is too large by 95
 Downsampling
class:True
 resampled to weight:0.09, count:28

2020-04-06 05:54:00.505019 Done resampling
2020-04-06 05:54:00.505149 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:00.506404 Available Data Records: 9.96E+02 = 996

2020-04-06 05:54:00.506474 weighing classes in target: EMBARKED_Q
class:Fals

---------------------
2020-04-06 05:54:00.687233 Available Data Records: 9.96E+02 = 996

2020-04-06 05:54:00.687271 weighing classes in target: EMBARKED_Q
class:False
- count: 901
- weight: 0.9046184738955824
class:True
- count: 95
- weight: 0.09538152610441768

2020-04-06 05:54:00.687686 defining training set...
Requested sample size: 557
requested_class_count: False 445
class: False is too large by 456
 Downsampling
class:False
 resampled to weight:0.8, count:445

requested_class_count: True 111
class: True is too small by 16
 Upsampling
class:True
 resampled to weight:0.2, count:111

2020-04-06 05:54:00.688415 Done resampling
2020-04-06 05:54:00.688468 done defining training-set! 

2020-04-06 05:54:00.688496 READING TRAINING-DATA FOR TASK
-------------------------------------
done reading training-data!

2020-04-06 05:54:00.698375 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:00.706367 Working on task:
--------------------


2020-04-06 05:54:00.889718 Working on task:
------------------------------------------
{'pct': 0.8, 'training_weights': {False: 0.5, True: 0.5}, 'model': 'Add_a_Boost', 'tested': False} 

2020-04-06 05:54:00.889970 Resampling data for task

2020-04-06 05:54:00.890019 defining test set...
Requested sample size: 314
requested_class_count: False 285
class: False is too large by 901
 Downsampling
class:False
 resampled to weight:0.91, count:285

requested_class_count: True 28
class: True is too large by 95
 Downsampling
class:True
 resampled to weight:0.09, count:28

2020-04-06 05:54:00.891682 Done resampling
2020-04-06 05:54:00.891820 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:00.893769 Available Data Records: 9.96E+02 = 996

2020-04-06 05:54:00.893868 weighing classes in target: EMBARKED_Q
class:False
- count: 901
- weight: 0.9046184738955824
class:True
- count: 95
- weight: 0.09538152610441768

2020-04-06 05:54:00.894922 defining training set.

done reading training-data!

2020-04-06 05:54:01.101878 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:01.112531 Working on task:
------------------------------------------
{'pct': 1.0, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

2020-04-06 05:54:01.112631 Resampling data for task

2020-04-06 05:54:01.112666 defining test set...
Requested sample size: 392
requested_class_count: False 356
class: False is too large by 830
 Downsampling
class:False
 resampled to weight:0.91, count:356

requested_class_count: True 35
class: True is too large by 88
 Downsampling
class:True
 resampled to weight:0.09, count:35

2020-04-06 05:54:01.113889 Done resampling
2020-04-06 05:54:01.114002 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:01.114947 Available Data Records: 9.18E+02 = 918

2020-04-06 05:54:01.114995 weighing classes in target: EMBARKED_Q

done reading training-data!

2020-04-06 05:54:01.341258 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:01.352797 Working on task:
------------------------------------------
{'pct': 1.0, 'training_weights': {False: 0.91, True: 0.09}, 'model': 'Neural_Network', 'tested': False} 

2020-04-06 05:54:01.353222 Resampling data for task

2020-04-06 05:54:01.353244 defining test set...
Requested sample size: 392
requested_class_count: False 356
class: False is too large by 830
 Downsampling
class:False
 resampled to weight:0.91, count:356

requested_class_count: True 35
class: True is too large by 88
 Downsampling
class:True
 resampled to weight:0.09, count:35

2020-04-06 05:54:01.354694 Done resampling
2020-04-06 05:54:01.354822 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:01.355833 Available Data Records: 9.18E+02 = 918

2020-04-06 05:54:01.355881 weighing classes in target: EMBARKED_Q
class:F

done reading test-data!

2020-04-06 05:54:01.552706 Working on task:
------------------------------------------
{'pct': 1.0, 'training_weights': {True: 0.3, False: 0.7}, 'model': 'Nearest_Neighbors', 'tested': False} 

2020-04-06 05:54:01.552961 Resampling data for task

2020-04-06 05:54:01.553001 defining test set...
Requested sample size: 392
requested_class_count: False 356
class: False is too large by 830
 Downsampling
class:False
 resampled to weight:0.91, count:356

requested_class_count: True 35
class: True is too large by 88
 Downsampling
class:True
 resampled to weight:0.09, count:35

2020-04-06 05:54:01.554209 Done resampling
2020-04-06 05:54:01.554404 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:01.555519 Available Data Records: 9.18E+02 = 918

2020-04-06 05:54:01.555584 weighing classes in target: EMBARKED_Q
class:False
- count: 830
- weight: 0.9041394335511983
class:True
- count: 88
- weight: 0.09586056644880174

2020-04-06 05:54:01

done reading training-data!

2020-04-06 05:54:01.758425 READING TEST-DATA FOR TASK
-------------------------------------
done reading test-data!

2020-04-06 05:54:01.769690 Working on task:
------------------------------------------
{'pct': 1.0, 'training_weights': {False: 0.5, True: 0.5}, 'model': 'Complement_Naive_Bayes', 'tested': False} 

2020-04-06 05:54:01.769786 Resampling data for task

2020-04-06 05:54:01.769819 defining test set...
Requested sample size: 392
requested_class_count: False 356
class: False is too large by 830
 Downsampling
class:False
 resampled to weight:0.91, count:356

requested_class_count: True 35
class: True is too large by 88
 Downsampling
class:True
 resampled to weight:0.09, count:35

2020-04-06 05:54:01.771005 Done resampling
2020-04-06 05:54:01.771126 done defining test-set!

Getting class weights
---------------------
2020-04-06 05:54:01.772222 Available Data Records: 9.18E+02 = 918

2020-04-06 05:54:01.772282 weighing classes in target: EMBARKED_Q
c