# How Many Events Are Good Enough?
The number of events directly affects the length of training. This notebook will explore a bit how the performance varies as the number of events. There are two things that need to be explored to understand this:

- How stable is a run?
- How many events are needed?

## Initalization

In [6]:
from bdt_training_scikit_tools import plot_training_performance, load_default_samples, test_train_samples, prep_samples, default_training, calc_performance, calc_performance_for_run
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Load Data Samples
Load all the data we need.

In [7]:
def fraction (fractionGoal):
    '''Return a function that will take from a randomly distributed number a fraction of events.
    Meant to be used against an EventNumber to evenly pair down the number of events.
    
    Args
        fractionGoal - the fraction of events we'd like to see
        
    Returns
        func - A function that will return True or False when given an event number. True, then keep the event.
    '''
    if (fractionGoal < 0) | (fractionGoal > 1.0):
        raise Exception("Fraction must be between 0.0 and 1.0 (not {0})".format(fractionGoal))
        
    seq = ()
    fg = fractionGoal
    maxCount = 300
    for i in range(1, maxCount):
        frac = 1.0/i
        if frac <= fg:
            seq = seq + (i,)
            remainingSequence = [i for i in range(maxCount) if len([j for j in seq if i%j == 0]) != 0]
            actualFraction = len(remainingSequence)/maxCount
            fg = fractionGoal - actualFraction

    return seq
    #return lambda x: len([i for i in seq if x%i == 0]) != 0
    #return lambda x: x%i == 0

In [8]:
all_events = load_default_samples("20")

BIB: 100000 events
Multijet: 100000 events
Signal: 100000 events


Some memory load information from the above. These numbers are pulled from the task manager display on Windows 10, Fall creators update.

- 100K events each for BIB, MJ, and Signal: 275 MB
- 600K events each for BIB, MJ, and Signal: 600 MB

In [25]:
def calcDFFilter (df, seq):
    gf, *gfRest = [df.EventNumber%i!=0 for i in seq]
    for g in gfRest:
        gf = gf & g
    return gf

def get_fraction_of_events(events, fractionToUse):
    '''Return a fraction of all events as training and testing samples.
    
    Args
        fractionToUse - fraction of the full datasample we should be using
        
    Returns
        training - Training tripple of events (bib, mj, sig)
        testing - Testing tripple of events (bib, mj, sig)
    
    '''
    # Create the per event filter. We have to do this against each
    # of the three input samples.
    seq = fraction(fractionToUse)
    fracFilters = [calcDFFilter(df, seq) for df in events]

    fraction_events = [dfi[1][dfi[0]] for dfi in zip(fracFilters,events)]
    return test_train_samples(fraction_events)

## Stability
Look at the stability of a run by repeating a training for a given fraction of events

In [38]:
def stability_run (train, testing, nRuns):
    '''Run the same training n-times
    
    Args
        train - (bib, mj, sig), which each one the full DF of all events to train on (including weights)
        testing - (bib, mj, sig), which each one the full DF of all events to train on (including weights)
        nRuns - How many times shall we repeat the training?
    
    Results
        [i,...] - i is the dict from calc_performance with all the results
    
    '''
    
    # Prep the events for running
    all_events, all_events_class, training_weight, evaluation_weight = prep_samples(train[0], train[1], train[2])
    
    # Build a list of the results
    return [calc_performance(default_training(all_events, training_weight, all_events_class), testing) for i in range(nRuns)]

In [39]:
training, testing = get_fraction_of_events(all_events, 0.5)

In [49]:
%%time
r = stability_run(training, testing, 10)

Wall time: 2min 55s


In [50]:
results = pd.DataFrame(r)
results.HSSSsqrtB

0    14.093197
1    14.093197
2    14.093197
3    14.093197
4    14.093197
5    14.093197
6    14.093197
7    14.093197
8    14.093197
9    14.093197
Name: HSSSsqrtB, dtype: float64

Conclusion from this: *BDT Boosting is deterministic*

- We don't have to worry about multiple runs to tell things apart
- We do have to be a little more serious about how we treat differences between runs now.