## Intro

This notebook gradually adds different complexity metrics to the ledger of all predictions across DVs, models, epochs -- essentially left-joining new metrics onto the ledger.

We checkpoint several of these intermediate ledgers, but the final output of this notebook is ledger_len, which is passed to ArtifactProcessing.ipynb to create input data for the artifacts (i.e., tables and figures) included in the paper.

Stored values can be loaded at several checkpoints by setting the `USED_STORED` flag to True, else the notebook will recalculate these values.  Initial data loading can take 8-10min but most other calculations finish in 1-2min with parallelization.

In [1]:
USE_STORED = True

### Load Ledgers

We store "ledgers'" for each model trained, where the ledger tracks model output / loss for every instance over each epoch.
They are inherently heavy documents (w.r.t. memory) but we need this instance-epoch level of granularity to calculate dynamic complexity metrics.

In [2]:
import numpy as np
import pandas as pd
from utils.post_processing import *

if USE_STORED:
    # WARNING: takes ~20sec
    # TODO; replace with 3rd part hosting
    all_ledgers = pd.read_csv( '/Users/ryancook/Downloads/all_ledgers.csv', keep_default_na=False, low_memory=False )
    PROBLEM_ID = 'f5_r594'
    all_ledgers = remove_duplicate_id( PROBLEM_ID, all_ledgers )
else:
    # WARNING: takes ~8min
    # TODO; better time printout
    from utils.post_processing import load_and_clean_ledgers
    all_ledgers = load_and_clean_ledgers()

# used later to ensure proper joins
check_counts = all_ledgers.groupby([ 'score_var', 'split_full' ])[ 'ID' ].count().reset_index().rename(columns={'ID': 'count'})

### v-Information 

The difficulty of a given instance can be seen as its lack of v-usable information, which considers the accessibility of Shannon mutual information between an encrypted input $X$ and an output $Y$ ()Xu et al., 2020; Shannon, 2001; Ethayarajh et al., 2022).


<br>

$
    \text{PVI}(x_i \rightarrow y_i) = -\log_2 H_{\nu}(Y) +
    \log_2 H_{\nu}(Y|X)
$


where $H_{\nu}(y_i) = \mathbf{E}[-\log g(y_i | \varnothing)]$ for ``null model'' $g$ trained on null string $\varnothing$ and $H_{\nu}(y_i) = \mathbf{E}[-\log g'(y_i | x_i)]$ from primary model $g'$.

<br>

These calculations can be done below in ~2sec.

In [3]:
def calculate_entropy(probs):
    return -1 * np.log2(probs)

def calculate_pvi(model_probs, null_probs):
    return calculate_entropy( null_probs ) - calculate_entropy( model_probs )


# Fast row-wise calculation since don't need to go by epoch
all_ledgers['null_probs'] = np.where(all_ledgers['null_probs']=='', np.NaN, all_ledgers['null_probs'] )
all_ledgers['PVI'] = calculate_pvi(all_ledgers['probs'].astype(float), all_ledgers['null_probs'].astype(float))

### Forgetting Statistics

Forgotten examples are instances which are classified correctly in earlier epochs but misclassified at some later epoch in the training process (Toneva et al., 2019).  These instances that are more frequently forgotten are considered to be more complex.

<br>

$
    \text{TF} = \sum \limits_{e=1}^{|E|} \sum \limits_{k=e+1}^{|E|} \mathbf{f}(y_i | x_i)
$

where $e \in E$ are training epochs.

<br>

These calculations can be done below in ~2min with parallelization.
Note that all parallelization was done on $n-1=7$ cores of a Mac M1 machine.

In [4]:
# remember returns zero-indexed IDXs not EPOCHs
def get_learn_forget_events(corr_idxs, last_idx=14):
    if len(corr_idxs)==0:
        return [], []
    
    # first one is always a learning event
    these_les = [corr_idxs[0]]
    these_fes = []
    # loop the rest of them
    counter = 1
    for cid in corr_idxs[1:]:
        prev_cid = corr_idxs[counter-1]
        counter+=1

        # a break of >1 cdx implies a learning event
        if prev_cid+1 == cid:
            continue
        else:
            these_les.append(cid)
            # NOTE; we also have a forgetting event after last cid
            these_fes.append(prev_cid+1)

    # check if we forget it again after cids
    last_cid = corr_idxs[-1]
    if last_cid!=last_idx:
        these_fes.append(last_cid+1)
    
    return these_les, these_fes


def proc_mn_forget(mn):
    mdf = all_ledgers[all_ledgers['model_name']==mn]
    # forgetting stats done at model level
    corr_by_ep = mdf.pivot_table('correct', ['ID'], 'epoch')

    idf = pd.DataFrame()
    for item in corr_by_ep.index:
        this_row = corr_by_ep.loc[item, :]
        corr_idxs = list(np.where(np.array(this_row)==1)[0])

        learn_events, forget_events = get_learn_forget_events(corr_idxs)
        this_d = {'ID': item,
                  'ep_times_learned': len(learn_events),
                  'ep_times_forgotten': len(forget_events),
                  'ep_is_unforgettable': int( len(forget_events)==0 )}
        idf = pd.concat([ idf, pd.DataFrame.from_dict([ this_d ]) ])

    return pd.merge( mdf, idf, on='ID', how='left' )


USE_STORED = False

if USE_STORED:
    ledger_forget = pd.read_csv( '/Users/ryancook/Downloads/these_ledgers/ledger_forget.csv', keep_default_na=False )
    ledger_forget['split_full'] = [ s + '-' + ledger_forget['strat'].iloc[sdx] if s=='train' else s for sdx, s in enumerate(ledger_forget['split']) ]
else:
    # start with index-level accuracy (i.e., correct)
    all_ledgers['correct'] = (all_ledgers['preds'] == all_ledgers['labels']).astype(int)
        
    #process files in parallel
    import multiprocess
    #multisetup
    cores_to_use = multiprocess.cpu_count()-1
    pool = multiprocess.Pool(cores_to_use)

    # WARNING: takes ~2min
    all_mns = all_ledgers['model_name'].unique()
    with multiprocess.Pool(cores_to_use) as pool:
        out_list = pool.map(proc_mn_forget, all_mns)

    ledger_forget = pd.concat( out_list )

In [50]:
ledger_forget

Unnamed: 0,model_name,epoch,split,ID,labels,probs,preds,losses,null_probs,score_var,strat,split_full,PVI,correct,ep_times_learned,ep_times_forgotten,ep_is_unforgettable
0,local-ffn_Numeracy_ft-True_sub-False_t7_numLay...,1,train,f4_r656,1,0.452078,0.0,0.793901,0.49707943,Numeracy,Constant,train-Constant,-0.136905,0,1,0,1
1,local-ffn_Numeracy_ft-True_sub-False_t7_numLay...,1,train,f5_r523,0,0.494448,0.0,0.682105,0.49707943,Numeracy,Constant,train-Constant,-0.007656,1,2,2,0
2,local-ffn_Numeracy_ft-True_sub-False_t7_numLay...,1,train,f5_r402,1,0.444545,0.0,0.810704,0.49707943,Numeracy,Constant,train-Constant,-0.161148,0,1,1,0
3,local-ffn_Numeracy_ft-True_sub-False_t7_numLay...,1,train,f4_r855,1,0.465312,0.0,0.765048,0.49707943,Numeracy,Constant,train-Constant,-0.095279,0,1,1,0
4,local-ffn_Numeracy_ft-True_sub-False_t7_numLay...,1,train,f2_r1164,0,0.439030,0.0,0.578088,0.49707943,Numeracy,Constant,train-Constant,-0.179157,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79855,local-cnn_Numeracy_ft-True_sub-False_t5_numLay...,15,test,f1_r1697,0,0.370953,0.0,0.463550,0.5009064674377441,Numeracy,Constant,test,-0.433304,1,1,0,1
79856,local-cnn_Numeracy_ft-True_sub-False_t5_numLay...,15,test,f1_r1698,0,0.354244,0.0,0.437333,0.5009064674377441,Numeracy,Constant,test,-0.499800,1,1,0,1
79857,local-cnn_Numeracy_ft-True_sub-False_t5_numLay...,15,test,f1_r1699,0,0.353470,0.0,0.436136,0.5009064674377441,Numeracy,Constant,test,-0.502953,1,1,0,1
79858,local-cnn_Numeracy_ft-True_sub-False_t5_numLay...,15,test,f1_r1700,0,0.361193,0.0,0.448153,0.5009064674377441,Numeracy,Constant,test,-0.471770,1,1,0,1


### PyHard

The PyHard algorithm uses instance space analysis to sample only informative meta-features and efficiently generate a single output probability of misclassification (PH) from a pool of seven diverse classifiers (Paiva et al., 2021).
A higher PH value indicates that an instance has a higher probability of being misclassified.

<br>

$
    \text{PH}_{\mathcal{L}} \Big( \langle x_i, y_i \rangle \Big) = 1 - \frac{1}{|\mathcal{L}|} \sum \limits_{j=1}^{|\mathcal{L}|} p \Big( y_i | x_i, ~g_j(t, \alpha) \Big)
$

for diverse classifiers $\mathcal{L}$ and $g_j(t, \alpha)$ is the complete set of learning algorithms and their hyperparameters.

<br>

Note that these calculations are generated from the <i>run_pyhard.py</i> script and stored in <i>pyhard/</i> for convenient access later.
While calculations took several hours on an HTC Condor compute system, they can be added back to the central ledger below in ~1min with parallelization.

In [5]:
def proc_hard_sv_split(in_tup):
    ph_base = 'pyhard/'

    svar, full_split = in_tup

    if svar == 'wer':
            input_data_d = load_zda_data('data/data_zda/', subset=False)
    else:
        input_data_d = load_hal_data('data/DataCVFolds/', score_variable=svar, subset=False)
    
    split = full_split.split('-')[0]

    if 'train' in full_split:
        input_data_d['hard_train_data'], input_data_d['rand_train_data'] = sample_hard_rand(input_data_d['train_data'], svar)

    split_path = full_split if 'train' not in full_split else ('hard_train' if 'Constant' in full_split else 'rand_train' )
    this_input_data = input_data_d[split_path+'_data']

    full_path = ph_base + f'{svar}/{split_path}/'
    this_ih = pd.read_csv( full_path + 'ih.csv' )

    ih_id = pd.concat([ this_input_data[['ID']], this_ih ], axis=1)
    ih_id['score_var'] = svar
    ih_id['split'] = split
    ih_id['split_full'] = full_split
    ih_id['instance_hardness'] = ih_id['instance_hardness'].fillna(0)

    # INIT CHECK: loaded data must match length of stored pyhard 
    ## stored from previously running run_pyhard.py
    if len(this_input_data) != len(ih_id):
        print('-----------------------------')
        print(f'INIT PROBLEM WITH {svar}-{full_split}')
        print('-----------------------------')
        raise ValueError('Shape of input data must match instance hardness calculations')

    # append pyhard metrics on left-hand side and filter 
    sub_ledg = ledger_forget[((ledger_forget['score_var']==svar) &  (ledger_forget['split_full']==full_split))]
    this_join = pd.merge( sub_ledg, ih_id, on=['score_var', 'split_full', 'split', 'ID'],
                            how='left' ).drop('instances', axis=1)
    sub_join = this_join[this_join['instance_hardness'].isna()==False]


    # JOIN CHECK: must have the same number of IDs in the existing ledger for this task-split
    join_check_num = check_counts[( (check_counts['score_var']==svar) &
                                (check_counts['split_full']==full_split) )]['count'].iloc[0]
    if join_check_num != len(sub_join):
        print('-----------------------------')
        print(f'JOIN PROBLEM WITH {svar}-{full_split}')
        print('-----------------------------')
        raise ValueError('Shape of joined data must match expected dimensions of ledger_forget')

    return sub_join



USE_STORED = False

if USE_STORED:
    ledger_hardness = pd.read_csv( '/Users/ryancook/Downloads/these_ledgers/ledger_hardness.csv', keep_default_na=False )
else:
    # NOTE; these are older functions to avoid filtering NAs from demographic information
    ## we are only interested in IDs / text strings here
    from utils.data_loading_no_demo import *
    from utils.data_processing import *
        
    #process files in parallel
    import multiprocess
    #multisetup
    cores_to_use = multiprocess.cpu_count()-1
    pool = multiprocess.Pool(cores_to_use)

    from itertools import product
    these_svars, these_full_splits = ledger_forget['score_var'].unique(), ledger_forget['split_full'].unique()
    all_sv_splits = list( product( these_svars, these_full_splits ) )
    
    with multiprocess.Pool(cores_to_use) as pool:
        out_list = pool.map(proc_hard_sv_split, all_sv_splits)

    ledger_hardness = pd.concat( out_list )

## IRT

The  difficulty of a given item can be considered as the point on the ability scale $\theta$ where the probability of any subject providing a correct answer is $p(\theta)=0.5$
We can obtain IRT difficulty estimates in a one parameter model by optimizing the Item Characteristic Curve:

$
    p(\theta) = \frac{ \displaystyle 1}{ \displaystyle 1 + e^{\theta-b}}
$

<br>

Note that calculations were implemented in Working_py-irt.ipynb (in ~2min) since py-irt has conflicting package requirements.
Code is included to create train predictions -- since the training ledger only stores val and test split predictions, but we can load these post hoc computed values from stored values in <i>py-irt/<i>.

In [67]:
def proc_irt_sv_split(in_tup):
    svar, full_split = in_tup
    this_id_diff = pd.read_csv( f"py-irt/{svar}_{full_split}_id_difficulty.csv" )
    this_id_diff['split_full'] = this_id_diff['split']
    this_id_diff['split'] = [ s.split('-')[0] for s in this_id_diff['split_full'] ]

    sub_ledg = ledger_hardness[((ledger_hardness['score_var']==svar) &  (ledger_hardness['split_full']==full_split))]
    this_join = pd.merge( sub_ledg, this_id_diff,
                        on=['score_var', 'split_full', 'split', 'ID'], how='left' )
    sub_join = this_join[this_join['irt_difficulty'].isna()==False]

    # also add boundary proximity while we're looping thru
    tdf = get_full_split_df(svar, full_split)
    sub_tdf = tdf[['ID', 'score']]
    sub_tdf.loc[:, 'split_full'] = full_split
    sub_tdf.loc[:, 'score_var'] = svar
    sub_tdf.loc[:, 'split'] = [ s.split('-')[0] for s in sub_tdf['split_full'] ]
    
    bound_join = pd.merge( sub_join, sub_tdf, on=['score_var', 'split_full', 'split', 'ID'], how='left')
    bound_sub = bound_join[ bound_join['score'].isna()==False ]

    join_check_num = check_counts[( (check_counts['score_var']==svar) & (check_counts['split_full']==full_split) )]['count'].iloc[0]
    if join_check_num != len(bound_sub):
        print('-----------------------------')
        print(f'JOIN PROBLEM WITH {svar}-{full_split}')
        print('-----------------------------')
        raise ValueError('Shape of joined data must match expected dimensions of ledger_hardness')
    # print('passed join check...')

    return bound_sub



pd.options.mode.chained_assignment = None

ledger_diff = pd.DataFrame()

# NOTE; our splits are named differently
split_lst = ['train-None', 'train-Constant', 'val', 'test']
check_counts = ledger_hardness.groupby([ 'score_var', 'split_full' ])[ 'ID' ].count().reset_index().rename(columns={'ID': 'count'})


USE_STORED = True

if USE_STORED:
    ledger_diff = pd.read_csv( '/Users/ryancook/Downloads/these_ledgers/ledger_diff.csv', keep_default_na=False )
else:
    # NOTE; these are older functions to avoid filtering NAs from demographic information
    ## we are only interested in IDs / text strings here
    from utils.data_loading_no_demo import *
    from utils.data_processing import *
        
    #process files in parallel
    import multiprocess
    #multisetup
    cores_to_use = multiprocess.cpu_count()-1
    pool = multiprocess.Pool(cores_to_use)

    from itertools import product
    these_svars, these_full_splits = ledger_hardness['score_var'].unique(), ledger_hardness['split_full'].unique()
    all_sv_splits = list( product( these_svars, these_full_splits ) )
    
    with multiprocess.Pool(cores_to_use) as pool:
        out_list = pool.map(proc_irt_sv_split, all_sv_splits)

    ledger_diff = pd.concat( out_list )

  ledger_diff = pd.read_csv( '/Users/ryancook/Downloads/these_ledgers/ledger_diff.csv', keep_default_na=False )


## Boundary Proximity

Boundary proximity can be seen as "the difficulty in separating the data points into their expected classes," and this complexity increases as the distance between a given point and the classification boundary shrinks (Lorena et al., 2024).

$
    \text{BP}(y_{i,c}) = | y_c^* - y_{i,c} |
$

where $y_c^*$ refers to the class boundary between class $c \in C$ and its nearest neighboring class $c^*$.

<br>

We can calculate BP from the latent variable score below in ~1min.

In [9]:
# add boundary proximity
# WARNING; takes ~1min
med_d = ledger_diff.groupby('score_var')['score'].median().to_dict()
ledger_diff['boundary_prox'] = [ abs( score - med_d[ ledger_diff['score_var'].iloc[sdx] ] ) for sdx, score  in enumerate(ledger_diff['score']) ]

### Sentence Length

The most efficient and popular linguistic heuristic is sentence length, which is a simple count of the number of tokens in the input sequence.
As sentence length grows, the sequence becomes more complex as the exponentially increasing number of possibilities makes the class harder to guess by chance (Spitkovsky et al., 2009).

$
\text{SL}(x_i) = ~||x_i||
$

<br>

We calculate sentence length in the code below in ~1min with parallelization.

In [70]:
def proc_sl_sv_split(in_tup):
    svar, full_split = in_tup

    if svar == 'wer':
        input_data_d = load_zda_data('data/data_zda/', subset=False)
    else:
        input_data_d = load_hal_data('data/DataCVFolds/', score_variable=svar, subset=False)

    split = full_split.split('-')[0]

    if 'train' in full_split:
        input_data_d['hard_train_data'], input_data_d['rand_train_data'] = sample_hard_rand(input_data_d['train_data'], svar)

    split_path = full_split if 'train' not in full_split else ('hard_train' if 'Constant' in full_split else 'rand_train' )
    this_input_data = input_data_d[split_path+'_data']

    slSer = pd.Series( [ len(word_tokenize(s)) for s in this_input_data['text'] ], name='tok_len' )

    jdf = pd.concat( [this_input_data, slSer], axis = 1 )
    jdf = jdf.drop( 'text', axis=1 )
    sub_jdf = jdf[['ID', 'tok_len']]
    sub_jdf.loc[:, 'split_full'] = full_split
    sub_jdf.loc[:, 'score_var'] = svar
    sub_jdf.loc[:, 'split'] = [ s.split('-')[0] for s in sub_jdf['split_full'] ]

    sub_ledg = ledger_diff[((ledger_diff['score_var']==svar) &  (ledger_diff['split_full']==full_split))]
    this_join = pd.merge( sub_ledg, sub_jdf, on=['score_var', 'split_full', 'split', 'ID'], how='left' )
    sub_join = this_join[this_join['tok_len'].isna()==False]

    join_check_num = check_counts[( (check_counts['score_var']==svar) & (check_counts['split_full']==full_split) )]['count'].iloc[0]
    if join_check_num != len(sub_join):
        print('-----------------------------')
        print(f'JOIN PROBLEM WITH {svar}-{full_split}')
        print('-----------------------------')
        raise ValueError('Shape of joined data must match expected dimensions of ledger_diff')
    # print('passed join check...')

    return sub_join



USE_STORED = False

if USE_STORED:
    ledger_len = pd.read_csv( '/Users/ryancook/Downloads/these_ledgers/ledger_len.csv', keep_default_na=False )
else:
    # NOTE; these are older functions to avoid filtering NAs from demographic information
    ## we are only interested in IDs / text strings here
    from utils.data_loading_no_demo import *
    from utils.data_processing import *
    
    #process files in parallel
    import multiprocess
    #multisetup
    cores_to_use = multiprocess.cpu_count()-1
    pool = multiprocess.Pool(cores_to_use)

    from itertools import product
    these_svars, these_full_splits = ledger_diff['score_var'].unique(), ledger_diff['split_full'].unique()
    all_sv_splits = list( product( these_svars, these_full_splits ) )
    
    from nltk.tokenize import word_tokenize
    with multiprocess.Pool(cores_to_use) as pool:
        out_list = pool.map(proc_sl_sv_split, all_sv_splits)

    ledger_len = pd.concat( out_list )
    ledger_len = ledger_len.rename(columns={'ep_times_forgotten': 'times_forgotten'})


In [72]:
ledger_len.to_csv( 'data/ledger_len.csv', index=False )