### Up/downsampling data using temperature-scaled mixing

Following this paper: 
- https://arxiv.org/pdf/1910.10683.pdf
  - Scroll down to the "Examples-proportional mixing" section 
- https://github.com/HKUNLP/UnifiedSKG/blob/main/seq2seq_construction/meta_tuning.py#L23-L74

to implement temperature scaled mixing. This is google's T5 paper and the algorithm refers to mixing datasets of different sizes. We are mixing labels of different sizes. The method is at the bottom of this notebook. P.S., if you're here for checkbox cleanup/analysis, I've moved it to checkbox_data_cleanup. 

In [None]:
import pandas as pd
import numpy as np
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
import re

# pd.set_option('display.max_rows', 120)

df = pd.read_csv("../data/processed/case-data.csv")
df.head()

In [2]:
label_counts = df["DESCRIPTION"].value_counts()
print("Total Examples: %d" % label_counts.sum()) 
label_counts

Total Examples: 108804


Direct Contact                            38878
Collateral Contact                        24794
Client Contact out of office              12677
Client contact in office                  11446
Attempted client contact                   4285
                                          ...  
Client rejected by available providers        2
Medical detox not available                   2
HACA                                          1
Outpatient Treatment Program                  1
Denied                                        1
Name: DESCRIPTION, Length: 113, dtype: int64

### Examples proportional mixing

"...if we simply sample in proportion to each data set’s size, the vast majority of the data the model sees will be unlabeled, and it will undertrain on all of the supervised tasks... To get around this issue, we set an artificial “limit” on the data set sizes before computing the proportions."0

Rewriting their formulas in terms of labels:

If number of examples for each of our $N$ labels is $ e_n, n \in \{1, ..., N\} $, then
$$
r_m = \frac{min(e_m, K)}{\sum min(e_n, K)}
$$
where $r_m$ is the probability of sampling an example with the $m$-th label during training, and $K$ is the arificial label size limit

In [3]:
# Regardless of how we choose to combine and remove data, generic sampling methods can be written
# For simplicity, lets use all_labels for now

k = 1000
total_labels = label_counts.apply(lambda x: min(x, k)).sum()
weights = label_counts.apply(lambda x: min(x, k) / total_labels)

In [4]:
df.set_index("DESCRIPTION", inplace=True)

Weights contains $r_m$ for the $m$-th label. Now for each datapoint, we need to make the probability of picking it equal to $r_m / m_{total}$ where $m_{total}$ is the total number of samples with that label. $m_{total}$ is not $K$, the artificial limit. That's separate. 

So the sum of probabilities of picking samples with label $m$ will equal $r_m$

In [5]:
df["class_weights"] = weights
df["class_totals"] = label_counts
df["sample_weights"] = df["class_weights"] / df["class_totals"]
df.reset_index(drop=False, inplace=True)

# These should be equal
# print(weights)
# df.groupby("DESCRIPTION")["sample_weights"].sum().sort_values(ascending=False)

In [6]:
sampled = df.sample(frac=1, weights="sample_weights", replace=True)
df.drop(columns=["class_weights", "class_totals", "sample_weights"], inplace=True)
sampled["DESCRIPTION"].value_counts()

Attempted client contact                           5546
Collateral Contact                                 5524
Client contact in office                           5491
Client Contact out of office                       5455
No Show                                            5448
                                                   ... 
Client rejected by available providers                9
Outpatient Treatment Program                          8
Denied                                                7
HACA                                                  6
Client not assigned DACC CSR due to court order       3
Name: DESCRIPTION, Length: 113, dtype: int64

### Temperature Scaled Mixing
Some summarization, some direct copying from the text: Temperature scaled mixing is almost identical, except each label's mixing rate $r_m$ is raised to the the power of $\frac{1}{T}$ where $T$ is a temperaure scaling parameter. The rates are then renormalized so they sum to 1. 

"When $T = 1$, this approach is equivalent to examples-proportional mixing and as $T$ increases the proportions become closer to equal mixing. We retain the data set size limit $K$ (applied to obtain $r_m$ before temperature scaling) but set it to a large value of $K = 2^{21}$. We use a large value of $K$ because increasing the temperature will decrease the mixing rate of the largest data"


In [7]:
def temperature_scaled_mixing(df: pd.DataFrame, label_col: str, T, K=None, frac=1.0):
    """
    df: the dataset to sample
    label_col: the column containing the labels
    T: The temperature parameter. When T=1, this is identical to examples-proportional mixing
    K: The artificial size limit. If not provided, defaults to size of largest label set
    frac: What fraction of the original df should the returned sampled df be. If 1, len(sampled.index) == len(df.index)
    """

    label_counts = df[label_col].value_counts()
    if not K:
        K = label_counts.max() 
    total_labels = label_counts.apply(lambda x: min(x, K)).sum()
    weights = label_counts.apply(lambda x: min(x, K) / total_labels)

    weights = weights.pow(1.0 / T)
    weights /= weights.sum()
    
    df.set_index(label_col, inplace=True)
    df["class_weights"] = weights
    df["class_totals"] = label_counts
    df["sample_weights"] = df["class_weights"] / df["class_totals"]
    df.reset_index(label_col, inplace=True)
    
    sampled = df.sample(frac=frac, weights="sample_weights", replace=True)
    df.drop(columns=["class_weights", "class_totals", "sample_weights"], inplace=True)
    sampled.drop(columns=["class_weights", "class_totals", "sample_weights"], inplace=True)

    return sampled

In [10]:
sampled_df = temperature_scaled_mixing(df, "DESCRIPTION", T=1, K=3000)
sampled_df["DESCRIPTION"].value_counts()

Direct Contact                                     10347
Client contact in office                           10315
Collateral Contact                                 10261
Client Contact out of office                       10260
Attempted client contact                           10239
                                                   ...  
Partial: More than 1/2                                 5
Denied                                                 5
HACA                                                   4
Client not assigned DACC CSR due to court order        2
Outpatient Treatment Program                           2
Name: DESCRIPTION, Length: 113, dtype: int64