In [1]:
# Import the module and get a dataset
import btv

# 44156 is the id for the "electricity" datasset, predicting whether the price of electricty rises or falls week to week.
# Feel free to chose another from the openML catalogue (https://www.openml.org/search?type=data&status=active)
# Make sure to update the name of the class column argument if necessary.
df, X, y = btv.data_tab.dataset2df(44156, class_cols=["class"], verbose=True)

# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# to reproduce results, uncomment the lines below and use these test/train sets instead of the random ones above
import numpy as np
train_idx = np.load('demo_support/electric_train_set.npy')
test_idx = np.load('demo_support/electric_test_set.npy')
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]

OpenML Dataset
Name..........: electricity
Version.......: 13
Format........: arff
Upload Date...: 2022-07-10 10:34:54
Licence.......: Public
Download URL..: https://api.openml.org/data/v1/download/22103281/electricity.arff
OpenML URL....: https://www.openml.org/d/44156
# of features.: 9
# of instances: 38474


### Label information without any data
Without looking at the data or deploying any kind of model, the labels necessarily hold a lot of information. All we have to go on to guess the label of any given data point is the frequency of the label's occurrence. The information in each label, $y_i=c_i$, is written $$I(y_i) = -\log_2[p(y_i=c_i)]$$ where the probability of label $y_i$ being the correct class for the sample, $c_i$, can only be crudely estimated. Note that data of the sample, $x_i$, plays no role and the base of the logarithm makes the unit of information the bit.

In [2]:
naive_test_info = btv.core.chance_info(y_test,use_freq=False)
frequnecy_test_info = btv.core.chance_info(y_test,use_freq=True)
print("Information in the test set assuming p(label) == 1/# of classes:\n", naive_test_info)
print("Information in the test set assuming p(label) == 1/# of label ouccurances:\n", frequnecy_test_info)

Information in the test set assuming p(label) == 1/# of classes:
 7695.0
Information in the test set assuming p(label) == 1/# of label ouccurances:
 7694.651180995905


The frequency-aware information should be *lower* than the one based only on the number of classes, that reflecting the actual distribution of labels better. If they are close, that means the samples are very evenly distributed between the classes.
### Information remaining for a trained model
After training, a model will be able to use the data to infer the label. Therefore the information in the label for a trained model will be much less than using just frequency statistics. It can be written $$I(y_i|x_i) = -\log_2[p(y_i=c_i|x_i)],$$ where $p(y_i=c_i|x_i)$ is the probability estimation the classifier gives to the correct class. A *perfect* model would know every label with certainty from it's data, guess every label correctly with probability one, giving $I=0$ for all samples.

In [3]:
from sklearn.ensemble import GradientBoostingClassifier # a classic for tabulat data
# feel free to use another classifier. The methods called will be .fit(X,y), .predict_proba(X), and .score(X,y)
clf = GradientBoostingClassifier(
    n_estimators=100, learning_rate=1.0, max_depth=1, random_state=10
)
clf.fit(X_train, y_train)
predicted_label_probabilities = clf.predict_proba(X_test)
trained_test_info = btv.core.prediction_info(y_test, predicted_label_probabilities).sum() # function returns the information of each label
full_train_score = clf.score(X_test, y_test)

print(f"Information in the test set from frqunecy stats: {frequnecy_test_info:.3f}")
print(f"Information remaining in test set for trained classifier: {trained_test_info:.3f}")
print(f"Score when trained on full training set: : {full_train_score:0.5f}")

Information in the test set from frqunecy stats: 7694.651
Information remaining in test set for trained classifier: 1761.151
Score when trained on full training set: : 0.81092


Since the Shannon information of message can be thought of as the amount of information needed to specify the message, a better-trained model need less information about (or "leave less information in" if you use the thermodynamic analogy of information being hidden in a system) a test set.

In [4]:
# Select a minimum training set, one that only includes one of every label
min_idx, _ = btv.core.collect_min_set(y_train)
X_min, y_min = X_train[min_idx] , y_train[min_idx]

# Fitting to this minimum set will result in a functional, but poor classifier
clf.fit(X_min, y_min)
min_trained_predicted_test_probabilities = clf.predict_proba(X_test)
min_trained_test_info = btv.core.prediction_info(y_test, min_trained_predicted_test_probabilities).sum()
min_trained_score = clf.score(X_test,y_test)

print(f"Information in the test set from frqunecy stats: {frequnecy_test_info:.3f}")
print(f"Information remaining in test set for fully-trained classifier: {trained_test_info:.3f}")
print(f"Score when trained on full training set: : {full_train_score:0.5f}")
print(f"Information remaining in test set for minimum-trained classifier: {min_trained_test_info:.3f}")
print(f"Score when trained on minimum training set: : {min_trained_score:0.5f}")

Information in the test set from frqunecy stats: 7694.651
Information remaining in test set for fully-trained classifier: 1761.151
Score when trained on full training set: : 0.81092
Information remaining in test set for minimum-trained classifier: 2889.147
Score when trained on minimum training set: : 0.61663


The amount of information left in a test set is more than just anoter metric for classifier performance. The more ignorant a classifier is about a set, the more it can potentially learn from it.
#### Looking at the information remaining in the training set after training on subsets of it:

In [8]:
frequnecy_train_info = btv.core.chance_info(y_train)
clf.fit(X_train, y_train)
full_trained_train_info = btv.core.prediction_info(y_train, clf.predict_proba(X_train)).sum()
clf.fit(X_train[:len(y_train)//10], y_train[:len(y_train)//10])
tenth_trained_train_info = btv.core.prediction_info(y_train, clf.predict_proba(X_train)).sum()
clf.fit(X_min, y_min)
min_trained_train_info = btv.core.prediction_info(y_train, clf.predict_proba(X_train)).sum()

print(f"Information in the training set from frqunecy stats: {frequnecy_train_info:.3f}")
print(f"Information remaining in train set for minimum-trained classifier: {min_trained_train_info:.3f}")
print(f"Information remaining in train set for tenth-trained classifier: {tenth_trained_train_info:.3f}")
print(f"Information remaining in train set for fully-trained classifier: {full_trained_train_info:.3f}")

Information in the training set from frqunecy stats: 30778.913
Information remaining in train set for minimum-trained classifier: 11439.607
Information remaining in train set for tenth-trained classifier: 7171.968
Information remaining in train set for fully-trained classifier: 6907.605


This raises the natural question of how much of the information in the train set a classifier extracts, the "extraction rate."

In [9]:
full_trained_train_info_rate = btv.core.extraction_rate(clf,X_train,y_train)
tenth_trained_train_info_rate = btv.core.extraction_rate(clf,X_train[:len(y_train)//10],y_train[:len(y_train)//10])
min_trained_train_info_rate =  btv.core.extraction_rate(clf,X_min,y_min)

print(f"Fraction of information remaining in train set for minimum-trained classifier: {min_trained_train_info_rate:.5f}")
print(f"Fraction of information remaining in train set for tenth-trained classifier: {tenth_trained_train_info_rate:.5f}")
print(f"Fraction of information remaining in train set for fully-trained classifier: {full_trained_train_info_rate:.5f}")

Fraction of information remaining in train set for minimum-trained classifier: 1.00000
Fraction of information remaining in train set for tenth-trained classifier: 0.77871
Fraction of information remaining in train set for fully-trained classifier: 0.77557


This gives a rough idea of how well a model learns from a training set, which a combination of how complex the data is and how well-suited the model's architecture and parameters are to the training data. Such a nest of dependancies do not make it a good evaluation tool.

However, we can ask how much information *about* the test set a training set has for the model:

In [10]:
import numpy as np

idx_0s = np.where(y_train == 0)[0] 
idx_0splus1 = np.unique(np.append(min_idx, idx_0s)) # a training set of almost all zeros will not tell us much about the test set
clf.fit(X_train[idx_0splus1],y_train[idx_0splus1])
almost_all_zero_test_info = btv.core.prediction_info(y_test, clf.predict_proba(X_test)).sum()

print(f"Information in the test set from frqunecy stats: {frequnecy_test_info:.3f}")
print(f"Information remaining in test set for fully-trained classifier: {trained_test_info:.3f}")
print(f"Score when trained on full training set: : {full_train_score:0.5f}")
print(f"Information remaining in test set for minimum-trained classifier: {min_trained_test_info:.3f}")
print(f"Score when trained on minimum training set: : {min_trained_score:0.5f}")
print(f"Information remaining in test set for classifier trained on almost all 0 labels: {almost_all_zero_test_info:.3f}")
print(f"Score when trained on almost all 0 labels : {clf.score(X_test,y_test):0.5f}")

Information in the test set from frqunecy stats: 7694.651
Information remaining in test set for fully-trained classifier: 1761.151
Score when trained on full training set: : 0.81092
Information remaining in test set for minimum-trained classifier: 2889.147
Score when trained on minimum training set: : 0.61663
Information remaining in test set for classifier trained on almost all 0 labels: 3561.000
Score when trained on almost all 0 labels : 0.53723


NB: When using a random forest classifier on the electricity data set (OpenML id = 44156), and the train/test split aved in ./demo_support/, the classifier trained on almost all zeros did *worse* than on trained on only two samples. It left more information in the test set ($\approx1100$ bits) and scored $\approx8$% worse ($\approx53$% accuracy). This is not too surprizing since the training set with only one sample of each labels reflects the class balance of the test set.

ASking how much information diffent training sets provide about a test set may be interesting, but in practical terms, this is just another evaluation metric. It requires us to perform the expensive part of model developmen, training, which we are trying to minimize through data pruning and prioritization.

# Training Data Selection
So Shnnon information is behaving as expected, giving a quanitative measure of how ignorant a classifier is about certain set of labels. If we are in the position of having to chose a training set however, information alone is not enough. Information a set contains for a given classifier scales with its size, while performance (and information remaining in a test set) once trained does not. What is more, ideally, we would select training data 

In [None]:
# Chosing between fine-tuning sets
ds = data_tab.getdata(44156, verbose=False)
df, X, y = data_tab.dataset2df(ds, class_cols=["class"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = GradientBoostingClassifier(
    n_estimators=100, learning_rate=1.0, max_depth=1, random_state=10
)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_dev1, y_train, y_dev1 = train_test_split(X_train, y_train, test_size=0.2)
X_train, X_dev2, y_train, y_dev2 = train_test_split(X_train, y_train, test_size=0.2)
X_train, X_dev3, y_train, y_dev3 = train_test_split(X_train, y_train, test_size=0.2)
X_train, X_dev4, y_train, y_dev4 = train_test_split(X_train, y_train, test_size=0.2)
X_train, X_dev5, y_train, y_dev5 = train_test_split(X_train, y_train, test_size=0.2)

X_ft_list = [X_dev1, X_dev2, X_dev3, X_dev4, X_dev5]
y_ft_list = [y_dev1, y_dev2, y_dev3, y_dev4, y_dev5]