# Judge as a means to resolve disagreement

- can we talk about honest and adversarial agents in the context of AI safety?
- either we accept that agents may want to deceive us, then we can't assume even 1 honest agent
- or we can decide to assume both of the agents are honest
- still, even if both agents are acting in a good faith there will be cases of disagreement
- how to combine results of different models is a known problem, usually solved by ensembling
- could the game of debate be used as an ensembling method?
- in this work I will explore the simplified version of the game with just the pre-commitmend phase, without the actual debate

In [1]:
import numpy as np
from utils import prepare_data, create_mnist_cnn_model, sparse_mnist_generator_nonzero

batch_size = 128
num_classes = 10
epochs = 12

def get_accuracy(y_pred, y_true):
    correct = (y_pred == y_true).sum()
    print('correct: ', correct)
    return correct / y_true.shape[0]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [None]:
x_train, y_train, x_test, y_test, input_shape = prepare_data()

In [None]:
val_data_generator = sparse_mnist_generator_nonzero(
    x_test,
    y_test,
    batch_size=x_test.shape[0],
    sparsity=6,
    shuffle=False
)

In [None]:
data_x_sparse, data_y = next(val_data_generator)

In [None]:
true_categories = data_y.argmax(axis=1)

### Judge - 5k batches

In [67]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('model_sparse_mnist_generator_nonzero_5k.h5py')

### Better judge - 30k batches

In [None]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('model_sparse_mnist_generator_nonzero_30k.h5py')

# The strategy

- let's have a simple strategy for combining opinions of 2 agents with superior capabilities and a judge of limited capabilities
- if the agents agree, take their classification as a result
- if they disagree, take their opinions as a preselection (or pre-commitment to be consistent in terminology with the previous work) of candidate solutions
- and let the judge decide which of the two is more likely

In [345]:
def resolve_disagreement(predictions_a, predictions_b, predictions_judge):
    disagreement = predictions_a != predictions_b

    resolution = predictions_judge[disagreement, predictions_a[disagreement]] > \
                 predictions_judge[disagreement, predictions_b[disagreement]]

    # take b's predictions
    result = predictions_b[disagreement]
    
    # unless a has a greater probability according to the judge
    result[resolution] = predictions_a[disagreement][resolution]
    return result

### Load two different models

In [311]:
super_agent_a = create_mnist_cnn_model(num_classes, input_shape)
super_agent_a.load_weights('model_mnist_1epoch_adam1e-5.h5py')

super_agent_b = create_mnist_cnn_model(num_classes, input_shape)
super_agent_b.load_weights('model_mnist_1epoch_adam5e-5.h5py')

### Make predictions

In [312]:
predictions_a = super_agent_a.predict(x_test).argmax(axis=1) # categorical
predictions_b = super_agent_b.predict(x_test).argmax(axis=1) # categorical
predictions_judge = judge.predict(data_x_sparse) # raw class probabilities

### What is the accuracy of the agents alone?

In [318]:
get_accuracy(predictions_a, true_categories)

0.8387

In [319]:
get_accuracy(predictions_b, true_categories)

0.9368

### What is their accuracy on the samples where they disagree?

In [316]:
get_accuracy(predictions_a[disagreement], true_categories[disagreement])

0.0722521137586472

In [317]:
get_accuracy(predictions_b[disagreement], true_categories[disagreement])

0.8262874711760184

### What is the judge's accuracy on the samples where the agents disagree?

In [333]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [321]:
get_accuracy(predicted_category_judge[disagreement], true_categories[disagreement])

0.22982321291314373

### How does it change if the agent's pre-commitments are combined with the judge's probabilities?

In [315]:
result = resolve_disagreement(predictions_a, predictions_b, predictions_judge)
get_accuracy(result, true_categories[disagreement])

0.44427363566487316

In [320]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [314]:
disagreement.sum()

1301

# Agents of the same power (just different seed)

In [322]:
super_agent_a = create_mnist_cnn_model(num_classes, input_shape)
super_agent_a.load_weights('model_mnist_1epoch_adam5e-5_2.h5py')

super_agent_b = create_mnist_cnn_model(num_classes, input_shape)
super_agent_b.load_weights('model_mnist_1epoch_adam5e-5.h5py')

In [339]:
data_x_sparse, data_y = next(val_data_generator)

true_categories = data_y.argmax(axis=1)

predictions_a = super_agent_a.predict(x_test).argmax(axis=1)
predictions_b = super_agent_b.predict(x_test).argmax(axis=1)
predictions_judge = judge.predict(data_x_sparse)

### What is the accuracy of the agents alone?

In [328]:
get_accuracy(predictions_a, true_categories)

0.9229

In [329]:
get_accuracy(predictions_b, true_categories)

0.9368

### What is their accuracy on the samples where they disagree?

In [348]:
get_accuracy(predictions_a[disagreement], true_categories[disagreement])

correct:  86


0.23822714681440443

In [349]:
get_accuracy(predictions_b[disagreement], true_categories[disagreement])

correct:  225


0.6232686980609419

### What is the judge's accuracy on the samples where the agents disagree?

In [333]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [334]:
get_accuracy(predicted_category_judge[disagreement], true_categories[disagreement])

0.21329639889196675

### How does it change if the agent's pre-commitments are combined with the judge's probabilities?

In [347]:
result = resolve_disagreement(predictions_a, predictions_b, predictions_judge)
get_accuracy(result, true_categories[disagreement])

correct:  137


0.37950138504155123

### How big is the disagreement?

In [330]:
# the test set has 10k samples
disagreement.sum()

361

### What's the resulting accuracy of the agents plus the judge's resolution?

In [384]:
all_preds_combined = predictions_a.copy()
all_preds_combined[disagreement] = res

In [385]:
get_accuracy(all_preds_combined, true_categories)

correct:  9280


0.928

# Preliminary results

- the pre-selection of two candidate classes by two superior, but fallible agents, seems to improve the accuracy of the limited judge
- the judge's accuracy is significantly improved even when we examine only the samples, where the agents disagree (22.98% to 44.43%, resp. 23.82% to 37.95% in the second experiment)

## What to do next
- experiments should be repeated and evaluated on mean values, as the judge's results are stochastic