# Judge as a means to resolve disagreement

- can we talk about honest and adversarial agents in the context of AI safety?
- either we accept that agents may want to deceive us, then we can't assume even 1 honest agent
- or we can decide to assume both of the agents are honest
- still, even if both agents are acting in a good faith there will be cases of disagreement
- how to combine results of different models is a known problem, usually solved by ensembling
- could the game of debate be used as an ensembling method?
- in this work I will explore the simplified version of the game with just the pre-commitmend phase, without the actual debate

In [2]:
import numpy as np
from precommit_analysis.keras_mnist_example import prepare_data, create_mnist_cnn_model
from precommit_analysis.generators import sparse_mnist_generator_nonzero

batch_size = 128
num_classes = 10
epochs = 12

def get_accuracy(y_pred, y_true):
    correct = (y_pred == y_true).sum()
    print('correct: ', correct)
    return correct / y_true.shape[0]

In [3]:
x_train, y_train, x_test, y_test, input_shape = prepare_data(num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [4]:
val_data_generator = sparse_mnist_generator_nonzero(
    x_test,
    y_test,
    batch_size=x_test.shape[0],
    sparsity=6,
    shuffle=False
)

In [5]:
data_x_sparse, data_y = next(val_data_generator)

In [6]:
true_categories = data_y.argmax(axis=1)

### Judge - 5k batches

In [9]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('models/model_sparse_mnist_generator_nonzero_5k.h5py')

### Better judge - 30k batches

In [None]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('model_sparse_mnist_generator_nonzero_30k.h5py')

# The strategy

- let's have a simple strategy for combining opinions of 2 agents with superior capabilities and a judge of limited capabilities
- if the agents agree, take their classification as a result
- if they disagree, take their opinions as a preselection (or pre-commitment to be consistent in terminology with the previous work) of candidate solutions
- and let the judge decide which of the two is more likely

In [18]:
def resolve_disagreement(predictions_a, predictions_b, predictions_judge):
    disagreement = predictions_a != predictions_b

    resolution = predictions_judge[disagreement, predictions_a[disagreement]] > \
                 predictions_judge[disagreement, predictions_b[disagreement]]

    # take b's predictions
    result = predictions_b[disagreement]
    
    # unless a has a greater probability according to the judge
    result[resolution] = predictions_a[disagreement][resolution]
    return result, disagreement

### Load two different models

In [12]:
super_agent_a = create_mnist_cnn_model(num_classes, input_shape)
super_agent_a.load_weights('models/model_mnist_1epoch_adam1e-5.h5py')

super_agent_b = create_mnist_cnn_model(num_classes, input_shape)
super_agent_b.load_weights('models/model_mnist_1epoch_adam5e-5.h5py')

### Make predictions

In [13]:
predictions_a = super_agent_a.predict(x_test).argmax(axis=1) # categorical
predictions_b = super_agent_b.predict(x_test).argmax(axis=1) # categorical
predictions_judge = judge.predict(data_x_sparse) # raw class probabilities

### Find and resolve disagreement

In [19]:
result, disagreement = resolve_disagreement(predictions_a, predictions_b, predictions_judge)

### What is the accuracy of the agents alone?

In [14]:
get_accuracy(predictions_a, true_categories)

correct:  8387


0.8387

In [15]:
get_accuracy(predictions_b, true_categories)

correct:  9368


0.9368

### What is their accuracy on the samples where they disagree?

In [20]:
get_accuracy(predictions_a[disagreement], true_categories[disagreement])

correct:  94


0.0722521137586472

In [21]:
get_accuracy(predictions_b[disagreement], true_categories[disagreement])

correct:  1075


0.8262874711760184

### What is the judge's accuracy on the samples where the agents disagree?

In [22]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [23]:
get_accuracy(predicted_category_judge[disagreement], true_categories[disagreement])

correct:  291


0.22367409684857803

### How does it change if the agent's pre-commitments are combined with the judge's probabilities?

In [24]:
get_accuracy(result, true_categories[disagreement])

correct:  563


0.4327440430438125

In [25]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [26]:
disagreement.sum()

1301

# Agents of the same power (just different seed)

In [27]:
super_agent_a = create_mnist_cnn_model(num_classes, input_shape)
super_agent_a.load_weights('models/model_mnist_1epoch_adam5e-5_2.h5py')

super_agent_b = create_mnist_cnn_model(num_classes, input_shape)
super_agent_b.load_weights('models/model_mnist_1epoch_adam5e-5.h5py')

### Make predictions

In [28]:
predictions_a = super_agent_a.predict(x_test).argmax(axis=1)
predictions_b = super_agent_b.predict(x_test).argmax(axis=1)
predictions_judge = judge.predict(data_x_sparse)

### Find and resolve disagreement

In [29]:
result, disagreement = resolve_disagreement(predictions_a, predictions_b, predictions_judge)

### What is the accuracy of the agents alone?

In [30]:
get_accuracy(predictions_a, true_categories)

correct:  9229


0.9229

In [31]:
get_accuracy(predictions_b, true_categories)

correct:  9368


0.9368

### What is their accuracy on the samples where they disagree?

In [32]:
get_accuracy(predictions_a[disagreement], true_categories[disagreement])

correct:  86


0.23822714681440443

In [33]:
get_accuracy(predictions_b[disagreement], true_categories[disagreement])

correct:  225


0.6232686980609419

### What is the judge's accuracy on the samples where the agents disagree?

In [34]:
predicted_category_judge = predictions_judge.argmax(axis=1)

In [35]:
get_accuracy(predicted_category_judge[disagreement], true_categories[disagreement])

correct:  68


0.1883656509695291

### How does it change if the agent's pre-commitments are combined with the judge's probabilities?

In [36]:
get_accuracy(result, true_categories[disagreement])

correct:  143


0.3961218836565097

### How big is the disagreement?

In [37]:
# the test set has 10k samples
disagreement.sum()

361

### What's the resulting accuracy of the agents plus the judge's resolution?

In [39]:
all_preds_combined = predictions_a.copy()
all_preds_combined[disagreement] = result

In [40]:
get_accuracy(all_preds_combined, true_categories)

correct:  9286


0.9286

# Preliminary results

- the pre-selection of two candidate classes by two superior, but fallible agents, seems to improve the accuracy of the limited judge
- the judge's accuracy is significantly improved even when we examine only the samples, where the agents disagree (22.36% to 43.27%, resp. 18.84% to 39.61% in the second experiment)

## What to do next
- experiments should be repeated and evaluated on mean values, as the judge's results are stochastic