In [9]:
import numpy as np
from precommit_analysis.keras_mnist_example import prepare_data, create_mnist_cnn_model
from precommit_analysis.generators import eval_generator, eval_precommit_generator, sparse_mnist_generator_nonzero, eval_precommit_adversarial_generator, eval_optimal_adversary_generator

batch_size = 128
num_classes = 10
epochs = 12

In [2]:
x_train, y_train, x_test, y_test, input_shape = prepare_data(num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [3]:
val_data_generator = sparse_mnist_generator_nonzero(
    x_test,
    y_test,
    batch_size=x_test.shape[0],
    sparsity=6,
    shuffle=False
)

# Evaluate judge alone

- judge is a sparse MNIST classifier
- 6 non-zero pixels are randomly sampled from an input image

### Judge - 5k batches

In [4]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('models/model_sparse_mnist_generator_nonzero_5k.h5py')

In [68]:
# judge samples 6 pixels on random -> we need to see more runs and look at mean and variance
accuracies = eval_generator(val_data_generator, judge, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 52.99%
variance: 9.043600E-06


### Better judge - 30k batches

In [None]:
judge = create_mnist_cnn_model(num_classes, input_shape)
judge.load_weights('models/model_sparse_mnist_generator_nonzero_30k.h5py')

In [64]:
accuracies = eval_generator(val_data_generator, judge, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 55.51%
variance: 8.398400E-06


# Random pre-commit

### Judge - 5k batches

In [3]:
accuracies = eval_precommit_generator(val_data_generator, judge, num_classes, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 87.36%
variance: 7.724500E-06


### Better judge - 30k batches

In [5]:
accuracies = eval_precommit_generator(val_data_generator, judge, num_classes, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 88.31%
variance: 8.738100E-06


# Adversarial precommit

- evaluate the best adversary, which was found in train_adversary.ipynb

In [6]:
adversary = create_mnist_cnn_model(num_classes, input_shape)
adversary.load_weights('models/model_mnist_1epoch_adam1e-5.h5py')

### Judge - 5k batches

In [8]:
accuracies = eval_precommit_adversarial_generator(x_test, val_data_generator, judge, adversary, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 73.87%
variance: 7.136900E-06


### Better judge - 30k batches

In [9]:
accuracies = eval_precommit_adversarial_generator(x_test, val_data_generator, judge, adversary, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 75.41%
variance: 1.368610E-05


# Optimal adversary - perfect knowledge of judge

- with perfect knowledge of the judge it's trivial to find an optimal adversarial pre-commit class
- choose judge's predicted categories as long as they are not true
- otherwise take the 2nd most probable class according to the judge and hope for a tie, which is a loose in our setting

### Judge - 5k batches

In [11]:
accuracies = eval_optimal_adversary_generator(val_data_generator, judge, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 52.89%
variance: 5.974100E-06


### Better judge - 30k batches

In [55]:
accuracies = eval_optimal_adversary_generator(val_data_generator, judge, num_repetitions=10)
print('accuracy: %.2f%%' % (100 * np.mean(accuracies)))
print('variance: %E' % np.var(accuracies))

accuracy: 55.62%
variance: 3.117450E-05


# Results

- accuracy for different choices of pre-commit
- highlighted are the highest accuracy for the judge with random 2nd class pre-commit
- and the best adversarial pre-commit

| pre-commit type | judge 5k | judge 30k |
|-----------------------|----------|----------|
| **random** | **87.35%** | **88.31%** |
| adversarial_top | 79.43% | 80.77% |
| adversarial_30k | 77.72% | 80.60% |
| adversarial_15k | 76.04% | 77.42% |
| adversarial_10k | 75.31% | 76.82% |
| adversarial_7.5k | 76.23% | 77.24% |
| adversarial_5k | 78.31% | 80.12% |
| adversarial_500 | 84.33% | 85.61% |
|-----------------------|----------|----------|
| adversarial_adam 1e-6 | 83.32% | 84.98% |
| adversarial_adam 5e-5 | 75.23% | 76.28% 
| **adversarial_adam 1e-5** | **73.87%** | **75.41%** |
| adversarial_adam 1e-4| 75.06% | 76.50% |
|-----------------------|----------|----------|
| perfect knowledge | 52.89% | 55.62% |
|-----------------------|----------|----------|
| none / baseline | 52.99% | 55.51% |

# Conclusion

- much of the gain in judge's accuracy can be explained with the pre-commit only, without the actual debate between the 2 agents
- adversarial precommit indeed managed to decrease the judge's accuracy compared to random precommit
- the game of debate seems to me as a good tool for finding the candidate solutions (values for pre-commit) via agents of superior capabilities, and for mitigating the negative effect of the adversary.

# Future work

It's a question whether we can talk about honest and adversarial agents in the context of AI safety. Either we accept that agents may generally want to deceive us, then we can't assume even one honest agent. Or we can decide to assume both of the agents are honest. But even if both agents are acting in a good faith there will be cases of disagreement. This is a known problem usually solved by ensembling methods. Could the game of debate be used as an addition to the existing ensembling methods? If you want to see some preliminary exploration of this idea, take a look at the future_work.ipynb notebook.