# 2-digit XOR in binary

Logical XOR is a binary operation that takes two bits and returns 1 if they are different and 0 if they are the same. 

For example, 0 XOR 0 = 0, 0 XOR 1 = 1, 1 XOR 0 = 1, and 1 XOR 1 = 0.

Here, I construct a dataset by generating 2-digit numbers, computing their XOR result, and representing everything in binary (zero-padded to 7 bits).

E.g. for 3 XOR 25, the result is 26:
    
    Prompt: '0000011 XOR 0011001 ='. Completion: ' 0011010'


UPD. The above formatting resulted in 97% accuracy after fine-tuning. I then remembered about BPE and changed it to the following:

    Prompt: '0 0 0 0 0 1 1 XOR 0 0 1 1 0 0 1 ='. Completion: ' 0 0 1 1 0 1 0'

This resulted in 100% accuracy after fine-tuning.

In [1]:
import numpy as np

np.random.seed(42)

def bin_string_from_int(n, length=7):
    return ' '.join(list(bin(n)[2:].zfill(length)))

dataset = []

for i in range(1, 99):
    for j in range(1, 100-i):
        dataset.append((f'{bin_string_from_int(i)} XOR {bin_string_from_int(j)} =', f' {bin_string_from_int(i^j)}'))

np.random.shuffle(dataset)
print('dataset size:', len(dataset))

dataset size: 4851


In [2]:

train_set = dataset[:-851]
test_set = dataset[-851:]

print('train set size:', len(train_set))
print('test set size:', len(test_set))

train set size: 4000
test set size: 851


In [10]:
def with_n_shot(dataset, n=1):
    examples = [''.join([question, answer]) for question, answer in dataset[:n]]
    dataset = [('\n'.join(examples) + '\n' + question, answer) for question, answer in dataset[n:]]
    return dataset

In [19]:
for question, answer in with_n_shot(test_set, n=3)[:3]:
    print()
    print(question)


0 0 1 1 0 1 0 XOR 0 1 1 0 1 1 1 = 0 1 0 1 1 0 1
0 0 0 0 1 0 1 XOR 0 1 0 0 0 1 0 = 0 1 0 0 1 1 1
0 0 0 1 0 1 0 XOR 1 0 0 1 1 0 1 = 1 0 0 0 1 1 1
0 0 0 1 1 1 0 XOR 0 1 1 1 0 1 1 =

0 0 1 1 0 1 0 XOR 0 1 1 0 1 1 1 = 0 1 0 1 1 0 1
0 0 0 0 1 0 1 XOR 0 1 0 0 0 1 0 = 0 1 0 0 1 1 1
0 0 0 1 0 1 0 XOR 1 0 0 1 1 0 1 = 1 0 0 0 1 1 1
0 0 0 0 1 0 1 XOR 1 0 1 1 1 0 0 =

0 0 1 1 0 1 0 XOR 0 1 1 0 1 1 1 = 0 1 0 1 1 0 1
0 0 0 0 1 0 1 XOR 0 1 0 0 0 1 0 = 0 1 0 0 1 1 1
0 0 0 1 0 1 0 XOR 1 0 0 1 1 0 1 = 1 0 0 0 1 1 1
0 0 1 1 0 0 0 XOR 0 1 0 0 0 1 1 =


In [3]:
test_set[:10]

[('0 0 1 1 0 1 0 XOR 0 1 1 0 1 1 1 =', ' 0 1 0 1 1 0 1'),
 ('0 0 0 0 1 0 1 XOR 0 1 0 0 0 1 0 =', ' 0 1 0 0 1 1 1'),
 ('0 0 0 1 0 1 0 XOR 1 0 0 1 1 0 1 =', ' 1 0 0 0 1 1 1'),
 ('0 0 0 1 1 1 0 XOR 0 1 1 1 0 1 1 =', ' 0 1 1 0 1 0 1'),
 ('0 0 0 0 1 0 1 XOR 1 0 1 1 1 0 0 =', ' 1 0 1 1 0 0 1'),
 ('0 0 1 1 0 0 0 XOR 0 1 0 0 0 1 1 =', ' 0 1 1 1 0 1 1'),
 ('1 0 0 0 0 0 1 XOR 0 0 0 1 0 0 0 =', ' 1 0 0 1 0 0 1'),
 ('0 0 1 0 1 0 0 XOR 0 0 0 0 0 1 0 =', ' 0 0 1 0 1 1 0'),
 ('0 1 1 1 1 1 0 XOR 0 1 0 0 0 0 0 =', ' 0 0 1 1 1 1 0'),
 ('0 1 1 1 1 1 0 XOR 0 0 1 1 0 1 0 =', ' 0 1 0 0 1 0 0')]

### Baseline performance

In [4]:
from src.openai_bb import OpenAIGPT3

2022-11-28 06:13:22.368454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [22]:
model = OpenAIGPT3('curie')

#### Zero-shot

In [10]:
questions, answers = zip(*test_set)

outputs = model.generate_text(questions, max_length=20, stop_string='\n')

correct_ones = np.array(outputs) == np.array(answers)
print('accuracy:', correct_ones.mean())

accuracy: 0.0


#### Few-shot

In [24]:
# 1-shot
questions, answers = zip(*with_n_shot(test_set, n=1))

outputs = model.generate_text(questions, max_length=20, stop_string='\n')

correct_ones = np.array(outputs) == np.array(answers)
print('accuracy:', correct_ones.mean())

accuracy: 0.010588235294117647


In [25]:
# 3-shot
questions, answers = zip(*with_n_shot(test_set, n=3))

outputs = model.generate_text(questions, max_length=20, stop_string='\n')

correct_ones = np.array(outputs) == np.array(answers)
print('accuracy:', correct_ones.mean())

accuracy: 0.041273584905660375


In [26]:
# 20-shot
questions, answers = zip(*with_n_shot(test_set, n=20))

outputs = model.generate_text(questions, max_length=20, stop_string='\n')

correct_ones = np.array(outputs) == np.array(answers)
print('accuracy:', correct_ones.mean())

accuracy: 0.021660649819494584


### Fine-tuning Curie on the task

In [11]:
import json

stop_sequence = '\n'

# clear file
with open('dataset-algo.jsonl', 'w') as f: pass

with open('dataset-algo.jsonl', 'a') as f:
    f.writelines(json.dumps({'prompt': x, 'completion': y + stop_sequence}) + '\n' for x, y in train_set)

In [12]:
# run finetuning using openai CLI

!openai api fine_tunes.create -t dataset-algo.jsonl -m curie

Upload progress: 100%|███████████████████████| 328k/328k [00:00<00:00, 180Mit/s]
Uploaded file from dataset-algo.jsonl: file-bzimJiRU3i7oP5bm8cJhM6FC
Created fine-tune: ft-rvSgR5Kg436xNsAGu1fGK67D
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-11-26 02:18:34] Created fine-tune: ft-rvSgR5Kg436xNsAGu1fGK67D
[2022-11-26 02:18:41] Fine-tune costs $1.20
[2022-11-26 02:18:42] Fine-tune enqueued. Queue number: 0
[2022-11-26 02:18:43] Fine-tune started

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-rvSgR5Kg436xNsAGu1fGK67D



### Evaluating fine-tuned Curie

In [13]:
!openai api fine_tunes.list

{
  "data": [
    {
      "created_at": 1669419491,
      "fine_tuned_model": "curie:ft-university-of-tartu-2022-11-25-23-59-58",
      "hyperparams": {
        "batch_size": 8,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-XtjdqmRTxlxAgqqZubqaMhWZ",
      "model": "curie",
      "object": "fine-tune",
      "organization_id": "org-U4Xje8KdPBHxjYb62oL10QeW",
      "result_files": [
        {
          "bytes": 108026,
          "created_at": 1669420799,
          "filename": "compiled_results.csv",
          "id": "file-HMlO2OoPaR2PYbfeAypoMXXh",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 256000,
          "created_at": 1669419490,
          "filename": "dataset-algo.jsonl",
          "id": "file-UkpD6ZXE95tymyrVIVry

In [16]:
model = OpenAIGPT3('curie:ft-university-of-tartu-2022-11-26-00-36-11')

questions, answers = zip(*test_set)

outputs = model.generate_text(questions, max_length=20, stop_string='\n')

correct_ones = np.array(outputs) == np.array(answers)
print('fine-tuned accuracy:', correct_ones.mean())

fine-tuned accuracy: 1.0


# Results


Model: Curie

Task: 2-digit XOR in binary

| Zero-shot | 1-shot | 3-shot | 20-shot | Fine-tuned |
|-----------|--------|--------|---------|------------|
| 0%        | 1.1%   | 4.1%   | 2.2%    | 100%       |