## Basic Idea

I want to try how a diff would look like, when BPE is used to encode the input, to work on tokens instead of chars.


In [None]:
from de.mindscan.fluentgenesis.bpe.bpe_model import BPEModel
from de.mindscan.fluentgenesis.bpe.bpe_encoder_decoder import SimpleBPEEncoder

In [None]:
# load the BPE Model description file and hyper-parameter file.
model = BPEModel("16K-full","D:\\Projects\\SinglePageApplication\\Angular\\FluentGenesis-Classifier\\src\\de\\mindscan\\fluentgenesis\\bpe\\")
model.load_hparams()

# load associated vocabulary and bpe-pairs
model_vocabulary = model.load_tokens()
model_bpe_data = model.load_bpe_pairs()
    
# we must also make use of the vocabulary and the byte-pair occuences and pass that information to the encoder.
bpe_encoder = SimpleBPEEncoder(model_vocabulary, model_bpe_data)

## Case Number 0x01, this has substitutions and insertions

* "tt" became "span"
* "text-monosoace" was "added"

In [None]:
del_line = '<tt class="ml-2 small">{{revision.shortrev}}</tt>'
add_line = '<span class="ml-2 small text-monospace">{{revision.shortrev}}</span>'


In [None]:
bpe_del_line = bpe_encoder.encode([del_line])
bpe_add_line = bpe_encoder.encode([add_line])

In [None]:
print(bpe_del_line)
print(bpe_add_line)

We want to figure out, where we have identical parts, substitutions, deletions and insertions

* insertions and deletions can be calculated by array stretching with a neutral element e.g. "0". The goal would be to have them euqal length, such that these arrays can be compared element wise.

Let's assume we have such an algorithm

In [None]:
bpe_del_line_stretched=[
    61, 3397, 2839, 1756, 539, 46, 51, 2119, 110, 625,    0,  0,    0,   0,    0, 
    10003, 124, 124, 6778, 47, 1755, 6844, 126, 126, 1794, 3397, 63]
bpe_add_line_stretched=[
    61, 3039, 2839, 1756, 539, 46, 51, 2119, 110, 625, 7645, 46, 2339, 450, 1070, 
    10003, 124, 124, 6778, 47, 1755, 6844, 126, 126, 1794, 3039, 63]

We can now compare elementwise
* Two equal elements -> no change
* del is zero and add is non zero -> insertion
* add is zero and del is non zero -> deletion
* two different values -> replacement

and output an array of equal length.

In [None]:
def bpe_syndrome_calculation(del_line:[], add_line:[]):
    syndrome = []
    if not len(del_line) == len(add_line):
        raise("can not calculate syndromes for different array lengths")
    for i in range(0,len(del_line)):
        if del_line[i] == add_line[i]:
            syndrome.append('_')
        elif del_line[i] == 0:
            syndrome.append('I')
        elif add_line[i] == 0:
            syndrome.append('D')
        else:
            syndrome.append('R')
    return syndrome

bpe_diff_syndrome = bpe_syndrome_calculation(bpe_del_line_stretched, bpe_add_line_stretched)

In [None]:
print(bpe_diff_syndrome)