# Fast Pair Extraction

In *theory202104* the pair extraction is done in a extensive way using just one property (if $ab$ is pair in $T$ then $ba$ is not a pair). For an alphabet $\Sigma$ this implies a number of operations $O( |\Sigma|^2 )$. This complexity can be lowered by observing that for pair $(a,b)$, the frecuencies of $a$ and $b$ should be the same.

## Python Preamble

In [1]:
import random
import sys
sys.path.append("../../src")
%reload_ext autoreload
%autoreload 2

from theory202105 import *
# from helpers import graph
import theory202104

In [2]:
import logging, logging.config, sys

# Disable other loggers
logging.config.dictConfig({
    'version': 1,
    'disable_existing_loggers': True,
})

# Set logging for this notebook only
logging.basicConfig(
    format='[%(asctime)s] %(levelname)s: %(message)s',
    level=logging.INFO,
    handlers=[
        logging.StreamHandler(stream=sys.stderr),
    ]
)

## 202104 all pair extraction

In [3]:
N=[1000]; Sigma=[100,200,300, 400, 500]
for n in N:
    for s in Sigma:
        print(f'\nExtract pairs in a random trace of length {n}, with {s} symbols')
        %time theory202104.pairs_in_trace( [random.randint(1,s) for _ in range(n)] )


Extract pairs in a random trace of length 1000, with 100 symbols
CPU times: user 542 ms, sys: 3.71 ms, total: 546 ms
Wall time: 547 ms

Extract pairs in a random trace of length 1000, with 200 symbols
CPU times: user 2.1 s, sys: 8.45 ms, total: 2.11 s
Wall time: 2.12 s

Extract pairs in a random trace of length 1000, with 300 symbols
CPU times: user 4.35 s, sys: 13.9 ms, total: 4.36 s
Wall time: 4.38 s

Extract pairs in a random trace of length 1000, with 400 symbols
CPU times: user 6.89 s, sys: 15.7 ms, total: 6.9 s
Wall time: 6.91 s

Extract pairs in a random trace of length 1000, with 500 symbols
CPU times: user 9.59 s, sys: 22.1 ms, total: 9.62 s
Wall time: 9.63 s


## New pairs in trace

In the previous version the full adjacency matrix is returned, mostly with -1. In this version, subtraces made of same frequency symbols are considered.

```python
def pairs_in_trace(Trace, cardinality=pair_cardinality):
    Pairs_in_T = {}

    # Fill freq table
    
    freq_of_char = {}
    for a in Trace:
        freq_of_char[a] = freq_of_char.get(a, 0) + 1    
        
    # split the trace in a hash table, identified by frequency. 
    # This step preserves order but reduces trace size.
    
    hashTrace = {}
    for a in Trace:
        f = freq_of_char[a]
        hashTrace[f] = hashTrace.get(f, []) + [a] if type(a)!=type([]) else a
        
    # For each sliced trace compute all positive pairs
    for f, T in hashTrace.items():

        # The alphabet in T
        Sigma = list(set( [x for x in T] ))
    
        for i in range( len(Sigma) ):
            a = Sigma[i]

            Pairs_in_T[ (a,a) ] = cardinality(a, a, T)
            for b in Sigma[i+1:]:
                Pairs_in_T[ (a,b) ] = cardinality(a, b, T)
                # asymmetric paired property: if (a,b) is paired in T the (b,a) is not
                if Pairs_in_T[ (a,b) ]>=0:
                    Pairs_in_T[ (b,a) ] = -1  
                else: 
                    Pairs_in_T[ (b,a) ] = cardinality(b, a, T)
                    
    return Pairs_in_T
```

In [8]:
# Note that the sequence 1234 can be destroyed if a 4321 is added.
T='123abcabc44321'
# pairs = pairs_in_trace(T)
# pandasPair(pairs)
print('Sequences inferred from 202104 theory: {}'.format(theory202104.sequences_from_log([T]))) 
print('Sequences inferred from 202105 theory: {}'.format(sequences_from_log([T]))) 

Sequences inferred from 202104 theory: {2: [['a', 'b', 'c']]}
Sequences inferred from 202105 theory: {2: [['a', 'b', 'c']]}


In [5]:
L=['aXYbaYXb']
for X in L:
    print('For trace {} the positive pair graph is:'.format(X))
#     graph(positive_graph(X))

print('Sequences inferred from 202104 theory: {}'.format(theory202104.sequences_from_log(L))) 
print('Sequences inferred from 202105 theory: {}'.format(sequences_from_log(L))) 
# pairs = pairs_in_trace(L[0])
# pandasPair(pairs)

For trace aXYbaYXb the positive pair graph is:
Sequences inferred from 202104 theory: {2: [['a', 'X', 'b'], ['a', 'Y', 'b']]}
Sequences inferred from 202105 theory: {2: [['a', 'X', 'b'], ['a', 'Y', 'b']]}


In [6]:
%%time
N=[1000]; Sigma=[500]
for n in N:
    for s in Sigma:
        print(f'\nExtract pairs in a random trace of length {n}, with {s} symbols')
        P = pairs_in_trace( [ f'x{random.randint(1,s)}' for _ in range(n)] )
        
print(f'I found {len(P)} pairs in the random trace')
pandasPair(P)


Extract pairs in a random trace of length 1000, with 500 symbols
I found 46616 pairs in the random trace
CPU times: user 673 ms, sys: 6.68 ms, total: 680 ms
Wall time: 678 ms


Unnamed: 0,pairs
"(x408, x408)",-1
"(x408, x360)",-1
"(x360, x408)",-1
"(x408, x291)",-1
"(x291, x408)",-1
...,...
"(x127, x127)",-1
"(x127, x401)",-1
"(x401, x127)",-1
"(x401, x401)",-1


For $N=1000$, $|\Sigma|=500$ there is an x20 improvement in time. Not bad. Let's try with bigger values.

In [7]:
N=[1000, 5000, 10000]; Sigma=[500, 1000, 2000]
for n in N:
    for s in Sigma:
        print(f'\nExtract pairs in a random trace of length {n}, with {s} symbols')
        %time pairs_in_trace( [random.randint(1,s) for _ in range(n)] )


Extract pairs in a random trace of length 1000, with 500 symbols
CPU times: user 535 ms, sys: 3.77 ms, total: 539 ms
Wall time: 538 ms

Extract pairs in a random trace of length 1000, with 1000 symbols
CPU times: user 2.72 s, sys: 11 ms, total: 2.73 s
Wall time: 2.73 s

Extract pairs in a random trace of length 1000, with 2000 symbols
CPU times: user 7.62 s, sys: 25.4 ms, total: 7.64 s
Wall time: 7.66 s

Extract pairs in a random trace of length 5000, with 500 symbols
CPU times: user 611 ms, sys: 2.01 ms, total: 613 ms
Wall time: 613 ms

Extract pairs in a random trace of length 5000, with 1000 symbols
CPU times: user 4.56 s, sys: 6.65 ms, total: 4.57 s
Wall time: 4.57 s

Extract pairs in a random trace of length 5000, with 2000 symbols
CPU times: user 34.6 s, sys: 88.7 ms, total: 34.7 s
Wall time: 34.7 s

Extract pairs in a random trace of length 10000, with 500 symbols
CPU times: user 681 ms, sys: 4.63 ms, total: 685 ms
Wall time: 689 ms

Extract pairs in a random trace of length 10