## PDFA Learning

In this notebook, we will show how to
use the implementation of PDFA learning,
as described in \[1\].

### Example

Utility functions to display SVGs.

In [1]:
%matplotlib inline
from pprint import pprint
from src.learn_pdfa.base import learn_pdfa
from src.learn_pdfa.common import MultiprocessedGenerator
import tempfile
from pathlib import Path
from IPython.core.display import display, HTML, SVG
from src.pdfa import PDFA
from src.pdfa.render import to_graphviz

_default_svg_style = "display: block; margin-left: auto; margin-right: auto; width: 50%;"
def display_svgs(*filenames, style=_default_svg_style):
    svgs = [SVG(filename=f).data for f in filenames]
    joined_svgs = "".join(svgs)
    no_wrap_div = f'<div style="{style}white-space: nowrap">{joined_svgs}</div>'
    display(HTML(no_wrap_div))

def render_automaton(pdfa: PDFA):
    digraph = to_graphviz(automaton)
    tmp_dir = tempfile.mkdtemp()
    tmp_filepath = str(Path(tmp_dir, "output"))
    digraph.render(tmp_filepath)
    display_svgs(tmp_filepath + ".svg")

## Example with 1 state.

Let's use the following automaton to generate samples.

In [2]:
p = 0.3
automaton = PDFA(
    nb_states=1,
    alphabet_size=2,
    transition_dict={
        0: {
            0: (0, p),
            1: (1, 1 - p),
        }
    }
)
render_automaton(automaton)

[2020-10-04 17:30:09,127][graphviz.files][DEBUG] write 148 bytes to '/tmp/tmpiaz6vifl/output'
[2020-10-04 17:30:09,128][graphviz.backend][DEBUG] run ['dot', '-Tsvg', '-O', 'output']


Now we will run the PAC learning algorithm
to learn the above automaton.

- `MultiprocessedGenerator` wraps the automaton and generates
  samples using multiple processes;
- `learn_pdfa` is the main entrypoint of the algorithm implementation.
- `n1_max_debug` is the maximum number for $N_1$ (for the subgraph learning)
- `n2_max_debug` is the maximum number for $N_2$ (for the probabilities learning)
- `m0_max_debug` is the maximum number for $m_0$ (for multiset filtering)

In [7]:
generator = MultiprocessedGenerator(automaton, nb_processes=8)

pdfa = learn_pdfa(
    sample_generator=generator,
    alphabet_size=2,
    epsilon=0.2,
    delta_1=0.2,
    delta_2=0.2,
    mu=0.1,
    n=3,
    n1_max_debug=100000,
    n2_max_debug=100000,
    m0_max_debug=100000 / 10,
)

[2020-10-04 17:32:08,897][src.learn_pdfa][INFO] Parameters: ('_Params(sample_generator=<src.learn_pdfa.common.MultiprocessedGenerator '
 'object at 0x7f5c2be78ed0>, alphabet_size=2, epsilon=0.2, delta_1=0.2, '
 'delta_2=0.2, mu=0.1, n=3, m0_max_debug=10000.0, n1_max_debug=100000, '
 'n2_max_debug=100000)')
[2020-10-04 17:32:08,899][src.learn_pdfa][INFO] N1 = 54432.579348157145, N2 = 55998960.0. Chosen: 55998960
[2020-10-04 17:32:08,900][src.learn_pdfa][INFO] m0 = 466658
[2020-10-04 17:32:08,901][src.learn_pdfa][INFO] N = 55998960
[2020-10-04 17:32:08,901][src.learn_pdfa][INFO] using m0 = 10000.0, N = 100000
[2020-10-04 17:32:10,078][src.learn_pdfa][INFO] Sampling done.
[2020-10-04 17:32:10,079][src.learn_pdfa][INFO] Number of samples: 100000.
[2020-10-04 17:32:10,083][src.learn_pdfa][INFO] Avg. length of samples: 1.4224.
[2020-10-04 17:32:10,209][src.learn_pdfa][INFO] Iteration 0
[2020-10-04 17:32:10,359][src.learn_pdfa][INFO] Iteration 1
[2020-10-04 17:32:10,448][src.learn_pdfa][INFO]

The learned automaton is:

In [8]:
print("Transitions: ")
pprint(pdfa.transitions)
render_automaton(pdfa)

[2020-10-04 17:32:27,330][graphviz.files][DEBUG] write 148 bytes to '/tmp/tmplcjmh_3y/output'
[2020-10-04 17:32:27,332][graphviz.backend][DEBUG] run ['dot', '-Tsvg', '-O', 'output']


Transitions: 
{(0, 0, 0.3002295247158932, 0), (0, 1, 0.6997704752841069, 1)}


## Example with 2 states.

Now let's try to learn the following automaton:

In [9]:
p1 = 0.4
p2 = 0.7
automaton = PDFA(
    2,
    2,
    {
        0: {
            0: (1, p1),
            1: (2, 1 - p1),
        },
        1: {
            0: (2, 1 - p2),
            1: (1, p2),
        },
    },
)
render_automaton(automaton)


[2020-10-04 17:32:36,415][graphviz.files][DEBUG] write 201 bytes to '/tmp/tmp4oc3458o/output'
[2020-10-04 17:32:36,417][graphviz.backend][DEBUG] run ['dot', '-Tsvg', '-O', 'output']


In [10]:
generator = MultiprocessedGenerator(automaton, nb_processes=8)

pdfa = learn_pdfa(
    sample_generator=generator,
    alphabet_size=2,
    epsilon=0.2,
    delta_1=0.2,
    delta_2=0.2,
    mu=0.1,
    n=3,
    n1_max_debug=3000000,
    n2_max_debug=1000000,
    m0_max_debug=3000000 / 10,
)

[2020-10-04 17:32:38,910][src.learn_pdfa][INFO] Parameters: ('_Params(sample_generator=<src.learn_pdfa.common.MultiprocessedGenerator '
 'object at 0x7f5c64a83210>, alphabet_size=2, epsilon=0.2, delta_1=0.2, '
 'delta_2=0.2, mu=0.1, n=3, m0_max_debug=300000.0, n1_max_debug=3000000, '
 'n2_max_debug=1000000)')
[2020-10-04 17:32:38,912][src.learn_pdfa][INFO] N1 = 54432.579348157145, N2 = 55998960.0. Chosen: 55998960
[2020-10-04 17:32:38,913][src.learn_pdfa][INFO] m0 = 466658
[2020-10-04 17:32:38,915][src.learn_pdfa][INFO] N = 55998960
[2020-10-04 17:32:38,916][src.learn_pdfa][INFO] using m0 = 300000.0, N = 3000000
[2020-10-04 17:34:00,177][src.learn_pdfa][INFO] Sampling done.
[2020-10-04 17:34:00,178][src.learn_pdfa][INFO] Number of samples: 3000000.
[2020-10-04 17:34:00,259][src.learn_pdfa][INFO] Avg. length of samples: 2.33844.
[2020-10-04 17:34:03,306][src.learn_pdfa][INFO] Iteration 0
[2020-10-04 17:34:11,405][src.learn_pdfa][INFO] Iteration 1
[2020-10-04 17:34:18,205][src.learn_pdfa

In [11]:
render_automaton(pdfa)

[2020-10-04 17:35:17,370][graphviz.files][DEBUG] write 201 bytes to '/tmp/tmpyopxpn2d/output'
[2020-10-04 17:35:17,374][graphviz.backend][DEBUG] run ['dot', '-Tsvg', '-O', 'output']


## References

- [1] Palmer N., Goldberg P.W. (2005)
  PAC-Learnability of Probabilistic Deterministic
  Finite State Automata in Terms of
  Variation Distance.
  In: Jain S., Simon H.U., Tomita E. (eds)
  Algorithmic Learning Theory. ALT 2005.
  Lecture Notes in Computer Science, vol 3734.
  Springer, Berlin, Heidelberg.
  https://doi.org/10.1007/11564089_14