Consider merging char and range transitions in NFA and DFA #40

osa1 · 2022-02-04T09:03:36Z

Currently we have these transitions in NFAs:

struct State<A> {
    char_transitions: Map<char, Set<StateIdx>>,
    range_transitions: RangeMap<Set<StateIdx>>,
    empty_transitions: Set<StateIdx>,
    any_transitions: Set<StateIdx>,
    end_of_input_transitions: Set<StateIdx>,
    ...
}

(I was confused for a few seconds here on why we have both empty_transitions and end_of_input_transitions, we should document that "empty" is epsilon)

and these in DFAs:

pub struct State<T, A> {
    char_transitions: Map<char, T>,
    range_transitions: RangeMap<T>,
    any_transition: Option<T>,
    end_of_input_transition: Option<T>,
    ...
}

In both of these, we could merge char and range transitions in a single RangeMap field. When the transition happens on a character the range will include a single character. This has a few advantages:

It simplifies things and adds no new or special cases to consider, as range transitions can already have just one character today.
- I think this case may be handled poorly in the code generator today, e.g. maybe we're generating guards like x >= 'a' && x <= 'a'. We should check this.
When we implement DFA minimization (Implement DFA minimization #38) we will have to iterate all inputs of a partition (i.e. set of states). There we will have to merge (or actually, take difference of) characters and ranges. For example, in a partition, if a state has 'a' as input, another has ['a'-'z'] as input, we will have to test the inputs 'a' and ['b'-'z'] on the partition. I don't know how to implement this yet, but clearly we need to consider range-range overlaps and also range-char overlaps. Removing range vs. char distinction means less cases to handle.

Actually, all states in a partition need to handle the same range, otherwise they can be distinguished and so should be moved to separate partitions. So we will have [RangeMap] (one for each state in the partition), split the partition into new partitions where for every state in a new partition, the RangeMaps have the same domain. Then we can use any of the RangeMaps to make sure they map same inputs to same partitions. Having just RangeMaps simplifies this.

Disadvantage is that for char transitions we will need heap allocation for the Vec<Range>. Two ways to avoid that:

Use SmallVec<[Range; 1]>
Make RangeMap an enum, with a variant for single-character ranges. With this we will still need to treat single-char ranges differently, but the handling will be encapsulated in RangeMap module. Outside of RangeMap we will only have ranges. (except in the codegen we will need to ask if a range is single-char to generate simpler conditionals)

The text was updated successfully, but these errors were encountered:

osa1 added the refactoring label Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider merging char and range transitions in NFA and DFA #40

Consider merging char and range transitions in NFA and DFA #40

osa1 commented Feb 4, 2022 •

edited

Consider merging char and range transitions in NFA and DFA #40

Consider merging char and range transitions in NFA and DFA #40

Comments

osa1 commented Feb 4, 2022 • edited

osa1 commented Feb 4, 2022 •

edited