Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider merging char and range transitions in NFA and DFA #40

Open
osa1 opened this issue Feb 4, 2022 · 0 comments
Open

Consider merging char and range transitions in NFA and DFA #40

osa1 opened this issue Feb 4, 2022 · 0 comments

Comments

@osa1
Copy link
Owner

osa1 commented Feb 4, 2022

Currently we have these transitions in NFAs:

struct State<A> {
    char_transitions: Map<char, Set<StateIdx>>,
    range_transitions: RangeMap<Set<StateIdx>>,
    empty_transitions: Set<StateIdx>,
    any_transitions: Set<StateIdx>,
    end_of_input_transitions: Set<StateIdx>,
    ...
}

(I was confused for a few seconds here on why we have both empty_transitions and end_of_input_transitions, we should document that "empty" is epsilon)

and these in DFAs:

pub struct State<T, A> {
    char_transitions: Map<char, T>,
    range_transitions: RangeMap<T>,
    any_transition: Option<T>,
    end_of_input_transition: Option<T>,
    ...
}

In both of these, we could merge char and range transitions in a single RangeMap field. When the transition happens on a character the range will include a single character. This has a few advantages:

  • It simplifies things and adds no new or special cases to consider, as range transitions can already have just one character today.

    • I think this case may be handled poorly in the code generator today, e.g. maybe we're generating guards like x >= 'a' && x <= 'a'. We should check this.
  • When we implement DFA minimization (Implement DFA minimization #38) we will have to iterate all inputs of a partition (i.e. set of states). There we will have to merge (or actually, take difference of) characters and ranges. For example, in a partition, if a state has 'a' as input, another has ['a'-'z'] as input, we will have to test the inputs 'a' and ['b'-'z'] on the partition. I don't know how to implement this yet, but clearly we need to consider range-range overlaps and also range-char overlaps. Removing range vs. char distinction means less cases to handle.

    Actually, all states in a partition need to handle the same range, otherwise they can be distinguished and so should be moved to separate partitions. So we will have [RangeMap] (one for each state in the partition), split the partition into new partitions where for every state in a new partition, the RangeMaps have the same domain. Then we can use any of the RangeMaps to make sure they map same inputs to same partitions. Having just RangeMaps simplifies this.

Disadvantage is that for char transitions we will need heap allocation for the Vec<Range>. Two ways to avoid that:

  • Use SmallVec<[Range; 1]>
  • Make RangeMap an enum, with a variant for single-character ranges. With this we will still need to treat single-char ranges differently, but the handling will be encapsulated in RangeMap module. Outside of RangeMap we will only have ranges. (except in the codegen we will need to ask if a range is single-char to generate simpler conditionals)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant