jl2 edited this page Sep 14, 2010 · 3 revisions
Clone this wiki locally

Features in the Python version:

Feature Example Implemented
Character classes [a-z] X
Closure a* X
Concatenation abc X
Alternation a|b X
Grouping (ab)* X
One or more a+ X
Numeric quantifiers a{1,3} X
Zero or one a? X
POSIX named character classes [:alpha:] X
Negative character classes [^a-z]  

After briefly thinking about negative character classes, my initial instinct is that they will require a pretty fundamental change to the NFA data structure being used. Characters and character classes are currently handled by enumerating all of the characters in the class and creating a transition in the NFA for each one.

The fundamental problem is that the implementation isn’t scalable. A regex like ‘[a-\u0xffff]{10}’ will blow up memory. To create a negative character class would require enumerating every unicode character not in the set. Also infeasible.

For the time being I’m going to ignore the problem while working on the code to convert the NFA to a DFA.