# pbrisbin/pbrisbin.com

Fetching contributors…
Cannot retrieve contributors at this time
379 lines (300 sloc) 13.2 KB
 --- title: Regular Expression Evaluation via Finite Automata tags: haskell ---
What follows is a literate haskell file runnable via `ghci`. The raw source for this page can be found [here][here].
[here]: https://github.com/pbrisbin/pbrisbin.com/blob/master/posts/2014-04-07-regular_expression_evaluation_via_finite_automata.lhs While reading [Understanding Computation][uc] again last night, I was going back through the chapter where Tom Stuart describes deterministic and non-deterministic finite automata. These simple state machines seem like little more than a teaching tool, but he eventually uses them as the implementation for a regular expression matcher. I thought seeing this concrete use for such an abstract idea was interesting and wanted to re-enforce the ideas by implementing such a system myself -- with Haskell, of course. [uc]: http://computationbook.com/ Before we get started, we'll just need to import some libraries: > import Control.Monad.State > import Data.List (foldl') > import Data.Maybe

Patterns and NFAs

We're going to model a subset of regular expression patterns. > data Pattern > = Empty -- "" > | Literal Char -- "a" > | Concat Pattern Pattern -- "ab" > | Choose Pattern Pattern -- "a|b" > | Repeat Pattern -- "a*" > deriving Show With this, we can build "pattern ASTs" to represent regular expressions: ``` ghci> let p = Choose (Literal 'a') (Repeat (Literal 'b')) -- /a|b*/ ``` It's easy to picture a small parser to build these out of strings, but we won't do that as part of this post. Instead, we'll focus on converting these patterns into [Nondeterministic Finite Automata][nfa] or NFAs. We can then use the NFAs to determine if the pattern matches a given string. [nfa]: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton To explain NFAs, it's probably easiest to explain DFAs, their deterministic counter parts, first. Then we can go on to describe how NFAs differ. A DFA is a simple machine with states and rules. The rules describe how to move between states in response to particular input characters. Certain states are special and flagged as "accept" states. If, after reading a series of characters, the machine is left in an accept state it's said that the machine "accepted" that particular input. An NFA is the same with two notable differences: First, an NFA can have rules to move it into more than one state in response to the same input character. This means the machine can be in more than one state at once. Second, there is the concept of a Free Move which means the machine can jump between certain states without reading any input. Modeling an NFA requires a type with rules, current states, and accept states: > type SID = Int -- State Identifier > > data NFA = NFA > { rules :: [Rule] > , currentStates :: [SID] > , acceptStates :: [SID] > } deriving Show A rule defines what characters tell the machine to change states and which state to move into. > data Rule = Rule > { fromState :: SID > , inputChar :: Maybe Char > , nextStates :: [SID] > } deriving Show Notice that `nextStates` and `currentStates` are lists. This is to represent the machine moving to, and remaining in, more than one state in response to a particular character. Similarly, `inputChar` is a `Maybe` value because it will be `Nothing` in the case of a rule representing a Free Move. If, after processing some input, any of the machine's current states (*or any states we can reach via a free move*) are in its list of "accept" states, the machine has accepted the input. > accepts :: NFA -> [Char] -> Bool > accepts nfa = accepted . foldl' process nfa > > where > accepted :: NFA -> Bool > accepted nfa = any (`elem` acceptStates nfa) (currentStates nfa ++ freeStates nfa) Processing a single character means finding any followable rules for the given character and the current machine state, and following them. > process :: NFA -> Char -> NFA > process nfa c = case findRules c nfa of > -- Invalid input should cause the NFA to go into a failed state. > -- We can do that easily, just remove any acceptStates. > [] -> nfa { acceptStates = [] } > rs -> nfa { currentStates = followRules rs } > > findRules :: Char -> NFA -> [Rule] > findRules c nfa = filter (ruleApplies c nfa) \$ rules nfa A rule applies if 1. The read character is a valid input character for the rule, and 2. That rule applies to an available state > ruleApplies :: Char -> NFA -> Rule -> Bool > ruleApplies c nfa r = > maybe False (c ==) (inputChar r) && > fromState r `elem` availableStates nfa An "available" state is one which we're currently in, or can reach via Free Moves. > availableStates :: NFA -> [SID] > availableStates nfa = currentStates nfa ++ freeStates nfa The process of finding free states (those reachable via Free Moves) gets a bit hairy. We need to start from our current state(s) and follow any Free Moves *recursively*. This ensures that Free Moves which lead to other Free Moves are correctly accounted for. > freeStates :: NFA -> [SID] > freeStates nfa = go [] (currentStates nfa) > > where > go acc [] = acc > go acc ss = > let ss' = filter (`notElem` acc) \$ followRules \$ freeMoves nfa ss > in go (acc ++ ss') ss' *(Many thanks go to Christopher Swenson for spotting an infinite loop here and fixing it by filtering out any states already in the accumulator)* Free Moves from a given set of states are rules for those states which have no input character. > freeMoves :: NFA -> [SID] -> [Rule] > freeMoves nfa ss = filter (\r -> > (fromState r `elem` ss) && (isNothing \$ inputChar r)) \$ rules nfa Of course, the states that result from following rules are simply the concatenation of those rules' next states. > followRules :: [Rule] -> [SID] > followRules = concatMap nextStates Now we can model an NFA and see if it accepts a string or not. You could test this in `ghci` by defining an NFA in state 1 with an accept state 2 and a single rule that moves the machine from 1 to 2 if the character "a" is read. ``` ghci> let nfa = NFA [Rule 1 (Just 'a') [2]] [1] [2] ghci> nfa `accepts` "a" True ghci> nfa `accepts` "b" False ``` Pretty cool. What we need to do now is construct an NFA whose rules for moving from state to state are derived from the nature of the pattern it represents. Only if the NFA we construct moves to an accept state for a given string of input does it mean the string matches that pattern. > matches :: String -> Pattern -> Bool > matches s = (`accepts` s) . toNFA We'll define `toNFA` later, but if you've loaded this file, you can play with it in `ghci` now: ``` ghci> "" `matches` Empty True ghci> "abc" `matches` Empty False ``` And use it in an example `main`: > main :: IO () > main = do > -- This AST represents the pattern /ab|cd*/: > let p = Choose > (Concat (Literal 'a') (Literal 'b')) > (Concat (Literal 'c') (Repeat (Literal 'd'))) > > print \$ "xyz" `matches` p > -- => False > > print \$ "cddd" `matches` p > -- => True Before I show `toNFA`, we need to talk about mutability.