{{ message }}
This repository has been archived by the owner. It is now read-only.

# pbrisbin / pbrisbin.com Public archive

Switch branches/tags
Nothing to show

Cannot retrieve contributors at this time
378 lines (300 sloc) 13.2 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 --- title: Regular Expression Evaluation via Finite Automata tags: haskell ---
What follows is a literate haskell file runnable via `ghci`. The raw source for this page can be found [here][here].
[here]: https://github.com/pbrisbin/pbrisbin.com/blob/master/posts/2014-04-07-regular_expression_evaluation_via_finite_automata.lhs While reading [Understanding Computation][uc] again last night, I was going back through the chapter where Tom Stuart describes deterministic and non-deterministic finite automata. These simple state machines seem like little more than a teaching tool, but he eventually uses them as the implementation for a regular expression matcher. I thought seeing this concrete use for such an abstract idea was interesting and wanted to re-enforce the ideas by implementing such a system myself -- with Haskell, of course. [uc]: http://computationbook.com/ Before we get started, we'll just need to import some libraries: > import Control.Monad.State > import Data.List (foldl') > import Data.Maybe

Patterns and NFAs

We're going to model a subset of regular expression patterns. > data Pattern > = Empty -- "" > | Literal Char -- "a" > | Concat Pattern Pattern -- "ab" > | Choose Pattern Pattern -- "a|b" > | Repeat Pattern -- "a*" > deriving Show With this, we can build "pattern ASTs" to represent regular expressions: ``` ghci> let p = Choose (Literal 'a') (Repeat (Literal 'b')) -- /a|b*/ ``` It's easy to picture a small parser to build these out of strings, but we won't do that as part of this post. Instead, we'll focus on converting these patterns into [Nondeterministic Finite Automata][nfa] or NFAs. We can then use the NFAs to determine if the pattern matches a given string. [nfa]: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton To explain NFAs, it's probably easiest to explain DFAs, their deterministic counter parts, first. Then we can go on to describe how NFAs differ. A DFA is a simple machine with states and rules. The rules describe how to move between states in response to particular input characters. Certain states are special and flagged as "accept" states. If, after reading a series of characters, the machine is left in an accept state it's said that the machine "accepted" that particular input. An NFA is the same with two notable differences: First, an NFA can have rules to move it into more than one state in response to the same input character. This means the machine can be in more than one state at once. Second, there is the concept of a Free Move which means the machine can jump between certain states without reading any input. Modeling an NFA requires a type with rules, current states, and accept states: > type SID = Int -- State Identifier > > data NFA = NFA > { rules :: [Rule] > , currentStates :: [SID] > , acceptStates :: [SID] > } deriving Show A rule defines what characters tell the machine to change states and which state to move into. > data Rule = Rule > { fromState :: SID > , inputChar :: Maybe Char > , nextStates :: [SID] > } deriving Show Notice that `nextStates` and `currentStates` are lists. This is to represent the machine moving to, and remaining in, more than one state in response to a particular character. Similarly, `inputChar` is a `Maybe` value because it will be `Nothing` in the case of a rule representing a Free Move. If, after processing some input, any of the machine's current states (*or any states we can reach via a free move*) are in its list of "accept" states, the machine has accepted the input. > accepts :: NFA -> [Char] -> Bool > accepts nfa = accepted . foldl' process nfa > > where > accepted :: NFA -> Bool > accepted nfa = any (`elem` acceptStates nfa) (currentStates nfa ++ freeStates nfa) Processing a single character means finding any followable rules for the given character and the current machine state, and following them. > process :: NFA -> Char -> NFA > process nfa c = case findRules c nfa of > -- Invalid input should cause the NFA to go into a failed state. > -- We can do that easily, just remove any acceptStates. > [] -> nfa { acceptStates = [] } > rs -> nfa { currentStates = followRules rs } > > findRules :: Char -> NFA -> [Rule] > findRules c nfa = filter (ruleApplies c nfa) \$ rules nfa A rule applies if 1. The read character is a valid input character for the rule, and 2. That rule applies to an available state > ruleApplies :: Char -> NFA -> Rule -> Bool > ruleApplies c nfa r = > maybe False (c ==) (inputChar r) && > fromState r `elem` availableStates nfa An "available" state is one which we're currently in, or can reach via Free Moves. > availableStates :: NFA -> [SID] > availableStates nfa = currentStates nfa ++ freeStates nfa The process of finding free states (those reachable via Free Moves) gets a bit hairy. We need to start from our current state(s) and follow any Free Moves *recursively*. This ensures that Free Moves which lead to other Free Moves are correctly accounted for. > freeStates :: NFA -> [SID] > freeStates nfa = go [] (currentStates nfa) > > where > go acc [] = acc > go acc ss = > let ss' = filter (`notElem` acc) \$ followRules \$ freeMoves nfa ss > in go (acc ++ ss') ss' *(Many thanks go to Christopher Swenson for spotting an infinite loop here and fixing it by filtering out any states already in the accumulator)* Free Moves from a given set of states are rules for those states which have no input character. > freeMoves :: NFA -> [SID] -> [Rule] > freeMoves nfa ss = filter (\r -> > (fromState r `elem` ss) && (isNothing \$ inputChar r)) \$ rules nfa Of course, the states that result from following rules are simply the concatenation of those rules' next states. > followRules :: [Rule] -> [SID] > followRules = concatMap nextStates Now we can model an NFA and see if it accepts a string or not. You could test this in `ghci` by defining an NFA in state 1 with an accept state 2 and a single rule that moves the machine from 1 to 2 if the character "a" is read. ``` ghci> let nfa = NFA [Rule 1 (Just 'a') [2]] [1] [2] ghci> nfa `accepts` "a" True ghci> nfa `accepts` "b" False ``` Pretty cool. What we need to do now is construct an NFA whose rules for moving from state to state are derived from the nature of the pattern it represents. Only if the NFA we construct moves to an accept state for a given string of input does it mean the string matches that pattern. > matches :: String -> Pattern -> Bool > matches s = (`accepts` s) . toNFA We'll define `toNFA` later, but if you've loaded this file, you can play with it in `ghci` now: ``` ghci> "" `matches` Empty True ghci> "abc" `matches` Empty False ``` And use it in an example `main`: > main :: IO () > main = do > -- This AST represents the pattern /ab|cd*/: > let p = Choose > (Concat (Literal 'a') (Literal 'b')) > (Concat (Literal 'c') (Repeat (Literal 'd'))) > > print \$ "xyz" `matches` p > -- => False > > print \$ "cddd" `matches` p > -- => True Before I show `toNFA`, we need to talk about mutability.

Since `Pattern` is a recursive data type, we're going to have to recursively create and combine NFAs. For example, in a `Concat` pattern, we'll need to turn both sub-patterns into NFAs then combine those in some way. In the Ruby implementation, Mr. Stuart used `Object.new` to ensure unique state identifiers between all the NFAs he has to create. We can't do that in Haskell. There's no global object able to provide some guaranteed-unique value. What we're going to do to get around this is conceptually simple, but appears complicated because it makes use of monads. All we're doing is defining a list of identifiers at the beginning of our program and drawing from that list whenever we need a new identifier. Because we can't maintain that as a variable we constantly update every time we pull an identifier out, we'll use the `State` monad to mimic mutable state through our computations.
I apologize for the naming confusion here. This `State` type is from the Haskell library and has nothing to with the states of our NFAs.
First, we take the parameterized `State s a` type, and fix the `s` variable as a list of (potential) identifiers: > type SIDPool a = State [SID] a This makes it simple to create a `nextId` action which requests the next identifier from this list as well as updates the computation's state, removing it as a future option before presenting that next identifier as its result. > nextId :: SIDPool SID > nextId = do > (x:xs) <- get > put xs > return x This function can be called from within any other function in the `SIDPool` monad. Each time called, it will read the current state (via `get`), assign the first identifier to `x` and the rest of the list to `xs`, set the current state to that remaining list (via `put`) and finally return the drawn identifier to the caller.

Pattern ⇒ NFA

Assuming we have some function `buildNFA` which handles the actual conversion from `Pattern` to `NFA` but is in the `SIDPool` monad, we can evaluate that action, supplying an infinite list as the potential identifiers, and end up with an NFA with unique identifiers. > toNFA :: Pattern -> NFA > toNFA p = evalState (buildNFA p) [1..] As mentioned, our conversion function, lives in the `SIDPool` monad, allowing it to call `nextId` at will. This gives it the following type signature: > buildNFA :: Pattern -> SIDPool NFA Every pattern is going to need at least one state identifier, so we'll pull that out first, then begin a case analysis on the type of pattern we're dealing with: > buildNFA p = do > s1 <- nextId > > case p of The empty pattern results in a predictably simple machine. It has one state which is also an accept state. It has no rules. If it gets any characters, they'll be considered invalid and put the machine into a failed state. Giving it no characters is the only way it can remain in an accept state. > Empty -> return \$ NFA [] [s1] [s1] Also simple is the literal character pattern. It has two states and a rule between them. It moves from the first state to the second only if it reads that character. Since the second state is the only accept state, it will only accept that character. > Literal c -> do > s2 <- nextId > > return \$ NFA [Rule s1 (Just c) [s2]] [s1] [s2] We can model a concatenated pattern by first turning each sub-pattern into their own NFAs, and then connecting the accept state of the first to the start state of the second via a Free Move. This means that as the combined NFA is reading input, it will only accept that input if it moves through the first NFAs states into what used to be its accept state, hop over to the second NFA, then move into its accept state. Conceptually, this is exactly how a concatenated pattern should match. *Note that `freeMoveTo` will be shown after.* > Concat p1 p2 -> do > nfa1 <- buildNFA p1 > nfa2 <- buildNFA p2 > > let freeMoves = map (freeMoveTo nfa2) \$ acceptStates nfa1 > > return \$ NFA > (rules nfa1 ++ freeMoves ++ rules nfa2) > (currentStates nfa1) > (acceptStates nfa2) We can implement choice by creating a new starting state, and connecting it to both sub-patterns' NFAs via Free Moves. Now the machine will jump into both NFAs at once, and the composed machine will accept the input if either of the paths leads to an accept state. > Choose p1 p2 -> do > s2 <- nextId > nfa1 <- buildNFA p1 > nfa2 <- buildNFA p2 > > let freeMoves = > [ freeMoveTo nfa1 s2 > , freeMoveTo nfa2 s2 > ] > > return \$ NFA > (freeMoves ++ rules nfa1 ++ rules nfa2) [s2] > (acceptStates nfa1 ++ acceptStates nfa2) > A repeated pattern is probably hardest to wrap your head around. We need to first convert the sub-pattern to an NFA, then we'll connect up a new start state via a Free Move (to match 0 occurrences), then we'll connect the accept state back to the start state (to match repetitions of the pattern). > Repeat p -> do > s2 <- nextId > nfa <- buildNFA p > > let initMove = freeMoveTo nfa s2 > freeMoves = map (freeMoveTo nfa) \$ acceptStates nfa > > return \$ NFA > (initMove : rules nfa ++ freeMoves) [s2] > (s2: acceptStates nfa) > And finally, our little helper which connects some state up to an NFA via a Free Move. > where > freeMoveTo :: NFA -> SID -> Rule > freeMoveTo nfa s = Rule s Nothing (currentStates nfa)

That's It

I want to give a big thanks to Tom Stuart for writing Understanding Computation. That book has opened my eyes in so many ways. I understand why he chose Ruby as the book's implementation language, but I find Haskell to be better-suited to these sorts of modeling tasks. Hopefully he doesn't mind me exploring that by rewriting some of his examples.