This repository has been archived by the owner on Sep 10, 2021. It is now read-only.
/
2014-04-07-regular_expression_evaluation_via_finite_automata.lhs
375 lines (298 loc) · 13 KB
/
2014-04-07-regular_expression_evaluation_via_finite_automata.lhs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: Regular Expression Evaluation via Finite Automata
tags: haskell
---
<div class="well">
What follows is a literate haskell file runnable via `ghci`. The raw
source for this page can be found [here][here].
</div>
[here]: https://github.com/pbrisbin/pbrisbin.com/blob/master/posts/2014-04-07-regular_expression_evaluation_via_finite_automata.lhs
While reading [Understanding Computation][uc] again last night, I was
going back through the chapter where Tom Stuart describes deterministic
and non-deterministic finite automata. These simple state machines seem
like little more than a teaching tool, but he eventually uses them as
the implementation for a regular expression matcher. I thought seeing
this concrete use for such an abstract idea was interesting and wanted
to re-enforce the ideas by implementing such a system myself -- with
Haskell, of course.
[uc]: http://computationbook.com/
Before we get started, we'll just need to import some libraries:
> import Control.Monad.State
> import Data.List (foldl')
> import Data.Maybe
<h2>Patterns and NFAs</h2>
We're going to model a subset of regular expression patterns.
> data Pattern
> = Empty -- ""
> | Literal Char -- "a"
> | Concat Pattern Pattern -- "ab"
> | Choose Pattern Pattern -- "a|b"
> | Repeat Pattern -- "a*"
> deriving Show
With this, we can build "pattern ASTs" to represent regular expressions:
```
ghci> let p = Choose (Literal 'a') (Repeat (Literal 'b')) -- /a|b*/
```
It's easy to picture a small parser to build these out of strings, but
we won't do that as part of this post. Instead, we'll focus on
converting these patterns into [Nondeterministic Finite Automata][nfa]
or NFAs. We can then use the NFAs to determine if the pattern matches a
given string.
[nfa]: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
To explain NFAs, it's probably easiest to explain DFAs, their
deterministic counter parts, first. Then we can go on to describe how
NFAs differ.
A DFA is a simple machine with states and rules. The rules describe how
to move between states in response to particular input characters.
Certain states are special and flagged as "accept" states. If, after
reading a series of characters, the machine is left in an accept state
it's said that the machine "accepted" that particular input.
An NFA is the same with two notable differences: First, an NFA can have
rules to move it into more than one state in response to the same input
character. This means the machine can be in more than one state at once.
Second, there is the concept of a Free Move which means the machine can
jump between certain states without reading any input.
Modeling an NFA requires a type with rules, current states, and accept
states:
> type SID = Int -- State Identifier
>
> data NFA = NFA
> { rules :: [Rule]
> , currentStates :: [SID]
> , acceptStates :: [SID]
> } deriving Show
A rule defines what characters tell the machine to change states and
which state to move into.
> data Rule = Rule
> { fromState :: SID
> , inputChar :: Maybe Char
> , nextStates :: [SID]
> } deriving Show
Notice that `nextStates` and `currentStates` are lists. This is to
represent the machine moving to, and remaining in, more than one state
in response to a particular character. Similarly, `inputChar` is a
`Maybe` value because it will be `Nothing` in the case of a rule
representing a Free Move.
If, after processing some input, any of the machine's current states (*or any
states we can reach via a free move*) are in its list of "accept" states, the
machine has accepted the input.
> accepts :: NFA -> [Char] -> Bool
> accepts nfa = accepted . foldl' process nfa
>
> where
> accepted :: NFA -> Bool
> accepted nfa = any (`elem` acceptStates nfa) (currentStates nfa ++ freeStates nfa)
Processing a single character means finding any followable rules for the
given character and the current machine state, and following them.
> process :: NFA -> Char -> NFA
> process nfa c = case findRules c nfa of
> -- Invalid input should cause the NFA to go into a failed state.
> -- We can do that easily, just remove any acceptStates.
> [] -> nfa { acceptStates = [] }
> rs -> nfa { currentStates = followRules rs }
>
> findRules :: Char -> NFA -> [Rule]
> findRules c nfa = filter (ruleApplies c nfa) $ rules nfa
A rule applies if
1. The read character is a valid input character for the rule, and
2. That rule applies to an available state
> ruleApplies :: Char -> NFA -> Rule -> Bool
> ruleApplies c nfa r =
> maybe False (c ==) (inputChar r) &&
> fromState r `elem` availableStates nfa
An "available" state is one which we're currently in, or can reach via
Free Moves.
> availableStates :: NFA -> [SID]
> availableStates nfa = currentStates nfa ++ freeStates nfa
The process of finding free states (those reachable via Free Moves) gets
a bit hairy. We need to start from our current state(s) and follow any
Free Moves *recursively*. This ensures that Free Moves which lead to
other Free Moves are correctly accounted for.
> freeStates :: NFA -> [SID]
> freeStates nfa = go [] (currentStates nfa)
>
> where
> go acc [] = acc
> go acc ss =
> let ss' = followRules $ freeMoves nfa ss
> in go (acc ++ ss') ss'
Free Moves from a given set of states are rules for those states which
have no input character.
> freeMoves :: NFA -> [SID] -> [Rule]
> freeMoves nfa ss = filter (\r ->
> (fromState r `elem` ss) && (isNothing $ inputChar r)) $ rules nfa
Of course, the states that result from following rules are simply the
concatenation of those rules' next states.
> followRules :: [Rule] -> [SID]
> followRules = concatMap nextStates
Now we can model an NFA and see if it accepts a string or not. You could
test this in `ghci` by defining an NFA in state 1 with an accept state 2
and a single rule that moves the machine from 1 to 2 if the character
"a" is read.
```
ghci> let nfa = NFA [Rule 1 (Just 'a') [2]] [1] [2]
ghci> nfa `accepts` "a"
True
ghci> nfa `accepts` "b"
False
```
Pretty cool.
What we need to do now is construct an NFA whose rules for moving from
state to state are derived from the nature of the pattern it represents.
Only if the NFA we construct moves to an accept state for a given string
of input does it mean the string matches that pattern.
> matches :: String -> Pattern -> Bool
> matches s = (`accepts` s) . toNFA
We'll define `toNFA` later, but if you've loaded this file, you can play
with it in `ghci` now:
```
ghci> "" `matches` Empty
True
ghci> "abc" `matches` Empty
False
```
And use it in an example `main`:
> main :: IO ()
> main = do
> -- This AST represents the pattern /ab|cd*/:
> let p = Choose
> (Concat (Literal 'a') (Literal 'b'))
> (Concat (Literal 'c') (Repeat (Literal 'd')))
>
> print $ "xyz" `matches` p
> -- => False
>
> print $ "cddd" `matches` p
> -- => True
Before I show `toNFA`, we need to talk about mutability.
<h3>A Bit About Mutable State</h3>
Since `Pattern` is a recursive data type, we're going to have to
recursively create and combine NFAs. For example, in a `Concat` pattern,
we'll need to turn both sub-patterns into NFAs then combine those in
some way. In the Ruby implementation, Mr. Stuart used `Object.new` to
ensure unique state identifiers between all the NFAs he has to create.
We can't do that in Haskell. There's no global object able to provide
some guaranteed-unique value.
What we're going to do to get around this is conceptually simple, but
appears complicated because it makes use of monads. All we're doing is
defining a list of identifiers at the beginning of our program and
drawing from that list whenever we need a new identifier. Because we
can't maintain that as a variable we constantly update every time we
pull an identifier out, we'll use the `State` monad to mimic mutable
state through our computations.
<div class="well">
I apologize for the naming confusion here. This `State` type is from the
Haskell library and has nothing to with the states of our NFAs.
</div>
First, we take the parameterized `State s a` type, and fix the `s`
variable as a list of (potential) identifiers:
> type SIDPool a = State [SID] a
This makes it simple to create a `nextId` action which requests the next
identifier from this list as well as updates the computation's state,
removing it as a future option before presenting that next identifier as
its result.
> nextId :: SIDPool SID
> nextId = do
> (x:xs) <- get
> put xs
> return x
This function can be called from within any other function in the
`SIDPool` monad. Each time called, it will read the current state (via
`get`), assign the first identifier to `x` and the rest of the list to
`xs`, set the current state to that remaining list (via `put`) and
finally return the drawn identifier to the caller.
<h2>Pattern ⇒ NFA</h2>
Assuming we have some function `buildNFA` which handles the actual
conversion from `Pattern` to `NFA` but is in the `SIDPool` monad, we can
evaluate that action, supplying an infinite list as the potential
identifiers, and end up with an NFA with unique identifiers.
> toNFA :: Pattern -> NFA
> toNFA p = evalState (buildNFA p) [1..]
As mentioned, our conversion function, lives in the `SIDPool` monad,
allowing it to call `nextId` at will. This gives it the following type
signature:
> buildNFA :: Pattern -> SIDPool NFA
Every pattern is going to need at least one state identifier, so we'll
pull that out first, then begin a case analysis on the type of pattern
we're dealing with:
> buildNFA p = do
> s1 <- nextId
>
> case p of
The empty pattern results in a predictably simple machine. It has one
state which is also an accept state. It has no rules. If it gets any
characters, they'll be considered invalid and put the machine into a
failed state. Giving it no characters is the only way it can remain in
an accept state.
> Empty -> return $ NFA [] [s1] [s1]
Also simple is the literal character pattern. It has two states and a
rule between them. It moves from the first state to the second only if
it reads that character. Since the second state is the only accept
state, it will only accept that character.
> Literal c -> do
> s2 <- nextId
>
> return $ NFA [Rule s1 (Just c) [s2]] [s1] [s2]
We can model a concatenated pattern by first turning each sub-pattern
into their own NFAs, and then connecting the accept state of the first
to the start state of the second via a Free Move. This means that as the
combined NFA is reading input, it will only accept that input if it
moves through the first NFAs states into what used to be its accept
state, hop over to the second NFA, then move into its accept state.
Conceptually, this is exactly how a concatenated pattern should match.
*Note that `freeMoveTo` will be shown after.*
> Concat p1 p2 -> do
> nfa1 <- buildNFA p1
> nfa2 <- buildNFA p2
>
> let freeMoves = map (freeMoveTo nfa2) $ acceptStates nfa1
>
> return $ NFA
> (rules nfa1 ++ freeMoves ++ rules nfa2)
> (currentStates nfa1)
> (acceptStates nfa2)
We can implement choice by creating a new starting state, and connecting
it to both sub-patterns' NFAs via Free Moves. Now the machine will jump
into both NFAs at once, and the composed machine will accept the input
if either of the paths leads to an accept state.
> Choose p1 p2 -> do
> s2 <- nextId
> nfa1 <- buildNFA p1
> nfa2 <- buildNFA p2
>
> let freeMoves =
> [ freeMoveTo nfa1 s2
> , freeMoveTo nfa2 s2
> ]
>
> return $ NFA
> (freeMoves ++ rules nfa1 ++ rules nfa2) [s2]
> (acceptStates nfa1 ++ acceptStates nfa2)
>
A repeated pattern is probably hardest to wrap your head around. We need
to first convert the sub-pattern to an NFA, then we'll connect up a new
start state via a Free Move (to match 0 occurrences), then we'll connect
the accept state back to the start state (to match repetitions of the
pattern).
> Repeat p -> do
> s2 <- nextId
> nfa <- buildNFA p
>
> let initMove = freeMoveTo nfa s2
> freeMoves = map (freeMoveTo nfa) $ acceptStates nfa
>
> return $ NFA
> (initMove : rules nfa ++ freeMoves) [s2]
> (s2: acceptStates nfa)
>
And finally, our little helper which connects some state up to an NFA via
a Free Move.
> where
> freeMoveTo :: NFA -> SID -> Rule
> freeMoveTo nfa s = Rule s Nothing (currentStates nfa)
<h2>That's It</h2>
I want to give a big thanks to Tom Stuart for writing Understanding
Computation. That book has opened my eyes in so many ways. I understand
why he chose Ruby as the book's implementation language, but I find
Haskell to be better-suited to these sorts of modeling tasks. Hopefully
he doesn't mind me exploring that by rewriting some of his examples.