Skip to content
Mauricio C Antunes edited this page Jan 13, 2014 · 3 revisions

Tutorial: a simple regular expression parser for Haskell strings

We will use bindings-posix package to build a simple and easy to use (and inefficient) Haskell interface. Your bread and butter while using C code in Haskell will be the Foreign modules. In this example, you should pay special attention to Foreign.C, Foreign.C.String, Foreign.Marshall.Alloc, and Foreign.Ptr. You may also take a look at Data.Bits for operators on bits (like .|.) that are useful when dealing with flags.


First, we create a Cabal package in a new directory:

mkdir regex-example
cd regex-example
cabal init

The resulting cabal file, after we have edited the necessary fields, can be seen at the repository. Note that there's no need to request the Foreign Funtion Interface extension or add bindings-DSL as a dependency. You do have to add bindings-posix as a dependency, which you may need to install first with cabal install bindings-posix.


Since we will name our module RegexExample, we will have a corresponding src/RegexExample.hs file, at which we'll now look in detail. You may want to keep Posix documentation open for comparison while reading it.


newtype RegexException = RegexException String deriving (Show, Typeable)
instance Exception RegexException

throwRegexException :: Ptr C'regex_t -> CInt -> IO ()
throwRegexException preg errcode = do
    errbuf_size <- c'regerror errcode preg nullPtr 0
    allocaBytes (fromEnum errbuf_size) $ \errbuf -> do
        c'regerror errcode preg errbuf errbuf_size
        peekCString errbuf >>= (throwIO . RegexException)

Posix function regerror gives human readable messages for regular expressions errors. We can use that to generate exceptions. So, throwRegexException takes as parameters a pointer to a regex_t structure and an error code. It then calls regerror with a null pointer and a buffer size of zero. This gives us the necessary size for a buffer that can hold the message for that code, which we store in errbuf_size. Calling regerror again with an allocated buffer of that size, we get a message in that buffer that can be made into a Haskell String with peekCString, and then thrown as an exception.


withRegex :: String -> (Ptr C'regex_t -> IO a) -> IO a
withRegex regex exec = 
    alloca $ \preg ->
    withCString regex $ \pattern ->
    do
        errcode <- c'regcomp preg pattern cflags
        when (errcode /= 0) $ throwRegexException preg errcode
        finally (exec preg) (c'regfree preg)
  where
    cflags = c'REG_EXTENDED .|. c'REG_NEWLINE

Posix requires that calls to the regular expression parsing function regexec be wrapped between calls to regcomp and regfree to fill and then free a regex_t structure. After filling a regex_t structure using regcomp and checking for a possible error, the withRegex function makes use of finally, from Control.Exception, to encapsulate that requirement, ensuring that regfree will be called even if an exception is thrown. There are other solutions to the same problem, including specific monad transformers.

Note the use of flags REG_EXTENDED and REG_NEWLINE in the call to regcomp, which says that we want extended regular expressions and the possibility of matching the start and end of lines with ^ and $ special characters.


parseLine :: Ptr C'regex_t -> String -> IO [String]
parseLine preg line =
    withCString line $ \cline ->
    alloca $ \pmatch ->

Function parseLine will parse a single line of text. This is convenient because we want to know if the ^ character in a regular expression will match the start of a line, and breaking text input into lines is the easier, although not the most efficient way to check that. (See the use of c'REG_NOTBOL below.)

The start of parseLine allocates the necessary C data: cline, a C string created from the Haskell string; and pmatch, a pointer to a regmatch_t structure that regexec will fill with indexes for regular expression matches.


  let
    findMatches :: Int -> MaybeT (WriterT [String] IO) ()

The main engine, findMatches, will make use of a stack of two monad transformers from transformers package. WriterT gives us a tell function which allows us to accumulate data (here in the form of [String]) as we go through some computation, and get the result at the end. MaybeT gives us the possibility of stopping a computation at some condition, using guard from its MonadPlus instance.


    findMatches pos = do
        status <- liftIO $ c'regexec preg (advancePtr cline pos) 1 pmatch eflags
        guard (status == 0)
        rm <- liftIO $ peek pmatch
        let so = pos + (fromEnum $ c'regmatch_t'rm_so rm)
        let eo = pos + (fromEnum $ c'regmatch_t'rm_eo rm)
        when (so < eo) $ do
            match <- liftIO $ peekCStringLen (advancePtr cline so, eo - so)
            lift $ tell [match]
        guard . (/= 0) =<< (liftIO $ peek (advancePtr cline eo))
        findMatches $ if so < eo then eo else eo + 1
      where
        eflags = if pos == 0 then 0 else c'REG_NOTBOL

The findMatches function uses tail recursion to go through the string searching for matches. It takes as parameter the position where regexec will be called. After that call, guard is used to stop iterating if the return value does not indicate a new valid match.

It then peek a regmatch_t structure from its pointers, and uses c'regmatch_t'rm_so field accessors to extract the start and end match positions, which are added to the start position given to regexec. Note the convention used by bindings-DSL for field accessors: a c' prefix, followed by the structure name, a ', and the field name.

If those positions are not the same, indicating the match is not empty, peekCStringLen is used to get the match into a Haskell string, which is then stored by tell.

A second use of guard checks if we found the end of the string (using the C convention of using a \0 char at that point). If we did not, it starts again.


  in execWriterT $ runMaybeT $ findMatches 0

This is how we get the accumulated [String].


parse :: String -> String -> [String]
parse regex text = unsafePerformIO $
    withRegex regex $ \preg -> liftM concat $ mapM (parseLine preg) (lines text)

Our last trick is the use of unsafePerformIO, allowing parseLine to be used outside IO monads after beeing mapped by parse to each line of input. As the name implies, unsafePerformIO is potentially unsafe. Read about it before using it. It's usually better to test your code without it first.


Our example is ready. Let's test it. (You'll need nmap installed for the last command.)

cabal install
ghci
:m + RegexExample
parse "a" "aaa"
	==> ["a","a","a"]
parse "a(" "aaa"
	==> *** Exception: RegexException "Unmatched ( or \\("
parse "(^...)|(...$)" "qwertyuiop\n1234567890"
	==> ["qwe","iop","123","890"]
:m + System.Process Control.Monad Data.List
readProcess "ncat" ["www.w3.org", "80"] "GET / HTTP/1.1\nHost: www.w3.org\n\n" >>= return . take 5 . nub . (parse "</[a-z]*>")
	==> ["</title>","</style>","</head>","</a>","</span>"]

If you think something could be improved or corrected in this tutorial, feel free to open an issue ticket about it.