Permalink
Cannot retrieve contributors at this time
The regex Tutorial | |
================== | |
This tutorial is a self-testing literate Haskell programme introducing | |
the vanilla API of the [regex package](http://hs.regex.uk). There | |
are other tutorials for explaining the more specialist aspects of regex | |
and you can load them into into you Haskell REPL of choice: see the | |
[regex Tutorials page](http://tutorial.regex.uk) for details. | |
Language Pragmas | |
---------------- | |
The first thing you will have to do is enable `QuasiQuotes` as regex | |
uses them to check that REs are well-formed at compile time. | |
\begin{code} | |
{-# LANGUAGE QuasiQuotes #-} | |
{-# OPTIONS_GHC -fno-warn-missing-signatures #-} | |
\end{code} | |
If you are trying out examples interactively at the ghci prompt then you | |
will need | |
``` | |
:seti -XQuasiQuotes | |
``` | |
Importing the API | |
----------------- | |
\begin{code} | |
module Main(main) where | |
\end{code} | |
********************************************************* | |
* | |
* WARNING: this is generated from pp-tutorial-master.lhs | |
* | |
********************************************************* | |
Before importing the `regex` API into your Haskell script you will need | |
to answer two questions: | |
1. Which flavour of REs do I need? If you need Posix REs then the `TDFA` | |
is for you, otherwise it is the PCRE back end, which is housed in | |
a seperate `regex-with-pcre` package. | |
2. Which Haskell type is being used for the text I need to match? This | |
can influence as, at the time of writing, the `PCRE` `regex` back end | |
[does not support the`Text` types](https://github.com/iconnect/regex/issues/58). | |
The import statement will in general look like this | |
``` | |
import Text.RE.<back-end>.<text-type> | |
``` | |
As we have no interest in Posix/PCRE distinctions or performance here, | |
we have chosen to work with the `TDFA` back end with `String` types. | |
\begin{code} | |
import TestKit | |
import Text.RE.TDFA.String | |
\end{code} | |
You could also import `Text.RE.TDFA` or `Text.RE.PCRE` to get an API | |
in which the operators are overloaded over all text types accepted by | |
each of these back ends: see the [Tools Tutorial](re-tutorial-tools.html) | |
for details. | |
Single `Match` with `?=~` | |
------------------------- | |
The regex API provides two matching operators: one for looking for the first | |
match in its search string and the other for finding all of the matches. The | |
first-match operator, `?=~`, yields the result of attempting to find the first | |
match. | |
``` | |
(?=~) :: String -> RE -> Match String | |
``` | |
The boolean `matched` function, | |
``` | |
matched :: Match a -> Bool | |
``` | |
can be used to test whether a match was found: | |
\begin{code} | |
evalme_SGL_01 = checkThis "evalme_SGL_01" (True) $ matched $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
To get the matched text use `matchText`, | |
``` | |
matchedText :: Match a -> Maybe a | |
``` | |
which returns `Nothing` if no match was found in the search string: | |
\begin{code} | |
evalme_SGL_02 = checkThis "evalme_SGL_02" (Just "2016-01-09") $ matchedText $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
\begin{code} | |
evalme_SGL_03 = checkThis "evalme_SGL_03" (Nothing) $ matchedText $ "2015-12-5" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
Multiple `Matches` with `*=~` | |
----------------------------- | |
Use `*=~` to locate all of the non-overlapping substrings that match a RE, | |
``` | |
(*=~) :: String -> RE -> Matches String | |
anyMatches :: Matches a -> Bool | |
``` | |
`anyMatches` can be used to determine if any matches were found | |
\begin{code} | |
evalme_MLT_01 = checkThis "evalme_MLT_01" (True) $ anyMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
and `countMatches` will tell us how many sub-strings matched: | |
\begin{code} | |
evalme_MLT_02 = checkThis "evalme_MLT_02" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
`matches` will return all of the matches. | |
``` | |
matches :: Natches a -> [a] | |
``` | |
\begin{code} | |
evalme_MLT_03 = checkThis "evalme_MLT_03" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|] | |
\end{code} | |
The `regex` Macros and Parsers | |
------------------------------ | |
regex supports macros in regular expressions. There are a bunch of | |
standard macros that you can just use, and you can define your own. | |
RE macros are enclosed in `@{` ... '}'. By convention the macros in | |
the standard environment start with a '%'. `@{%date}` will match an | |
ISO 8601 date, this | |
\begin{code} | |
evalme_MAC_00 = checkThis "evalme_MAC_00" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|@{%date}|] | |
\end{code} | |
will pick out the two dates. | |
There are also parsing functions for analysing the matched text. The | |
`@{%string}` macro will match quoted strings (in which double quotes can be | |
escaped with backslashes in the usual way) and its companion `parseString` | |
function will extract the string that was being quoted, interpreting any | |
escaped double quotes: | |
\begin{code} | |
evalme_MAC_01 = checkThisWith convertMaybeTextList "evalme_MAC_01" ([Just "foo",Just "bar", Just "\""]) $ map parseString $ matches $ "\"foo\", \"bar\" and a quote \"\\\"\"" *=~ [re|@{%string}|] | |
\end{code} | |
See the [macro tables page](http://macros.regex.uk) for details of the standard macros and their parsers. | |
See the [testbench tutorial](re-tutorial-testbench.html) for more on how | |
you can develop, document and test RE macros with the regex test bench. | |
Search and Replace | |
------------------ | |
If you need to edit a string then `SearchReplace` `[ed|` ... `|]` | |
templates can be used with `?=~/` to replace a single instance or | |
`*=~/` to replace all matching instances. | |
\begin{code} | |
evalme_SRP_00 = checkThis "evalme_SRP_00" ("0x0000: 40AA fab0") $ "0000 40AA fab0" ?=~/ [ed|${adr}([0-9A-Fa-f]{4}):?///0x${adr}:|] | |
\end{code} | |
\begin{code} | |
evalme_SRP_01 = checkThis "evalme_SRP_01" ("0x0000: 0x40AA 0xfab0") $ "0000: 40AA fab0" *=~/ [ed|[0-9A-Fa-f]{4}///0x$0|] | |
\end{code} | |
Specifying Options | |
------------------ | |
By default regular expressions are of the multi-line case-sensitive | |
variety so this | |
\begin{code} | |
evalme_SOP_01 = checkThis "evalme_SOP_01" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [re|[0-9a-f]{2}$|] | |
\end{code} | |
will find 2 matches, the '$' anchor matching each of the newlines, but only | |
the first two lowercase hex numbers matching the RE. The case sensitivity | |
and multiline-ness can be controled by selecting alternative parsers. | |
+--------------------------+-------------+-----------+----------------+ | |
| long name | short forms | multiline | case sensitive | | |
+==========================+=============+===========+================+ | |
| reMultilineSensitive | reMS, re | yes | yes | | |
+--------------------------+-------------+-----------+----------------+ | |
| reMultilineInsensitive | reMI | yes | no | | |
+--------------------------+-------------+-----------+----------------+ | |
| reBlockSensitive | reBS | no | yes | | |
+--------------------------+-------------+-----------+----------------+ | |
| reBlockInsensitive | reBI | no | no | | |
+--------------------------+-------------+-----------+----------------+ | |
So while the default setup | |
\begin{code} | |
evalme_SOP_02 = checkThis "evalme_SOP_02" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineSensitive|[0-9a-f]{2}$|] | |
\end{code} | |
finds 2 matches, a case-insensitive RE | |
\begin{code} | |
evalme_SOP_03 = checkThis "evalme_SOP_03" (4) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineInsensitive|[0-9a-f]{2}$|] | |
\end{code} | |
finds 4 matches, while a non-multiline RE | |
\begin{code} | |
evalme_SOP_04 = checkThis "evalme_SOP_04" (0) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockSensitive|[0-9a-f]{2}$|] | |
\end{code} | |
finds no matches but a non-multiline, case-insensitive match | |
\begin{code} | |
evalme_SOP_05 = checkThis "evalme_SOP_05" (1) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockInsensitive|[0-9a-f]{2}$|] | |
\end{code} | |
finds the final match. | |
For the hard of typing the shortforms are available. | |
\begin{code} | |
evalme_SOP_06 = checkThis "evalme_SOP_06" (True) $ matched $ "SuperCaliFragilisticExpialidocious" ?=~ [reMI|supercalifragilisticexpialidocious|] | |
\end{code} | |
Compiling and Escaping | |
---------------------- | |
It is possible to compile a dynamically aquired RE string at run-time using | |
`compileRegex`: | |
``` | |
compileRegex :: (Functor m, Monad m) => String -> m RE | |
``` | |
\begin{code} | |
evalme_CPL_01 = checkThis "evalme_CPL_01" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegex "[0-9]{4}-[0-9]{2}-[0-9]{2}") | |
\end{code} | |
These will compile the RE using the default multiline, case-sensitive options, | |
but you can specify the options dynamically using `compileRegexWith`: | |
``` | |
compileRegexWith :: (Functor m, Monad m) => SimpleREOptions -> String -> m RE | |
``` | |
where `SimpleREOptions` is a simple enumerated type. | |
%include "Text/RE/REOptions.lhs" "^data SimpleREOptions" | |
\begin{code} | |
evalme_CPL_02 = checkThis "evalme_CPL_02" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegexWith MultilineSensitive "[0-9]{4}-[0-9]{2}-[0-9]{2}") | |
\end{code} | |
If you need to compile `SearchReplace` templates for use with `?=~/` and | |
`*=~/` then the `compileSearchReplace` and `compileSearchReplaceWith`, | |
``` | |
compileSearchReplace :: (Monad m, Functor m, IsRegex RE s) => String -> String -> m (SearchReplace RE s) | |
compileSearchReplaceWith :: (Monad m, Functor m, IsRegex RE s) => SimpleREOptions -> String -> String -> m (SearchReplace RE s) | |
``` | |
work analagously to `compileRegex` and `compileRegexWith`, with the RE | |
and replacement template (either side of the '///' in the `[ed|...///...|]` | |
quasi quoters) being passed into these functions in two separate strings, | |
to compile to the `SearchReplace` type expected by the `?=~/` and `*=~/` | |
operators. | |
%include "Text/RE/ZeInternals/Types/SearchReplace.lhs" "^data SearchReplace" | |
The `escape` and `escapeWith` functions are special compilers that compile | |
a string into a RE that should match itself, which is assumed to be embedded | |
in a complex RE to be compiled. | |
``` | |
escape :: (Functor m, Monad m) => (String->String) -> String -> m RE | |
``` | |
The function pased in the first argument to `escape` takes the RE string | |
that will match the string passed in the second argument and yields the | |
RE to be compiled, which is returned from the parsing action. | |
\begin{code} | |
evalme_CPL_03 = checkThis "evalme_CPL_03" ("foobar") $ "fooe{0}bar" *=~/ SearchReplace (maybe (error "evalme_CPL_03") id $ escape id "e{0}") "" | |
\end{code} | |
The Classic regex-base Match Operators | |
-------------------------------------- | |
The original `=~` and `=~~` match operators are still available for | |
those that have mastered them. | |
\begin{code} | |
evalme_CLC_01 = checkThis "evalme_CLC_01" (True ) $ ("bar" =~ [re|(foo|bar)|] :: Bool) | |
\end{code} | |
\begin{code} | |
evalme_CLC_02 = checkThis "evalme_CLC_02" (False) $ ("quux" =~ [re|(foo|bar)|] :: Bool) | |
\end{code} | |
\begin{code} | |
evalme_CLC_03 = checkThis "evalme_CLC_03" (2) $ ("foobar" =~ [re|(foo|bar)|] :: Int) | |
\end{code} | |
\begin{code} | |
evalme_CLC_04 = checkThis "evalme_CLC_04" (Nothing) $ ("foo" =~~ [re|bar|] :: Maybe String) | |
\end{code} | |
\begin{code} | |
main :: IO () | |
main = runTheTests | |
[ evalme_CLC_04 | |
, evalme_CLC_03 | |
, evalme_CLC_02 | |
, evalme_CLC_01 | |
, evalme_CPL_03 | |
, evalme_CPL_02 | |
, evalme_CPL_01 | |
, evalme_SOP_06 | |
, evalme_SOP_05 | |
, evalme_SOP_04 | |
, evalme_SOP_03 | |
, evalme_SOP_02 | |
, evalme_SOP_01 | |
, evalme_SRP_01 | |
, evalme_SRP_00 | |
, evalme_MAC_01 | |
, evalme_MAC_00 | |
, evalme_MLT_03 | |
, evalme_MLT_02 | |
, evalme_MLT_01 | |
, evalme_SGL_03 | |
, evalme_SGL_02 | |
, evalme_SGL_01 | |
] | |
\end{code} | |