# iconnect/regex

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
235 lines (186 sloc) 7.81 KB
 The regex Replacing Tutorial ============================ Language Options and Imports ----------------------------- This tutorial is a literate Haskell program where we start by specifying the language pragmas and imports we will need for this module. \begin{code} {-# LANGUAGE QuasiQuotes #-} {-# OPTIONS_GHC -fno-warn-missing-signatures #-} \end{code} \begin{code} module Main(main) where \end{code} ********************************************************* * * WARNING: this is generated from pp-tutorial-master.lhs * ********************************************************* \begin{code} import TestKit import Text.RE.Replace import Text.RE.TDFA.String \end{code} Simple Text Replacement ----------------------- regex supports the replacement of matched text with alternative text. This section will cover replacement text specified with templates. More flexible tools that allow functions calculate the replacement text are covered below. _Capture_ sub-expressions, whose matched text can be inserted into the replacement template, can be specified as follows: * $( ... ) identifies a capture that can be identified by its left-to-right position relative to the other captures in the replacement template, with $1 being used to represent the leftmost capture, $2 the next leftmost capture, and so on; * ${foo}( ... ) can be used to identify a capture by name. Such captures can be identified either by their left-to-right position in the regular expression or by ${foo} in the template. A function to convert ISO format dates into a UK-format date could be written thus: \begin{code} uk_dates :: String -> String uk_dates src = replaceAll "${d}/${m}/${y}" $src *=~ [re|${y}([0-9]{4})-${m}([0-9]{2})-${d}([0-9]{2})|] \end{code} with \begin{code} evalme_RPL_01 = checkThis "evalme_RPL_01" ("09/01/2016 2015-12-5 05/10/2015") $uk_dates "2016-01-09 2015-12-5 2015-10-05" \end{code} The same function written with numbered captures: \begin{code} uk_dates' :: String -> String uk_dates' src = replaceAll "$3/$2/$1" $src *=~ [re|$([0-9]{4})-$([0-9]{2})-$([0-9]{2})|] \end{code} with \begin{code} evalme_RPL_02 = checkThis "evalme_RPL_02" ("09/01/2016 2015-12-5 05/10/2015") $uk_dates' "2016-01-09 2015-12-5 2015-10-05" \end{code} yielding the same result. (Most regex conventions use plain parentheses, ( ... ), to mark captures but we would like to reserve those exclusively for grouping in regex REs.) Matches/Match/Capture --------------------- The types returned by the ?=~ and *=~ form the foundations of the package. Understanding these simple types is the key to understanding the package. The type of *=~ in this module (imported from Text.RE.TDFA.String) is: haskell (*=~) :: String -> RE -> Matches String  with Matches defined in Text.RE.ZeInternals.Types.Capture thus: %include "Text/RE/ZeInternals/Types/Matches.lhs" "^data Matches " The critical component of the Matches type is the [Match a] in allMatches, containing the details all of each substring matched by the RE. The matchSource component also retains a copy of the original search string but the critical information is in allmatches. The type of ?=~ in this module (imported from Text.RE.TDFA.String) is: haskell (?=~) :: String -> RE -> Match String  with Match (referenced in the definition of Matches above) defined in Text.RE.ZeInternals.Types.Capture thus: %include "Text/RE/ZeInternals/Types/Match.lhs" "^data Match " Like matchesSource above, matchSource retains the original search string, but also a CaptureNames field listing all of the capture names in the RE (needed by the text replacemnt tools). But the 'real' content of Match is to be found in the MatchArray, enumerating all of the substrings captured by this match, starting with 0 for the substring captured by the whole RE, 1 for the leftmost explicit capture in the RE, 2 for the next leftmost capture, and so on. Each captured substring is represented by the following Capture type: %include "Text/RE/ZeInternals/Types/Capture.lhs" "^data Capture " Here we list the whole original search string in captureSource and the text of the sub-string captured in capturedText. captureOffset contains the number of characters preceding the captured substring, or is negative if no substring was captured (which is a different situation from epsilon, the empty string, being captured). captureLength gives the length of the captured string in capturedText. The test suite in [examples/re-tests.lhs](re-tests.html) contains extensive worked-out examples of these Matches/Match/Capture types. Using Functions to Replace Text ------------------------------- Sometimes you will need to process each string captured by an RE with a function. replaceAllCaptures takes a REContext, a substitution function and a Matches and applies the function to each captured substring according to the REContext, as we can see in the following example function to clean up all of the mis-formatted dates in the argument string, \begin{code} fixup_dates :: String -> String fixup_dates src = replaceAllCaptures SUB phi$ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|] where phi _ loc cap = Just $case locationCapture loc of 1 -> fmt 4$ read s 2 -> fmt 2 $read s 3 -> fmt 2$ read s _ -> error "fixup_dates" where s = capturedText cap fmt :: Int -> Int -> String fmt w x = replicate (max 0 $w - length x_s ) '0' ++ x_s where x_s = show x \end{code} which will fix up our running example \begin{code} evalme_RPF_01 = checkThis "evalme_RPF_01" ("2016-01-09 2015-12-05 2015-10-05")$ fixup_dates "2016-01-09 2015-12-5 2015-10-05" \end{code} The replaceAllCaptures function is of type %include "Text/RE/ZeInternals/Replace.lhs" "replaceAllCaptures ::" and the REContext and RELocation types are defined in Text.RE.Replace as follows, %include "Text/RE/ZeInternals/Replace.lhs" "^data REContext" The processing function gets applied to the captures specified by the REContext, which can be directed to process ALL of the captures, including the substring captured by the whole RE and all of the subsidiary capture, or just the TOP, 0 capture that the whole RE matches, or just the SUB (subsidiary) captures, as was the case above. The substitution function takes the Match corresponding to the current redex being processed, the RELocation information specifying redex _n_ redex and capure _i_, and the Capure being substituted. Our substitution function didn't need the Match context so it ignored it. The substition function either return Nothing to indicate that no substitution should be made or the replacement text. The above fixup function could be extended to enclose whole date in square brackets by specifing an ALL context and a 0 case for the substitution function. \begin{code} fixup_and_reformat_dates :: String -> String fixup_and_reformat_dates src = replaceAllCaptures ALL f $src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|] where f _ loc cap = Just$ case locationCapture loc of 0 -> "["++txt++"]" 1 -> fmt 4 $read txt 2 -> fmt 2$ read txt 3 -> fmt 2 $read txt _ -> error "fixup_date" where txt = capturedText cap \end{code} The fixup_and_reformat_dates applied to our running example, \begin{code} evalme_RPF_02 = checkThis "evalme_RPF_02" ("[2016-01-09] [2015-12-05] [2015-10-05]")$ fixup_and_reformat_dates "2016-01-09 2015-12-5 2015-10-05" \end{code} Text.RE.Replace provides analagous functions for replacing the test of a single Match returned from ?=~. \begin{code} main :: IO () main = runTheTests [ evalme_RPF_02 , evalme_RPF_01 , evalme_RPL_02 , evalme_RPL_01 ] \end{code}