In [14]:
:opt no-lint

In [2]:
import Control.Monad.Trans.State
import Text.Trifecta
import Text.Parser.Combinators
import Control.Applicative ((<|>))
import Text.RawString.QQ

# 24 State
## 24.1 Parser combinators
Chapter content:
• Use a parsing library to cover the basics of parsing.
• Demonstrate the power of parser combinators.
• Marshall and unmarshall some JSON data.
• Talk about tokenization.
## 24.2 A few more words of introduction
Due to the size of the field, the chapter explains parsing with a outside-in approach.
## 24.3 Understanding the parsing process
A parser combinator is a higher-order function that takes parsers as input and returns a new parser as output. In term of lambda calculus, combinators are expressions with no free variables.
Usually, the argument passing is elided, because the interface of parsers will often be like the `State` monad, which permits `Reader`-style implicit argument passing.
Combinators allow for recursion and for gluing together parsers in a modular fashion to parse data according to complex rules.
#### A bit like...
`Parser` is similar to `State`, plus `Maybe`. Parsing is like scrolling a cursor over a text, while mutating the internal state.
### Exercises: Parsing practice
1. Modify the parsers to expect an end-of-file closing the input stream

In [3]:
one = char '1' >> eof

oneTwo = char '1' >> char '2' >> eof

print $ parseString one mempty "123"
print $ parseString oneTwo mempty "123"

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m2[0m: [91merror[0m: expected: end of input
123[1m[94m<EOF>[0;1m[0m 
 [92m^[0m       , _errDeltas = [Columns 1 1]})

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m3[0m: [91merror[0m: expected: end of input
123[1m[94m<EOF>[0;1m[0m 
  [92m^[0m      , _errDeltas = [Columns 2 2]})

2. Use string to make a `Parser` that parses `“1”`, `“12”`, and `“123”` out of the example input, respectively. Try combining it with `stop`, too.

In [4]:
stop = unexpected "stop"

oneTwoThree = string "123" <|> string "12" <|> string "1"
oneTwoThree' = oneTwoThree <* stop

print $ parseString oneTwoThree mempty "1"
print $ parseString oneTwoThree mempty "12"
print $ parseString oneTwoThree mempty "123"
print $ parseString oneTwoThree' mempty "1"

Success "1"

Success "12"

Success "123"

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m2[0m: [91merror[0m: unexpected
    stop
1[1m[94m<EOF>[0;1m[0m 
 [92m^[0m     , _errDeltas = [Columns 1 1]})

3. Try writing a `Parser` that does what `string` does, but using `char`.

In [5]:
stringAsChar :: CharParsing m => String -> m String
stringAsChar = try . traverse char

## 24.4 Parsing fractions
In the example, it ensures only correct fractions are parsed, while a monadic `fail` is returned when denominator is 0.
### Exercise: Unit of success
Write a parser such that it returns the integer successfully when it receives an input with an integer, followed by an EOF, and fail in all other cases:

In [6]:
parseInteger :: Parser Integer
parseInteger = integer <* eof

parseString parseInteger mempty "123"
parseString parseInteger mempty "123abc"

Success 123

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m4[0m: [91merror[0m: expected: digit,
    end of input
123abc[1m[94m<EOF>[0;1m[0m 
   [92m^[0m        , _errDeltas = [Columns 3 3]})

## 24.5 Haskell’s parsing ecosystem
Haskell has several excellent parsing libraries available.
### Type classes of parsers
The type class `Parsing` has `Alternative` as a superclass. It includes function:
- `try`: takes a parser that may consume input and, on failure, goes back to where it started and fails if we don’t consume any input.
- `unexpected`: used to emit an error on an unexpected token.

It is superclass of `CharParsing` which defined functions such as `notChar`, `anyChar`, `string`, `text`.
## 24.6 Alternative
`<|>` is like a disjunction of two parsers. `Alternative` is a superclass of `Applicative` and defines common parsing functions such as `many`, `some`.
It is possible to `fmap` a type over a parser: the results of the parsing will be wrapped inside the type.
### Exercise: Try try
Make a parser, using the existing fraction parser plus a new decimal parser, that can parse either decimals or fractions.

In [7]:
import Data.Ratio ((%))

virtuousParseFraction :: Parser Rational
virtuousParseFraction = do
  numerator <- decimal
  char '/'
  denominator <- decimal
  case denominator of
    0 -> fail "denominator cannot be zero"
    _ -> return (numerator % denominator)

type DecimalOrFraction = Either Double Rational

parseDecimalOrFraction :: Parser DecimalOrFraction
parseDecimalOrFraction = (Left <$> try double) <|> (Right <$> try virtuousParseFraction)

print $ parseString parseDecimalOrFraction mempty "1/2"
print $ parseString parseDecimalOrFraction mempty "0.5"

Success (Right (1 % 2))

Success (Left 0.5)

## 24.7 Parsing configuration files
The chapter shows how to hierarchically combine smaller parsers in order to build a parser to INI files.
## 24.8 Character and token parsers
Haskell parsing libraries usually offer the same API for both lexing and parsing. Lexers (or tokenizers) and parsers hare primarily differentiated by their purposes and classes of grammar. Tokenization isn’t exclusively about whitespace: it’s about ignoring noise.
Overuse of tokenizing parsers or mixture with character parsers can make your parser slow or hard
to understand.
## 24.9 Polymorphic parsers
If we take the time to assert polymorphic types for our parsers, we can get parsers that can be run using various parsing libraries.
Anyway, you may need to understand the particularities of each one.
#### Failure and backtracking
Backtracking is returning the cursor to where it was before a failing parser consumed input.
With backtracking, error attribution can become more complicated. To avoid this, consider using the `<?>` operator to annotate parse rules any time you use `try`.
## 24.10 Marshalling from an AST to a datatype
The chapter advises not to mix `ByteString` with `Lazy.ByteString`.
Also, it shows how to marshal and unmarshal to/from Json with `aeson` library. For each type, two conversion instances must be implemented. Sum types are supported.
## 24.11 Chapter exercises
1. Write a parser for semantic versions as defined by http://semver.org/. After making a working parser, write an `Ord` instance for the `SemVer` type that obeys the specification outlined on the SemVer website:

In [8]:
import Text.Read (readMaybe)

data NumberOrString = NOSS String | NOSI Integer deriving (Eq, Ord, Show)

type Major = Integer
type Minor = Integer
type Patch = Integer
type Release = [NumberOrString]
type Metadata = [NumberOrString]

data SemVer = SemVer Major Minor Patch Release Metadata deriving (Eq, Show)

instance Ord SemVer where
  compare (SemVer major minor patch rel _) (SemVer major' minor' patch' rel' _) =
    mconcat
      [ compare major major'
      , compare minor minor'
      , compare patch patch'
      , compare rel rel'
      ]

parseSemVer :: Parser SemVer
parseSemVer =  do
  major     <- decimal
  char '.'
  minor     <- decimal
  char '.'
  patch     <- decimal
  release   <- option [] releaseP
  metadata  <- option [] metadataP
  eof
  return $ SemVer major minor patch release metadata

releaseP :: Parser Release
releaseP = char '-' *>  sepBy1 labelP (char '.')

metadataP :: Parser Metadata
metadataP = char '+' *> sepBy1 labelP (char '.')

labelP :: Parser NumberOrString
labelP = do
  nos <- some $ oneOf ['a'..'z'] <|> oneOf ['A'..'Z'] <|> digit <|> char '-'
  case readMaybe nos of
      Nothing -> return $ NOSS nos
      Just int -> return $ NOSI int

parseString parseSemVer mempty "2.1.1"
parseString parseSemVer mempty "1.0.0-x.7.z.92"
parseString parseSemVer mempty "1.0.0-gamma+002"
parseString parseSemVer mempty "1.0.0-beta+oof.sha.41af286"

big = SemVer 2 1 1 [] []
little = SemVer 2 1 0 [] []
big > little

Success (SemVer 2 1 1 [] [])

Success (SemVer 1 0 0 [NOSS "x",NOSI 7,NOSS "z",NOSI 92] [])

Success (SemVer 1 0 0 [NOSS "gamma"] [NOSI 2])

Success (SemVer 1 0 0 [NOSS "beta"] [NOSS "oof",NOSS "sha",NOSS "41af286"])

True

2. Write a parser for positive integer values. Don’t reuse the preexisting `digit` or `integer` functions.

In [5]:
import Data.Char (digitToInt)

parseDigit :: Parser Char
parseDigit = oneOf ['0'..'9'] <?> "['0'..'9']"

base10Integer :: Parser Integer
base10Integer = read' <$> some parseDigit

read' :: String -> Integer
read' = listToInteger . map (toInteger . digitToInt)
  where listToInteger :: [Integer] -> Integer
        listToInteger = foldl (\acc a -> acc * 10 + a) 0

parseString parseDigit mempty "123"
parseString parseDigit mempty "abc"
parseString base10Integer mempty "123abc"
parseString base10Integer mempty "123"

Success '1'

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m1[0m: [91merror[0m: expected: ['0'..'9']
abc[1m[94m<EOF>[0;1m[0m 
[92m^[0m        , _errDeltas = [Columns 0 0]})

Success 123

Success 123

3. Extend the parser you wrote to handle negative and positive integers. Try writing a new parser in terms of the one you already have in order to do this:

In [51]:
base10Integer' :: Parser Integer
base10Integer' = do
   sign <- op <$> optional (char '+' <|> char '-')
   sign <$> base10Integer
     
op sign = case sign of
  Just '-' -> negate
  _        -> id
    
parseString base10Integer' mempty "-123abc"

Success (-123)

4. Write a parser for US/Canada phone numbers with varying formats:

In [41]:
type NumberingPlanArea = Int
type Exchange = Int
type LineNumber = Int

data PhoneNumber = PhoneNumber NumberingPlanArea Exchange LineNumber deriving (Eq, Show)

parsePhone :: Parser PhoneNumber
parsePhone = do
  _ <- optional $ string "1-"
  a <- count 3 digit <|> char '(' *> count 3 digit <* char ')'
  _ <- optional $ oneOf [' ', '-']
  b <- count 3 digit
  _ <- optional $ oneOf [' ', '-']
  c <- count 4 digit
  pure $ PhoneNumber (read a) (read b) (read c)

parseString parsePhone mempty "123-456-7890"
parseString parsePhone mempty "1234567890"
parseString parsePhone mempty "(123) 456-7890"
parseString parsePhone mempty "1-123-456-7890"

Success (PhoneNumber 123 456 7890)

Success (PhoneNumber 123 456 7890)

Success (PhoneNumber 123 456 7890)

Success (PhoneNumber 123 456 7890)

5. Write a parser for a log file format, and sum the time spent in each activity. Additionally, provide an alternative aggregation of the data that provides average time spent per activity per day. The format supports the use of comments, which your parser will have to ignore. The # characters followed by a date mark the beginning of a particular day.

In [15]:
{-# LANGUAGE OverloadedStrings, QuasiQuotes #-}

import Control.Monad           (void)
import Text.Printf             (printf)
import Text.Parser.Char        (char, anyChar, newline, space, spaces, string)
import Text.Parser.Token       (decimal)
import Test.QuickCheck         (Arbitrary, Property, arbitrary, quickCheck, (==>), generate, elements, listOf)

type Year = Integer
type Month = Integer
type Day = Integer
type Hour = Integer
type Minute = Integer
type Description = String

data Date = Date Year Month Day  deriving (Eq, Ord) -- WWWW usare import Data.Date

instance Show Date where
  show (Date year month day) = printf "%04d" year ++ "-" ++ printf "%02d" month ++ "-" ++ printf "%02d" day

instance Arbitrary Date where
  arbitrary = Date <$> elements [2000..2020] <*> elements [1..12] <*> elements [1..30]


data Time = Time Hour Minute deriving (Eq, Ord)

instance Show Time where
  show (Time hour min) = printf "%02d" hour ++ ":" ++ printf "%02d" min
  
instance Arbitrary Time where
  arbitrary = Time <$> elements [0..23] <*> elements [0..59]


data Activity = Activity Time Description deriving (Eq)

instance Show Activity where
  show (Activity time desc) = show time ++ " " ++ desc

instance Arbitrary Activity where
  arbitrary = Activity <$> arbitrary <*> listOf (elements ['a' .. 'z'])


data DailyLog = DailyLog Date [Activity] deriving (Eq)

instance Show DailyLog where
  show (DailyLog date activities) = unlines $ ("# " ++ show date) : map show activities

instance Arbitrary DailyLog where
  arbitrary = do
    date <- arbitrary
    activities <- arbitrary
    pure $ DailyLog date [activities]
    
newtype Log = Log [DailyLog] deriving (Eq)

instance Show Log where
  show (Log dailies) = concatMap show dailies
  
instance Arbitrary Log where
  arbitrary = do
    daily <- arbitrary
    pure $ Log [daily]


skipComment :: Parser ()
skipComment = char '-' *> char '-' *> skipRestOfLine

skipRestOfLine :: Parser ()
skipRestOfLine = void $ manyTill anyChar (void newline <|> eof)

skipSpaceAndComments :: Parser ()
skipSpaceAndComments = skipMany (skipComment <|> skipSome space)

dateP :: Parser Date
dateP = Date <$> (decimal <* char '-') <*> (decimal <* char '-') <*> decimal

dateLineP :: Parser Date
dateLineP = char '#' *> char ' ' *> dateP <* skipRestOfLine

timeP :: Parser Time
timeP = Time <$> (decimal <* char ':') <*> decimal

descriptionP :: Parser Description
descriptionP = manyTill anyChar (try skipComment <|> void newline <|> eof)

activityP :: Parser Activity
activityP = Activity <$> timeP <*> (char ' ' *> descriptionP)

activitiesP :: Parser [Activity]
activitiesP = some (skipSpaceAndComments *> activityP <* skipSpaceAndComments)

dayP :: Parser DailyLog
dayP = do
  skipSpaceAndComments
  date <- dateLineP
  activities <- activitiesP
  pure $ DailyLog date activities

logP :: Parser Log
logP = Log <$> some dayP

maybeSuccess :: Result a -> Maybe a
maybeSuccess (Success a) = Just a
maybeSuccess _ = Nothing

runLogParser :: String -> Maybe Log
runLogParser = maybeSuccess . parseString logP mempty

prop_correctParsing :: Log -> Bool
prop_correctParsing x = runLogParser (show x) == Just x

quickCheck prop_correctParsing

+++ OK, passed 100 tests.

6. Write a parser for IPv4 addresses:

In [13]:
import Data.Word
import Control.Monad (guard)

data IPAddress = IPAddress Word32 deriving (Eq, Ord, Show)

ipv4P :: Parser IPAddress
ipv4P = do
  x4 <- octetP
  [x3, x2, x1] <- count 3 (char '.' *> octetP)
  pure $ IPAddress $ x4 * 256^3 + x3 * 256^2 + x2 * 256 + x1

octetP :: Num a => Parser a
octetP = do
  oct <- decimal
  guard $ (oct >= 0) && (oct <= 255)
  pure $ fromIntegral oct

print $ parseString ipv4P mempty "172.16.257.1"
print $ parseString ipv4P mempty "204.120.0.15"

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m11[0m: [91merror[0m: expected: digit
172.16.257.1[1m[94m<EOF>[0;1m[0m 
          [92m^[0m       , _errDeltas = [Columns 10 10]})

Success (IPAddress 3430416399)

7. Same as before, but IPv6:

In [9]:
import Data.Word
import Data.List.Split (splitOn)
import Data.Bits       (shiftL)
import Numeric         (readHex)

data IPAddress6 = IPAddress6 Word64 Word64 deriving (Eq, Ord, Show)

ipv6P :: Parser IPAddress6
ipv6P = try ipV6FullP <|> ipV6AbbrP 

ipV6FullP :: Parser IPAddress6
ipV6FullP = toIPAddress6 <$> some hexDigit `sepBy` colon

ipV6AbbrP :: Parser IPAddress6
ipV6AbbrP = do
  before <- manyTill ipv6DigitP $ try $ string "::"
  after <- many ipv6DigitP
  let bs = splitOn ":" before
  let as = splitOn ":" after
  return $ toIPAddress6 $ bs ++ replicate (8 - length bs - length as) "0" ++ as

ipv6DigitP :: Parser Char
ipv6DigitP = oneOf ['a'..'f'] <|> oneOf ['A'..'F'] <|> digit <|> char ':'

toIPAddress6 :: [String] -> IPAddress6
toIPAddress6 xs = let
    [x8, x7, x6, x5, x4, x3, x2, x1] = xs
  in
    IPAddress6 (toWord64 [x8, x7, x6, x5]) (toWord64 [x4, x3, x2, x1])

toWord64 :: [String] -> Word64
toWord64 xs  = let
    [x4, x3, x2, x1] = fmap (fromIntegral . fst) $ xs >>= readHex
  in
    sum [shiftL x4 48, shiftL x3 32, shiftL x2 16, shiftL x1 0]

print $ parseString ipv6P mempty "0:0:0:0:0:ffff:ac10:fe01"
print $ parseString ipv6P mempty "0:0:0:0:0:ffff:cc78:f"
print $ parseString ipv6P mempty "FE80:0000:0000:0000:0202:B3FF:FE1E:8329"
print $ parseString ipv6P mempty "2001:DB8::8:800:200C:417A"

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m1[0m: [91merror[0m: unspecified
    error
0:0:0:0:0:ffff:ac10:fe01[1m[94m<EOF>[0;1m[0m 
[92m^[0m                             , _errDeltas = [Columns 0 0]})

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m1[0m: [91merror[0m: unspecified
    error
0:0:0:0:0:ffff:cc78:f[1m[94m<EOF>[0;1m[0m 
[92m^[0m                          , _errDeltas = [Columns 0 0]})

Failure (ErrInfo {_errDoc = [1m(interactive)[0m:[1m1[0m:[1m1[0m: [91merror[0m: unspecified
    error
FE80:0000:0000:0000:0202:B3FF:FE1E:8329[1m[94m<EOF>[0;1m[0m 
[92m^[0m                                            , _errDeltas = [Columns 0 0]})

Success (IPAddress6 2306139568115548160 2260596444381562)

8. Remove the derived `Show` instances from the `IPAddress`/`IPAddress6` types, and write your own `Show` instance for each type that renders in the typical textual format appropriate to each.

In [157]:
import Data.List  (intersperse, intercalate)
import Numeric    (showHex)
import Data.Bits  (shiftR, (.&.))

instance Show IPAddress where
  show (IPAddress w) = intercalate "." $ fmap show
    [ shiftR x 24 .&. 0xff
    , shiftR x 16 .&. 0xff
    , shiftR x  8 .&. 0xff
    ,        x    .&. 0xff
    ]
      where
        x = fromIntegral w

instance Show IPAddress6 where
  show (IPAddress6 hi lo) = (foldr ($) "" . intersperse (":" ++) . fmap showHex) $ quads hi ++ quads lo
    where 
      quads w =
        let
          x = fromIntegral w
        in
          [ shiftR x 48 .&. 0xffff
          , shiftR x 32 .&. 0xffff
          , shiftR x 16 .&. 0xffff
          ,        x    .&. 0xffff
          ]

print $ parseString ipv4P mempty "172.16.254.1"
print $ parseString ipv6P mempty "2001:DB8::8:800:200C:417A"

Success 172.16.254.1

Success 2001:db8:0:0:8:800:200c:417a

9. Write a function that converts between your types for `IPAddress` and `IPAddress6`.

In [84]:
ipv4ToIpv6 :: IPAddress -> IPAddress6
ipv4ToIpv6 (IPAddress w32) = IPAddress6 0 (fromIntegral w32)

10. Write a parser for the DOT language that GraphViz uses to express graphs in plain text

In [27]:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes #-} 

import Data.GraphViz as DG
import Data.Text.Lazy as T
import Text.RawString.QQ

type DG = DG.DotGraph String

c1 :: T.Text
c1 = [r|
digraph "sm0" {
  A
  B
  A -> B
}
|]

print (DG.parseDotGraph c1 :: DG)

DotGraph {strictGraph = False, directedGraph = True, graphID = Just (Str "sm0"), graphStatements = DotStmts {attrStmts = [], subGraphs = [], nodeStmts = [DotNode {nodeID = "A", nodeAttributes = []},DotNode {nodeID = "B", nodeAttributes = []}], edgeStmts = [DotEdge {fromNode = "A", toNode = "B", edgeAttributes = []}]}}

## 24.12 Definitions
1. A *parser* parses.
2. A *parser combinator* combines two or more parsers to produce a new parser (eg.: `<|>`, `>>`, `some`, `many`)
3. *Marshalling/unmarshalling*: cf. serialization and deserialization.
4. A *tokenizer*, or *lexer*, converts text, usually a stream of characters, into meaningful symbols.