Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-line parser reporting per-line errors #210

Open
MaxGabriel opened this issue Dec 20, 2021 · 3 comments
Open

Per-line parser reporting per-line errors #210

MaxGabriel opened this issue Dec 20, 2021 · 3 comments
Labels
PR welcome re: error reporting Concerning error messages delivered by the parser

Comments

@MaxGabriel
Copy link

Right now, the functions in Data.Csv generally return Either String (Vector a) or a similar variant, which is great in most cases.

If you want to get errors on a per-line basis, one needs to use Data.Csv.Incremental*. This is unfortunate because the complexity of it is significantly higher (because it also is supporting interleaving IO and incrementally feeding data to the parser).

I think adding convenience functions for taking a ByteString and returning Vector (Either String a), without going through the Incremental functions, could be a good addition. The main use case I have in mind for those is providing better error messages for user-provided CSVs.

  • If my read of the Cassava docs is right
@andreasabel
Copy link
Member

I think adding convenience functions for taking a ByteString and returning Vector (Either String a),

Yeah, that is thinkable.
The current parser (e.g. for without header),

csv :: DecodeOptions -> AL.Parser Csv
csv !opts = do
vals <- sepByEndOfLine1' (record (decDelimiter opts))
_ <- optional endOfLine
endOfInput
let nonEmpty = removeBlankLines vals
return $! V.fromList nonEmpty

uses the parser modifier sepByEndOfLine':
-- | Specialized version of 'sepBy1'' which is faster due to not
-- accepting an arbitrary separator.
sepByEndOfLine1' :: AL.Parser a
-> AL.Parser [a]
sepByEndOfLine1' p = liftM2' (:) p loop
where
loop = do
mb <- A.peekWord8
case mb of
Just b | b == cr ->
liftM2' (:) (A.anyWord8 *> A.word8 newline *> p) loop
| b == newline ->
liftM2' (:) (A.anyWord8 *> p) loop
_ -> pure []

So, it is just a single parse with a single error returned.
If you want per-line parsing, you will first have to split the input using a variant of sepByEndOfLine' and then map a line parser over it.

PR welcome, if it comes with benchmarks comparing the performance of single parse versus per line parse.

@MaxGabriel
Copy link
Author

MaxGabriel commented Dec 21, 2021

Ok, I'm not interested in the performance implications of this, so just going to close

@andreasabel andreasabel changed the title Convenience function for per-line errors Per-line parser reporting per-line errors Dec 21, 2021
@andreasabel andreasabel added PR welcome re: error reporting Concerning error messages delivered by the parser labels Dec 21, 2021
@andreasabel
Copy link
Member

Let's leave it open, maybe others are interested.

@andreasabel andreasabel reopened this Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR welcome re: error reporting Concerning error messages delivered by the parser
Projects
None yet
Development

No branches or pull requests

2 participants