Second implementation of parsing for space-delimited data #36

Open
wants to merge 14 commits into
from

Projects

None yet

2 participants

@Shimuuar

Previous dicussion is in #30 pul request

Everything does work. Although encoding os space delimited data is a bit fragile and depends on fact that space is always escaped

There are performance regression with respect to current HEAD. Benchmarks shows that streaming have around 1.3 slowdown and decoding of named CSV have similar slowdown.

tibbe and others added some commits Feb 23, 2013
@tibbe tibbe Check-point commit
Space-delimited parsing is mostly working, except that trailing
whitespace is not dropped correctly.
f2bf695
@Shimuuar Shimuuar Fix build 9eeea09
@Shimuuar Shimuuar Add more failing tests d217e27
@Shimuuar Shimuuar Correctly leading/trailing spaces
It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.
6b8e1dd
@Shimuuar Shimuuar parser for header is same as parser for record 6fbc77f
@Shimuuar Shimuuar Pass DecodeOptions to the field and name parsers
It's necessary for implementing delimiters which are not single
character
0a60b79
@Shimuuar Shimuuar Use predicate on delimiter character.
Now all tests are passing.
24f375e
@Shimuuar Shimuuar Reexport spaceDecodeOptions f7d8d8a
@Shimuuar Shimuuar Inline unescapedField
Now it depends on rather complex data structure and needs to be
specialized by compiler
1c18502
@tibbe
Collaborator

The change looks good from a brief look. I will have to take a deeper look and run the benchmarks later next week (I'm very busy this week, sorry!)

@tibbe tibbe commented on an outdated diff Mar 21, 2013
Data/Csv/Encoding.hs
. V.toList
{-# INLINE encodeWith #-}
-encodeRecord :: Word8 -> Record -> Builder
-encodeRecord delim = mconcat . intersperse (fromWord8 delim)
- . map fromByteString . map escape . V.toList
+encodeRecord :: EncodeOptions -> Record -> Builder
@tibbe
tibbe Mar 21, 2013

I don't see why passing the whole record here is necessary (since we only use the delimiter). It might be slower or it might not, but I'm in favor of not changing things without reason (unless there is a reason I missed).

@tibbe
Collaborator

I took a deeper look and things look fine except for the small comments above.

Could you please post the criterion benchmarks for before and after your change (you can use cabal bench to run the benchmarks).

Shimuuar added some commits Feb 28, 2013
@Shimuuar Shimuuar Add options for encoding of space-delimited data
Always escape delimiter. Now it's passed to escape function as
parameter and not hardcoded as comma.

Note correct encoding of space-delimited data depends on escaping
of space character.
361703c
@Shimuuar Shimuuar Reexport spaceEncodingOptions from Data.Csv fa7fe10
@Shimuuar Shimuuar Add tests for encoding of space delimited data
All tests are passing
1ca1784
@Shimuuar Shimuuar Document field values in defaultDecodeOptions and spaceDecodeOptions b009085
@Shimuuar

I changed EncodeOptions for a few times and when I selected simple option I didn't roll back all changes. I've pushed amended commits.

Here is table with benchmarks results. Current HEAD, space delimited implementation and ration space/HEAD

                                                       Name       old       new     ratio
1           positional/decode/presidents/without conversion  60.86681  57.76073 0.9489693
2              positional/decode/presidents/with conversion 134.72521 123.18220 0.9143218
3 positional/decode/streaming/presidents/without conversion  98.87020 171.77240 1.7373526
4    positional/decode/streaming/presidents/with conversion 137.70761 191.79104 1.3927410
5              positional/encode/presidents/with conversion 162.29042 180.51010 1.1122659
6                named/decode/presidents/without conversion 282.89621 355.71811 1.2574156
7                   named/decode/presidents/with conversion 213.22024 275.15457 1.2904712
8                   named/encode/presidents/with conversion 264.76437 267.29542 1.0095596
9                                       comparison/lazy-csv 150.31214 147.47115 0.9810994

Streaming suffered badly and named fields see performance impact as well,

@tibbe
Collaborator

The reason I haven't looked into this yet is I was hoping to find some time to figure out why the streaming decode got 70% slower.

@Shimuuar

I've merged current master and updated benchmarks. Streaming performans still suffer although less

                                                       name       old      new     ratio
1           positional/decode/presidents/without conversion  59.47160  63.4873 1.0675229
2              positional/decode/presidents/with conversion 107.10150 112.4752 1.0501736
3 positional/decode/streaming/presidents/without conversion  89.97198 126.9854 1.4113888
4    positional/decode/streaming/presidents/with conversion 107.13403 149.0713 1.3914473
5              positional/encode/presidents/with conversion 134.93142 128.5330 0.9525799
6                named/decode/presidents/without conversion 247.20380 289.9606 1.1729619
7                   named/decode/presidents/with conversion 197.89479 246.9330 1.2477996
8                   named/encode/presidents/with conversion 195.77406 193.9283 0.9905722
9                                       comparison/lazy-csv 116.55858 117.6775 1.0095993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment