Second implementation of parsing for space-delimited data #36

Shimuuar · 2013-03-08T16:28:21Z

Previous dicussion is in #30 pul request

Everything does work. Although encoding os space delimited data is a bit fragile and depends on fact that space is always escaped

There are performance regression with respect to current HEAD. Benchmarks shows that streaming have around 1.3 slowdown and decoding of named CSV have similar slowdown.

Space-delimited parsing is mostly working, except that trailing whitespace is not dropped correctly.

It's hard to strip whitespaces correctly because a) It's valid part of field for CSV so "a,b,c " -> ["a","b","c "] b) If we're using spaces as delimiter we get spurious empty field at the end fo the line "a b c " -> ["a","b","c",""] Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.

It's necessary for implementing delimiters which are not single character

Now all tests are passing.

Now it depends on rather complex data structure and needs to be specialized by compiler

tibbe · 2013-03-11T23:50:29Z

The change looks good from a brief look. I will have to take a deeper look and run the benchmarks later next week (I'm very busy this week, sorry!)

tibbe · 2013-03-21T21:20:56Z

Data/Csv/Encoding.hs

-encodeRecord :: Word8 -> Record -> Builder
-encodeRecord delim = mconcat . intersperse (fromWord8 delim)
-                     . map fromByteString . map escape . V.toList
+encodeRecord :: EncodeOptions -> Record -> Builder


I don't see why passing the whole record here is necessary (since we only use the delimiter). It might be slower or it might not, but I'm in favor of not changing things without reason (unless there is a reason I missed).

tibbe · 2013-03-21T21:25:18Z

I took a deeper look and things look fine except for the small comments above.

Could you please post the criterion benchmarks for before and after your change (you can use cabal bench to run the benchmarks).

Always escape delimiter. Now it's passed to escape function as parameter and not hardcoded as comma. Note correct encoding of space-delimited data depends on escaping of space character.

All tests are passing

Shimuuar · 2013-03-23T08:50:28Z

I changed EncodeOptions for a few times and when I selected simple option I didn't roll back all changes. I've pushed amended commits.

Here is table with benchmarks results. Current HEAD, space delimited implementation and ration space/HEAD

                                                       Name       old       new     ratio
1           positional/decode/presidents/without conversion  60.86681  57.76073 0.9489693
2              positional/decode/presidents/with conversion 134.72521 123.18220 0.9143218
3 positional/decode/streaming/presidents/without conversion  98.87020 171.77240 1.7373526
4    positional/decode/streaming/presidents/with conversion 137.70761 191.79104 1.3927410
5              positional/encode/presidents/with conversion 162.29042 180.51010 1.1122659
6                named/decode/presidents/without conversion 282.89621 355.71811 1.2574156
7                   named/decode/presidents/with conversion 213.22024 275.15457 1.2904712
8                   named/encode/presidents/with conversion 264.76437 267.29542 1.0095596
9                                       comparison/lazy-csv 150.31214 147.47115 0.9810994

Streaming suffered badly and named fields see performance impact as well,

tibbe · 2013-04-11T20:37:03Z

The reason I haven't looked into this yet is I was hoping to find some time to figure out why the streaming decode got 70% slower.

Shimuuar · 2013-10-15T09:26:46Z

I've merged current master and updated benchmarks. Streaming performans still suffer although less

                                                       name       old      new     ratio
1           positional/decode/presidents/without conversion  59.47160  63.4873 1.0675229
2              positional/decode/presidents/with conversion 107.10150 112.4752 1.0501736
3 positional/decode/streaming/presidents/without conversion  89.97198 126.9854 1.4113888
4    positional/decode/streaming/presidents/with conversion 107.13403 149.0713 1.3914473
5              positional/encode/presidents/with conversion 134.93142 128.5330 0.9525799
6                named/decode/presidents/without conversion 247.20380 289.9606 1.1729619
7                   named/decode/presidents/with conversion 197.89479 246.9330 1.2477996
8                   named/encode/presidents/with conversion 195.77406 193.9283 0.9905722
9                                       comparison/lazy-csv 116.55858 117.6775 1.0095993

tibbe and others added 9 commits February 28, 2013 20:32

Check-point commit

f2bf695

Space-delimited parsing is mostly working, except that trailing whitespace is not dropped correctly.

Fix build

9eeea09

Add more failing tests

d217e27

parser for header is same as parser for record

6fbc77f

Pass DecodeOptions to the field and name parsers

0a60b79

It's necessary for implementing delimiters which are not single character

Use predicate on delimiter character.

24f375e

Now all tests are passing.

Reexport spaceDecodeOptions

f7d8d8a

Inline unescapedField

1c18502

Now it depends on rather complex data structure and needs to be specialized by compiler

tibbe reviewed Mar 21, 2013
View reviewed changes

Shimuuar added 4 commits March 23, 2013 12:17

Add options for encoding of space-delimited data

361703c

Always escape delimiter. Now it's passed to escape function as parameter and not hardcoded as comma. Note correct encoding of space-delimited data depends on escaping of space character.

Reexport spaceEncodingOptions from Data.Csv

fa7fe10

Add tests for encoding of space delimited data

1ca1784

All tests are passing

Document field values in defaultDecodeOptions and spaceDecodeOptions

b009085

Merge master into space2

2610958

devonhollowood mentioned this pull request Feb 27, 2017

Current status of implementing space-delimited data #132

Open

hvr added this to the ⊥ milestone Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second implementation of parsing for space-delimited data #36

Second implementation of parsing for space-delimited data #36

Shimuuar commented Mar 8, 2013

tibbe commented Mar 11, 2013

tibbe Mar 21, 2013

tibbe commented Mar 21, 2013

Shimuuar commented Mar 23, 2013

tibbe commented Apr 11, 2013

Shimuuar commented Oct 15, 2013

Second implementation of parsing for space-delimited data #36

Are you sure you want to change the base?

Second implementation of parsing for space-delimited data #36

Conversation

Shimuuar commented Mar 8, 2013

tibbe commented Mar 11, 2013

tibbe Mar 21, 2013

Choose a reason for hiding this comment

tibbe commented Mar 21, 2013

Shimuuar commented Mar 23, 2013

tibbe commented Apr 11, 2013

Shimuuar commented Oct 15, 2013