Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DKVP: pair separator in value breaks parsing #62

Closed
rbroemeling opened this issue Sep 3, 2015 · 8 comments
Closed

DKVP: pair separator in value breaks parsing #62

rbroemeling opened this issue Sep 3, 2015 · 8 comments
Labels

Comments

@rbroemeling
Copy link

Miller has difficulty parsing DKVP when the pair separator character is within the value of the data that it is parsing. To whit:

Source File

a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647=899636,y=0.5221511083334797
a=wye,b="wy=e",i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x==0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5=,x=0.5732889198020006,y=0.8636244699032729

Expected Output of mlr --opprint cat

a   b      i  x                    y
pan pan    1  0.3467901443380824   0.7268028627434533
eks pan    2  0.7586799647=899636  0.5221511083334797
wye "wy=e" 3  0.20460330576630303  0.33831852551664776
eks wye    4  =0.38139939387114097 0.13418874328430463
wye pan    5= 0.5732889198020006   0.8636244699032729

Actual Output of mlr --opprint cat

a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 899636              0.5221511083334797
wye e"  3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan - 0.5732889198020006  0.8636244699032729

It looks like what miller does is that it reads the key until it finds the first pair separator, and then starts reading the value after the last pair separator.

What it should do, I think, is read the key until it finds the first pair separator, and then assign everything after that first pair separator right up until the field separator is found to be value.

@johnkerl
Copy link
Owner

johnkerl commented Sep 3, 2015

The RFC-compliant CSV reader handles embedded separators within double quotes; readers for the other formats do not at all. This is a non-ideal situation, for sure.

This is a dup of #52 and all four of the current on-deck or active tasks are what I'm actively working on for v2.2.0.

@rbroemeling
Copy link
Author

@johnkerl While proper quoting can solve this issue, I'm not sure it is the only solution and I'm not certain that it is preferable in this case. Having miller make a best-effort to parse unquoted-but-still-parsable data is a bug, IMO.

Given the log file that I am currently working with, for example: adding quotes to all the fields would make it a LOT harder to read (as a human reading plaintext, I mean).

@johnkerl
Copy link
Owner

johnkerl commented Sep 3, 2015

Good feedback. I'll dig harder into your request this evening.

@rbroemeling
Copy link
Author

Thanks for looking into it further, @johnkerl -- I have not spent long looking into this, but I'm wondering if the problem isn't in /c/input/lrec_reader_mmap_dkvp.c within the lrec_parse_mmap_dkvp function: whenever it matches on ips, it changes the ips to a null and then sets the value to the byte after the ips. This seemingly would result in the behavior that is being seen (i.e. the key ends at the first ips, and the value begins at the last ips).

I wonder if this couldn't be fixed by tweaking that code to ensure that you only match on ips once per field, which should result in only the first ips being matched.

@johnkerl
Copy link
Owner

johnkerl commented Sep 3, 2015

My apologies for the hasty read; I've got double-quotes on the brain & assigned too much weight to the double-quotes in your data. :^/

This is (was) definitely a bug; fixed in c2e11c0.

Thank you for the report!! :)

@rbroemeling
Copy link
Author

No problem at all, @johnkerl -- thanks very much for the quick fix!

johnkerl added a commit that referenced this issue Sep 3, 2015
@rbroemeling
Copy link
Author

Nice fix, @johnkerl -- have confirmed that with it in-place, miller can now deal very nicely with things like a web-server access.log file in dkvp format (using = as the PS now works well, even though it is used in some field values as well).

@johnkerl
Copy link
Owner

johnkerl commented Sep 4, 2015

Awesome!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants