Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV node outputs non valid RFC4180 CSV data under certain conditions #3934

Open
Steve-Mcl opened this issue Oct 29, 2022 · 8 comments · Fixed by #4540
Open

CSV node outputs non valid RFC4180 CSV data under certain conditions #3934

Steve-Mcl opened this issue Oct 29, 2022 · 8 comments · Fixed by #4540
Labels
Milestone

Comments

@Steve-Mcl
Copy link
Contributor

Current Behavior

This is related #3919 (but to be addressed separately)

In the screenshot below, an object prop with a double quote generates invalid CSV then fails completely to reconvert to object
image

Here is what Google Sheets / Excel makes of double quotes...
image
Note that double quotes are doubled up and surrounded in double quotes as per RFC

Here is what convertcsv.com make of it...
image

Expected Behavior

If this is considered a bug, then target master with fixes

If this is considered "current/expected behaviour" then target dev with suggested enhancements...

Ideally, we would make the CSV output data that is 100% compatible with the specification however if it is deemed making the CSV node 100% RFC compatible would break users existing flows, it may be preferable to implement a new [x] Strict Mode option that fully adheres to the specification

Steps To Reproduce

Import provided flows - they demonstrate the various problems

Example flow

[{"id":"ee0d712a0a753877","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":1460,"wires":[["81bfac100669ed83"]]},{"id":"81bfac100669ed83","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1110,"y":1460,"wires":[["21813caee6ea84c1"]]},{"id":"21813caee6ea84c1","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 28","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1300,"y":1460,"wires":[]},{"id":"31e6b7bd1227c180","type":"comment","z":"3b9a9d02857dd4a7","name":"No col/template object data only.  Outputs invalid header   c\"c,d.d,\"b,b\",e`e,a'a   (RFC4180 2.6/2.7)","info":"#### Demo 1 - no column template\n\nInput data\n```\n[\n    {\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},\n    {\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}\n]\n```\n\nExtracts col template...\n```\n['c\"c', 'd.d', 'b,b', 'e`e', \"a'a\"]\n//c\"c,d.d,b,b,e`e,a'a               as a CSV string\n```\n\nAccording to RFC4180 2.6 & 2.7 the CSV generated for the template is invalid:\n\n> 2.6\n> Fields containing line breaks (CRLF), double quotes, and commas\n> should be enclosed in double-quotes\n\n> 2.7\n> If double-quotes are used to enclose fields, then a double-quote\n> appearing inside a field must be escaped by preceding it with\n> another double quote\n\ni.e. the column template should be...\n\n```\n\"c\"\"c\",d.d,\"b,b\",e`e,a'a\n```","x":1210,"y":1420,"wires":[]},{"id":"fe74d9ac546a36a1","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":1600,"wires":[["bb56df9c07e10fb7"]]},{"id":"bb56df9c07e10fb7","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"a'a,\"b,b\",c\"c,d.d,e`e","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1110,"y":1600,"wires":[["c450170efe7dfb96"]]},{"id":"c450170efe7dfb96","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 109","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1310,"y":1600,"wires":[]},{"id":"8de0e9f39fe63f23","type":"comment","z":"3b9a9d02857dd4a7","name":"Incorrectly parses `col`  a'a,\"b,b\",c\"c,d.d,e`e  - as -  ['a'a,b', 'b,cc', 'd.d', 'e`e']      (RFC4180 2.6)","info":"#### Demo 2\n\nColumn Template\n```\na'a,\"b,b\",c\"c,d.d,e`e\n```\n\nparses as\n```\n['a'a,b', 'b,cc', 'd.d', 'e`e']\n```\n\nAccording to RFC4180 2.6 the CSV for the template is invalid:\n> Fields containing line breaks (CRLF), double quotes, and commas\n>  should be enclosed in double-quotes\n\n\nUsing https://www.npmjs.com/package/@vanillaes/csv\n\n```\ncsv.parse('a\\'a,\"b,b\",c\"c,d.d,e`e')\n//Uncaught Error: CSVError: Illegal state [row:1, col:3]\n```","x":1190,"y":1560,"wires":[]},{"id":"eecf56ebbebfa0c6","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":2020,"wires":[["56c208f10d5666c4"]]},{"id":"56c208f10d5666c4","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"a'a,\"b,b\",\"c\"\"c\",d.d,e`e","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1110,"y":2020,"wires":[["2f27f53e4d0407a0"]]},{"id":"2f27f53e4d0407a0","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 110","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1310,"y":2020,"wires":[]},{"id":"ade3b3657fafd2a1","type":"comment","z":"3b9a9d02857dd4a7","name":"Valid CSV `col` incorrectly parsed   a'a,\"b,b\",\"c\"\"c\",d.d,e`e  →  a'a,\"b,b\",cc,d.d,e`e","info":"#### Demo 5\n\nColumn Template\n```\na'a,\"b,b\",\"c\"\"c\",d.d,e`e\n```\n\nparses as\n```\n['a'a', 'b,b', 'cc', 'd.d', 'e`e']\n//a'a   \"b,b\"   cc    d.d    e`e\n```\n\nAccording to RFC4180 2.7 this should be\n```\n[ \"a'a\", 'b,b', 'c\"c', 'd.d', 'e`e' ]\n// a'a    b,b    c\"c    d.d    e`e    just values\n// a'a,\"b,b\",\"c\"\"c\",d.d,e`e           as a CSV string\n```\n\n\n--- \n\nUsing https://www.npmjs.com/package/@vanillaes/csv\n\n```\ncsv.parse('a\\'a,\"b,b\",\"c\"\"c\",d.d,e`e')\n// [ [ \"a'a\", 'b,b', 'c\"c', 'd.d', 'e`e' ] ]\n```","x":1150,"y":1980,"wires":[]},{"id":"a93a707f65d0e980","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"a'a,\"b,b\",c\"\"c,d.d,e`e","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1110,"y":1740,"wires":[["a848ea05db8a81f3"]]},{"id":"ce107b562b9d768b","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":1740,"wires":[["a93a707f65d0e980"]]},{"id":"a848ea05db8a81f3","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 108","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1310,"y":1740,"wires":[]},{"id":"a535f1c991015aca","type":"comment","z":"3b9a9d02857dd4a7","name":"Incorrectly parses `col`  a'a,\"b,b\",c\"\"c,d.d,e`e  - as -  ['a'a,b', 'b,cc', 'd.d', 'e`e']      (RFC4180 2.6)","info":"#### Demo 3\n\nColumn Template\n```\na'a,\"b,b\",c\"\"c,d.d,e`e\n```\n\nparses as\n```\n['a'a,b', 'b,cc', 'd.d', 'e`e']\n```\n\nAccording to RFC4180 2.6 the CSV for the template is invalid:\n> Fields containing line breaks (CRLF), double quotes, and commas\n>  should be enclosed in double-quotes\n\n\nUsing https://www.npmjs.com/package/@vanillaes/csv\n\n```\ncsv.parse('a\\'a,\"b,b\",c\"\"c,d.d,e`e')\n// Uncaught Error: CSVError: Illegal state [row:1, col:3]\n```","x":1190,"y":1700,"wires":[]},{"id":"5f8644df15a13de8","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":2160,"wires":[["5c78de4ec669b8a6"]]},{"id":"5c78de4ec669b8a6","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"a'a ,\" b,b\",\" c\"\"c\",d.d, e`e","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1090,"y":2160,"wires":[["d4cd02472bf4a44f"]]},{"id":"d4cd02472bf4a44f","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 111","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1230,"y":2160,"wires":[]},{"id":"b21a46644ce246f1","type":"comment","z":"3b9a9d02857dd4a7","name":"Valid CSV `col` incorrectly parsed     a'a ,\" b,b\",\" c\"\"c\",d.d, e`e  →   ['a'a', 'b,b', 'cc', 'd.d', 'e`e']  (RFC4180 2.4)","info":"#### Demo 6\n\nColumn Template\n```\na'a ,\" b,b\",\" c\"\"c\",d.d, e`e\n```\n\nparses as\n```\n['a'a', 'b,b', 'cc', 'd.d', 'e`e']\n//a'a,\"b,b\",cc,d.d,e`e             as a CSV string\n```\n\nAccording to RFC4180 2.4 this should be\n```\n[ \"a'a \", ' b,b', ' c\"c', 'd.d', ' e`e' ]\n//\"a'a \",\" b,b\",\" c\"\"c\",\" d.d\",\" e`e\"     as a CSV string\n```\n\n\n--- \n\nUsing https://www.npmjs.com/package/@vanillaes/csv\n\n```\ncsv.parse('a\\'a ,\" b,b\",\" c\"\"c\",d.d, e`e')\n[ \"a'a \", ' b,b', ' c\"c', 'd.d', ' e`e' ]\n```","x":1230,"y":2120,"wires":[]},{"id":"c95c51ee79981882","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"a'a,\"b,b\",c\"\"c,d.d,e`e","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1110,"y":1880,"wires":[["4b3bb17567930944"]]},{"id":"f2066292ecf6c372","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"c\\\"c\":3,\"d.d\":4,\"b,b\":2,\"e`e\":5,\"a'a\":1},{\"d.d\":2,\"e`e\":1,\"b,b\":4,\"c\\\"c\":3,\"a'a\":5}]","payloadType":"json","x":930,"y":1880,"wires":[["c95c51ee79981882"]]},{"id":"4b3bb17567930944","type":"debug","z":"3b9a9d02857dd4a7","name":"debug 112","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","x":1310,"y":1880,"wires":[]},{"id":"f20fd2e9a3d93b60","type":"comment","z":"3b9a9d02857dd4a7","name":"Incorrectly parses `col`  a'a,\"b,b\",\"c\"c\",d.d,e`e  - as -  a'a,\"b,b\",cc,d.d,e`e      (RFC4180 2.7)","info":"#### Demo 4\n\nColumn Template\n```\na'a,\"b,b\",\"c\"c\",d.d,e`e\n```\n\nincorrectly parses as\n```\na'a,\"b,b\",cc,d.d,e`e\n```\n\nAccording to RFC4180 2.6 the CSV for the template is invalid:\n> Fields containing line breaks (CRLF), double quotes, and commas\n>  should be enclosed in double-quotes\n","x":1170,"y":1840,"wires":[]},{"id":"08bc532a384512b8","type":"inject","z":"3b9a9d02857dd4a7","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":895,"y":2380,"wires":[["344cc244e2924a56"]],"l":false},{"id":"344cc244e2924a56","type":"template","z":"3b9a9d02857dd4a7","name":"RFC4180 compliant CSV","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"a'a,\"b,b\",\"c\"\"c\",d.d,e`e\n1,2,3,4,5\n5,4,3,2,1","output":"str","x":1070,"y":2380,"wires":[["2937c4b7db968a2d"]]},{"id":"2937c4b7db968a2d","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":true,"hdrout":"all","multi":"mult","ret":"\\n","temp":"","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1250,"y":2380,"wires":[["7384a10937784487"]]},{"id":"7384a10937784487","type":"debug","z":"3b9a9d02857dd4a7","name":"gets prop name  c\"c  wrong","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1430,"y":2380,"wires":[]},{"id":"34442861f7ba5ee0","type":"comment","z":"3b9a9d02857dd4a7","name":"\"c\"\"c\",d.d,\"b,b\",e`e,a'a \\n 3,4,2,5,1 \\n 3,2,4,1,5","info":"#### Demo 6\n\nColumn Template\n```\na'a ,\" b,b\",\" c\"\"c\",d.d, e`e\n```\n\nparses as\n```\n['a'a', 'b,b', 'cc', 'd.d', 'e`e']\n//a'a,\"b,b\",cc,d.d,e`e             as a CSV string\n```\n\nAccording to RFC4180 2.4 this should be\n```\n[ \"a'a \", ' b,b', ' c\"c', 'd.d', ' e`e' ]\n//\"a'a \",\" b,b\",\" c\"\"c\",\" d.d\",\" e`e\"     as a CSV string\n```\n\n\n--- \n\nUsing https://www.npmjs.com/package/@vanillaes/csv\n\n```\ncsv.parse('a\\'a ,\" b,b\",\" c\"\"c\",d.d, e`e')\n[ \"a'a \", ' b,b', ' c\"c', 'd.d', ' e`e' ]\n```","x":1070,"y":2320,"wires":[]},{"id":"ca71df2df82ef0bc","type":"inject","z":"3b9a9d02857dd4a7","name":"Object data with prop names  a'a  b  c\"c","props":[{"p":"payload"},{"p":"topic","vt":"str"},{"p":"payloadCopy","v":"[{\"a'a\":\"A1\",\"b\":\"B1\",\"c\\\"c\":\"C1\"},{\"a'a\":\"A2\",\"b\":\"B2\",\"c\\\"c\":\"C2\"}]","vt":"json"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"[{\"a'a\":\"A1\",\"b\":\"B1\",\"c\\\"c\":\"C1\"},{\"a'a\":\"A2\",\"b\":\"B2\",\"c\\\"c\":\"C2\"}]","payloadType":"json","x":1030,"y":2580,"wires":[["7568fe04a18602b3","b08d04c713911cfe"]]},{"id":"7568fe04a18602b3","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1090,"y":2640,"wires":[["401aa7959500fa9b","854aa86a2263b67d"]]},{"id":"401aa7959500fa9b","type":"debug","z":"3b9a9d02857dd4a7","name":"Input data as CSV","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1430,"y":2640,"wires":[]},{"id":"c53905fdfdd7ce03","type":"comment","z":"3b9a9d02857dd4a7","name":"Output invalid CSV (outputs header row as  a'a,b,c\"c  but it should be  a'a,b,\"c\"\"c\"  (RFC4180 2.7)","info":"> 2.7\n> If double-quotes are used to enclose fields, then a double-quote\n> appearing inside a field must be escaped by preceding it with\n> another double quote","x":1200,"y":2540,"wires":[]},{"id":"854aa86a2263b67d","type":"csv","z":"3b9a9d02857dd4a7","name":"","sep":",","hdrin":"","hdrout":"all","multi":"one","ret":"\\n","temp":"","skip":"0","strings":true,"include_empty_strings":"","include_null_values":"","x":1090,"y":2700,"wires":[["a5cd2b4bf639c39a"]]},{"id":"a5cd2b4bf639c39a","type":"debug","z":"3b9a9d02857dd4a7","name":"Input data back to Object","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1410,"y":2700,"wires":[]},{"id":"b08d04c713911cfe","type":"debug","z":"3b9a9d02857dd4a7","name":"Input data","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","x":1460,"y":2580,"wires":[]}]

Environment

  • Node-RED version: V3+
  • Node.js version: 16
  • npm version: 8
  • Platform/OS: All / NA
  • Browser: All / NA
@Steve-Mcl
Copy link
Contributor Author

@dceejay has already started works in branch CSV-fix-n-handling-inside-text-fields

@Steve-Mcl Steve-Mcl changed the title CSV mode outputs non valid RFC4180 CSV data CSV node outputs non valid RFC4180 CSV data under certain conditions Oct 29, 2022
@Steve-Mcl Steve-Mcl added this to the 4.0 milestone Jun 15, 2023
@Steve-Mcl
Copy link
Contributor Author

@dceejay

Following our discussion on RFC for CSV node, I am almost finished (tests written etc).

One last bone of contention for me is the use of regex for the number parsing. For one, regex is not known for its super high performance and it fails to convert hexadecimal/octal/binary notation too.

So, since number parsing is on the hot path, I have been doing some performance testing and seeing if there is a better AND faster method.

TL;DR

I am proposing using the + cast method for number parsing. In tests (100000 iterations) it is consistently 1.93 times faster (often more) than the original regex. Additionally, it also correctly converts 0x0, 0b0 and 0o0 into numbers

Do you have any objections?

toNumber (proposed)

function toNumber(value) {
    const v = +value
    const isNumber = !isNaN(v)
    return isNumber ? v : value
}
if (parseNumbers) {
    v = toNumber(v)
}

regex (original)

var isNumberTest = /^ *[-]?(?!E)(?!0\d)\d*\.?\d*(E-?\+?)?\d+ *$/i;
if (parseNumbers &&  isNumberTest.test(v)) {
    v = parseFloat(v)
}


performance testing details

2 data sets,:

  1. Array of 55 items, 40 good number strings, 15 bad/non number strings
  2. Array of 12 items, all strings (should defiantly not be parsed to numbers)

3 conversion functions

  1. toNumber_regex - using original regex & parseFloat method
  2. toNumber_parseFloat - using isNaN and parseFloat
  3. toNumber_proposed - using + cast

Summary of Data set 1

testNumbers[55], 100000 iterations. 40 numbers, 15 non numbers

  • toNumber_proposed (winner)
    • Operations : 5500000
    • Pass rate% : 100.00%
    • Performance: 342.5534999370575ms
  • toNumber_regex (2nd fastest)
    • Operations : 5500000
    • Pass rate% : 74.55%
    • Performance: 662.5470999479294ms
  • toNumber_parseFloat (slowest)
    • Operations : 5500000
    • Pass rate% : 83.64%
    • Performance: 736.0683000087738ms

toNumber_proposed is 2.15 times faster than toNumber_parseFloat
toNumber_proposed is 1.93 times faster than toNumber_regex

Summary of Data set 2

testStrings[12], 100000 iterations. (all strings, no numbers)

  • toNumber_proposed (winner)
    • Pass rate% : 100.00%
    • Performance: 62.891900062561035ms
  • toNumber_regex (2nd fastest)
    • Pass rate% : 100.00%
    • Performance: 130.63719999790192ms
  • toNumber_parseFloat (slowest)
    • Pass rate% : 100.00%
    • Performance: 136.92980003356934ms

toNumber_proposed is 2.18 times faster than toNumber_parseFloat
toNumber_proposed is 2.08 times faster than toNumber_regex


Full test results with detail (click to expand)

################################################################################
Data set 'testNumbers[55], 100000 iterations.  40 numbers, 15 non numbers

------------------------------toNumber_parseFloat-------------------------------
Operations : 5500000
Pass rate% : 83.64%
Performance: 736.0683000087738ms
Failures:
┌─────────┬──────────┬────────┬──────────┐
│ (index) │   test   │ result │ expected │
├─────────┼──────────┼────────┼──────────┤
│    0    │  '0x1'   │   0    │    1     │
│    1    │  '0x01'  │   0    │    1     │
│    2    │ '0x001'  │   0    │    1     │
│    3    │  '0b1'   │   0    │    1     │
│    4    │ '0b0001' │   0    │    1     │
│    5    │ '0b1010' │   0    │    10    │
│    6    │  '0o1'   │   0    │    1     │
│    7    │ '0o001'  │   0    │    1     │
│    8    │ '0o123'  │   0    │    83    │
└─────────┴──────────┴────────┴──────────┘
--------------------------------------------------------------------------------

---------------------------------toNumber_regex---------------------------------
Operations : 5500000
Pass rate% : 74.55%
Performance: 662.5470999479294ms
Failures:
┌─────────┬────────────┬────────────┬──────────┐
│ (index) │    test    │   result   │ expected │
├─────────┼────────────┼────────────┼──────────┤
│    0    │   '0x0'    │   '0x0'    │    0     │
│    1    │ '0x000000' │ '0x000000' │    0     │
│    2    │   '0x1'    │   '0x1'    │    1     │
│    3    │   '0x01'   │   '0x01'   │    1     │
│    4    │  '0x001'   │  '0x001'   │    1     │
│    5    │   '0b0'    │   '0b0'    │    0     │
│    6    │ '0b000000' │ '0b000000' │    0     │
│    7    │   '0b1'    │   '0b1'    │    1     │
│    8    │  '0b0001'  │  '0b0001'  │    1     │
│    9    │  '0b1010'  │  '0b1010'  │    10    │
│   10    │   '0o1'    │   '0o1'    │    1     │
│   11    │  '0o001'   │  '0o001'   │    1     │
│   12    │  '0o123'   │  '0o123'   │    83    │
│   13    │   '007'    │   '007'    │    7     │
└─────────┴────────────┴────────────┴──────────┘
--------------------------------------------------------------------------------

-------------------------------toNumber_proposed--------------------------------
Operations : 5500000
Pass rate% : 100.00%
Performance: 342.5534999370575ms
--------------------------------------------------------------------------------

################################################################################
Data set 'testStrings[12], 100000 iterations.  (all strings, no numbers)

------------------------------toNumber_parseFloat-------------------------------
Operations : 1200000
Pass rate% : 100.00%
Performance: 136.92980003356934ms
--------------------------------------------------------------------------------

---------------------------------toNumber_regex---------------------------------
Operations : 1200000
Pass rate% : 100.00%
Performance: 130.63719999790192ms
--------------------------------------------------------------------------------

-------------------------------toNumber_proposed--------------------------------
Operations : 1200000
Pass rate% : 100.00%
Performance: 62.891900062561035ms
--------------------------------------------------------------------------------


Performance test code (click to expand)

const isNumberTest = /^ *[-]?(?!E)(?!0\d)\d*\.?\d*(E-?\+?)?\d+ *$/i;

function toNumber_parseFloat(value) {
    const v = parseFloat(value);
    const isNumber = !isNaN(value - v);
    return isNumber ? v : value
}
function toNumber_regex(value) {
    const isNumber = isNumberTest.test(value)
    return isNumber ? parseFloat(value) : value
}

function toNumber_proposed(value) {
    const v = +value
    const isNumber = !isNaN(v)
    return isNumber ? v : value
}

const testNumbers = [
    { value: "0x0", expect: 0},
    { value: "0x000000", expect: 0},
    { value: "0x1", expect: 1},
    { value: "0x01", expect: 1},
    { value: "0x001", expect: 1},

    { value: "0b0", expect: 0},
    { value: "0b000000", expect: 0},
    { value: "0b1", expect: 1},
    { value: "0b0001", expect: 1},
    { value: "0b1010", expect: 10},

    { value: "0o1", expect: 1},
    { value: "0o001", expect: 1},
    { value: "0o123", expect: 83},

    { value: "007", expect: 7},
    { value: "0", expect: 0},
    { value: "0 ", expect: 0},
    { value: "0.0", expect: 0},
    { value: "0.0 ", expect: 0},
    { value: "0.1", expect: 0.1},
    { value: " 0.1", expect: 0.1},
    { value: "0.1 ", expect: 0.1},
    { value: " 0.1 ", expect: 0.1},
    { value: "0.1.1", expect: "0.1.1"},
    { value: "0.000000000000" , expect: 0},
    
    { value: "1 ", expect: 1},
    { value: "1", expect: 1},
    { value: "1.1", expect: 1.1},
    { value: " 1.1", expect: 1.1},
    { value: "1.1 ", expect: 1.1},
    { value: " 1.1 ", expect: 1.1},
    { value: "1.1.1", expect: "1.1.1"},
    { value: "1.000000000000" , expect: 1},

    { value: "123", expect: 123},
    { value: "3.14", expect: 3.14},
    { value: "3.14.1", expect: "3.14.1"},
    
    { value: "-5", expect: -5},
    { value: "-5.1", expect: -5.1},
    { value: "-5.1.1", expect: "-5.1.1"},
    
        
    { value: "-0.5", expect: -0.5},
    { value: "-1.23", expect: -1.23},
    { value: "-1.23.1", expect: "-1.23.1"},

    { value: "2e3", expect: 2000},
    { value: "1E-4", expect: 0.0001},
    { value: "1e+12", expect: 1000000000000},
    { value: "1e+12.1", expect: "1e+12.1"},
    { value: "1e+12.1.1", expect: "1e+12.1.1"},
    { value: "1F-4", expect: "1F-4"},
    { value: "1d-4", expect: "1d-4"},

    { value: "1.7976931348623157e+308", expect: 1.7976931348623157e+308},

    // strings that should not be considered numbers
    { value: "1 2", expect: "1 2"},
    { value: " 1 2", expect: " 1 2"},
    { value: "foo123", expect: "foo123"},
    { value: "123foo", expect: "123foo"},
    { value: "123foo456", expect: "123foo456"},
    { value: "123 bar", expect: "123 bar"},
]

// just strings! should NOT be considered numbers
const testStrings = [
    { value: "1 2", expect: "1 2"},
    { value: " 1 2", expect: " 1 2"},
    { value: "foo123", expect: "foo123"},
    { value: "123foo", expect: "123foo"},
    { value: "123foo456", expect: "123foo456"},
    { value: "foo123bar", expect: "foo123bar"},
    { value: "foo 123", expect: "foo 123"},
    { value: "123 bar", expect: "123 bar"},
    { value: "0 foo 123 bar 456", expect: "0 foo 123 bar 456"},
    { value: "a very long long long looooooooooooooooooooooooooooong string", expect: "a very long long long looooooooooooooooooooooooooooong string"},
    { value: "123 numbers followed by a very long long long looooooooooooooooooooooooooooong string", expect: "123 numbers followed by a very long long long looooooooooooooooooooooooooooong string"},
    { value: "123 numbers followed by a very long long long looooooooooooooooooooooooooooong string followed by numbers 123", expect: "123 numbers followed by a very long long long looooooooooooooooooooooooooooong string followed by numbers 123"},
]
const iterations = 100000
console.log('################################################################################')
console.log(`Data set 'testNumbers[${testNumbers.length}], ${iterations} iterations.  ${testNumbers.filter(e => typeof e.expect === 'number').length} numbers, ${testNumbers.filter(e => typeof e.expect !== 'number').length} non numbers`)
console.log('')
const testResults1 = doTest(toNumber_parseFloat, testNumbers, iterations)
const testResults2 = doTest(toNumber_regex, testNumbers, iterations)
const testResults3 = doTest(toNumber_proposed, testNumbers, iterations)
logResults([testResults1, testResults2, testResults3])

console.log('################################################################################')
console.log(`Data set 'testStrings[${testStrings.length}], ${iterations} iterations.  (all strings, no numbers)`)
console.log('')
const testResults1b = doTest(toNumber_parseFloat, testStrings, iterations)
const testResults2b = doTest(toNumber_regex, testStrings, iterations)
const testResults3b = doTest(toNumber_proposed, testStrings, iterations)
logResults([testResults1b, testResults2b, testResults3b])

function doTest(testFunction, testValues, iterations = 10000) {
    const performanceStart = performance.now();
    const opCount = testValues.length * iterations;
    const failures = {};
    let passCount = 0;
    for (let i = 0; i < iterations; i++) {
        testValues.forEach(({value, expect}) => {
            const result = testFunction(value);
            if (result === expect) {
                passCount++
            } else {
                failures[value] = {
                    result: result,
                    expected: expect
                }
            }
        });
    }
    return { functionName: testFunction.name, opCount, passCount, failures, duration: performance.now() - performanceStart }
}

function logResults(testResults) {
    const arr = Array.isArray(testResults) ? testResults : [testResults]
    for (const testResult of arr) {
        const {functionName, opCount, passCount, failures, duration} = testResult
        const fnLen = functionName.length
        const header = fnLen > 80 ? functionName : functionName.padStart((80 - fnLen) / 2 + fnLen, "-").padEnd(80, "-")
        console.log(header);
        console.log("Operations : " + opCount)
        console.log("Pass rate% : " + (passCount / opCount * 100).toFixed(2) + "%")
        console.log("Performance: " + duration + "ms");
        if (Object.keys(failures).length) {
            console.log("Failures:")
            console.table(Object.entries(failures).map(([key, val]) => { return { test: key, ...val } }))
        }
        console.log("--------------------------------------------------------------------------------");
        console.log("");
    }
}

@Steve-Mcl
Copy link
Contributor Author

@dceejay ^ scratch that ^

It would mess with things like phone numbers 0800100100 and the likes :(

Was a nice idea until reality got in the way.

PR incoming for the work we discussed last week.

@dceejay
Copy link
Member

dceejay commented Jan 21, 2024

Also would it have worked with scientific notation ? 3E5 10E-2 etc ?

@Steve-Mcl
Copy link
Contributor Author

Steve-Mcl commented Jan 21, 2024

Also would it have worked with scientific notation ? 3E5 10E-2 etc ?

@dceejay Yeah, +casting does handle scientific notation too.


It would mess with things like phone numbers 0800100100 and the likes :(

I think I have a compromise which yields 13%+ performance increase (not as snappy as previous try out, but it passes all the tests and gives us 0x0, 0b0, 0o0 parsing too)

By simplifying the number check regex and reversing the logic (i.e. find the things we know we DONT want to convert) then let the +cast do its bit.

// original regex
const isNumberTest = /^ *[-]?(?!E)(?!0\d)\d*\.?\d*(E-?\+?)?\d+ *$/i;
function toNumber_regex(value) {
    const isNumber = isNumberTest.test(value)
    return isNumber ? parseFloat(value) : value
}

// alt logic
const skipNumberConversion = /^ *(\+|-0\d|0\d)/
function toNumber_proposed_alt(value) {
    if(!skipNumberConversion.test(value)) {
        const v = +value
        return !isNaN(v) ? v : value
    }
    return value
}

operational comparison

image


Performance Summary

  • toNumber_proposed_alt is ~12.8% faster than toNumber_regex when the CSV data strings are mostly numerical
  • toNumber_proposed_alt is ~37% faster than toNumber_regex when NONE of the CSV data strings are numerical
  • toNumber_proposed_alt is ~13% faster than toNumber_regex when the CSV data is all numerical strings
Performance Details (click to view)

################################################################################
Data set 'testNumbers[60], 200000 iterations.  40 numbers, 20 non numbers

---------------------------------toNumber_regex---------------------------------
Operations : 12000000
Pass rate% : 78.33%
Performance: 1214.7373000383377ms
Failures: 13...
┌─────────┬────────────┬────────────┬──────────┐
│ (index) │    test    │   result   │ expected │
├─────────┼────────────┼────────────┼──────────┤
│    0    │   '0x0'    │   '0x0'    │    0     │
│    1    │ '0x000000' │ '0x000000' │    0     │
│    2    │   '0x1'    │   '0x1'    │    1     │
│    3    │   '0x01'   │   '0x01'   │    1     │
│    4    │  '0x001'   │  '0x001'   │    1     │
│    5    │   '0b0'    │   '0b0'    │    0     │
│    6    │ '0b000000' │ '0b000000' │    0     │
│    7    │   '0b1'    │   '0b1'    │    1     │
│    8    │  '0b0001'  │  '0b0001'  │    1     │
│    9    │  '0b1010'  │  '0b1010'  │    10    │
│   10    │   '0o1'    │   '0o1'    │    1     │
│   11    │  '0o001'   │  '0o001'   │    1     │
│   12    │  '0o123'   │  '0o123'   │    83    │
└─────────┴────────────┴────────────┴──────────┘
--------------------------------------------------------------------------------

-----------------------------toNumber_proposed_alt------------------------------
Operations : 12000000
Pass rate% : 100.00%
Performance: 1059.2009999752045ms
--------------------------------------------------------------------------------

toNumber_proposed_alt is ~12.8% faster than toNumber_regex
when there are mostly numbers in the strings

################################################################################
Data set 'testStrings[12], 200000 iterations.  (all strings, no numbers)

---------------------------------toNumber_regex---------------------------------
Operations : 2400000
Pass rate% : 100.00%
Performance: 307.0346999168396ms
--------------------------------------------------------------------------------

-----------------------------toNumber_proposed_alt------------------------------
Operations : 2400000
Pass rate% : 100.00%
Performance: 193.2387000322342ms
--------------------------------------------------------------------------------

toNumber_proposed_alt is ~37% faster than toNumber_regex
when the CSV data strings are NOT numerical

################################################################################
Data set 'allNumbers[40], 200000 iterations.  (all numbers)

---------------------------------toNumber_regex---------------------------------
Operations : 8000000
Pass rate% : 67.50%
Performance: 943.0785000324249ms
Failures: 13...
┌─────────┬────────────┬────────────┬──────────┐
│ (index) │    test    │   result   │ expected │
├─────────┼────────────┼────────────┼──────────┤
│    0    │   '0x0'    │   '0x0'    │    0     │
│    1    │ '0x000000' │ '0x000000' │    0     │
│    2    │   '0x1'    │   '0x1'    │    1     │
│    3    │   '0x01'   │   '0x01'   │    1     │
│    4    │  '0x001'   │  '0x001'   │    1     │
│    5    │   '0b0'    │   '0b0'    │    0     │
│    6    │ '0b000000' │ '0b000000' │    0     │
│    7    │   '0b1'    │   '0b1'    │    1     │
│    8    │  '0b0001'  │  '0b0001'  │    1     │
│    9    │  '0b1010'  │  '0b1010'  │    10    │
│   10    │   '0o1'    │   '0o1'    │    1     │
│   11    │  '0o001'   │  '0o001'   │    1     │
│   12    │  '0o123'   │  '0o123'   │    83    │
└─────────┴────────────┴────────────┴──────────┘
--------------------------------------------------------------------------------

-----------------------------toNumber_proposed_alt------------------------------
Operations : 8000000
Pass rate% : 100.00%
Performance: 817.5798999071121ms
--------------------------------------------------------------------------------

toNumber_proposed_alt is ~13% faster than toNumber_regex
when the data is all numerical strings



Differences that we need to agree:

  1. RFC MODE - Will parse hex/oct/bin notation numbers - CHANGE
  2. RFC MODE - All other strings (not covered by point 1) should yield the same result as lecagy mode - NO CHANGE
  3. LEGACY MODE - CSV parsing will use the original regex - NO CHANGE


Final words

I have now spent far too much time on this in the pursuit of a few ms. What do they say about premature optimisation? 😆

Steve-Mcl added a commit to Steve-Mcl/node-red that referenced this issue Jan 23, 2024
@Steve-Mcl Steve-Mcl linked a pull request Jan 23, 2024 that will close this issue
6 tasks
@GogoVega
Copy link
Contributor

GogoVega commented Apr 4, 2024

Hi @Steve-Mcl, can you please close this issue (resolved in 4540) because it seems that GH did not do it. Thanks 🙂

@Steve-Mcl
Copy link
Contributor Author

It will be closed when merged into master

@knolleary
Copy link
Member

I have removed the needs-triage label and added fixed label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants