Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

landisdesign · 2023-03-23T00:42:24Z

When using parser.pause()/parser.resume(), after resuming the parser, trailing empty cells in the row returned by step have "_1", "_2", "_3" etc. instead of empty cells. Thereafter, parsing appears to hit wrong locations.

Code sandbox here:
https://codesandbox.io/s/heuristic-violet-8wrinv?file=/src/CSVParser.js

The text was updated successfully, but these errors were encountered:

pokoli · 2023-03-23T08:21:31Z

It should be fixed with 824bbd9
Please reopen if you do not agree.

landisdesign · 2023-03-23T15:11:19Z

I replaced the package dependency with the new file from main in the CodeSandbox above, and it's returning the same results.

The odd thing is that, in addition to the underscore indices appearing, it looks like the parser is losing its place. Fields are getting eaten and the results are shifting left.

pokoli · 2023-03-23T15:15:12Z

Please provide a simple example that we can run on our test suite.

landisdesign · 2023-03-23T15:42:25Z

I don't have a lot of Node experience (I'm a front-end developer 🤷‍♂️) but below is a new entry for the CUSTOM_TESTS suite, based on the test CSV from the code sandbox above. It appears to illustrate the problem accurately.

	{
		description: "Pause and resume works with headers and empty trailing fields (Regression Test for Bug #985)",
		expected: [["Column 1", "Column 2", "Column 3", "Column 4", "Column 5"], [
			{ "Column 1": "R1C1", "Column 2": "", "Column 3": "R1C3", "Column 4": "", "Column 5": "" },
			{ "Column 1": "R2C1", "Column 2": "", "Column 3": "", "Column 4": "", "Column 5": "" },
			{ "Column 1": "R3C1", "Column 2": "", "Column 3": "", "Column 4": "R3C4", "Column 5": "" },
			{ "Column 1": "R4C1", "Column 2": "", "Column 3": "", "Column 4": "", "Column 5": "" },
		]],
		run: function(callback) {
			var inputString = [
				"Column 1,Column 2,Column 3,Column 4,Column 5",
				"R1C1,,R1C3,,",
				"R2C1,,,,",
				"R3C1,,,R3C4,",
				"R4C1,,,,"
			].join("\n");
			var output = [];
			var dataRows = [];
			var headers = null;
			Papa.parse(inputString, {
				header: true,
				step: function(results, parser) {
					if (results)
					{
						if (!headers) {
							headers = results.meta.fields;
						}
						parser.pause();
						parser.resume();
						if (results.data) {
							dataRows.push(results.data);
						}
					}
				},
				complete: function() {
					output.push(headers);
					output.push(dataRows);
					callback(output);
				}
			});
		}
	},

landisdesign · 2023-03-23T15:43:10Z

(Also, since I haven't collaborated yet I can't reopen this on my own.)

landisdesign · 2023-03-23T15:48:21Z

I was able to confirm that this only occurs when headers are requested. pause and resume appear to work fine with trailing empty fields when headers aren't requested.

landisdesign · 2023-03-24T06:36:33Z

I think I found the cause of this.

The first issue is how baseIndex is checked at line 1470. Currently it's checking for falsiness, but I'm finding that, at least with strings, baseIndex starts out as undefined but then becomes 0 when resuming, which causes it to check for headers multiple times if requested. (I haven't determined why this happens, but just saw it happened while adding console.log lines throughout.)

The second issue is how the headers are modified if duplicates are found at lines 1497-1501. If duplicates are found, indexes are appended, but when the first input line is reinserted into _input, cursor isn't adjusted to reflect the change in _input length.

The result of this is that Parser is reading the first row below the headers as a second set of headers, adding _n to each additional empty field which skews the cursor, leading to the results seen in the failing test.

It feels to me like it might make more sense to directly check for preexisting headers, rather than rely on baseIndex. The trick is keeping track of them throughout the parsing lifecycle.

Instead of populating _fields via fillHeaderFields(), I'd like to propose the following:

Initialize _fields to null.
Pass _fields as a second parameter to Parser() and save it as a local variable (fields without the underscore, perhaps).
In line 1470, check fields for falsiness instead of baseIndex.
If fields is falsy and config.headers is truthy, generate headerMap as it's currently done and save that value to fields, but don't change _input if duplicates were found.
In returnable(), if fields is not null, add a copy of it to the returned object.
In processResults(), if _results.meta.fields exists, save it back to _fields.
If parseChunk() is exited then reentered, the populated _fields variable would be fed back into the new Parser object, keeping it populated for results while prevent it from being recreated.

I think this would prevent _input from getting shifted, as well as make the connection between the nullness of _fields and populating it clearer.

Does this make sense? I wanted to run this by you before making a PR, but I think I could put this together tomorrow.

landisdesign · 2023-03-24T16:49:18Z

As I'm building this, I'm seeing how Parser.parse returns the header row as the first row of data. Since I'm proposing having Parser.parse build out results.meta.fields within itself, would it make sense to not send the header row back, and remove the special casing done at lines 1100-1101?

landisdesign · 2023-03-24T21:48:55Z

(I got excited and created the PR anyways. 😁 )

nickgrato · 2023-09-24T00:42:17Z

It should be fixed with 824bbd9 Please reopen if you do not agree.

having this same issue. but does this on default... in the csv there I just empty commas ",,,,," like that and they get auto filled with _1 _2 etc....

PierrickLozach · 2023-10-08T14:39:07Z

This is still happening with 5.4.1. I get empty values that are populated with _x values. Would welcome a workaround. At the moment, I fix it manually from code.

landisdesign changed the title ~~Empty trailing fields populated with indexes after resuming a paused parser~~ Empty trailing fields populated with indexes after resuming a paused parser, loses place Mar 23, 2023

pokoli closed this as completed Mar 23, 2023

pokoli reopened this Mar 23, 2023

landisdesign mentioned this issue Mar 30, 2023

Generate headers in Parser, based upon presence of _fields, without changing _input (fixes issue #985) #989

Open

tjstavenger-pnnl mentioned this issue May 3, 2023

CSV parsing is broken in 5.4.1 #998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

landisdesign commented Mar 23, 2023 •

edited

Loading

pokoli commented Mar 23, 2023

landisdesign commented Mar 23, 2023

pokoli commented Mar 23, 2023

landisdesign commented Mar 23, 2023

landisdesign commented Mar 23, 2023

landisdesign commented Mar 23, 2023 •

edited

Loading

landisdesign commented Mar 24, 2023 •

edited

Loading

landisdesign commented Mar 24, 2023

landisdesign commented Mar 24, 2023

nickgrato commented Sep 24, 2023

PierrickLozach commented Oct 8, 2023

Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

Comments

landisdesign commented Mar 23, 2023 • edited Loading

pokoli commented Mar 23, 2023

landisdesign commented Mar 23, 2023

pokoli commented Mar 23, 2023

landisdesign commented Mar 23, 2023

landisdesign commented Mar 23, 2023

landisdesign commented Mar 23, 2023 • edited Loading

landisdesign commented Mar 24, 2023 • edited Loading

landisdesign commented Mar 24, 2023

landisdesign commented Mar 24, 2023

nickgrato commented Sep 24, 2023

PierrickLozach commented Oct 8, 2023

landisdesign commented Mar 23, 2023 •

edited

Loading

landisdesign commented Mar 23, 2023 •

edited

Loading

landisdesign commented Mar 24, 2023 •

edited

Loading