Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty trailing fields populated with indexes after resuming a paused parser, loses place #985

Open
landisdesign opened this issue Mar 23, 2023 · 11 comments

Comments

@landisdesign
Copy link

landisdesign commented Mar 23, 2023

When using parser.pause()/parser.resume(), after resuming the parser, trailing empty cells in the row returned by step have "_1", "_2", "_3" etc. instead of empty cells. Thereafter, parsing appears to hit wrong locations.

Code sandbox here:
https://codesandbox.io/s/heuristic-violet-8wrinv?file=/src/CSVParser.js

@landisdesign landisdesign changed the title Empty trailing fields populated with indexes after resuming a paused parser Empty trailing fields populated with indexes after resuming a paused parser, loses place Mar 23, 2023
@pokoli
Copy link
Collaborator

pokoli commented Mar 23, 2023

It should be fixed with 824bbd9
Please reopen if you do not agree.

@pokoli pokoli closed this as completed Mar 23, 2023
@landisdesign
Copy link
Author

I replaced the package dependency with the new file from main in the CodeSandbox above, and it's returning the same results.

The odd thing is that, in addition to the underscore indices appearing, it looks like the parser is losing its place. Fields are getting eaten and the results are shifting left.

Screenshot from code sandbox

@pokoli
Copy link
Collaborator

pokoli commented Mar 23, 2023

Please provide a simple example that we can run on our test suite.

@landisdesign
Copy link
Author

I don't have a lot of Node experience (I'm a front-end developer 🤷‍♂️) but below is a new entry for the CUSTOM_TESTS suite, based on the test CSV from the code sandbox above. It appears to illustrate the problem accurately.

	{
		description: "Pause and resume works with headers and empty trailing fields (Regression Test for Bug #985)",
		expected: [["Column 1", "Column 2", "Column 3", "Column 4", "Column 5"], [
			{ "Column 1": "R1C1", "Column 2": "", "Column 3": "R1C3", "Column 4": "", "Column 5": "" },
			{ "Column 1": "R2C1", "Column 2": "", "Column 3": "", "Column 4": "", "Column 5": "" },
			{ "Column 1": "R3C1", "Column 2": "", "Column 3": "", "Column 4": "R3C4", "Column 5": "" },
			{ "Column 1": "R4C1", "Column 2": "", "Column 3": "", "Column 4": "", "Column 5": "" },
		]],
		run: function(callback) {
			var inputString = [
				"Column 1,Column 2,Column 3,Column 4,Column 5",
				"R1C1,,R1C3,,",
				"R2C1,,,,",
				"R3C1,,,R3C4,",
				"R4C1,,,,"
			].join("\n");
			var output = [];
			var dataRows = [];
			var headers = null;
			Papa.parse(inputString, {
				header: true,
				step: function(results, parser) {
					if (results)
					{
						if (!headers) {
							headers = results.meta.fields;
						}
						parser.pause();
						parser.resume();
						if (results.data) {
							dataRows.push(results.data);
						}
					}
				},
				complete: function() {
					output.push(headers);
					output.push(dataRows);
					callback(output);
				}
			});
		}
	},

@landisdesign
Copy link
Author

(Also, since I haven't collaborated yet I can't reopen this on my own.)

@landisdesign
Copy link
Author

landisdesign commented Mar 23, 2023

I was able to confirm that this only occurs when headers are requested. pause and resume appear to work fine with trailing empty fields when headers aren't requested.

@pokoli pokoli reopened this Mar 23, 2023
@landisdesign
Copy link
Author

landisdesign commented Mar 24, 2023

I think I found the cause of this.

The first issue is how baseIndex is checked at line 1470. Currently it's checking for falsiness, but I'm finding that, at least with strings, baseIndex starts out as undefined but then becomes 0 when resuming, which causes it to check for headers multiple times if requested. (I haven't determined why this happens, but just saw it happened while adding console.log lines throughout.)

The second issue is how the headers are modified if duplicates are found at lines 1497-1501. If duplicates are found, indexes are appended, but when the first input line is reinserted into _input, cursor isn't adjusted to reflect the change in _input length.

The result of this is that Parser is reading the first row below the headers as a second set of headers, adding _n to each additional empty field which skews the cursor, leading to the results seen in the failing test.

It feels to me like it might make more sense to directly check for preexisting headers, rather than rely on baseIndex. The trick is keeping track of them throughout the parsing lifecycle.

Instead of populating _fields via fillHeaderFields(), I'd like to propose the following:

  1. Initialize _fields to null.
  2. Pass _fields as a second parameter to Parser() and save it as a local variable (fields without the underscore, perhaps).
  3. In line 1470, check fields for falsiness instead of baseIndex.
  4. If fields is falsy and config.headers is truthy, generate headerMap as it's currently done and save that value to fields, but don't change _input if duplicates were found.
  5. In returnable(), if fields is not null, add a copy of it to the returned object.
  6. In processResults(), if _results.meta.fields exists, save it back to _fields.
  7. If parseChunk() is exited then reentered, the populated _fields variable would be fed back into the new Parser object, keeping it populated for results while prevent it from being recreated.

I think this would prevent _input from getting shifted, as well as make the connection between the nullness of _fields and populating it clearer.

Does this make sense? I wanted to run this by you before making a PR, but I think I could put this together tomorrow.

@landisdesign
Copy link
Author

As I'm building this, I'm seeing how Parser.parse returns the header row as the first row of data. Since I'm proposing having Parser.parse build out results.meta.fields within itself, would it make sense to not send the header row back, and remove the special casing done at lines 1100-1101?

@landisdesign
Copy link
Author

(I got excited and created the PR anyways. 😁 )

@nickgrato
Copy link

It should be fixed with 824bbd9 Please reopen if you do not agree.

having this same issue. but does this on default... in the csv there I just empty commas ",,,,," like that and they get auto filled with _1 _2 etc....

@PierrickLozach
Copy link

This is still happening with 5.4.1. I get empty values that are populated with _x values. Would welcome a workaround. At the moment, I fix it manually from code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants