Bugfix for Complex JSON Array Unexpected Behavior #90

hdwatts · 2021-01-11T17:32:53Z

Status

READY

Description

I had been dealing with issues similar to Issue #78 - and pulled in PR #81 to help resolve those issues. However, I found a few further exceptions with complex array structures and schemas. I believe this PR is an alternative to PR #81, although their PR mentions fillGaps which I don't think I address here. Unfortunately there are no tests in that PR so I do not know if this addresses the fillGaps issue mentioned.

The method used here is to keep a stricter track of the csv headers, and ensure that we are only ever adding to a line if the current element is supposed to come after the previous element.

[{
    companies: [
    {
      display_name: "A",
      investments: [{ name: "B" }, { name: "C" }],
    },
    {
      display_name: "B",
      investments: [{ name: "D" }, { name: "E" }],
    },
  ]
}]

Old Logic

Step	Current Row	Comments
0	[companies.display_name, companies.investments.name]	Calculate Headers, New row needed
1	[A, null]	Populate company display name
2	[A, B]	Populate first investment name
3	[A, B]	New row needed
4	[null, C]	Populate second investment name
5	[B, C]	"B" gets placed here since it was `null`, falsely applying investment "C" to company "B"

New Logic

Step	Current Row	Comments
0	[companies.display_name, companies.investments.name]	Calculate Headers, New row needed
1	[A, null]	Populate company display name
2	[A, B]	Populate first investment name, it comes after the previously placed "A", so it is placed on the same row
3	[A, B]	New row needed
4	[null, C]	C is placed at its correct index, New row needed
5	[B, null]	"B" is associated with a header that comes before the previously placed "C", so it is associated to a new row, as appropriate

The other exceptions are noted in the tests - please let me know if you have any questions.

Related PRs

List related PRs against other branches:

branch	PR
Original Bugfix PR	link

Todos

Tests
Documentation

Steps to Test or Reproduce

Outline the steps to test or reproduce the PR here.

Tests have been added.

npm run test

Impacted Areas in Application

List general components of the application that this PR will affect:

Exporting complex JSON array structures as CSV

Update iuabhmalat/jsonexport fork with changes from kaue/jsonexport

AckerApple · 2021-01-12T22:21:16Z

Link to example of current results relevant to claim in description

(UPDATE: note to self that issuer has provided updated link example in next comment)

hdwatts · 2021-01-12T22:23:10Z

Link to example of current results relevant to claim in description

Apologies, I should have wrapped the object with brackets in my PR (updated). You will see the behavior then.

Here is an updated link.

AckerApple · 2021-01-13T00:59:39Z

Acknowledged. Due to complexity it will take me another round of reviewing and examining.

It would behoove you to attempt a most basic summation if possible for me. It’s taking me a lot of energy to compare results and dissect just exactly is the all encompassing view of this work.

I believe you may have something here. Just trying so extra careful to make sure we have full understanding and solid documentation.

If week or more goes by, please feel free to ping a comment again as my plate if fairly full

hdwatts · 2021-01-14T03:23:58Z

Sure thing, here is a breakdown of my changes. Apologies for the long write up, I just got diagnosed
with COVID and so have a lot of time on my hands.

The Problem

As stated in my initial write up, I noticed that as the results within a complex JSON array were being
iterated upon that there were easily replicable situations where items will get inserted into the wrong
row of the CSV.

The table copied from above demonstrates the logic flow.

Step	Current Row	Comments
0	[companies.display_name, companies.investments.name]	Calculate Headers, New row needed
1	[A, null]	Populate company display name
2	[A, B]	Populate first investment name
3	[A, B]	New row needed
4	[null, C]	Populate second investment name
5	[B, C]	"B" gets placed here since it was `null`, falsely applying investment "C" to company "B"

The issue noted in step 5 is due to the current code only checking if the header index is not undefined.

//
// Line 96 in lib/parser/csv.js
//
let elementHeaderIndex = getHeaderIndex(element.item);
if (currentRow[elementHeaderIndex] != undefined) {
	...

This returns false in the above example, where we are attempting to insert a display name. Therefore the
display name does not get put into a new row and instead gets put on the previous row.

My Proposed Solution

My proposed solution is to adjust the check for a new row to be whether the column at the current row is
not undefined OR it is attempting to insert an element at an index before the previously placed item.

This, of course, needs to rely on the assumption that the data coming in is in an order sorted according to
the headers of the inputted JSON object. Alas, that was not the case, so I also had to make adjustments to
lib/parser/handler.js.

The goal with the changes there is to return the same array of { item: ..., value: ...} objects that is
currently returned, however in order of the header's appearance within the JSON structure.

Ex:

[
	{
		"a": {
			"b": true,
			"c": [{
					"d": 1
				},
				{
					"d": 2
				},
				{
					"d": 3
				},
				{
					"d": 4
				}
			],
			"e": [{
					"f": 1
				},
				{
					"f": 2
				}
			]
		}
	}
]

Becomes:

[
	{ item: "a.b", value: true },
	{ item: "a.c.d", value: 1 },
	{ item: "a.c.d", value: 2 },
	{ item: "a.c.d", value: 3 },
	{ item: "a.c.d", value: 4 },
	{ item: "a.e.f", value: 1 },
	{ item: "a.e.f", value: 2 },
]

While this was already the case for "perfect" JSON schemas, if there were keys out of order or there were
missing keys these items would be returned out of order. These situations were being properly handled in
the existing "undefined check" implementation, but would break with my "header index check".

Therefore, changes had to be made in two files:

lib/parser/handler.js to ensure that the results are sorted normalized to the header indexes
lib/parser/csv.js to add the header index check to the insert new row logic

The Implementation

lib/parser/handler.js

constructor(options) {
	...
+	this._headers = []
}

We need to track headers for this JSON object on a class level.

_handleArray(array) {
	...
+	const getHeaderIndex = function(item) {
+		let index = self._headers.indexOf(item);
+		if (index === -1) {
+			if (item === null) {
+				self._headers.unshift(item);
+			} else {
+				self._headers.push(item);
+			}
+			index = self._headers.indexOf(item);
+		}
+		return index
+	}
+	const sortByHeaders = function(itemA, itemB) {
+		return getHeaderIndex(itemA.item) - getHeaderIndex(itemB.item);
+	}
	...
}

As we are going to be sorting by the headers stored on the class level, we need a sortByHeaders
function and a way to check a header value against the stored headers. The getHeaderIndex function
has additional logic for if the header is null. That is to handle situations where the user has an array
of strings in their object which require a special delimiter, and appease the firstElementWithoutItem
logic that occurs down the line. Basically if it is a null header, it should always be at the front.

Now, the _handleArray function works recursively through the JSON object, populating the item (header)
and value (the value) for each key as it works through and appending those results to a wider result
array.

This gets tricky because we cannot just sort these items at the end, as that will end up with the results
ignoring how deeply nested they are and just sorting according to the JSON schema.

Example:

[
	{
		c: [
			{
				a: "Name 1",
				b: "Field 1"
			},
			{
				a: "Name 2",
				b: "Field 2"
			}
		]
	}
]

Would result in:

[
	{ item: "c.a", "Name 1" },
	{ item: "c.a", "Name 2" },
	{ item: "c.b", "Field 1" },
	{ item: "c.b", "Field 2" },
]

Therefore, we have to sort only the results at the depth they are at, which gets complex because
we are returning the results as we iterate through recursively. To handle this, I've added a "depth"
field to the object. This is to track how deep we are in the recursive stack, as we want to be sure
that we are not inapropriately sorting items that have already been sorted at their respective depth.

+      for (let bIndex=0; bIndex < resultCheckType.length; bIndex++) {
+        getHeaderIndex(resultCheckType[bIndex].item);
+        resultCheckType[bIndex]._depth = (resultCheckType[bIndex]._depth || 0) + 1

While I am iterating within the above loop, I check for the current items depth. If it is 1, then
it is safe to sort, and I add it to a toSort array. Once I hit an item with a depth that isn't 1,
and there are items to sort, then it is time to sort those items by the headers using the sort function
above. After sorting the array, I iterate through the sorted array and place the items back into the main
result array in their sorted order. This is slightly convoluted, as I can't just make a new array without
risking losing the firstElementWithoutItem references...

+	if (resultCheckType[bIndex]._depth === 1) {
+		toSort.push(resultCheckType[bIndex]);
+	} else if (toSort.length > 0) {
+		const sorted = toSort.sort(sortByHeaders)
+		for (let cIndex = 0; cIndex < sorted.length; cIndex++) {
+			resultCheckType[bIndex - sorted.length + cIndex] =
+				sorted[cIndex];
+		}
+		toSort = []
+	}

Then the same code is executed again at the end of the for loop, just in case we had any items left
in the toSort array.

I've put some output on how the result array looks during this recursive process at this gist.

The end result is an array of items and values that are in a proper order according to the JSON's schema!

lib/parser/csv.js

_parseArray(json, stream) {
	...
+ 	let normalizedHeaders = []
	...
+	let getNormalizedIndex = function(header) {
+		var index = normalizedHeaders.indexOf(header)
+		if (index === -1) {
+			normalizedHeaders.push(header)
+			index = normalizedHeaders.indexOf(header)
+		}
+		return index
+	}
	...
}

The first changes are relatively self explanatory. We can't use the existing this._headers array and
getHeaderIndex functions, as that order can be adjusted with the headers option field. Therefore I
use a new normalizedHeaders array, and have a new getNormalizedIndex function which does ignores the
headers field, ensuring we remain with the order from the JSON schema.

fillRows = function(result) {
	...
+	let lastIndex = -1
	...
	for (let element of result) {
        	let elementHeaderIndex = getHeaderIndex(element.item);
+		let normalizedIndex = getNormalizedIndex(element.item)
+		if (
+			currentRow[elementHeaderIndex] != undefined ||
+			normalizedIndex < lastIndex
+		) {
			fillAndPush(currentRow);
			currentRow = newRow();
		}
		emptyRowIndexByHeader[elementHeaderIndex] = emptyRowIndexByHeader[elementHeaderIndex] || 0;
+	        lastIndex = normalizedIndex;
		...
	}
	...
}

And finally, here we have the check for the normalized header index against the most recently placed noramlized
index. In instances where the same element is being placed twice, then the undefined check will cause a new row.
However now, when we are going to be placing an item backwards, we will also be creating a new row.

Conclusion

While my proposed solution solves for my exact problem, I'm very new to this library and have not touched
any of the options beyond adjusting my changes based on the tests. I am open to any questions, comments,
and concerns you may have, and look forward to collaborating with you further!

kaue · 2021-09-29T14:28:38Z

@AckerApple when you have time please follow up on that last round of review.
I will try to do the same, I still have a lot to catch up on this thread.

@hdwatts Thanks for this contribution, it's really well documented 😄

I feel sorry that It's taking me so much time to reply, I hoping that next year I will be able to be more active in this community.

AckerApple · 2021-09-29T14:41:30Z

@kaue you bet friend, I will do a review within a max 4 days. Fairly loaded but I know you just asking for an eye ball review. I will respond in time

hdwatts · 2021-10-07T14:47:46Z

@kaue @AckerApple Thanks for checking back in on this, guys. I'll note that we've been using these changes on our production platform to export results from our GraphQL API over the past year and have yet to encounter any issues.

loganpowell · 2022-01-20T21:39:33Z

I would like to use the fixes proposed by @hdwatts... any chance you'll merge it?

EDIT:

I've spoken without fully testing. Now that I have, I get a number of new rows with mostly empty cells for nested objects... Not sure why...

kaue · 2022-01-21T14:00:55Z

@loganpowell i will try to find some time to review this PR

but if you can please try using @hdwatts branch and see if the fix is working properly.

sorry for the delay guys, busy times :/

iuabhmalat and others added 7 commits November 12, 2020 19:48

Merge pull request iuabhmalat#1 from kaue/master

432a650

Update iuabhmalat/jsonexport fork with changes from kaue/jsonexport

improve complex json array support with support for inconsistent schemas

c21014c

add further tests

2d36662

build

eb036ef

Revert package-lock.json

46ec5fa

Update package-lock.json

d1fe700

Merge branch 'master' into master

18d990e

AckerApple self-assigned this Jan 13, 2021

AckerApple added the help wanted label Jan 13, 2021

AckerApple mentioned this pull request Jan 15, 2021

Bugfix/complex json array unexpected behavior #81

Open

1 task

hdwatts changed the title ~~Alternative Bugfix for Complex JSON Array Unexpected Behavior~~ Bugfix for Complex JSON Array Unexpected Behavior Dec 8, 2021

hdwatts mentioned this pull request Jan 20, 2022

Empty Array Being Returned as Header causing JSON to be Outputted in CSV mrodrig/json-2-csv#168

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for Complex JSON Array Unexpected Behavior #90

Bugfix for Complex JSON Array Unexpected Behavior #90

hdwatts commented Jan 11, 2021 •

edited

AckerApple commented Jan 12, 2021 •

edited

hdwatts commented Jan 12, 2021 •

edited

AckerApple commented Jan 13, 2021

hdwatts commented Jan 14, 2021

kaue commented Sep 29, 2021

AckerApple commented Sep 29, 2021

hdwatts commented Oct 7, 2021

loganpowell commented Jan 20, 2022 •

edited

kaue commented Jan 21, 2022

Bugfix for Complex JSON Array Unexpected Behavior #90

Are you sure you want to change the base?

Bugfix for Complex JSON Array Unexpected Behavior #90

Conversation

hdwatts commented Jan 11, 2021 • edited

Status

Description

Old Logic

New Logic

Related PRs

Todos

Steps to Test or Reproduce

Impacted Areas in Application

AckerApple commented Jan 12, 2021 • edited

hdwatts commented Jan 12, 2021 • edited

AckerApple commented Jan 13, 2021

hdwatts commented Jan 14, 2021

The Problem

My Proposed Solution

The Implementation

lib/parser/handler.js

lib/parser/csv.js

Conclusion

kaue commented Sep 29, 2021

AckerApple commented Sep 29, 2021

hdwatts commented Oct 7, 2021

loganpowell commented Jan 20, 2022 • edited

kaue commented Jan 21, 2022

hdwatts commented Jan 11, 2021 •

edited

AckerApple commented Jan 12, 2021 •

edited

hdwatts commented Jan 12, 2021 •

edited

loganpowell commented Jan 20, 2022 •

edited