fixes #503: Not able to Insert Neo4j Map Data type using the neo4j-spark connector #507

conker84 · 2023-05-30T12:10:06Z

fixes #503

Fixed the issue
Added a test

… the neo4j-spark connector

common/src/test/scala/org/neo4j/spark/util/Neo4jUtilTest.scala

common/src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala

fbiville · 2023-06-07T15:59:06Z

common/src/main/scala/org/neo4j/spark/service/MappingService.scala

+                s"""
+                   |The field `${field.name}` which has a map {${scalaMap.mkString(", ")}}
+                   |contains the following duplicated keys: [${dupKeys.mkString(", ")}],
+                   |you will loose information stored in these keys


Suggested change

|you will loose information stored in these keys

|you will lose some of the values associated with these duplicate keys

shall this warning be logged in the implicit directly instead? I'm not sure

I thought about that, but I preferred to leave the implementation clean and then ad the business logic for checking duplicates in the Service

common/src/main/scala/org/neo4j/spark/util/Neo4jImplicits.scala

common/src/main/scala/org/neo4j/spark/service/MappingService.scala

common/src/test/scala/org/neo4j/spark/util/Neo4jImplicitsTest.scala

doc/docs/modules/ROOT/pages/writing.adoc

fbiville · 2023-06-07T16:10:00Z

doc/docs/modules/ROOT/pages/writing.adoc

+Because the information that we'll store into Neo4j will be this:
+
+----
+MyNodeWithMapFlattend {


Suggested change

MyNodeWithMapFlattend {

MyNodeWithFlattenedMap {

doc/docs/modules/ROOT/pages/writing.adoc

fbiville · 2023-06-07T16:10:36Z

doc/docs/modules/ROOT/pages/writing.adoc

+----
+The field `table` which has a map {key.inner -> {key=innerValue}, key -> {inner.key=value}}
+contains the following duplicated keys: [table.key.inner.key],
+you will loose information stored in these keys


see one of my first suggestion
if you accept it, you will need to change the message here

doc/docs/modules/ROOT/pages/writing.adoc

fbiville · 2023-06-09T15:35:21Z

common/src/main/scala/org/neo4j/spark/service/MappingService.scala

+            val dupKeys = scalaMap.flattenKeys(field.name)
+              .groupBy(identity)
+              .collect { case (x, List(_,_,_*)) => x }
+            if (dupKeys.nonEmpty) {


actually, it's a pity to iterate over the map twice, could we move the duplicate check into flattenMap?

I'm also thinking that we could actually aggregate values in a collection when there are duplicates (with the caveat that the order is not deterministic is the initial map does not guarantee any iteration order)
since it's a breaking change, could we maybe allow this behavior only if a specific config flag is set (which would default to false -> the default value can change at the next major version)
wdyt?

Let's eliminate logging because if we have a Dataframe with a column map with collisions with a count of 5b we risk printing the warning 5b times, which is not super...

doc/docs/modules/ROOT/pages/writing.adoc

fbiville · 2023-06-13T12:05:14Z

common/src/test/scala/org/neo4j/spark/util/Neo4jImplicitsTest.scala

@@ -178,6 +178,23 @@ class Neo4jImplicitsTest {
    Assert.assertEquals(expected, actual)
  }

+  @Test
+  def `should not handle collision by aggregating values`(): Unit = {


Suggested change

def `should not handle collision by aggregating values`(): Unit = {

def `should handle collision by aggregating values`(): Unit = {

fbiville · 2023-06-13T12:07:01Z

doc/docs/modules/ROOT/pages/writing.adoc

+=== Group duplicated keys to array of values
+
+You can use the option `schema.map.group.duplicate.keys` to avoid this problem, what the connector will do is to group all and only the values with duplicate keys into an array with all the values. The default value for the option is `false`.


Suggested change

You can use the option `schema.map.group.duplicate.keys` to avoid this problem, what the connector will do is to group all and only the values with duplicate keys into an array with all the values. The default value for the option is `false`.

You can use the option `schema.map.group.duplicate.keys` to avoid this problem. The connector will group all the values with the same keys into an array. The default value for the option is `false`.

fbiville · 2023-06-13T12:09:16Z

spark-3/src/test/scala/org/neo4j/spark/DataSourceWriterTSE.scala

+        |)
+        |RETURN count(n)
+        |""".stripMargin)
+      .peek()


Suggested change

.peek()

.single()

fixes neo4j-contrib#503: Not able to Insert Neo4j Map Data type using…

50abda2

… the neo4j-spark connector

fbiville requested changes Jun 5, 2023

View reviewed changes

common/src/test/scala/org/neo4j/spark/util/Neo4jUtilTest.scala Outdated Show resolved Hide resolved

common/src/test/scala/org/neo4j/spark/util/Neo4jUtilTest.scala Outdated Show resolved Hide resolved

common/src/main/scala/org/neo4j/spark/util/Neo4jUtil.scala Outdated Show resolved Hide resolved

improved loggin and added documentation about collisions

29ec619

fbiville reviewed Jun 7, 2023

View reviewed changes

feedback implementation

2f72f1e

fbiville force-pushed the issue_503 branch from 2f72f1e to afcde78 Compare June 9, 2023 15:30

fbiville reviewed Jun 9, 2023

View reviewed changes

conker84 force-pushed the issue_503 branch 5 times, most recently from 2ac57bf to a07d708 Compare June 13, 2023 08:32

fbiville approved these changes Jun 13, 2023

View reviewed changes

conker84 force-pushed the issue_503 branch from a07d708 to 2d4a39c Compare June 14, 2023 08:30

aggregating duplicates

99123ab

conker84 force-pushed the issue_503 branch from 2d4a39c to 99123ab Compare June 14, 2023 08:30

conker84 merged commit c93774a into neo4j-contrib:5.0 Jun 14, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #503: Not able to Insert Neo4j Map Data type using the neo4j-spark connector #507

fixes #503: Not able to Insert Neo4j Map Data type using the neo4j-spark connector #507

conker84 commented May 30, 2023

fbiville Jun 7, 2023

fbiville Jun 7, 2023

conker84 Jun 8, 2023

conker84 Jun 8, 2023

fbiville Jun 7, 2023

conker84 Jun 8, 2023

fbiville Jun 7, 2023

conker84 Jun 8, 2023

fbiville Jun 9, 2023

fbiville Jun 9, 2023

conker84 Jun 12, 2023

fbiville Jun 13, 2023

fbiville Jun 13, 2023

fbiville Jun 13, 2023

fbiville Jun 13, 2023

	\|you will loose information stored in these keys
	\|you will lose some of the values associated with these duplicate keys

	def `should not handle collision by aggregating values`(): Unit = {
	def `should handle collision by aggregating values`(): Unit = {

		=== Group duplicated keys to array of values

		You can use the option `schema.map.group.duplicate.keys` to avoid this problem, what the connector will do is to group all and only the values with duplicate keys into an array with all the values. The default value for the option is `false`.

fixes #503: Not able to Insert Neo4j Map Data type using the neo4j-spark connector #507

fixes #503: Not able to Insert Neo4j Map Data type using the neo4j-spark connector #507

Conversation

conker84 commented May 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment