Create a MutableMap when accessing keySet and values in SparkMap #58

raymondlam12 · 2020-09-21T18:36:27Z

Currently, accessing the keySet via: _mapData.keyArray().array.iterator or values via: _mapData.valueArray().array.iterator fails as its underlying type in Spark is UnsafeArrayData.

UnsafeArrayData does not allow accessing the raw array: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L108 .

Creating the mutable map solves this problem.

shardulm94 · 2020-09-22T05:48:31Z

transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/data/SparkMap.scala

@@ -31,10 +31,14 @@ case class SparkMap(private var _mapData: MapData,
  }

  override def keySet(): util.Set[StdData] = {
+    if (_mutableMap == null) {
+      _mutableMap = createMutableMap()
+    }


Creating a mutable map on the read path seems excessive. I see that _map.keyArray() exposes getter methods. Can we build an iterator on top of this method ourselves? I think we would need to just maintain a current index inside the iterator and then increment it on next().

It is definitely excessive in this scenario, but I don't think there will be keySet/values calls without some get(key) call which would create a mutable map anyways.

I can make the change to make another Iterator though.

I made the change to the iterator. It's a cleaner solution, thanks for suggestion!

Nice! Looks much better. Consider reducing duplicated code? Something like this?

override def keySet(): util.Set[StdData] = { val keysIterator: Iterator[Any] = if (_mutableMap == null) { new Iterator[Any] { private val keyArray = _mapData.keyArray() var offset : Int = 0 override def next(): Any = { offset += 1 keyArray.get(offset -1, _keyType) } override def hasNext: Boolean = offset < SparkMap.this.size() } } else { _mutableMap.keysIterator } new util.AbstractSet[StdData] { override def iterator(): util.Iterator[StdData] = new util.Iterator[StdData] { override def next(): StdData = SparkWrapper.createStdData(keysIterator.next(), _keyType) override def hasNext: Boolean = keysIterator.hasNext } override def size(): Int = SparkMap.this.size() } }

acked and updated.

shardulm94 · 2020-09-22T06:05:21Z

...sportable-udfs-test-spark/src/main/scala/com/linkedin/transport/test/spark/SparkTester.scala

+      case array: mutable.WrappedArray[Any] if expectedOutputData != null =>
+        val expectedArray = array.sortWith(
+          (l, r) => r == null || (l != null && l.toString < r.toString))
+        val actualArray = result.get(0).asInstanceOf[mutable.WrappedArray[Any]].sortWith(


I don't think this is the right fix. Arrays are ordered and so this check should be ordered. I understand that in this specific case, the output you are trying to test is that of the map_keys, map_values functions which are unordered. IMO, an ideal fix would be to pass the output of the function call through an array_sort UDF in the test case for those specific UDFs.

If we create the iterator ourselves instead of using a mutable map as mentioned in my comment above, do we still need this in the scope of the PR?

Yeah this should not be the fix, but I thought it might be okay to bandaid it since the unit tests don't seem to be evolving for now.
In theory, doing something like Map.get() then Map.keySet() would break the unit test ordering since a mutable map would be created which uses a sorted set as its key collection (at least from what it seems).
I think there should be some checkUnordered method in the StdTester, but it is not necessary for this RB.

If we create our own iterator without the mutable map, this change is not necessary.

Create this change to fix UnsafeArrayData bug

raymondlam12 · 2020-09-23T17:51:53Z

I don't have authorization to merge this branch; could you merge it on my behalf Shardul?

shardulm94 · 2020-09-23T18:29:39Z

@raymondlam12 Thanks for the PR! A new version should be published in the next couple of hours.

* Disable Jacoco for platform tests (#37) * 0.0.46 release (previous 0.0.45) + release notes updated [ci skip] * Fixed presto UDF patch broken link (#41) Co-authored-by: Sushant Raikar <sraikar@sraikar-mn2.linkedin.biz> * FileSystemUtils: remove an unreliable check for unit testing (#42) Why: we don't need it What changed: instead of checking configuration for hints on whether we are unit testing, we trust the URI's protocol Tests performed: ./gradlew build * 0.0.47 release (previous 0.0.46) + release notes updated [ci skip] * Presto: Pass custom configuration object when using FileSystemUtils (#43) * 0.0.48 release (previous 0.0.47) + release notes updated [ci skip] * Presto: Make ScalarFunctionImplementation state independent of StdUdfWrapper (#44) * 0.0.49 release (previous 0.0.48) + release notes updated [ci skip] * Upgrade to PrestoSQL 333 (#45) Some major changes: - `SqlFunction`, `SqlScalarFunction` and `ScalarFunctionImplementation` have evolved in trinodb/trino#1764 - `Metadata::getScalarFunctionImplementation` evolved in trinodb/trino#1039 - Type signature parser was moved to presto-main in trinodb/trino#1738 * 0.0.50 release (previous 0.0.49) + release notes updated [ci skip] * Add support for StdFloat, StdDouble, and StdBinary (#46) * Introduce StdFloat, StdDouble, and StdBinary interfaces * Add implementations of those interfaces in Avro, Hive, Presto, Spark, and Generic type systems * Add examples of transport UDFs on those new types, and add tests for those UDFs * Update documentation * 0.0.51 release (previous 0.0.50) + release notes updated [ci skip] * Allow users to override main and test source set names, output directories * 0.0.52 release (previous 0.0.51) + release notes updated [ci skip] * Empty commit to release new version * Empty commit to release new version [ci skip-compare-publications] * 0.0.53 release (previous 0.0.52) + release notes updated [ci skip] * Plugin: Publish Presto thin jar which allows consumers to control dependency graph (#49) * 0.0.54 release (previous 0.0.53) + release notes updated [ci skip] * Hive: Struct data should not be converted to object array during StdStruct creation (#50) * Remove slf4j-log4j12 from Transport dependency graph (#51) * Bump shipkit (#54) * 0.0.55 release (previous 0.0.54) + release notes updated [ci skip] * Fix test SQL generation for binary inputs (#55) * 0.0.56 release (previous 0.0.55) + release notes updated [ci skip] * Spark: Create index-based iterator for non-mutable map keySet and values access (#58) * 0.0.57 release (previous 0.0.56) + release notes updated [ci skip] * Avro: Support simple union schemas (#60) * 0.0.58 release (previous 0.0.57) + release notes updated [ci skip] * Support conversion of String type to Utf8 in AvroWrapper (#61) * 0.0.59 release (previous 0.0.58) + release notes updated [ci skip] * Add Avro ENUM read support and fix String bug (#62) Co-authored-by: Raymond Lam <ralam@linkedin.com> * 0.0.60 release (previous 0.0.59) + release notes updated [ci skip] * Build: Fail if there are checkstyle violations (#64) - Change the Checkstyle severity level from warning to error - Eliminate all existing checkstyle violations Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com> * 0.0.61 release (previous 0.0.60) + release notes updated [ci skip] * Add travis-build.sh for pre-commit testing from command line (#65) * Add travis-build.sh for pre-commit testing from command line * Fix name of build file in comment Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> * Upgrade to Gradle 6.7 (#67) * Support builds with platform specific JDK (#69) * Bump Avro dependency to 1.10.2 (from 1.7.7). (#71) There doesn't seem to be any impact to the code. gradle build passes. * Migrate from PrestoSQL to Trino (#68) * Automate artifact publication to Maven Central (#72) * Update ci.yml java version to 8 (#77) skip release * Fix org.pentaho:pentaho-aggdesigner-algorithm sunset problem (#78) * Remove travis build in favor of github actions (#87) * Add scala_2.11 and scala_2.12 support (#85) * Update ci.yml to also build the udf-examples folder (#90) Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com> * Fix running multiple builds in run step in workflow action (#92) Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com> * A solution to fix running multiple UDFs in Spark issue (#93) Co-authored-by: Kai Xu <kxu@kxu-mn3.linkedin.biz> Co-authored-by: Shardul Mahadik <smahadik@linkedin.com> Co-authored-by: shipkit-org <shipkit.org@gmail.com> Co-authored-by: Sushant Raikar <sraikar@wish.com> Co-authored-by: Sushant Raikar <sraikar@sraikar-mn2.linkedin.biz> Co-authored-by: Suren Nihalani <1093911+SurenNihalani@users.noreply.github.com> Co-authored-by: Xingyuan Lin <xinlin@linkedin.com> Co-authored-by: Khai Tran <46727493+khaitranq@users.noreply.github.com> Co-authored-by: John Joyce <jjoyce6@nd.edu> Co-authored-by: Raymond <13109642+raymondlam12@users.noreply.github.com> Co-authored-by: curtiscwang <cwang89@gmail.com> Co-authored-by: Raymond Lam <ralam@linkedin.com> Co-authored-by: Carl Steinbach <cwsteinbach+github@gmail.com> Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com> Co-authored-by: Akshay Rai <akrai@linkedin.com> Co-authored-by: Sreeram Ramachandran <sramachandran@linkedin.com> Co-authored-by: Raymond Zhang <razhang@linkedin.com> Co-authored-by: KAI XU <kxu@linkedin.com> Co-authored-by: Sushant Raikar <sraikar@linkedin.com> Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com> Co-authored-by: Kai Xu <kxu@kxu-mn3.linkedin.biz>

…ues access (#58)

raymondlam12 force-pushed the create-mutable-map branch from df4a6e7 to 2395082 Compare September 21, 2020 23:45

shardulm94 reviewed Sep 22, 2020

View reviewed changes

raymondlam12 force-pushed the create-mutable-map branch from 2395082 to 71ac147 Compare September 22, 2020 18:13

Create index-based iterator for non-mutable map keySet and values access

eb496db

Create this change to fix UnsafeArrayData bug

raymondlam12 force-pushed the create-mutable-map branch from 71ac147 to eb496db Compare September 22, 2020 20:50

shardulm94 approved these changes Sep 23, 2020

View reviewed changes

shardulm94 merged commit 5702f7a into linkedin:master Sep 23, 2020

wmoustafa pushed a commit that referenced this pull request Oct 13, 2021

Spark: Create index-based iterator for non-mutable map keySet and val…

fa476fa

…ues access (#58)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a MutableMap when accessing keySet and values in SparkMap #58

Create a MutableMap when accessing keySet and values in SparkMap #58

raymondlam12 commented Sep 21, 2020

shardulm94 Sep 22, 2020

raymondlam12 Sep 22, 2020

raymondlam12 Sep 22, 2020

shardulm94 Sep 22, 2020

raymondlam12 Sep 22, 2020

shardulm94 Sep 22, 2020

raymondlam12 Sep 22, 2020 •

edited

Loading

raymondlam12 commented Sep 23, 2020

shardulm94 commented Sep 23, 2020

Create a MutableMap when accessing keySet and values in SparkMap #58

Create a MutableMap when accessing keySet and values in SparkMap #58

Conversation

raymondlam12 commented Sep 21, 2020

shardulm94 Sep 22, 2020

Choose a reason for hiding this comment

raymondlam12 Sep 22, 2020

Choose a reason for hiding this comment

raymondlam12 Sep 22, 2020

Choose a reason for hiding this comment

shardulm94 Sep 22, 2020

Choose a reason for hiding this comment

raymondlam12 Sep 22, 2020

Choose a reason for hiding this comment

shardulm94 Sep 22, 2020

Choose a reason for hiding this comment

raymondlam12 Sep 22, 2020 • edited Loading

Choose a reason for hiding this comment

raymondlam12 commented Sep 23, 2020

shardulm94 commented Sep 23, 2020

raymondlam12 Sep 22, 2020 •

edited

Loading