Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a MutableMap when accessing keySet and values in SparkMap #58

Merged
merged 1 commit into from
Sep 23, 2020

Conversation

raymondlam12
Copy link
Contributor

Currently, accessing the keySet via: _mapData.keyArray().array.iterator or values via: _mapData.valueArray().array.iterator fails as its underlying type in Spark is UnsafeArrayData.

UnsafeArrayData does not allow accessing the raw array: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L108 .

Creating the mutable map solves this problem.

@@ -31,10 +31,14 @@ case class SparkMap(private var _mapData: MapData,
}

override def keySet(): util.Set[StdData] = {
if (_mutableMap == null) {
_mutableMap = createMutableMap()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a mutable map on the read path seems excessive. I see that _map.keyArray() exposes getter methods. Can we build an iterator on top of this method ourselves? I think we would need to just maintain a current index inside the iterator and then increment it on next().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely excessive in this scenario, but I don't think there will be keySet/values calls without some get(key) call which would create a mutable map anyways.

I can make the change to make another Iterator though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change to the iterator. It's a cleaner solution, thanks for suggestion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks much better. Consider reducing duplicated code? Something like this?

  override def keySet(): util.Set[StdData] = {
    val keysIterator: Iterator[Any] = if (_mutableMap == null) {
      new Iterator[Any] {
        private val keyArray = _mapData.keyArray()
        var offset : Int = 0

        override def next(): Any = {
          offset += 1
          keyArray.get(offset -1, _keyType)
        }

        override def hasNext: Boolean = offset < SparkMap.this.size()
      }
    } else {
      _mutableMap.keysIterator
    }

    new util.AbstractSet[StdData] {
      override def iterator(): util.Iterator[StdData] = new util.Iterator[StdData] {

        override def next(): StdData = SparkWrapper.createStdData(keysIterator.next(), _keyType)

        override def hasNext: Boolean = keysIterator.hasNext
      }

      override def size(): Int = SparkMap.this.size()
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acked and updated.

case array: mutable.WrappedArray[Any] if expectedOutputData != null =>
val expectedArray = array.sortWith(
(l, r) => r == null || (l != null && l.toString < r.toString))
val actualArray = result.get(0).asInstanceOf[mutable.WrappedArray[Any]].sortWith(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right fix. Arrays are ordered and so this check should be ordered. I understand that in this specific case, the output you are trying to test is that of the map_keys, map_values functions which are unordered. IMO, an ideal fix would be to pass the output of the function call through an array_sort UDF in the test case for those specific UDFs.

If we create the iterator ourselves instead of using a mutable map as mentioned in my comment above, do we still need this in the scope of the PR?

Copy link
Contributor Author

@raymondlam12 raymondlam12 Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should not be the fix, but I thought it might be okay to bandaid it since the unit tests don't seem to be evolving for now.
In theory, doing something like Map.get() then Map.keySet() would break the unit test ordering since a mutable map would be created which uses a sorted set as its key collection (at least from what it seems).
I think there should be some checkUnordered method in the StdTester, but it is not necessary for this RB.

If we create our own iterator without the mutable map, this change is not necessary.

Create this change to fix UnsafeArrayData bug
@raymondlam12
Copy link
Contributor Author

I don't have authorization to merge this branch; could you merge it on my behalf Shardul?

@shardulm94 shardulm94 merged commit 5702f7a into linkedin:master Sep 23, 2020
@shardulm94
Copy link
Contributor

@raymondlam12 Thanks for the PR! A new version should be published in the next couple of hours.

wmoustafa pushed a commit that referenced this pull request Oct 12, 2021
* Disable Jacoco for platform tests (#37)

* 0.0.46 release (previous 0.0.45) + release notes updated [ci skip]

* Fixed presto UDF patch broken link (#41)

Co-authored-by: Sushant Raikar <sraikar@sraikar-mn2.linkedin.biz>

* FileSystemUtils: remove an unreliable check for unit testing (#42)

Why: we don't need it

What changed: instead of checking configuration for hints on whether we
are unit testing, we trust the URI's protocol

Tests performed: ./gradlew build

* 0.0.47 release (previous 0.0.46) + release notes updated [ci skip]

* Presto: Pass custom configuration object when using FileSystemUtils (#43)

* 0.0.48 release (previous 0.0.47) + release notes updated [ci skip]

* Presto: Make ScalarFunctionImplementation state independent of StdUdfWrapper (#44)

* 0.0.49 release (previous 0.0.48) + release notes updated [ci skip]

* Upgrade to PrestoSQL 333 (#45)

Some major changes:
 - `SqlFunction`, `SqlScalarFunction` and `ScalarFunctionImplementation` have evolved
   in trinodb/trino#1764
 - `Metadata::getScalarFunctionImplementation` evolved in trinodb/trino#1039
 - Type signature parser was moved to presto-main in trinodb/trino#1738

* 0.0.50 release (previous 0.0.49) + release notes updated [ci skip]

* Add support for StdFloat, StdDouble, and StdBinary (#46)

* Introduce StdFloat, StdDouble, and StdBinary interfaces
* Add implementations of those interfaces in Avro, Hive, Presto, Spark, and Generic type systems
* Add examples of transport UDFs on those new types, and add tests for those UDFs
* Update documentation

* 0.0.51 release (previous 0.0.50) + release notes updated [ci skip]

* Allow users to override main and test source set names, output directories

* 0.0.52 release (previous 0.0.51) + release notes updated [ci skip]

* Empty commit to release new version

* Empty commit to release new version [ci skip-compare-publications]

* 0.0.53 release (previous 0.0.52) + release notes updated [ci skip]

* Plugin: Publish Presto thin jar which allows consumers to control dependency graph (#49)

* 0.0.54 release (previous 0.0.53) + release notes updated [ci skip]

* Hive: Struct data should not be converted to object array during StdStruct creation (#50)

* Remove slf4j-log4j12 from Transport dependency graph (#51)

* Bump shipkit (#54)

* 0.0.55 release (previous 0.0.54) + release notes updated [ci skip]

* Fix test SQL generation for binary inputs (#55)

* 0.0.56 release (previous 0.0.55) + release notes updated [ci skip]

* Spark: Create index-based iterator for non-mutable map keySet and values access (#58)

* 0.0.57 release (previous 0.0.56) + release notes updated [ci skip]

* Avro: Support simple union schemas (#60)

* 0.0.58 release (previous 0.0.57) + release notes updated [ci skip]

* Support conversion of String type to Utf8 in AvroWrapper (#61)

* 0.0.59 release (previous 0.0.58) + release notes updated [ci skip]

* Add Avro ENUM read support and fix String bug (#62)

Co-authored-by: Raymond Lam <ralam@linkedin.com>

* 0.0.60 release (previous 0.0.59) + release notes updated [ci skip]

* Build: Fail if there are checkstyle violations (#64)

- Change the Checkstyle severity level from warning to error
- Eliminate all existing checkstyle violations

Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com>

* 0.0.61 release (previous 0.0.60) + release notes updated [ci skip]

* Add travis-build.sh for pre-commit testing from command line (#65)

* Add travis-build.sh for pre-commit testing from command line

* Fix name of build file in comment

Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com>
Co-authored-by: Shardul Mahadik <smahadik@linkedin.com>

* Upgrade to Gradle 6.7 (#67)

* Support builds with platform specific JDK (#69)

* Bump Avro dependency to 1.10.2 (from 1.7.7). (#71)

There doesn't seem to be any impact to the code. gradle build passes.

* Migrate from PrestoSQL to Trino  (#68)

* Automate artifact publication to Maven Central (#72)

* Update ci.yml java version to 8 (#77)

skip release

* Fix org.pentaho:pentaho-aggdesigner-algorithm sunset problem (#78)

* Remove travis build in favor of github actions (#87)

* Add scala_2.11 and scala_2.12 support (#85)

* Update ci.yml to also build the udf-examples folder (#90)

Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com>

* Fix running multiple builds in run step in workflow action (#92)

Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com>

* A solution to fix running multiple UDFs in Spark issue (#93)

Co-authored-by: Kai Xu <kxu@kxu-mn3.linkedin.biz>

Co-authored-by: Shardul Mahadik <smahadik@linkedin.com>
Co-authored-by: shipkit-org <shipkit.org@gmail.com>
Co-authored-by: Sushant Raikar <sraikar@wish.com>
Co-authored-by: Sushant Raikar <sraikar@sraikar-mn2.linkedin.biz>
Co-authored-by: Suren Nihalani <1093911+SurenNihalani@users.noreply.github.com>
Co-authored-by: Xingyuan Lin <xinlin@linkedin.com>
Co-authored-by: Khai Tran <46727493+khaitranq@users.noreply.github.com>
Co-authored-by: John Joyce <jjoyce6@nd.edu>
Co-authored-by: Raymond <13109642+raymondlam12@users.noreply.github.com>
Co-authored-by: curtiscwang <cwang89@gmail.com>
Co-authored-by: Raymond Lam <ralam@linkedin.com>
Co-authored-by: Carl Steinbach <cwsteinbach+github@gmail.com>
Co-authored-by: Carl Steinbach <cwsteinbach@gmail.com>
Co-authored-by: Akshay Rai <akrai@linkedin.com>
Co-authored-by: Sreeram Ramachandran <sramachandran@linkedin.com>
Co-authored-by: Raymond Zhang <razhang@linkedin.com>
Co-authored-by: KAI XU <kxu@linkedin.com>
Co-authored-by: Sushant Raikar <sraikar@linkedin.com>
Co-authored-by: Malini Mahalakshmi Venkatachari <malvenkatachari@linkedin.com>
Co-authored-by: Kai Xu <kxu@kxu-mn3.linkedin.biz>
wmoustafa pushed a commit that referenced this pull request Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants