Skip to content

Commit

Permalink
[SPARK-1468] Modify the partition function used by partitionBy.
Browse files Browse the repository at this point in the history
Make partitionBy use a tweaked version of hash as its default partition function
since the python hash function does not consistently assign the same value
to None across python processes.

Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468

Author: Erik Selin <erik.selin@jadedpixel.com>

Closes apache#371 from tyro89/consistent_hashing and squashes the following commits:

201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
  • Loading branch information
Erik Selin authored and mateiz committed Jun 3, 2014
1 parent b1f2853 commit 8edc9d0
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -1062,7 +1062,7 @@ def rightOuterJoin(self, other, numPartitions=None):
return python_right_outer_join(self, other, numPartitions)

# TODO: add option to control map-side combining
def partitionBy(self, numPartitions, partitionFunc=hash):
def partitionBy(self, numPartitions, partitionFunc=None):
"""
Return a copy of the RDD partitioned using the specified partitioner.
Expand All @@ -1073,6 +1073,9 @@ def partitionBy(self, numPartitions, partitionFunc=hash):
"""
if numPartitions is None:
numPartitions = self.ctx.defaultParallelism

if partitionFunc is None:
partitionFunc = lambda x: 0 if x is None else hash(x)
# Transferring O(n) objects to Java is too expensive. Instead, we'll
# form the hash buckets in Python, transferring O(numPartitions) objects
# to Java. Each object is a (splitNumber, [objects]) pair.
Expand Down

0 comments on commit 8edc9d0

Please sign in to comment.