This brief experiment assumes you randomly assign K items to N buckets. How many empty buckets will you have for various n and k? 

Why does it matter? Consider a Delta Lake merge job on a table that is partitioned randomly. Any partition that contains at least one row in the new data must be rewritten. As we'll see, in most cases that is every row.

**The math way**

With N buckets and K items, the probability that any particular bucket is empty is $\frac{N-1}{N}^K$.

Intuitively, each item has a $\frac{N-1}{N}$ chance of landing in some other bucket, and each item is independent. You can see some examples below:

In [6]:
N = 1024  # num buckets
for K in [1, 1000, 1024, 10000, 1000000]:
    print(f"N={N}  K={K}  p={((N-1)/N)**K})")

N=1024  K=1  p=0.9990234375)
N=1024  K=1000  p=0.37642379805672405)
N=1024  K=1024  p=0.3676997394112712)
N=1024  K=10000  p=5.711770163265015e-05)
N=1024  K=1000000  p=0.0)


The probability of a single bucket being empty bucket decreases rapidly as K increases. (Fun fact: for K=1024 the answer is 1/e). 

This investigation looks at probability that bucket i is empty for any bucket i. How many buckets will be empty? 

The math answer is $N * p$. We'll show this in a simulation here:

In [36]:
import hashlib
import uuid

def run_sim(K, N, num_samples):

    num_samples = 10000
    num_zeros_accum = []
    min_values_accum = []
    for s in range(num_samples):

        buckets = [0] * N
        for i in range(K):
            result = hashlib.md5(f'{uuid.uuid4()}'.encode())
            buckets[int(result.hexdigest(), 16) % 1024] += 1
        num_zeros = sum([i == 0 for i in buckets])
        min_value = min(buckets)
        num_zeros_accum.append(num_zeros)
        min_values_accum.append(min_value)

    print(f"N={N}  K={K} average number of empty buckets is {sum(num_zeros_accum) / len(num_zeros_accum)}")

In [40]:
run_sim(1024, 1024, 10000)
run_sim(K=10000, N=1024, num_samples=10000)


N=1024  K=1024 average number of empty buckets is 376.468
N=1024  K=10000 average number of empty buckets is 0.059


What's the upshot? If you have 10000 objects, it's very likely that *all* buckets will have one record.