Skip to content

kekepins/spark-percentile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

spark-percentile

From the wikipedia page about percentile: https://en.wikipedia.org/wiki/Percentile

Let's do some experiments with spark.

Spark has two built-in percentile implementations:

  • percentile
  • percentile_approx

Let's implement 3 new one described in wikipedia page:

  • nearest rank
  • interpolation C1
  • interpolation C0

In spark this can be done with UserDefinedAggregateFunction, java code can be see here

Now test this on various examples from wikipedia page code here

With code:

// Register udf 
sparkSession.udf().register("percentileC1", new MyPercentile(percentiles, PercentileMode.INTERPOLATION_C1));
sparkSession.udf().register("percentileC0", new MyPercentile(percentiles, PercentileMode.INTERPOLATION_C0));
sparkSession.udf().register("percentileNearestRank", new MyPercentile(percentiles, PercentileMode.NEAREST_RANK));

// Get a dataset
Dataset<Row> ds = fromArray(sparkSession, values);
ds.show(false);
		
// Compute percentiles
ds = ds.select(
    	callUDF("percentileC1", col("data")).as("Percentile C1"),
    	callUDF("percentileC0", col("data")).as("Percentile C0"),
    	callUDF("percentileNearestRank", col("data")).as("Percentile Nearest Rank"),
    	callUDF("percentile_approx", col("data"), lit( percentiles) ).as("percentile_approx (spark builtin)"),
    	callUDF("percentile", col("data"), lit( percentiles) ).as("percentile (spark builtin)")
    		);
		
ds.show(false);

Test 1, Nearest rank example 1

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
16.0, 23.0, 29.0, 35.0, 50.0 15.0, 19.0, 26.0, 35.0, 50.0 15.0, 20.0, 20.0, 35.0, 50.0 20.0, 20.0, 20.0, 35.0, 50.0 16.0, 23.0, 29.0, 35.0, 50.0

Test 2, Nearest rank example 2

data
3.0
6.0
7.0
8.0
8.0
10.0
13.0
15.0
16.0
20.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
7.25, 9.0, 14.5, 20.0 6.75, 9.0, 15.25, 20.0 7.0, 8.0, 15.0, 20.0 7.0, 8.0, 15.0, 20.0 7.25, 9.0, 14.5, 20.0

Test 3, Nearest rank example 3

data
3.0
6.0
7.0
8.0
8.0
9.0
10.0
13.0
15.0
16.0
20.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
7.5, 9.0, 14.0, 20.0 7.0, 9.0, 15.0, 20.0 7.0, 9.0, 15.0, 20.0 7.0, 9.0, 15.0, 20.0 7.5, 9.0, 14.0, 20.0

Test 4, Interpolation between closest rank (C=1) example 1 (second variant)

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
29.0 26.0 20.0 20.0 29.0

Test 5, Interpolation between closest rank (C=1) example 2

data
1.0
2.0
3.0
4.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
3.25 3.75 3.0 3.0 3.25

Test 6, Interpolation between closest rank (C=0) example 1 (third variant)

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
16.0, 23.0, 29.0, 48.0 15.0, 19.0, 26.0, 50.0 15.0, 20.0, 20.0, 50.0 20.0, 20.0, 20.0, 50.0 16.0, 23.0, 29.0, 48.0

Releases

No releases published

Packages

No packages published

Languages