- Title: Statistical Functions in Spark
- Slug: spark-stat-functions
- Date: 2019-12-13
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, statistical functions, quantile, approximate, approxQuantile, approx_quantile
- Author: Ben Du
- Modified: 2021-09-30 17:35:09


https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameStatFunctions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#approxQuantile-java.lang.String:A-double:A-double-

https://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile

In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1

In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark Example")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()

import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@236eb82c

In [4]:
import org.apache.spark.sql.functions._

val df = Seq(
    ("Ben", "Du", 1),
    ("Ben", "Du", 2),
    ("Ben", "Tu", 3),
    ("Ben", "Tu", 4),
    ("Ken", "Xu", 1),
    ("Ken", "Xu", 9)
).toDF("fname", "lname", "score")
df.show

+-----+-----+-----+
|fname|lname|score|
+-----+-----+-----+
|  Ben|   Du|    1|
|  Ben|   Du|    2|
|  Ben|   Tu|    3|
|  Ben|   Tu|    4|
|  Ken|   Xu|    1|
|  Ken|   Xu|    9|
+-----+-----+-----+



null

## DataFrame.stat.approxQuantile

Notice that it returns a Double array.

In [7]:
df.stat.approxQuantile("score", Array(0.5), 0.1)

[2.0]

In [8]:
df.stat.approxQuantile("score", Array(0.5), 0.001)

[2.0]

In [10]:
df.stat.approxQuantile("score", Array(0.5), 0.5)

[1.0]

## References

- [Spark SQL Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)

- [Spark Scala Functions](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html)

- [Spark Java API for Dataset](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html)

- [Spark Java Functions](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html)

- [Spark Java API for Row](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html)