- Title: Statistical Functions in Spark
- Slug: spark-stat-functions
- Date: 2019-12-13
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, statistical functions, quantile, approximate, approxQuantile, approx_quantile
- Author: Ben Du
- Modified: 2021-09-30 17:35:09


https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameStatFunctions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#approxQuantile-java.lang.String:A-double:A-double-

https://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile

In [9]:
import pandas as pd

In [8]:
import socket
import findspark

findspark.init("/opt/spark-3.2.0-bin-hadoop3.2")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("PySpark").enableHiveSupport().getOrCreate()

In [11]:
df = spark.createDataFrame(
    pd.DataFrame(
        data=(
            ("Ben", "Du", 1),
            ("Ben", "Du", 2),
            ("Ben", "Tu", 3),
            ("Ben", "Tu", 4),
            ("Ken", "Xu", 1),
            ("Ken", "Xu", 9),
        ),
        columns=("fname", "lname", "score"),
    )
)
df.show()

+-----+-----+-----+
|fname|lname|score|
+-----+-----+-----+
|  Ben|   Du|    1|
|  Ben|   Du|    2|
|  Ben|   Tu|    3|
|  Ben|   Tu|    4|
|  Ken|   Xu|    1|
|  Ken|   Xu|    9|
+-----+-----+-----+



                                                                                

## DataFrame.stat.approxQuantile

Notice that it returns a Double array.

In [16]:
df.stat.approxQuantile("score", [0.5], 0.1)

[2.0]

In [15]:
df.stat.approxQuantile("score", [0.5], 0.001)

[2.0]

In [18]:
df.stat.approxQuantile("score", [0.5], 0.5)

[1.0]

## References

- [Spark SQL Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)

- [PySpark APIs](https://spark.apache.org/docs/latest/api/python/reference/index.html)

- [Spark Scala Functions](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html)

- [Spark Java API for Dataset](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html)

- [Spark Java Functions](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html)

- [Spark Java API for Row](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html)