# 3.2 Aggregation functions

**Aggregate functions** operate on a group of rows and calculate a single return value for every group.

In previous GroupBy section, we have seen some aggregation example. In this section, we will examine all existing aggregation functions in spark. The full list in alphabetic order:

- approx_count_distinct, count, countDistinct
- avg/mean, max, min, 
- collect_list, collect_set
- grouping : Check if a column is created by aggregation function or not, returns 1 for aggregated, 0 for not aggregated
- first,last
- sum, sumDistinct
- kurtosis, skewness
- stddev, stddev_samp, stddev_pop
- variance, var_samp, var_pop

In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import count, countDistinct, approx_count_distinct, avg, min, max, mean, collect_list, \
    collect_set, grouping, first, last, sum, sumDistinct, skewness, kurtosis, stddev, stddev_samp, stddev_pop, \
    variance, var_samp, var_pop

In [2]:
local=True

if local:
    spark=SparkSession.builder.master("local[4]").appName("pySparkGroupBy").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("SparkArrowCompression") \
                      .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .getOrCreate()


In [3]:
data = [("Alice", "Sales", 3000),
            ("Michael", "Sales", 4600),
            ("Robert", "IT", 4100),
            ("Maria", "Finance", 3000),
            ("Haha", "IT", 3000),
            ("Scott", "Finance", 3300),
            ("Jen", "Finance", 3900),
            ("Jeff", "Marketing", 3000),
            ("Kumar", "Marketing", 2000),
            ("Haha", "Sales", 4100)
            ]
schema = ["name", "department", "salary"]

df=spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|  Alice|     Sales|  3000|
|Michael|     Sales|  4600|
| Robert|        IT|  4100|
|  Maria|   Finance|  3000|
|   Haha|        IT|  3000|
|  Scott|   Finance|  3300|
|    Jen|   Finance|  3900|
|   Jeff| Marketing|  3000|
|  Kumar| Marketing|  2000|
|   Haha|     Sales|  4100|
+-------+----------+------+



## 3.2.1 Count, countDistinct, approx_count_distinct
The classic count() will just count row numbers of a group.

The **approx_count_distinct** is implemented to avoid count(distinct()) operations. The approx_count_distinct uses an algorithm called **HyperLogLog**. This algorithm can estimate the number of distinct values of greater than 1,000,000,000, where the accuracy of the calculated approximate distinct count value is within 2% of the actual distinct count value. It can do this while using much less memory.

Because **count(distinct())** requires more and more memory as the number of distinct values increases. 

This tutorial shows how the approx_count_distinct function is implemented
https://mungingdata.com/apache-spark/hyperloglog-count-distinct/#:~:text=approx_count_distinct%20uses%20the%20HyperLogLog%20algorithm,count()%20will%20run%20slower).

In [4]:
# Aggregation function can be used after groupBy or on the whole data frame.
df.select(count("salary")).show()

# below is
df.groupBy("department").count().show()


+-------------+
|count(salary)|
+-------------+
|           10|
+-------------+

+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    3|
| Marketing|    2|
|        IT|    2|
+----------+-----+



In [5]:
# countDistinct can take multiple column as argument. If input column number is greater than 1, then
# the value combination (col1, col2, ...) must be distinct.

# salary distinct value is 6, but (department, salary) distinct value is 10
df.select(countDistinct("salary")).show()
df.select(countDistinct("department", "salary")).show()


+----------------------+
|count(DISTINCT salary)|
+----------------------+
|                     6|
+----------------------+

+----------------------------------+
|count(DISTINCT department, salary)|
+----------------------------------+
|                                10|
+----------------------------------+



In [6]:
# approx count uses less resources
df.select(approx_count_distinct("salary")).show()

+-----------------------------+
|approx_count_distinct(salary)|
+-----------------------------+
|                            6|
+-----------------------------+



## 3.2.2 Get basic stats of a column by using avg/mean, min, max

Note mean is the alias of avg. It's not the median function.

In [7]:
# show avg example
# avg can only apply on digit column, if type is mismatch, it returns null
df.select(avg("salary")).show()
df.select(avg("name")).show()

   

+-----------+
|avg(salary)|
+-----------+
|     3400.0|
+-----------+

+---------+
|avg(name)|
+---------+
|     null|
+---------+



In [8]:
# show mean example
# mean is the alias of avg. so it works like avg. It's not the median.
df.select(mean("salary")).show()
df.select(mean("name")).show()



+-----------+
|avg(salary)|
+-----------+
|     3400.0|
+-----------+

+---------+
|avg(name)|
+---------+
|     null|
+---------+



In [9]:
# show min example
# min can apply on digit and string column, it uses default sorting order (ascending) to find min
df.select(min("salary")).show()
df.select(min("name")).show()


+-----------+
|min(salary)|
+-----------+
|       2000|
+-----------+

+---------+
|min(name)|
+---------+
|    Alice|
+---------+



In [10]:
# show max example
# max can apply on digit and string column, it uses default sorting order (ascending) to find max
df.select(max("salary")).show()
df.select(max("name")).show()

+-----------+
|max(salary)|
+-----------+
|       4600|
+-----------+

+---------+
|max(name)|
+---------+
|    Scott|
+---------+



## 3.2.3 collect_list, collect_set 

- collect_list() function returns a list that contains all values from an input column with duplicates.
- collect_set() function returns a list that contains all values from an input column without duplicates

In [11]:
# show collect_list example

# collect the salary of employee, note we have duplicates
df.select(collect_list("salary")).show(truncate=False)

# collect the name of each department
# note collect_list can't be used directly after groupBy, it must be in agg()
df.groupBy("department").agg(collect_list("name")).show(truncate=False)

+------------------------------------------------------------+
|collect_list(salary)                                        |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

+----------+----------------------+
|department|collect_list(name)    |
+----------+----------------------+
|Sales     |[Alice, Michael, Haha]|
|Finance   |[Maria, Scott, Jen]   |
|Marketing |[Jeff, Kumar]         |
|IT        |[Robert, Haha]        |
+----------+----------------------+



In [12]:
# show collect_set example
# note we dont have duplicates
df.select(collect_set("salary")).show(truncate=False)

+------------------------------------+
|collect_set(salary)                 |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+------------------------------------+



## 3.2.4 Groupings

In [None]:
df1 = df.groupBy("department").avg("salary")
df1.show()
# can't make grouping works 
# df1.select(grouping("avg(salary)")).show()

## 3.2.5 

- first(colName): it returns the first element of a group in a column. When ignoreNulls is set to true, it returns the first non-null element.
- last(colName): it returns the last element of a group in a column. When ignoreNulls is set to true, it returns the last non-null element.

In [16]:
# get the first and last of all rows
df.select(first("salary"),last("name")).show()

+-------------+----------+
|first(salary)|last(name)|
+-------------+----------+
|         3000|      Haha|
+-------------+----------+



In [17]:
# get the first, last of each sub group
df.groupBy("department").agg(first("salary"),last("name")).show()

+----------+-------------+----------+
|department|first(salary)|last(name)|
+----------+-------------+----------+
|     Sales|         3000|      Haha|
|   Finance|         3000|       Jen|
| Marketing|         3000|     Kumar|
|        IT|         4100|      Haha|
+----------+-------------+----------+



## 3.2.6 sum and sumDistinct
- sum(colName): returns the sum of all values in a column.
- sumDistinct(colName): returns the sum of all distinct values in a column.

In [19]:
# get the sum and sumDistinct of each group
df.select(sum("salary").alias("salary_sum"),sumDistinct("salary").alias("salary_distinct_sum")).show()

+----------+-------------------+
|salary_sum|salary_distinct_sum|
+----------+-------------------+
|     34000|              20900|
+----------+-------------------+



In [21]:
df.groupBy("department").agg(collect_list("salary"),sum("salary").alias("salary_sum"),sumDistinct("salary").alias("salary_distinct_sum")).show()

+----------+--------------------+----------+-------------------+
|department|collect_list(salary)|salary_sum|salary_distinct_sum|
+----------+--------------------+----------+-------------------+
|     Sales|  [4600, 4100, 3000]|     11700|              11700|
|   Finance|  [3900, 3000, 3300]|     10200|              10200|
| Marketing|        [2000, 3000]|      5000|               5000|
|        IT|        [3000, 4100]|      7100|               7100|
+----------+--------------------+----------+-------------------+



## 3.2.7 skewness, kurtosis

When we do descriptive analysis, we want to know the skewness of a distribution. If a distribution is not skewed, which
is a normal distribution, then we want to know the Kurtosis(峰度) of the distribution.

Central tendency
This is nothing but Mean, its the average value of a distribution. to calculate the central tendency we can use Imputer or Spark SQL's stats function.

Dispersion
This Nothing but Variance,Is a measure that how far the data set is spread out, So calculate the Central tendency and dispersion refer this tutorial.

- skewness(colName): Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric 
if it looks the same to the left and right of the centre point. 

- kurtosis(colName): Kurtosis measures whether your dataset is heavy-tailed or light-tailed compared to a normal distribution. 
Data sets with high kurtosis have heavy tails and more outliers and data sets with low kurtosis tend to have 
light tails and fewer outliers.



In [37]:
# if the skew value is positive, it means, the distribution is skew to the right 
df.select(skewness("salary"), kurtosis("salary")).show()
df.groupBy("department").agg(collect_list("salary"), skewness("salary"),kurtosis("salary")).show()

+--------------------+-------------------+
|    skewness(salary)|   kurtosis(salary)|
+--------------------+-------------------+
|-0.12041791181069564|-0.6467803030303032|
+--------------------+-------------------+

+----------+--------------------+-------------------+----------------+
|department|collect_list(salary)|   skewness(salary)|kurtosis(salary)|
+----------+--------------------+-------------------+----------------+
|     Sales|  [3000, 4600, 4100]|-0.4220804429649461|            -1.5|
|   Finance|  [3000, 3300, 3900]|0.38180177416060623|            -1.5|
| Marketing|        [3000, 2000]|                0.0|            -2.0|
|        IT|        [4100, 3000]|                0.0|            -2.0|
+----------+--------------------+-------------------+----------------+



In [28]:
# Below example, col1 is skew to the left, col2 is skew to right
data2 = [(20,100),(50,100),(100,100),(100,100),(100,100),(100,100),(100, 150),(100,180)]
df_skew= spark.createDataFrame(data=data2, schema=["col1","col2"])
df_skew.show()

+----+----+
|col1|col2|
+----+----+
|  20| 100|
|  50| 100|
| 100| 100|
| 100| 100|
| 100| 100|
| 100| 100|
| 100| 150|
| 100| 180|
+----+----+



In [30]:
# col1 skew to the left, returns negative skewness value
df_skew.select(skewness("col1"), kurtosis("col1")).show()

# col2 skew to the right, returns positive skewness value
df_skew.select(skewness("col2"), kurtosis("col2")).show()

# you can notice they have the same kurtorsis value, it means the distance between tail and median is the same. They are both not heavy 

+-------------------+-------------------+
|     skewness(col1)|     kurtosis(col1)|
+-------------------+-------------------+
|-1.3746740084540479|0.16603074794216655|
+-------------------+-------------------+

+------------------+-----------------+
|    skewness(col2)|   kurtosis(col2)|
+------------------+-----------------+
|1.3746740084540479|0.166030747942167|
+------------------+-----------------+



In [33]:
data3 = [(100,100),(100,100),(100,110),(100,120),(100,130),(100,140),(100, 150),(110,180)]
df_kurt= spark.createDataFrame(data=data3, schema=["col1","col2"])
df_kurt.show()

+----+----+
|col1|col2|
+----+----+
| 100| 100|
| 100| 100|
| 100| 110|
| 100| 120|
| 100| 130|
| 100| 140|
| 100| 150|
| 110| 180|
+----+----+



In [35]:
# col1 skew to the right, returns positive skewness value, is not heavily tailed, so kurt is positive
df_kurt.select(skewness("col1"), kurtosis("col1")).show()

# col2 skew to the right, returns positive skewness value. is heavily tailed, so kurt is negative
df_kurt.select(skewness("col2"), kurtosis("col2")).show()

+------------------+------------------+
|    skewness(col1)|    kurtosis(col1)|
+------------------+------------------+
|2.2677868380553634|3.1428571428571432|
+------------------+------------------+

+------------------+-------------------+
|    skewness(col2)|     kurtosis(col2)|
+------------------+-------------------+
|0.6682892518272334|-0.5349496168871446|
+------------------+-------------------+



## 3.2.8 Standard deviation

Standard deviation is a measure of the amount of variation or dispersion of a set of values.
- A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) 
     of the set
- A high standard deviation indicates that the values are spread out over a wider range.

Spark provides three methods:
- stddev(): alias for stddev_samp.
- stddev_samp(): returns the unbiased sample standard deviation of the expression in a group.
- stddev_pop(): returns population standard deviation of the expression in a group.

In [38]:
df.select(stddev("salary"), stddev_samp("salary"), stddev_pop("salary")).show()

+-------------------+-------------------+------------------+
|stddev_samp(salary)|stddev_samp(salary)|stddev_pop(salary)|
+-------------------+-------------------+------------------+
|  765.9416862050705|  765.9416862050705|  726.636084983398|
+-------------------+-------------------+------------------+



In [39]:
df.groupBy("department").agg(collect_list("salary"), stddev("salary"),stddev_pop("salary")).show()

+----------+--------------------+-------------------+------------------+
|department|collect_list(salary)|stddev_samp(salary)|stddev_pop(salary)|
+----------+--------------------+-------------------+------------------+
|     Sales|  [3000, 4600, 4100]|   818.535277187245| 668.3312551921141|
|   Finance|  [3000, 3300, 3900]|   458.257569495584|374.16573867739413|
| Marketing|        [3000, 2000]|  707.1067811865476|             500.0|
|        IT|        [4100, 3000]|  777.8174593052023|             550.0|
+----------+--------------------+-------------------+------------------+



## 3.2.9 Variance

Variance is also a measure of the amount of variation or dispersion of a set of values. The difference with standard
deviation is that square functions are used during the calculation because they weight outliers more heavily than 
points that are near to the mean. This prevents that differences above the mean neutralize those below the mean.

But the square function makes the Variance change the unit of measurement of the original data. For example, a column
contains centimeter values. Your variance would be in squared centimeters and therefore not the best measurement. 

Spark provides three methods:
- variance(): alias for var_samp.
- var_samp(): returns the unbiased sample variance of the values in a group.
- var_pop(): returns the population variance of the values in a group. 

High variance value means data are spreaded to a wider range.
Low variance value means data are close to mean

In [40]:

df.select(variance("salary"), var_samp("salary"), var_pop("salary")).show()

+-----------------+-----------------+---------------+
| var_samp(salary)| var_samp(salary)|var_pop(salary)|
+-----------------+-----------------+---------------+
|586666.6666666666|586666.6666666666|       528000.0|
+-----------------+-----------------+---------------+



In [41]:
df.groupBy("department").agg(collect_list("salary"), variance("salary"),var_pop("salary")).show()

+----------+--------------------+----------------+-----------------+
|department|collect_list(salary)|var_samp(salary)|  var_pop(salary)|
+----------+--------------------+----------------+-----------------+
|     Sales|  [3000, 4600, 4100]|        670000.0|446666.6666666667|
|   Finance|  [3000, 3300, 3900]|        210000.0|         140000.0|
| Marketing|        [3000, 2000]|        500000.0|         250000.0|
|        IT|        [4100, 3000]|        605000.0|         302500.0|
+----------+--------------------+----------------+-----------------+

