# 3.2 Aggregation functions

**Aggregate functions** operate on a group of rows and calculate a single return value for every group.

In previous GroupBy section, we have seen some aggregation example. In this section, we will examine all existing aggregation functions in spark. The full list in alphabetic order:

- approx_count_distinct, count, countDistinct
- avg/mean, max, min, 
- collect_list, collect_set
- grouping : Check if a column is created by aggregation function or not, returns 1 for aggregated, 0 for not aggregated
- first,last
- sum, sumDistinct
- kurtosis, skewness
- stddev, stddev_samp, stddev_pop
- variance, var_samp, var_pop

In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import count, countDistinct, approx_count_distinct, avg, min, max, mean, collect_list, \
    collect_set, grouping, first, last, sum, sumDistinct, skewness, kurtosis, stddev, stddev_samp, stddev_pop, \
    variance, var_samp, var_pop

In [2]:
local=True

if local:
    spark=SparkSession.builder.master("local[4]").appName("pySparkGroupBy").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("SparkArrowCompression") \
                      .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .getOrCreate()


In [3]:
data = [("Alice", "Sales", 3000),
            ("Michael", "Sales", 4600),
            ("Robert", "IT", 4100),
            ("Maria", "Finance", 3000),
            ("Haha", "IT", 3000),
            ("Scott", "Finance", 3300),
            ("Jen", "Finance", 3900),
            ("Jeff", "Marketing", 3000),
            ("Kumar", "Marketing", 2000),
            ("Haha", "Sales", 4100)
            ]
schema = ["name", "department", "salary"]

df=spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|  Alice|     Sales|  3000|
|Michael|     Sales|  4600|
| Robert|        IT|  4100|
|  Maria|   Finance|  3000|
|   Haha|        IT|  3000|
|  Scott|   Finance|  3300|
|    Jen|   Finance|  3900|
|   Jeff| Marketing|  3000|
|  Kumar| Marketing|  2000|
|   Haha|     Sales|  4100|
+-------+----------+------+



## 3.2.1 Count, countDistinct, approx_count_distinct
The classic count() will just count row numbers of a group.

The **approx_count_distinct** is implemented to avoid count(distinct()) operations. The approx_count_distinct uses an algorithm called **HyperLogLog**. This algorithm can estimate the number of distinct values of greater than 1,000,000,000, where the accuracy of the calculated approximate distinct count value is within 2% of the actual distinct count value. It can do this while using much less memory.

Because **count(distinct())** requires more and more memory as the number of distinct values increases. 

This tutorial shows how the approx_count_distinct function is implemented
https://mungingdata.com/apache-spark/hyperloglog-count-distinct/#:~:text=approx_count_distinct%20uses%20the%20HyperLogLog%20algorithm,count()%20will%20run%20slower).

In [8]:
# Aggregation function can be used after groupBy or on the whole data frame.
df.select(count("salary")).show()

# below is
df.groupBy("department").count().show()


+-------------+
|count(salary)|
+-------------+
|           10|
+-------------+

+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    3|
| Marketing|    2|
|        IT|    2|
+----------+-----+



In [9]:
# countDistinct can take multiple column as argument. If input column number is greater than 1, then
# the value combination (col1, col2, ...) must be distinct.

# salary distinct value is 6, but (department, salary) distinct value is 10
df.select(countDistinct("salary")).show()
df.select(countDistinct("department", "salary")).show()


+----------------------+
|count(DISTINCT salary)|
+----------------------+
|                     6|
+----------------------+

+----------------------------------+
|count(DISTINCT department, salary)|
+----------------------------------+
|                                10|
+----------------------------------+



In [10]:
# approx count uses less resources
df.select(approx_count_distinct("salary")).show()

+-----------------------------+
|approx_count_distinct(salary)|
+-----------------------------+
|                            6|
+-----------------------------+



## 3.2.2 Get basic stats of a column by using avg/mean, min, max

Note mean is the alias of avg. It's not the median function.

In [11]:
# show avg example
# avg can only apply on digit column, if type is mismatch, it returns null
df.select(avg("salary")).show()
df.select(avg("name")).show()

   

+-----------+
|avg(salary)|
+-----------+
|     3400.0|
+-----------+

+---------+
|avg(name)|
+---------+
|     null|
+---------+



In [12]:
# show mean example
# mean is the alias of avg. so it works like avg. It's not the median.
df.select(mean("salary")).show()
df.select(mean("name")).show()



+-----------+
|avg(salary)|
+-----------+
|     3400.0|
+-----------+

+---------+
|avg(name)|
+---------+
|     null|
+---------+



In [13]:
# show min example
# min can apply on digit and string column, it uses default sorting order (ascending) to find min
df.select(min("salary")).show()
df.select(min("name")).show()


+-----------+
|min(salary)|
+-----------+
|       2000|
+-----------+

+---------+
|min(name)|
+---------+
|    Alice|
+---------+



In [14]:
# show max example
# max can apply on digit and string column, it uses default sorting order (ascending) to find max
df.select(max("salary")).show()
df.select(max("name")).show()

+-----------+
|max(salary)|
+-----------+
|       4600|
+-----------+

+---------+
|max(name)|
+---------+
|    Scott|
+---------+



## 3.2.3 collect_list, collect_set 

- collect_list() function returns a list that contains all values from an input column with duplicates.
- collect_set() function returns a list that contains all values from an input column without duplicates

In [19]:
# show collect_list example

# collect the salary of employee, note we have duplicates
df.select(collect_list("salary")).show(truncate=False)

# collect the name of each department
# note collect_list can't be used directly after groupBy, it must be in agg()
df.groupBy("department").agg(collect_list("name")).show(truncate=False)

+------------------------------------------------------------+
|collect_list(salary)                                        |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

+----------+----------------------+
|department|collect_list(name)    |
+----------+----------------------+
|Sales     |[Alice, Michael, Haha]|
|Finance   |[Maria, Scott, Jen]   |
|Marketing |[Jeff, Kumar]         |
|IT        |[Robert, Haha]        |
+----------+----------------------+



In [16]:
# show collect_set example
# note we dont have duplicates
df.select(collect_set("salary")).show(truncate=False)

+------------------------------------+
|collect_set(salary)                 |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+------------------------------------+

