# Overview
Aggregation methods are functions and formulas that analyze descriptive statistics by groups. Descriptive statistics include but are not limited to: count, sum, min, max, and average.

Start by creating a DataFrame...

In [0]:
sdf = spark.createDataFrame(
[
  ('Robin',  'Science', 95),
  ('Nathan', 'Science', 78),
  ('Anna',   'Science', 88),
  ('Sonia',  'Science', 87),
  ('Robin' , 'English', 95),
  ('Nathan', 'English', 80),
  ('Anna'  , 'English', 87),
  ('Sonia' , 'English', 91)
],
  ['student_name', 'subject_name', 'subject_score']
)

sdf.show(n=1, truncate = False)

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|Robin       |Science     |95           |
+------------+------------+-------------+
only showing top 1 row



### Aggregation Methods:

The .count() function can be used to obtain record counts:

In [0]:
sdf.count()

Out[2]: 8

The .groupBy() function can be used to add obtain metrics by category:

In [0]:
sdf_min = sdf.groupBy('student_name').min('subject_score') # .min() function
sdf_min.show()

sdf_sum = sdf.groupBy('student_name').sum('subject_score') # .sum() function
sdf_sum.show(n=10,truncate=False)

+------------+------------------+
|student_name|min(subject_score)|
+------------+------------------+
|       Robin|                95|
|      Nathan|                78|
|        Anna|                87|
|       Sonia|                87|
+------------+------------------+

+------------+------------------+
|student_name|sum(subject_score)|
+------------+------------------+
|Robin       |190               |
|Nathan      |158               |
|Anna        |175               |
|Sonia       |178               |
+------------+------------------+



The .agg() function can be used to aggregate multiple metrics. The .agg() function provides additional customization with the .alias() function to control for the output's name.

In [0]:
import pyspark.sql.functions as f

sdf.groupBy("student_name")                                     \
    .agg(f.count('subject_score').alias("count_subject_score"),
         f.min('subject_score').alias("min_subject_score"),
         f.max('subject_score').alias("max_subject_score"),
         f.sum('subject_score').alias("sum_subject_score"),
         f.avg('subject_score').alias("avg_subject_score"))     \
    .show(truncate=False)

+------------+-------------------+-----------------+-----------------+-----------------+-----------------+
|student_name|count_subject_score|min_subject_score|max_subject_score|sum_subject_score|avg_subject_score|
+------------+-------------------+-----------------+-----------------+-----------------+-----------------+
|Robin       |2                  |95               |95               |190              |95.0             |
|Nathan      |2                  |78               |80               |158              |79.0             |
|Anna        |2                  |87               |88               |175              |87.5             |
|Sonia       |2                  |87               |91               |178              |89.0             |
+------------+-------------------+-----------------+-----------------+-----------------+-----------------+



### Sorting Aggregations:

The .sort() function is used to sort tables. The ascending parameter is used to control ascending/ descending order.

The first option is the .sort() function which can be chained to the end of an aggregate function:

In [0]:
sdf_max = sdf.groupBy('student_name').max('subject_score').sort('max(subject_score)',ascending = False)
sdf_max.show()

+------------+------------------+
|student_name|max(subject_score)|
+------------+------------------+
|       Robin|                95|
|       Sonia|                91|
|        Anna|                88|
|      Nathan|                80|
+------------+------------------+



Alternatively, the .orderBy() function can be used. If the column is explicity referenced with the col() function, a .desc() or .asc() function can be chained to the end for an alternative method of sorting:

In [0]:
from pyspark.sql.functions import col
sdf_sort = sdf_sum.orderBy(col('sum(subject_score)').desc()) 
sdf_sort.show()

+------------+------------------+
|student_name|sum(subject_score)|
+------------+------------------+
|       Robin|               190|
|       Sonia|               178|
|        Anna|               175|
|      Nathan|               158|
+------------+------------------+

