## PySpark Aggregation

PySpark `groupBy()` function is used to collect the identical data into groups on DataFrame and perform aggregate functions on the grouped data.

| func | desc |
| - | -|
|count() | Returns the count of rows for each group. |
|mean()  | Returns the mean of values for each group. |
|max()   | Returns the maximum of values for each group. |
|min()   | Returns the minimum of values for each group. |
|sum()   | Returns the total for values for each group. |
|avg()   | Returns the average for values for each group. |
|agg()   | Using this function, we can calculate more than one aggregate at a time. |
|pivot() | This function is used to Pivot the DataFrame (see pivoting notebook) |

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark Aggregation').getOrCreate()

In [0]:
data = [
    ('Sam', 'Software Engineer', 'US', 5000, 30, 500),
    ('Adam', 'Data Scientist', 'US', 6000, 58, 550),
    ('Jonas', 'Sales Person', 'Wales', 5000, 41, 500),
    ('Peter', 'CTO', 'Ireland', 10000, 35, 1500),
    ('Ann', 'Data Analyst', 'Australia', 6000, 24, 500),
    ('Ralph', 'CEO', 'Germany', 15000, 50, 2500),
    ('Lekhana', 'Advertising', 'England', 4500, 27, 560),
    ('Tomas', 'Marketing', 'Hungary', 4500, 30, 570),
    ('Nick', 'Data Engineer', 'Ireland', 5000, 41, 600),
    ('Wade', 'Data Engineer', 'Scotland', 5500, 25, 600)
]

columns = ['name', 'job', 'country', 'salary', 'age', 'bonus']

df = spark.createDataFrame(data = data, schema = columns)

df.printSchema()
df.show(truncate=False)

#### groupBy and aggregate on DataFrame columns

In [0]:
df.groupBy('job').avg('salary').show(truncate=False)

In [0]:
df.groupBy('job').count().show()

In [0]:
df.groupBy('job').min('salary').show()

In [0]:
df.count()

#### Aggregate on multiple columns

In [0]:
df.groupBy('country').sum('salary','bonus').show()

#### Running more aggregates at a time

In [0]:
df \
.groupBy('country') \
.agg( \
  sum('salary').alias('sum_salary'), \
  avg('salary').alias('avg_salary'), \
  sum('bonus').alias('sum_bonus'), \
  max('bonus').alias('max_bonus') \
) \
.show()

#### The end of the notebook