# 3.1 Spark GroupBy

Similar to SQL GROUP BY clause, PySpark groupBy() function is used to collect the identical data into
groups on DataFrame and perform aggregate functions on the grouped data.
roupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
      org.apache.spark.sql.RelationalGroupedDataset
Note that, it can take one or more column names and returns GroupedData object which can use below aggregation functions.
- count() - Returns the count of rows for each group.
- mean() - Returns the mean of values for each group.
- max() - Returns the maximum of values for each group.
- min() - Returns the minimum of values for each group.
- sum() - Returns the total for values for each group.
- avg() - Returns the average for values for each group.
- agg() - Using agg() function, we can calculate more than one aggregate at a time.
- pivot() - This function is used to Pivot the DataFrame.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql import functions as f

In [3]:
local=True

if local:
    spark=SparkSession.builder.master("local[4]").appName("pySparkGroupBy").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("SparkArrowCompression") \
                      .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .getOrCreate()


In [4]:
data = [("James", "Sales", "NY", 90000, 34, 10000),
            ("Michael", "Sales", "NY", 86000, 56, 20000),
            ("Robert", "Sales", "CA", 81000, 30, 23000),
            ("Maria", "Finance", "CA", 90000, 24, 23000),
            ("Raman", "Finance", "CA", 99000, 40, 24000),
            ("Scott", "Finance", "NY", 83000, 36, 19000),
            ("Jen", "Finance", "NY", 79000, 53, 15000),
            ("Jeff", "Marketing", "CA", 80000, 25, 18000),
            ("Kumar", "Marketing", "NY", 91000, 50, 21000)
            ]

schema = ["employee_name", "department", "state", "salary", "age", "bonus"]
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
+-------------+----------+-----+------+---+-----+



## 3.1.1 Groupby single column 

In this example, we groupby a single column and aggregate single/multi column

In [5]:
# groupby department and count the rows in each grouped department. If each row represent a distinct employee, then we have the count of employee
# of each department
df.groupBy("department").count().show()

+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    4|
| Marketing|    2|
+----------+-----+



In [6]:
# get the average salary and age of each state
df.groupBy("state").avg("salary","age").show()

+-----+-----------+--------+
|state|avg(salary)|avg(age)|
+-----+-----------+--------+
|   CA|    87500.0|   29.75|
|   NY|    85800.0|    45.8|
+-----+-----------+--------+



## 3.1.2 GroupBy multiple columns
We have seen how to groupBy one column. We can also groupBy multiple columns.

Suppose we do a groupBy on two column A, B. If A has m distinct value, and B has n distinct value. Then after groupby, we will have m*n rows

In [7]:
# In this example, we groupBy rows by their department(Sales, Finance, Marketing) and state(CA, NY)
# we will have 6 possible combination, for each combination we will use the aggregation function to calculate the max.
df.groupBy("department", "state").max("salary", "bonus").show()

+----------+-----+-----------+----------+
|department|state|max(salary)|max(bonus)|
+----------+-----+-----------+----------+
|   Finance|   NY|      83000|     19000|
| Marketing|   NY|      91000|     21000|
|     Sales|   CA|      81000|     23000|
| Marketing|   CA|      80000|     18000|
|   Finance|   CA|      99000|     24000|
|     Sales|   NY|      90000|     20000|
+----------+-----+-----------+----------+



## 3.1.3 Apply multiple aggregation function on one groupBy 

The default min, max, can only show one type of stats. If you want to show multiple column with different stats, you 
need to use agg().  

Note here, avg and mean returned the same result.

In [8]:
df.groupBy("department").agg(
  f.min("salary").alias("min_salary"), \
  f.max("salary").alias("max_salary"), \
  f.avg("salary").alias("avg_salary"), \
  f.mean("salary").alias("mean_salary")  
).show()

+----------+----------+----------+-----------------+-----------------+
|department|min_salary|max_salary|       avg_salary|      mean_salary|
+----------+----------+----------+-----------------+-----------------+
|     Sales|     81000|     90000|85666.66666666667|85666.66666666667|
|   Finance|     79000|     99000|          87750.0|          87750.0|
| Marketing|     80000|     91000|          85500.0|          85500.0|
+----------+----------+----------+-----------------+-----------------+



We can use other aggregation functions inside agg, the full list of predefine aggregation function is here 
https://sparkbyexamples.com/pyspark/pyspark-aggregate-functions/ 

Below shows collect_list

In [9]:
df.groupBy("department").agg(f.collect_list("employee_name")).show(truncate=False)

+----------+---------------------------+
|department|collect_list(employee_name)|
+----------+---------------------------+
|Sales     |[James, Michael, Robert]   |
|Finance   |[Maria, Raman, Scott, Jen] |
|Marketing |[Jeff, Kumar]              |
+----------+---------------------------+



##  3.1.4 Cube

We have seen in a GROUP BY, every row is included only once in its corresponding summary.

With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

In [10]:
# check the difference with groupBy("department")
# as we cube only one column, so the possible combination is 3, plus wildcards. So we have 4 rows in total. 
# The wildcards represnet the all departments

df.cube("department").count().show()

+----------+-----+
|department|count|
+----------+-----+
|      null|    9|
| Marketing|    2|
|     Sales|    3|
|   Finance|    4|
+----------+-----+



In [13]:
# If we cube two columns department(3 distinc value + wildcard),state(2 distinc value + wildcard), we should have 4*3 (12) row in total  
df1=df.cube("department","state").count()
print(f"Total row count: {df1.count()}")
# (Finance null) means Finance department of all state. 
# (null   NY) means all department of NY state
#(null null) means all department and all state
df1.show()

Total row count: 12
+----------+-----+-----+
|department|state|count|
+----------+-----+-----+
|   Finance| null|    4|
| Marketing| null|    2|
|   Finance|   CA|    2|
|   Finance|   NY|    2|
|     Sales|   CA|    1|
|     Sales|   NY|    2|
|      null| null|    9|
|      null|   NY|    5|
| Marketing|   CA|    1|
|      null|   CA|    4|
| Marketing|   NY|    1|
|     Sales| null|    3|
+----------+-----+-----+



##  3.1.5 Rollup
rollup computes hierarchical subtotals from left to right

In [17]:
# We rollup("department","state"). The first column department has 3 distinct value, 
# the seconde column department has 2 distinct value. For the second column, we need to add wildcard
# so we have 3*3+global wildcart=10 rows in the rollup result.
# Similar to cube, 
# (Finance null) means Finance department of all state. 
# (null null) means all department and all state

# You can notice we don't have rows like  (null   NY) anymore. Because the order of input columns play as hierarchy.

df2=df.rollup("department", "state").count()
print(f"Total row count {df2.count()}")
df2.show()

Total row count 10
+----------+-----+-----+
|department|state|count|
+----------+-----+-----+
|   Finance| null|    4|
| Marketing| null|    2|
|   Finance|   CA|    2|
|   Finance|   NY|    2|
|     Sales|   CA|    1|
|     Sales|   NY|    2|
|      null| null|    9|
| Marketing|   CA|    1|
| Marketing|   NY|    1|
|     Sales| null|    3|
+----------+-----+-----+

