## Groupby
Similar to SQL GROUP BY clause, PySpark `groupBy()` function is used to collect the identical data into groups on DataFrame and perform aggregate functions on the grouped data

When we perform `groupBy()` on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions.

* `count()` - Returns the count of rows for each group.

* `mean()` - Returns the mean of values for each group.

* `max()` - Returns the maximum of values for each group.

* `min()` - Returns the minimum of values for each group.

* `sum()` - Returns the total for values for each group.

* `avg()` - Returns the average for values for each group.

* `agg()` - Using agg() function, we can calculate more than one aggregate at a time.

* `pivot()` - This function is used to Pivot the DataFrame which I will not be covered in this article as I already have a dedicated article for Pivot & Unpivot DataFrame.

Create DataFrame from a sequence of the data to work with. This DataFrame contains columns `employee_name`, `department`, `state`, `salary`, `age` and `bonus` columns.

In [16]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("groupby").getOrCreate()

simpleData = [("James","Sales","NY",90000,34,10000),
              ("Michael","Sales","NY",86000,56,20000),
              ("Robert","Sales","CA",81000,30,23000),
              ("Maria","Finance","CA",90000,24,23000),
              ("Raman","Finance","CA",99000,40,24000),
              ("Scott","Finance","NY",83000,36,19000),
              ("Jen","Finance","NY",79000,53,15000),
              ("Jeff","Marketing","CA",80000,25,18000),
              ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

25/08/10 17:12:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
+-------------+----------+-----+------+---+-----+



### groupBy and aggregate on DataFrame columns
Let’s do the `groupBy()` on `department` column of DataFrame and then find the sum of `salary` for each department using `sum()` aggregate function

In [17]:
df.groupBy("department").sum("salary").show(truncate=False)

+----------+-----------+
|department|sum(salary)|
+----------+-----------+
|Sales     |257000     |
|Finance   |351000     |
|Marketing |171000     |
+----------+-----------+



Similarly, we can calculate the number of employee in each department using `count()`

In [18]:
df.groupBy("department").count().show()

+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    4|
| Marketing|    2|
+----------+-----+



Calculate the minimum salary of each department using `min()`

In [19]:
df.groupBy("department").min("salary").show()

+----------+-----------+
|department|min(salary)|
+----------+-----------+
|     Sales|      81000|
|   Finance|      79000|
| Marketing|      80000|
+----------+-----------+



Calculate the maximin salary of each department using `max()`

In [20]:
df.groupBy("department").max("salary").show()

+----------+-----------+
|department|max(salary)|
+----------+-----------+
|     Sales|      90000|
|   Finance|      99000|
| Marketing|      91000|
+----------+-----------+



Calculate the average salary of each department using `avg()`

In [21]:
df.groupBy("department").avg( "salary").show()

+----------+-----------------+
|department|      avg(salary)|
+----------+-----------------+
|     Sales|85666.66666666667|
|   Finance|          87750.0|
| Marketing|          85500.0|
+----------+-----------------+



Calculate the mean salary of each department using `mean()`

In [22]:
df.groupBy("department").mean( "salary").show() 

+----------+-----------------+
|department|      avg(salary)|
+----------+-----------------+
|     Sales|85666.66666666667|
|   Finance|          87750.0|
| Marketing|          85500.0|
+----------+-----------------+



### groupBy and aggregate on multiple columns
Similarly, we can also run groupBy and aggregate on two or more DataFrame columns, below example does group by on `department`,`stat`e and does `sum()` on `salary` and `bonus` columns.

In [23]:
# GroupBy on multiple columns
df.groupBy("department","state").sum("salary","bonus").show()

+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
|   Finance|   NY|     162000|     34000|
| Marketing|   NY|      91000|     21000|
|     Sales|   CA|      81000|     23000|
| Marketing|   CA|      80000|     18000|
|   Finance|   CA|     189000|     47000|
|     Sales|   NY|     176000|     30000|
+----------+-----+-----------+----------+



### Running more aggregates at a time
Using `agg()` aggregate function we can calculate many aggregations at a time on a single statement using PySpark SQL aggregate functions `sum()`, `avg()`, `min()`, `max()`, `mean()` etc. In order to use these, we should import "from pyspark.sql.functions import sum,avg,max,min,mean,count"

In [24]:
from pyspark.sql.functions import sum,avg,max,min,mean,count

df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus")).show(truncate=False)

+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
|Marketing |171000    |85500.0          |39000    |21000    |
+----------+----------+-----------------+---------+---------+



### Using filter on aggregate data
Similar to SQL “HAVING” clause, On PySpark DataFrame we can use either `where()` or `filter()` function to filter the rows of aggregated data

In [25]:
from pyspark.sql.functions import col

df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus")).where(col("sum_bonus") >= 50000).show(truncate=False)

+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
+----------+----------+-----------------+---------+---------+



## -------------- Second example

In [26]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
sc = spark.sparkContext
  
datavengers = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("Jean","Data Scientist","UK",220000,30),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("T'challa","CEO","USA",300000,20),
    ("Xavier","Marketing","USA",100000,11),
    ("Wade","Marketing","UK",60000,2)
]
 
schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
df.show(truncate=False)

root
 |-- Name: string (nullable = true)
 |-- Job: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- seniority: long (nullable = true)

+--------+--------------+-------+------+---------+
|Name    |Job           |Country|salary|seniority|
+--------+--------------+-------+------+---------+
|Carol   |Data Scientist|USA    |70000 |5        |
|Peter   |Data Scientist|USA    |90000 |7        |
|Clark   |Data Scientist|UK     |111000|10       |
|Jean    |Data Scientist|UK     |220000|30       |
|Bruce   |Data Engineer |UK     |80000 |4        |
|Thanos  |Data Engineer |USA    |115000|13       |
|Scott   |Data Engineer |UK     |180000|15       |
|T'challa|CEO           |USA    |300000|20       |
|Xavier  |Marketing     |USA    |100000|11       |
|Wade    |Marketing     |UK     |60000 |2        |
+--------+--------------+-------+------+---------+



25/08/10 17:12:15 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### groupBy using count() function
To count the number of employees per job type, you can proceed like this

In [27]:
df.groupBy("Job").count().show(truncate=False)

+--------------+-----+
|Job           |count|
+--------------+-----+
|CEO           |1    |
|Data Scientist|4    |
|Marketing     |2    |
|Data Engineer |3    |
+--------------+-----+



### groupBy using sum() function

In [28]:
df.groupBy("Job").sum("salary").show(truncate=False)

+--------------+-----------+
|Job           |sum(salary)|
+--------------+-----------+
|CEO           |300000     |
|Data Scientist|491000     |
|Marketing     |160000     |
|Data Engineer |375000     |
+--------------+-----------+



### groupBy using avg() function

In [29]:
df.groupBy("Job").avg("salary").show(truncate=False)

+--------------+-----------+
|Job           |avg(salary)|
+--------------+-----------+
|CEO           |300000.0   |
|Data Scientist|122750.0   |
|Marketing     |80000.0    |
|Data Engineer |125000.0   |
+--------------+-----------+



### groupBy and aggregation functions on DataFrame multiple columns

In [30]:
df.groupBy("Job","Country").avg("salary","seniority").show(truncate=False)

+--------------+-------+-----------+--------------+
|Job           |Country|avg(salary)|avg(seniority)|
+--------------+-------+-----------+--------------+
|Marketing     |UK     |60000.0    |2.0           |
|Data Engineer |UK     |130000.0   |9.5           |
|Data Scientist|UK     |165500.0   |20.0          |
|Marketing     |USA    |100000.0   |11.0          |
|Data Scientist|USA    |80000.0    |6.0           |
|CEO           |USA    |300000.0   |20.0          |
|Data Engineer |USA    |115000.0   |13.0          |
+--------------+-------+-----------+--------------+

