* **Aggregation** function must produce one result for each group, given multiple input values.
* Spark also allows us to create the following groupings types
  * The **simplest grouping** is to just summarize a complete DataFrame by performing an aggregation in a select statement.
  * A **group by** allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns.
  * A **window** gives you the ability to specify one or more keys as well as one or more aggregation functions to transform the value columns. However, the rows input to the function are somehow related to the current row.
  * A **grouping set,** which you can use to aggregate at multiple different levels. Grouping sets are available as a primitive in SQL and via rollups and cubes in DataFrames.
  * A **rollup** makes it possible for you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized hierarchically.
  * A **cube** allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized across all combinations of columns.

* Each grouping returns a RelationalGroupedDataset on which we specify our aggregations.

In [3]:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Aggregations').getOrCreate()

22/10/20 15:17:07 WARN Utils: Your hostname, tars resolves to a loopback address: 127.0.1.1; using 192.168.1.66 instead (on interface wlan0)
22/10/20 15:17:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/20 15:17:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/20 15:17:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Reading in our data on purchases, repartitioning the data to have far fewer partitions (because we know it’s a small volume of data stored in a lot of small files), and caching the results for rapid access

In [4]:
df = spark.read.format("csv")\
.option("header","true")\
.option("inferSchema","true")\
.load("data/*.csv")\
.coalesce(5)

df.cache()
df.createOrReplaceTempView("dfTable")

In [6]:
df.rdd.getNumPartitions()

1

In [7]:
df.count()

3108

###### Count
* specify a specific column to count, or 
* all the columns by using count(*) or count(1) to represent that we want to count every row as the literal one

In [8]:
from pyspark.sql.functions import count
df.select(count("StockCode")).show()

+----------------+
|count(StockCode)|
+----------------+
|            3108|
+----------------+



**Note:**

when performing a count(*), Spark will count null values (including rows containing all nulls). 
However, when counting an individual column, Spark will not count the null values.

**countDistinct:** To get the number of unique groups

In [9]:
from pyspark.sql.functions import countDistinct

df.select(countDistinct("StockCode")).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     1351|
+-------------------------+



**approx_count_distinct:**

* use this function when an approximation to a certain degree of accuracy will work just fine
* Another parameter with which you can specify the maximum estimation error allowed.
* We can see much performance with larger datasets. specifying the large error makes an answer that is quite far off but does complete more quickly than countDistinct

In [11]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("StockCode",0.01)).show()

22/10/20 15:24:16 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


[Stage 17:>                                                         (0 + 1) / 1]

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            1359|
+--------------------------------+



[Stage 19:>                                                         (0 + 1) / 1]                                                                                

In [12]:
df.select(approx_count_distinct("StockCode")).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            1282|
+--------------------------------+



**first and last:**

* Used to get the first and last values from a DataFrame.
* This will be based on the rows in the DataFrame, not on the values in the DataFrame.

In [13]:
from pyspark.sql.functions import first,last

df.select(first("StockCode"),last("StockCode")).show()

+----------------+---------------+
|first(StockCode)|last(StockCode)|
+----------------+---------------+
|          85123A|          20755|
+----------------+---------------+



**min and max:**

use min and max to extract the minimum and maximum values from a DataFrame

In [14]:
from pyspark.sql.functions import min,max

df.select(min("Quantity"),max("Quantity")).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|          -24|          600|
+-------------+-------------+



In [0]:
df.select(min("StockCode"),max("StockCode")).show()

**SUM:**

add all the values in a row using the sum function

In [15]:
from pyspark.sql.functions import sum

df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|        26814|
+-------------+



**sumDistinct:**
By using this function, we can sum a distinct set of values

In [16]:
from pyspark.sql.functions import sumDistinct

df.select(sumDistinct("Quantity")).show()



+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                  4690|
+----------------------+



**avg:**
Although you can calculate average by dividing sum by count, Spark provides an easier way to get that value via the avg or mean functions.

In [17]:
from pyspark.sql.functions import avg, expr

df.select(
         count("Quantity").alias("total_transactions"),
         sum("Quantity").alias("total_purchases"),
         avg("Quantity").alias("avg_purchases"),
         expr("mean(Quantity)").alias("mean_purchases"))\
       .selectExpr(
                  "total_purchases/total_transactions",
                  "avg_purchases",
                  "mean_purchases").show()

+--------------------------------------+-----------------+-----------------+
|(total_purchases / total_transactions)|    avg_purchases|   mean_purchases|
+--------------------------------------+-----------------+-----------------+
|                     8.627413127413128|8.627413127413128|8.627413127413128|
+--------------------------------------+-----------------+-----------------+



**Variance and Standard Deviation:**

Spark has both the formula for the sample standard deviation as well as the formula for the population standard deviation.

In [18]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+-----------------+------------------+--------------------+---------------------+
|var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+-----------------+------------------+--------------------+---------------------+
|695.2492099104054| 695.4729785650273|  26.367578764657278|   26.371821677029203|
+-----------------+------------------+--------------------+---------------------+



**skewness and kurtosis:**

* Skewness measures the asymmetry of the values in your data around the mean.
* kurtosis is a measure of the tail of data.

In [19]:
from pyspark.sql.functions import skewness, kurtosis

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+------------------+------------------+
|skewness(Quantity)|kurtosis(Quantity)|
+------------------+------------------+
|11.384721296581182|182.91886804842397|
+------------------+------------------+



**Covariance and Correlation:**

* Correlation measures the Pearson correlation coefficient, which is scaled between –1 and +1.
* The covariance is scaled according to the inputs in the data
* Covariance can be calculated either as the sample covariance or the population covariance.

In [20]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),covar_pop("InvoiceNo", "Quantity")).show()

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     -0.12225395743668731|            -235.56327681311157|            -235.4868448608685|
+-------------------------+-------------------------------+------------------------------+



**Aggregating to Complex Types:**
In Spark, you can perform aggregations not just of numerical values using formulas, you can also perform them on complex types. 

For example, we can collect a list of values present in a given column or only the unique values by collecting to a set.

In [21]:
from pyspark.sql.functions import collect_set, collect_list

df.agg(collect_set("Country"), collect_list("Country")).show()

+--------------------+---------------------+
|collect_set(Country)|collect_list(Country)|
+--------------------+---------------------+
|[France, Australi...| [United Kingdom, ...|
+--------------------+---------------------+

22/10/20 19:28:20 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 4131223 ms exceeds timeout 120000 ms
22/10/20 19:28:20 WARN SparkContext: Killing executors is not supported by current scheduler.


###### Grouping
This will be done as group our data on one column and perform some calculations on the other columns that end up in that group.
We will group by each unique invoice number and get the count of items on that invoice. This will be done in 2 phases

* First we specify the column(s) on which we would like to group. This returns **Relatational Grouped Dataset.**
* then we specify the aggregation(s).This step returns **DataFrame.**

In [0]:
df.groupBy("InvoiceNo","CustomerId")

In [0]:
df.groupBy("InvoiceNo","CustomerId").count().show(4)

**Grouping with Expressions:**
Rather than passing count() function as an expression into a select statement, we specify it as within agg. This makes it possible for you to pass-in arbitrary expressions that just need to have some aggregations specified

In [0]:
df.groupBy("InvoiceNo").agg(\
                           count("Quantity").alias("quan"),
                           expr("count(Quantity)")).show(2)

**Grouping with Maps:**
Sometimes it can be easier to specify your transformations as a series of Maps for which the key is the column and the value is the aggregation function. 

You can reuse multiple column names if you specify them inline.

In [0]:
df.groupBy("InvoiceNo").agg(\
                           expr("avg(Quantity)"),
                           expr("stddev_pop(Quantity)"),
                           expr("count(Quantity)")).show(3)

###### Window Functions:

window functions are used to carry out some unique aggregations by either computing some aggregation on a specific **window** of data, which you define by using a reference to the current data.

This window specification determines which rows will be passed in to this function.

A groupBy takes data, and every row can go only into one grouping. A window function calculates a return value for every input row of a table based on group of rows, called a frame. Each row fall into one or more frames.

Spark supports three kinds of window functions:
* ranking functions - Rank, Dense_Rank, Row_Number etc..,
* analytic functions - Lead, Lag,First_value,Last_value etc..,
* aggregate functions- sum,avg,count,min,max etc..,

OVER clause defines the partitioning and ordering of rows(i.e a window) for the above functions to operate on. Hence these functions are called window functions. The OVER clause accepts the following three arguments to define a window for these functions to operate on.

* ORDERBY - Defines the logical order of the rows
* PARTITION BY - Divides the query result set in to partitions. The window function is applied to each partition separately.
* ROWS or RANGE Clause - Further limits the rows within the partition by specifying start and end points within the partition

In [0]:
from pyspark.sql.functions import to_date,col

dfWithDate = df.withColumn("date",to_date(col("InvoiceDate"),"MM/d/YYYY H:mm"))
dfWithDate.createOrReplaceTempView("dfWithDate")

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc

windowSpec = Window.partitionBy("CustomerId","date")\
                   .orderBy(desc("Quantity"))\
                   .rowsBetween(Window.unboundedPreceding, Window.currentRow)


We need aggregate function to learn more about each specific customer. ex: maximum purchase quantity over all time

In [0]:
import pyspark.sql.functions as F

In [0]:
from pyspark.sql.functions import max
maxPurchaseQuantity = F.max(F.col("Quantity")).over(windowSpec)
maxPurchaseQuantity

create the purchase quantity rank. For that, we use the **dense_rank** function to determine which date had the maximum purchase quantity
for every customer.

We use dense_rank as opposed to rank to avoid gaps in the ranking sequence when there are tied values

In [0]:
from pyspark.sql.functions import dense_rank, rank

purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)

In [0]:
dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
.select(\
       col("CustomerId"),
       col("date"),
       col("Quantity"),
       purchaseRank.alias("quantityRank"),
       purchaseDenseRank.alias("quantityDenseRank"),
       maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

In [0]:
%sql
SELECT CustomerId, date, Quantity,
rank(Quantity) OVER (PARTITION BY CustomerId, date
ORDER BY Quantity DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as rank,
dense_rank(Quantity) OVER (PARTITION BY CustomerId, date
ORDER BY Quantity DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as dRank,
max(Quantity) OVER (PARTITION BY CustomerId, date
ORDER BY Quantity DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as maxPurchase
FROM dfWithDate WHERE CustomerId IS NOT NULL ORDER BY CustomerId

CustomerId,date,Quantity,rank,dRank,maxPurchase
12346.0,2011-01-18,74215,1,1,74215
12346.0,2011-01-18,-74215,2,2,74215
12347.0,2011-04-07,6,15,6,240
12347.0,2011-08-02,10,11,5,36
12347.0,2011-08-02,4,16,8,36
12347.0,2011-08-02,24,2,2,36
12347.0,2011-08-02,8,13,6,36
12347.0,2011-10-31,8,34,8,48
12347.0,2011-10-31,12,14,6,48
12347.0,2011-10-31,24,4,3,48


###### Grouping Sets
* an aggregation across multiple groups.
* These are a low-level tool for combining sets of aggregations together.
* 

total quantity of all stock codes and customers.

In [0]:
dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")

In [0]:
%sql
SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull
GROUP BY customerId, stockCode
ORDER BY CustomerId DESC, stockCode DESC
LIMIT 5

CustomerId,stockCode,sum(Quantity)
18287.0,85173,48
18287.0,85040A,48
18287.0,85039B,120
18287.0,85039A,96
18287.0,84920,4


In [0]:
%sql
SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull
GROUP BY GROUPING SETS((customerId, stockCode))
ORDER BY CustomerId DESC, stockCode DESC
LIMIT 5

customerId,stockCode,sum(Quantity)
18287.0,85173,48
18287.0,85040A,48
18287.0,85039B,120
18287.0,85039A,96
18287.0,84920,4


But if we want to include the total number of items, regardless of customer or stockcode ? with conventiopnal group-by statement this impossible. But, its simple with grouping sets.

In [0]:
%sql
SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull
GROUP BY GROUPING SETS((customerId, stockCode),())
ORDER BY CustomerId DESC, stockCode DESC
LIMIT 5

customerId,stockCode,sum(Quantity)
18287.0,85173,48
18287.0,85040A,48
18287.0,85039B,120
18287.0,85039A,96
18287.0,84920,4


The **GROUPING SETS** operator is only available in SQL. To perform the same in DataFrames, we use the rollup and cube operators - which will allow us to get the same results.

Grouping sets depend on **null values** for aggregation levels. If you do not filter-out null values, you will get incorrect results. 
This applies to cubes, rollups, and grouping sets.

In [0]:
%sql
SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull
GROUP BY customerId, stockCode GROUPING SETS(())
ORDER BY CustomerId DESC, stockCode DESC

customerId,stockCode,sum(Quantity)
,,5176450


The GROUPING SETS operator is only available in SQL. To perform the same in DataFrames, we need to  use the **rollup** and **cube** operators—which allow us to get the same results.

**ROLLUP:** 
* is used to do aggregate Opereation on multiple levels in a heirarchy.  
* is a multidimensional aggregation that performs a variety of group-by style calculations

In [0]:
rolledUpDF = dfNoNull.rollup("Date","Country")\
                     .agg(sum("Quantity"))\
                     .selectExpr("Date","Country", "`sum(Quantity)` as total_quantity")\
                     .orderBy("Date")
rolledUpDF.count()

In [0]:
rolledUpDF.show(10)

In [0]:
rolledUpDF.where("Country IS NULL" and "Date IS NULL").show()
#rolledUpDF.where().show()

**CUBE:**
* Produces the result set by generating all combinations of columns specified in GroupBy CUBE()
* Rather than treating elements hierarchically, a cube does the same thing across all dimensions. This means that it won’t just go by date over the entire time period, but also the country.
* To createa a table that has below combinations
  * The total across all dates and countries
  * The total for each date across all countries
  * The total for each country on each date
  * The total for each country across all dates

In [0]:
cubeDF = dfNoNull.cube("Date","Country")\
        .agg(sum("Quantity"))\
        .select("Date","Country","sum(Quantity)")\
        .orderBy("Date")
cubeDF.count()

** Difference Between CUBE and ROLLUP**
* CUBE generates a result that shows aggregation for all combinations of values in the selected columns.
* ROLLUP generates a result that shows aggregation for hierarchy of values in the selected columns.

ROLLUP("Date","Country") gives combinations of
  * "Date","Country"
  * "Date"
  * ()
It gave a count of 2022

cube("Date","Country") gives combination of
  * "Date","Country"
  * "Date"
  * "Country"
  * ()
It gave a count of 2060

**GROUPING METADATA:**
Sometimes when using cubes and rollups, you want to be able to query the aggregation levels so that you can easily filter them down accordingly.
We can do this by using the grouping_id, which gives us a column specifying the level of aggregation that we have in our result set.

In [0]:
from pyspark.sql.functions import grouping_id,desc

dfNoNull.cube("customerId", "stockCode")\
        .agg(grouping_id(), sum("Quantity"))\
        .orderBy(desc("grouping_id()"))\
        .show()

**PIVOT:**
* Used to turn unique values from one column, into multiple columns in the output.
* With a pivot, we can aggregate according to some function for each of those given countries and display them in an easy-to-query way.

In [0]:
pivoted = dfWithDate.groupBy("date").pivot("Country").sum()
pivoted.show(3)

This DataFrame will now have a column for every combination of country, numeric variable, and a column specifying the date. 

For example, for USA we have columns:USA_sum(Quantity), USA_sum(UnitPrice), USA_sum(CustomerID).

This represents one for each numeric column in our dataset (because we just performed an aggregation over all of them).

In [0]:
pivoted.where("date > '2011-12-05'").select("date","`USA_sum(CustomerID)`","`USA_sum(CAST(Quantity AS BIGINT))`").show()

###### User-Defined Aggregation Functions(UDAF):

* UDAFs to compute custom calculations over groups of input data (as opposed to single rows).
* they are a way for users to define their own aggregation functions based on custom formulae or business rules.
* UDAFs are currently available only in Scala or Java. But these can be called in Python.
* To create a UDAF, you must inherit from the **UserDefinedAggregateFunction** base class and implement the following methods:
   * **inputSchema** represents input arguments as a StructType
   * **bufferSchema** represents intermediate UDAF results as a StructType
   * **dataType** represents the return DataType
   * **deterministic** is a Boolean value that specifies whether this UDAF will return the same result for a given input
   * **initialize** allows you to initialize values of an aggregation buffer
   * **update** describes how you should update the internal buffer based on a given row
   * **merge** describes how two aggregation buffers should be merged
   * **evaluate** will generate the final result of the aggregation
   
 The following example implements a BoolAnd, which will inform us whether all the rows (for a given column) are true; if they’re not, it will return false.

In [0]:
%scala
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

class BoolAnd extends UserDefinedAggregateFunction {
  def inputSchema: org.apache.spark.sql.types.StructType = StructType(StructField("value", BooleanType) :: Nil)
  
  def bufferSchema: StructType = StructType(StructField("result", BooleanType) :: Nil)
  
  def dataType: DataType = BooleanType
  
  def deterministic: Boolean = true
  
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = true
  }
  
  def update(buffer: MutableAggregationBuffer, input: Row):Unit = {
    buffer(0) = buffer.getAs[Boolean](0) && input.getAs[Boolean](0)
  }
  
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getAs[Boolean](0) && buffer2.getAs[Boolean](0)
    }
  
  def evaluate(buffer: Row): Any = {
    buffer(0)
    }

}

In [0]:
%scala
val ba = new BoolAnd
spark.udf.register("booland", ba)
import org.apache.spark.sql.functions._
spark.range(1)
.selectExpr("explode(array(TRUE, TRUE, TRUE)) as t")
.selectExpr("explode(array(TRUE, FALSE, TRUE)) as f", "t")
.select(ba(col("t")), expr("booland(f)"))
.show()

###### Conclusion:
We have learnt about simple grouping-to window functions as well as rollups and cubes. Next chapterdiscusses how to perform joins to combine different data sources together.