* Function must produce one result for each group. Each grouping returns RelationalGroupedDataset on which we do aggregation.
* `Group by` allows us to specify one or more keys as wells as one or more aggregation function to transform the value columns
* `Window` gives ability to specify one or more keys as well as one or more aggregation. However rows input to function are related to current row

In [1]:
import findspark
findspark.init('/home/purvil/spark-2.4.3-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Aggregation').getOrCreate()

In [4]:
df = spark.read.csv("spark_data/retail-data/all/*.csv", header = True, inferSchema=True)

In [5]:
df.cache() # cache entire data in memory

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: string, UnitPrice: double, CustomerID: int, Country: string]

In [6]:
df.createOrReplaceTempView("dfTable")

In [7]:
df.show(5)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
only showing top 5 rows



In [8]:
df.count() # Also use to cache entire data in memory

541909

#### count

In [9]:
from pyspark.sql.functions import count

In [10]:
df.select(count("StockCode")).show()

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+



In [11]:
spark.sql('SELECT COUNT(*) FROM dfTable').show()

+--------+
|count(1)|
+--------+
|  541909|
+--------+



* When we are counting *, spark will count null values. But when we count only column, it will ingore nulls.

#### countDistinct

In [12]:
from pyspark.sql.functions import countDistinct

In [13]:
df.select(countDistinct("StockCode")).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



In [14]:
spark.sql('SELECT COUNT (DISTINCT StockCode) FROM dfTable').show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



#### approx_count_distinct

* For large dataset we need approximate degree of accuracy for total distinct values

In [15]:
from pyspark.sql.functions import approx_count_distinct

In [16]:
df.select(approx_count_distinct("StockCode", 0.1)).show() # 0.1 is max estimation error allowed.

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+



#### first last
* First and Last value

In [17]:
from pyspark.sql.functions import first, last

In [18]:
df.select(first("StockCode"), last("StockCode")).show()

+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
|                 85123A|                 22138|
+-----------------------+----------------------+



#### min max

In [19]:
from pyspark.sql.functions import min, max

In [20]:
df.select(min("Quantity"), max("Quantity")).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+



#### sum

In [21]:
from pyspark.sql.functions import sum

In [22]:
df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



#### sumDistinct

In [23]:
from pyspark.sql.functions import sumDistinct

In [24]:
df.select(sumDistinct("Quantity")).show()

+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



#### avg

In [25]:
from pyspark.sql.functions import avg

In [26]:
df.select(avg("Quantity")).show()

+----------------+
|   avg(Quantity)|
+----------------+
|9.55224954743324|
+----------------+



#### Variance and std

In [27]:
from pyspark.sql.functions import var_pop, stddev_pop, stddev, var_samp

In [28]:
df.select(var_pop("Quantity").alias("Pop var"), var_samp("Quantity").alias("Smaple var")).show()

+-----------------+-----------------+
|          Pop var|       Smaple var|
+-----------------+-----------------+
|47559.30364660916|47559.39140929885|
+-----------------+-----------------+



#### skewness and kurtosis
* Skewness measusre asymmetry of the values in your data around the mean.
* Kurtosis is measure of tail of data

In [29]:
from pyspark.sql.functions import kurtosis, skewness

In [30]:
df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+-------------------+------------------+
| skewness(Quantity)|kurtosis(Quantity)|
+-------------------+------------------+
|-0.2640755761052971|119768.05495535096|
+-------------------+------------------+



#### Covariance and Correlation

In [31]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

In [32]:
df.select(corr("InvoiceNo", "Quantity"), covar_pop("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity")).show()

+-------------------------+------------------------------+-------------------------------+
|corr(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|
+-------------------------+------------------------------+-------------------------------+
|     4.912186085644065E-4|            1052.7260778758152|             1052.7280543919194|
+-------------------------+------------------------------+-------------------------------+



#### Complex type aggregation
* Collect a values from given column in set (distict values) or as list

In [33]:
from pyspark.sql.functions import collect_list, collect_set

In [34]:
df.agg(collect_list("Country"), collect_set("Country")).show()

+---------------------+--------------------+
|collect_list(Country)|collect_set(Country)|
+---------------------+--------------------+
| [United Kingdom, ...|[Portugal, Italy,...|
+---------------------+--------------------+



### Grouping



1. Specify the columns on which we want to group, returns RelationalGroupedDataset
2. Specify aggregation 

In [35]:
df.groupBy("InvoiceNo", "CustomerId").count().show(10)

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
+---------+----------+-----+
only showing top 10 rows



In [36]:
spark.sql("SELECT InvoiceNo, CustomerId, count(*) FROM dfTable GROUp BY InvoiceNo, CustomerId").show(10)

+---------+----------+--------+
|InvoiceNo|CustomerId|count(1)|
+---------+----------+--------+
|   536846|     14573|      76|
|   537026|     12395|      12|
|   537883|     14437|       5|
|   538068|     17978|      12|
|   538279|     14952|       7|
|   538800|     16458|      10|
|   538942|     17346|      12|
|  C539947|     13854|       1|
|   540096|     13253|      16|
|   540530|     14755|      27|
+---------+----------+--------+
only showing top 10 rows



In [37]:
df.groupBy("InvoiceNo").agg(count("Quantity")).show()

+---------+---------------+
|InvoiceNo|count(Quantity)|
+---------+---------------+
|   536596|              6|
|   536938|             14|
|   537252|              1|
|   537691|             20|
|   538041|              1|
|   538184|             26|
|   538517|             53|
|   538879|             19|
|   539275|              6|
|   539630|             12|
|   540499|             24|
|   540540|             22|
|  C540850|              1|
|   540976|             48|
|   541432|              4|
|   541518|            101|
|   541783|             35|
|   542026|              9|
|   542375|              6|
|  C542604|              8|
+---------+---------------+
only showing top 20 rows



In [38]:
df.groupBy("InvoiceNo").agg(avg("Quantity"), stddev_pop("Quantity")).show(5)

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
|   537252|              31.0|                 0.0|
|   537691|              8.15|   5.597097462078001|
|   538041|              30.0|                 0.0|
+---------+------------------+--------------------+
only showing top 5 rows



### Window Function
* Aggregation on specific window of data, which we define by reference to the current data. Window specification define which row will pass.
* Group by takes data and every row can go to 1 group only.
* In window row can fall in one or more frame.Aggregation is computed on frame.
* Spark has ranking function, analytic function, aggregate function.

In [39]:
from pyspark.sql.functions import col,to_date

In [40]:
dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))

In [41]:
dfWithDate.createOrReplaceTempView('dfWithDate')

In [42]:
dfWithDate.show(2)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+----------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|      date|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+----------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|2010-12-01|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|2010-12-01|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+----------+
only showing top 2 rows



In [43]:
from pyspark.sql.window import Window

In [44]:
from pyspark.sql.functions import desc

* First create window specification
* ParitionBy define how we will break our group
* OrderBy is order within partition
* rowsBetween is frame specification, which row is included in sample, based on its reference to current row.

In [45]:
windowSpec = Window.partitionBy("CustomerId", "date").orderBy("Quantity").rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [46]:
max(col("Quantity")).over(windowSpec)

Column<b'max(Quantity) OVER (PARTITION BY CustomerId, date ORDER BY Quantity ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)'>

In [47]:
### TO BE DONE -purvil

### Grouping Set

In [49]:
dfNotNull = dfWithDate.drop()

In [50]:
dfNotNull.createOrReplaceTempView("dfNotNull")

In [51]:
spark.sql("""
    SELECT CustomerId, StockCode, sum(Quantity) FROM dfNotNull
    GROUP BY customerId, stockCode
    ORDER BY CustomerID DESC, stockCode DESC
""").show(5)

+----------+---------+-------------+
|CustomerId|StockCode|sum(Quantity)|
+----------+---------+-------------+
|     18287|    85173|           48|
|     18287|   85040A|           48|
|     18287|   85039B|          120|
|     18287|   85039A|           96|
|     18287|    84920|            4|
+----------+---------+-------------+
only showing top 5 rows



In [None]:
S