# Aggregations

Aggregations are an important step while processing dataframes and tabular data
in general. And therefore, they should be as simple as possible to implement.
Some notable data aggregation semantics are provided by pandas, spark and the SQL
language.

When designing an aggregation API method, the following characteristics make in
my opinion a good aggregation method.

-   easily perform aggregation on a column or a set of columns
-   easily perform multiple aggregation functions on the same columns
-   selectively perform differently aggregations on different columns

As an nice to have to this list, it would be nice to apply aggregation functions
by passing the function name as a string. A good aggregation method should allow
all the above with minimal amount of code required.

## Getting started

Let's start spark using datafaucet.

In [1]:
import datafaucet as dfc

In [2]:
# let's start the engine
dfc.engine('spark')

created SparkEngine
Init engine "spark"
Connecting to spark master: local[*]
Engine context spark:2.4.4 successfully started


<datafaucet.spark.engine.SparkEngine at 0x7fbd9888e9e8>

In [3]:
# expose the engine context
spark  = dfc.context()

## Generating Data

In [4]:
df = spark.range(1000)

In [5]:
df = (df
    .cols.create('g').randint(0,3)
    .cols.create('n').randchoice(['Stacy', 'Sandra'])
    .cols.create('x').randint(0,100)
    .cols.create('y').randint(0,100)
)

In [6]:
df.data.grid(5)

Unnamed: 0,id,g,n,x,y
0,0,1,Sandra,42,96
1,1,1,Stacy,6,83
2,2,2,Stacy,28,75
3,3,2,Sandra,6,21
4,4,0,Sandra,75,36


## Pandas
Let's start by looking how Pandas does aggregations. Pandas is quite flexible on the points noted above and uses hierachical indexes on both columns and rows to store the aggregation names and the groupby values. Here below a simple aggregation and a more complex one with groupby and multiple aggregation functions.

In [7]:
pf = df.data.collect()

In [8]:
pf[['n', 'x', 'y']].agg(['max'])

Unnamed: 0,n,x,y
max,Stacy,99,99


In [9]:
agg = (pf[['g','n', 'x', 'y']]
           .groupby(['g', 'n'])
           .agg({
               'n': 'count',
               'x': ['min', max],
               'y':['min', 'max']
           }))
agg

Unnamed: 0_level_0,Unnamed: 1_level_0,n,x,x,y,y
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,min,max
g,n,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,Sandra,172,0,99,0,99
0,Stacy,159,0,99,0,99
1,Sandra,190,0,99,0,99
1,Stacy,146,2,98,0,99
2,Sandra,153,1,99,0,99
2,Stacy,180,0,99,0,99


### Stacking 
In pandas, you can stack the multiple column index and move it to a column, as below. The choice of stacking or not after aggregation depends on wht you want to do later with the data. Next to the extra index, stacking also explicitely code NaN / Nulls for evry aggregation which is not shared by each column (in case of dict of aggregation functions.

In [10]:
agg = pf[['g', 'n', 'x', 'y']].groupby(['g', 'n']).agg(['min', 'max', 'mean'])
agg = agg.stack(0)
agg

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,max,mean,min
g,n,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Sandra,x,99,49.5,0
0,Sandra,y,99,49.877907,0
0,Stacy,x,99,48.566038,0
0,Stacy,y,99,52.125786,0
1,Sandra,x,99,51.952632,0
1,Sandra,y,99,49.921053,0
1,Stacy,x,98,49.787671,2
1,Stacy,y,99,49.554795,0
2,Sandra,x,99,49.738562,1
2,Sandra,y,99,48.568627,0


### Index as columns
Index in pandas is not the same as column data, but you can easily move from one to the other, as shown below, by combine the name information of the various index levels with the values of each level.

In [11]:
agg.index.names

FrozenList(['g', 'n', None])

In [12]:
# for example these are the value from the first level of the index
agg.index.get_level_values(0)

Int64Index([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], dtype='int64', name='g')

The following script will iterate through all the levels and create a column with the name of the original index level otherwise will use `_<level#>` if no name is available. Remember that pandas allows indexes to be nameless.

In [13]:
levels = agg.index.names
for (name, lvl) in zip(levels, range(len(levels))):
    agg[name or f'_{lvl}'] = agg.index.get_level_values(lvl)

In [14]:
#now the index is standard columns, drop the index
agg.reset_index(inplace=True, drop=True)
agg

Unnamed: 0,max,mean,min,g,n,_2
0,99,49.5,0,0,Sandra,x
1,99,49.877907,0,0,Sandra,y
2,99,48.566038,0,0,Stacy,x
3,99,52.125786,0,0,Stacy,y
4,99,51.952632,0,1,Sandra,x
5,99,49.921053,0,1,Sandra,y
6,98,49.787671,2,1,Stacy,x
7,99,49.554795,0,1,Stacy,y
8,99,49.738562,1,2,Sandra,x
9,99,48.568627,0,2,Sandra,y


## Spark (Python)
Spark aggregation is a bit simpler, but definitely very flexible, so we can achieve the same result with a little more work in some cases. Here below a simple example and a more complex one, reproducing the same three cases as above.

In [15]:
df.select('n', 'x', 'y').agg({'n':'max', 'x':'max', 'y':'max'}).toPandas()

Unnamed: 0,max(x),max(y),max(n)
0,99,99,Stacy


Or with a little more work we can exactly reproduce the pandas case:

In [16]:
from pyspark.sql import functions as F

df.select('n', 'x', 'y').agg(
    F.lit('max').alias('_idx'),
    F.max('n').alias('n'), 
    F.max('x').alias('x'), 
    F.max('y').alias('y')).toPandas()

Unnamed: 0,_idx,n,x,y
0,max,Stacy,99,99


More complicated aggregation cannot be called by string and must be provided by functions. Here below a way to reproduce groupby aggregation as in the second pandas example:

In [17]:
(df
    .select('g', 'n', 'x', 'y')
    .groupby('g', 'n')
    .agg(
        F.count('n').alias('n_count'),
        F.min('x').alias('x_min'),
        F.max('x').alias('x_max'),
        F.min('y').alias('y_min'),
        F.max('y').alias('y_max')
    )
).toPandas()
        

Unnamed: 0,g,n,n_count,x_min,x_max,y_min,y_max
0,0,Sandra,175,0,99,0,99
1,0,Stacy,156,0,99,0,99
2,1,Stacy,147,1,98,0,99
3,2,Sandra,159,0,99,0,99
4,1,Sandra,189,0,99,0,99
5,2,Stacy,174,0,98,0,98


### Stacking

Stacking, as in pandas, can be used to expose the column name on a different index column, unfortunatel stack is currently available only in the SQL initerface and not very flexible as in the pandas counterpart (https://spark.apache.org/docs/2.3.0/api/sql/#stack)

You could use pyspark `expr` to call the SQL function as explained here (https://stackoverflow.com/questions/42465568/unpivot-in-spark-sql-pyspark). However, another way would be to union the various results as shown here below.

In [18]:
from pyspark.sql import functions as F

(df
    .select('g', 'x')
    .groupby('g')
    .agg(
        F.lit('x').alias('_idx'),
        F.min('x').alias('min'),
        F.max('x').alias('max'),
        F.mean('x').alias('mean')
    )
).union(
df
    .select('g', 'y')
    .groupby('g')
    .agg(
        F.lit('y').alias('_idx'),
        F.min('y').alias('min'),
        F.max('y').alias('max'),
        F.mean('y').alias('mean')
    )
).toPandas()

Unnamed: 0,g,_idx,min,max,mean
0,1,x,0,99,51.011905
1,2,x,0,99,47.339339
2,0,x,0,99,49.05136
3,1,y,0,99,49.761905
4,2,y,0,99,48.0
5,0,y,0,99,50.957704


### Generatring aggregating code

The code above looks complicated, but is very regular, hence we can generate it! What we need is a to a list of lists for the aggregation functions as shown here below:

In [19]:
dfs = []
for c in ['x','y']:
    print(' '*2, f'col: {c}')
    aggs = []
    for func in [F.min, F.max, F.mean]:
        f = func(c).alias(func.__name__)
        aggs.append(f)
        print(' '*4, f'func: {f}')
        
    dfs.append(df.select('g', c).groupby('g').agg(*aggs))

   col: x
     func: Column<b'min(x) AS `min`'>
     func: Column<b'max(x) AS `max`'>
     func: Column<b'avg(x) AS `mean`'>
   col: y
     func: Column<b'min(y) AS `min`'>
     func: Column<b'max(y) AS `max`'>
     func: Column<b'avg(y) AS `mean`'>


The dataframes in this generator have all the same columns and can be reduced with union calls

In [20]:
from functools import reduce

reduce(lambda a,b: a.union(b), dfs).toPandas()

Unnamed: 0,g,min,max,mean
0,1,0,99,51.011905
1,2,0,99,47.339339
2,0,0,99,49.05136
3,1,0,99,49.761905
4,2,0,99,48.0
5,0,0,99,50.957704


## Meet DataFaucet agg

One of the goal of datafaucet is to simplify analytics, data wrangling and data
discovery over a set of engine with an intuitive interface. So the sketched
solution above is available, with a few extras. See below the examples

The code here below attempt to produce readable code, engine agnostic data
aggregations. The aggregation api is always in the form:   

`df.cols.get(...).groupby(...).agg(...)`

Alternativaly, you can `find` instead of `get`

In [21]:
# simple aggregation by name
d = df.cols.get('x').agg('distinct')
d.data.grid()

Unnamed: 0,x
0,100


In [22]:
# simple aggregation (multiple) by name
d = df.cols.get('x').agg(['distinct', 'avg'])
d.data.grid()

Unnamed: 0,x_distinct,x_avg
0,100,49.14


In [23]:
# simple aggregation (multiple) by name (stacked)
d = df.cols.get('x').agg(['distinct', 'avg'], stack=True)
d.data.grid()

Unnamed: 0,_idx,distinct,avg
0,x,100,49.14


In [24]:
# simple aggregation (multiple) by name (stacked, custom index name)
d = df.cols.get('x').agg(['distinct', 'avg'], stack='colname')
d.data.grid()

Unnamed: 0,colname,distinct,avg
0,x,100,49.14


In [25]:
# simple aggregation (multiple) by name and function
d = df.cols.get('x').agg(['distinct', F.min, F.max, 'avg'])
d.data.grid()

Unnamed: 0,x_distinct,x_min,x_max,x_avg
0,100,0,99,49.14


In [26]:
# multiple aggregation by name and function
d = df.cols.get('x', 'y').agg(['distinct', F.min, F.max, 'avg'])
d.data.grid()

Unnamed: 0,x_distinct,x_min,x_max,x_avg,y_distinct,y_min,y_max,y_avg
0,100,0,99,49.14,100,0,99,49.571


In [27]:
# multiple aggregation (multiple) by name and function
d = df.cols.get('x', 'y').agg({
    'x':['distinct', F.min], 
    'y':['distinct', 'max']})

d.data.grid()

Unnamed: 0,x_distinct,x_min,x_max,y_distinct,y_min,y_max
0,100,0,,100,,99


In [28]:
# multiple aggregation (multiple) by name and function (stacked)
d = df.cols.get('x', 'y').agg({
    'x':['distinct', F.min], 
    'y':['distinct', 'max']}, stack=True)
d.data.grid()

Unnamed: 0,_idx,distinct,min,max
0,x,100,0.0,
1,y,100,,99.0


In [29]:
# grouped by, multiple aggregation (multiple) by name and function (stacked)
d = df.cols.get('x', 'y').groupby('g','n').agg({
    'x':['distinct', F.min], 
    'y':['distinct', 'max']}, stack=True)
d.data.grid()

Unnamed: 0,g,n,_idx,distinct,min,max
0,0,Sandra,x,88,0.0,
1,0,Stacy,x,78,0.0,
2,1,Stacy,x,81,2.0,
3,2,Sandra,x,89,0.0,
4,1,Sandra,x,88,0.0,
5,2,Stacy,x,78,0.0,
6,0,Sandra,y,84,,99.0
7,0,Stacy,y,84,,99.0
8,1,Stacy,y,79,,99.0
9,2,Sandra,y,83,,99.0


### Extended list of aggregation

An extended list of aggregation is available, both by name and by function in the datafaucet library

In [32]:
from datafaucet.spark import functions as FF

d = df.cols.get('x', 'y').groupby('g','n').agg([
        'type',
        ('uniq', FF.distinct),
        'one',
        'top3',
    ], stack=True)

d.data.grid()

ValueError: function type not found