# Aggregate Functions

When faced with a large amount of data, a useful first step is to compute summary statistics.

NumPy has built-in aggregation functions for working on arrays tha we will discuss here.

In [6]:
import numpy as np
np.random.seed(1234567890)

## Summing Values

Python has it's own built in `sum` function:

In [18]:
array = np.random.random(100)
sum(array)

49.60614887869724

and NumPy has its own corresponding one:

In [19]:
np.sum(array)

49.606148878697233

As discussed earlier, NumPys functions are compiled - so they should be much quicker on large arrays

In [21]:
big_array = np.random.random(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

10 loops, best of 3: 75.7 ms per loop
1000 loops, best of 3: 406 µs per loop


We recommend you use the NumPy sum function, it also works on multidimensional arrays.

## Maximum and Minimum

Python also has built in `max` and `min` functions

In [6]:
min(big_array)

2.425872294153919e-07

In [7]:
max(big_array)

0.99999661881043855

And again there are NumPy equivalents. The plain Python ones suffer the same problem as the `sum` function, so recommend these:

In [8]:
np.min(big_array)

2.425872294153919e-07

In [9]:
np.max(big_array)

0.99999661881043855

There is a shorter syntax we may be useful too (holds for other aggregate functions as well):

In [10]:
print(big_array.min(), big_array.max(), big_array.sum())

2.42587229415e-07 0.99999661881 499718.699975


## Aggregates on Multidimensional arrays

Aggregate functions work on multidimension arrays too:

In [30]:
matrix = np.random.normal(0, 1, (5, 4))
print(matrix)

[[  1.96028429e-04  -2.91605470e-01   7.00145665e-01  -7.81665049e-01]
 [ -1.03140273e+00  -5.34097360e-01  -6.98989896e-01  -3.58551478e-01]
 [  1.39277189e+00  -7.88091397e-01  -1.86613048e+00  -7.48470646e-01]
 [ -1.18093486e+00   2.92262116e-01  -5.24043263e-01  -1.77156390e+00]
 [ -5.25360017e-01  -5.55513279e-01  -1.42538837e+00   4.29336780e-01]]


In [31]:
matrix.sum()

-10.26709570646498

Often this is not what we want. If we want the value of an aggregate across an axis, we have to specify the axis. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. Thus

* `axis = 0` computes column-wise, collapsing the rows

In [32]:
matrix.min(axis=0)

array([-1.18093486, -0.7880914 , -1.86613048, -1.7715639 ])

* `axis = 1` computes row-wise, collapsing the columns

In [33]:
matrix.max(axis=1)

array([ 0.70014567, -0.35855148,  1.39277189,  0.29226212,  0.42933678])

## Aggregate Functions on Missing data

All NumPy aggregate functions will produce errors when working with missing data, which NumPy specifies as `NaN`. Instead there are routines that are `NaN`-safe in the sense they ignore missing values.

The following table provides a list of useful aggregate functions and their NaN-safe equivalents:


|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

Source: Jake VanderPlas (2016), Python Data Science Handbook Essential Tools for Working with Data, O'Reilly Media.