Browse files

Add NEP for group-by additions to NumPy: reduceby, reducein, segment,…

… and edges.
  • Loading branch information...
1 parent a839a42 commit bceb9c20700514db5667831ce2878f1660fb071f Travis Oliphant committed Apr 29, 2010
Showing with 112 additions and 0 deletions.
  1. +112 −0 doc/neps/groupby_additions.rst
112 doc/neps/groupby_additions.rst
@@ -0,0 +1,112 @@
+ A proposal for adding groupby functionality to NumPy
+:Author: Travis Oliphant
+:Date: 2010-04-27
+Executive summary
+NumPy provides tools for handling data and doing calculations in much
+the same way as relational algebra allows. However, the common group-by
+functionality is not easily handled. The reduce methods of NumPy's
+ufuncs are a natural place to put this groupby behavior. This NEP
+describes two additional methods for ufuncs (reduceby and reducein) and
+two additional functions (segment and edges) which can help add this
+Example Use Case
+Suppose you have a NumPy structured array containing information about
+the number of purchases at several stores over multiple days. To be clear, the
+structured array data-type is:
+dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
+ ('store', i4), ('SKU', 'S6'), ('number', i4)]
+Suppose there is a 1-d NumPy array of this data-type and you would like
+to compute various statistics (max, min, mean, sum, etc.) on the number
+of products sold, by product, by month, by store, etc.
+Currently, this could be done by using reduce methods on the number
+field of the array, coupled with in-place sorting, unique with
+return_inverse=True and bincount, etc. However, for such a common
+data-analysis need, it would be nice to have standard and more direct
+ways to get the results.
+Ufunc methods proposed
+It is proposed to add two new reduce-style methods to the ufuncs:
+reduceby and reducein. The reducein method is intended to be a simpler
+to use version of reduceat, while the reduceby method is intended to
+provide group-by capability on reductions.
+ <ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)
+ Perform a local reduce with slices specified by pairs of indices.
+ The reduction occurs along the provided axis, using the provided
+ data-type to calculate intermediate results, storing the result into
+ the array out (if provided).
+ The indices array provides the start and end indices for the
+ reduction. If the length of the indices array is odd, then the
+ final index provides the beginning point for the final reduction
+ and the ending point is the end of arr.
+ This generalizes along the given axis, the behavior:
+ [<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
+ for i in range(len(indices)/2)]
+ This assumes indices is of even length
+ Example:
+ >>> a = [0,1,2,4,5,6,9,10]
+ >>> add.reducein(a,[0,3,2,5,-2])
+ [3, 11, 19]
+ Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19
+ <ufunc>.reduceby(arr, by, dtype=None, out=None)
+ Perform a reduction in arr over unique non-negative integers in by.
+ Let N=arr.ndim and M=by.ndim. Then, by.shape[:N] == arr.shape.
+ In addition, let I be an N-length index tuple, then by[I]
+ contains the location in the output array for the reduction to
+ be stored. Notice that if N == M, then by[I] is a non-negative
+ integer, while if N < M, then by[I] is an array of indices into
+ the output array.
+ The reduction is computed on groups specified by unique indices
+ into the output array. The index is either the single
+ non-negative integer if N == M or if N < M, the entire
+ (M-N+1)-length index by[I] considered as a whole.
+Functions proposed
+.. Local Variables:
+.. mode: rst
+.. coding: utf-8
+.. fill-column: 72
+.. End:

0 comments on commit bceb9c2

Please sign in to comment.