# Appendix A - Advanced Numpy

## A.1 ndarray Object Internals

More precisely, the ndarray internally consists of the following:
* A pointer to data—that is, a block of data in RAM or in a memory-mapped file
* The data type or dtype, describing fixed-size value cells in the array
* A tuple indicating the array’s shape
* A tuple of strides, integers indicating the number of bytes to “step” in order to advance one element along a dimension

In [1]:
import numpy as np

np.ones((10, 5)).shape

(10, 5)

A typical (C order) 3 × 4 × 5 array of float64 (8-byte) values has strides `(160, 40, 8)` (knowing about the strides can be useful because, in general, the larger the strides on a particular axis, the more costly it is to perform computation along that axis)

In [2]:
np.ones((10, 5)).strides

(40, 8)

In [3]:
np.ones((3, 4, 5)).strides

(160, 40, 8)

### NumPy dtype Hierarchy

You may occasionally have code that needs to check whether an array contains integers, floating-point numbers, strings, or Python objects. Because there are multiple types of floating-point numbers (`float16` through `float128`), checking that the dtype is among a list of types would be very verbose. Fortunately, the dtypes have superclasses such as `np.integer` and `np.floating`, which can be used in conjunction with the `np.issubdtype` function:

In [4]:
ints = np.ones(10, dtype=np.uint16)

floats = np.ones(10, dtype=np.float32)

ints, floats

(array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint16),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32))

In [5]:
np.issubdtype(ints.dtype, np.integer)

True

In [6]:
np.issubdtype(floats.dtype, np.floating)

True

In [7]:
# You can see all of the parent classes of a specific 
# dtype by calling the type’s mro method
np.float64.mro()

[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]

In [8]:
np.uint16.mro()

[numpy.uint16,
 numpy.unsignedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

In [9]:
np.issubdtype(ints.dtype, np.number)

True

In [10]:
np.issubdtype(ints.dtype, np.unsignedinteger)

True

In [11]:
np.bool_.mro()

[numpy.bool_, numpy.generic, object]

## A.2 Advanced Array Manipulation

### Reshaping Arrays

In [12]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [13]:
arr.reshape((4, 2))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [14]:
arr.reshape((4, 2), order='C')

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [15]:
arr.reshape((4, 2), order='F')

array([[0, 4],
       [1, 5],
       [2, 6],
       [3, 7]])

In [16]:
arr.reshape((4, 2)).reshape((2, 4))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

One of the passed shape dimensions can be –1, in which case the value used for that
dimension will be inferred from the data:

In [17]:
arr = np.arange(15)

arr.reshape((5, -1))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [18]:
# Since an array’s shape attribute is a tuple, it can be passed to reshape, too
other_arr = np.ones((3, 5))

arr.reshape(other_arr.shape)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The opposite operation of `reshape` from one-dimensional to a higher dimension is
typically known as **flattening** or **raveling**

In [19]:
arr = np.arange(15).reshape((5, 3))

arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [20]:
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [21]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

`ravel` does not produce a copy of the underlying values if the values in the result
were contiguous in the original array. The `flatten` method behaves like `ravel` except
it always returns a copy of the data

### C Versus Fortran Order

Functions like reshape and ravel accept an order argument indicating the order to use the data in the array. This is usually set to `'C'` or `'F'` in most cases 

In [22]:
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [23]:
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [24]:
arr.ravel('F')

array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

Reshaping arrays with more than two dimensions can be a bit mind-bending. The key difference between C and Fortran order is the way in which the dimensions are walked:
* C/row major order
    * Traverse higher dimensions first (e.g., axis 1 before advancing on axis 0).
* Fortran/column major order
    * Traverse higher dimensions last (e.g., axis 0 before advancing on axis 1)
    
### Concatenating and Splitting Arrays

In [25]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])

arr2 = np.array([[7, 8, 9], [10, 11, 12]])

In [26]:
np.concatenate([arr1, arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [27]:
np.concatenate([arr1, arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

There are some convenience functions, like `vstack` and `hstack`, for common kinds of
concatenation. The preceding operations could have been expressed as:

In [28]:
np.vstack((arr1, arr2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [29]:
np.hstack((arr1, arr2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

`split`, on the other hand, slices apart an array into multiple arrays along an axis

In [30]:
arr = np.random.randn(5, 2)
arr

array([[ 0.16386506,  1.3675787 ],
       [-0.36596317,  0.39997361],
       [ 0.72088765,  1.01511159],
       [ 1.0209746 , -0.10938467],
       [ 0.27720946, -0.36140418]])

In [31]:
first, second, third = np.split(arr, [1, 3])

first

array([[0.16386506, 1.3675787 ]])

In [32]:
second

array([[-0.36596317,  0.39997361],
       [ 0.72088765,  1.01511159]])

In [33]:
third

array([[ 1.0209746 , -0.10938467],
       [ 0.27720946, -0.36140418]])

### Stacking helpers: r_ and c_

In [34]:
arr = np.arange(6)

arr1 = arr.reshape((3, 2))

arr2 = np.random.randn(3, 2)

In [35]:
np.r_[arr1, arr2]

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [-1.65712798,  0.06950105],
       [-1.53363004,  0.90192553],
       [-0.53149155, -0.71446115]])

In [36]:
np.c_[np.r_[arr1, arr2], arr]

array([[ 0.        ,  1.        ,  0.        ],
       [ 2.        ,  3.        ,  1.        ],
       [ 4.        ,  5.        ,  2.        ],
       [-1.65712798,  0.06950105,  3.        ],
       [-1.53363004,  0.90192553,  4.        ],
       [-0.53149155, -0.71446115,  5.        ]])

In [37]:
# These additionally can translate slices to arrays
np.c_[1:6, -10:-5]

array([[  1, -10],
       [  2,  -9],
       [  3,  -8],
       [  4,  -7],
       [  5,  -6]])

### Repeating Elements: tile and repeat

In [38]:
arr = np.arange(3)
arr

array([0, 1, 2])

In [39]:
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

By default, if you pass an integer, each element will be repeated that number of times.
If you pass an array of integers, each element can be repeated a different number of
times

In [40]:
arr.repeat([2, 3, 4])

array([0, 0, 1, 1, 1, 2, 2, 2, 2])

Multidimensional arrays can have their elements repeated along a particular axis.

In [41]:
arr = np.random.randn(2, 2)
arr

array([[-0.14124957, -2.08834871],
       [-0.40339915,  0.57617911]])

In [42]:
arr.repeat(2, axis=0)

array([[-0.14124957, -2.08834871],
       [-0.14124957, -2.08834871],
       [-0.40339915,  0.57617911],
       [-0.40339915,  0.57617911]])

Note that if no axis is passed, the array will be flattened first, which is likely not what
you want. Similarly, you can pass an array of integers when repeating a multidimen‐
sional array to repeat a given slice a different number of times

In [43]:
arr.repeat([2, 3], axis=0)

array([[-0.14124957, -2.08834871],
       [-0.14124957, -2.08834871],
       [-0.40339915,  0.57617911],
       [-0.40339915,  0.57617911],
       [-0.40339915,  0.57617911]])

In [44]:
arr.repeat([2, 3], axis=1)

array([[-0.14124957, -0.14124957, -2.08834871, -2.08834871, -2.08834871],
       [-0.40339915, -0.40339915,  0.57617911,  0.57617911,  0.57617911]])

`tile`, on the other hand, is a shortcut for stacking copies of an array along an axis.
Visually you can think of it as being akin to “laying down tiles”

In [45]:
arr

array([[-0.14124957, -2.08834871],
       [-0.40339915,  0.57617911]])

In [46]:
np.tile(arr, 2)

array([[-0.14124957, -2.08834871, -0.14124957, -2.08834871],
       [-0.40339915,  0.57617911, -0.40339915,  0.57617911]])

In [47]:
#The second argument to tile can be a tuple
# indicating the layout of the “tiling”
np.tile(arr, (3, 2))

array([[-0.14124957, -2.08834871, -0.14124957, -2.08834871],
       [-0.40339915,  0.57617911, -0.40339915,  0.57617911],
       [-0.14124957, -2.08834871, -0.14124957, -2.08834871],
       [-0.40339915,  0.57617911, -0.40339915,  0.57617911],
       [-0.14124957, -2.08834871, -0.14124957, -2.08834871],
       [-0.40339915,  0.57617911, -0.40339915,  0.57617911]])

## A.3 Broadcasting

In [48]:
arr = np.random.randn(4, 3)
arr

array([[-0.49686946, -1.27582474, -0.06440397],
       [-0.17882568, -0.11396129,  0.42356345],
       [ 2.49045093, -1.26066715,  0.75243843],
       [ 0.84295562, -0.34382055, -0.07924497]])

In [49]:
arr.mean(0)

array([ 0.66442785, -0.74856844,  0.25808824])

In [50]:
demeaned = arr - arr.mean(0)
demeaned

array([[-1.16129731, -0.52725631, -0.32249221],
       [-0.84325353,  0.63460714,  0.16547521],
       [ 1.82602308, -0.51209872,  0.4943502 ],
       [ 0.17852776,  0.40474788, -0.3373332 ]])

#### The Broadcasting Rule
* Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions.

Since arr.mean(0)
has length 3, it is compatible for broadcasting across axis 0 because the trailing
dimension in arr is 3 and therefore matches. According to the rules, to subtract over
axis 1 (i.e., subtract the row mean from each row), the smaller array must have shape
(4, 1)

In [51]:
row_means = arr.mean(1)

row_means.shape

(4,)

In [52]:
demeaned = arr - row_means.reshape((4, 1))
demeaned

array([[ 0.1154966 , -0.66345869,  0.54796209],
       [-0.22241784, -0.15755345,  0.37997129],
       [ 1.82971019, -1.92140789,  0.0916977 ],
       [ 0.70299225, -0.48378392, -0.21920833]])

### Broadcasting Over Other Axes

Broadcasting with higher dimensional arrays can seem even more mind-bending, but
it is really a matter of following the rules. 

A common problem, therefore, is needing to add a new axis with length 1 specifically
for broadcasting purposes. Using `reshape` is one option, but inserting an axis
requires constructing a tuple indicating the new shape. This can often be a tedious
exercise. Thus, NumPy arrays offer a special syntax for inserting new axes by indexing. We use the special `np.newaxis` attribute along with “full” slices to insert the new
axis.

In [53]:
arr = np.zeros((4, 4))

arr_3d = arr[:, np.newaxis, :]

arr_3d.shape

(4, 1, 4)

In [54]:
arr_1d = np.random.normal(size=3)

arr_1d[:, np.newaxis]

array([[-0.30875999],
       [ 0.08430808],
       [ 1.25344254]])

In [55]:
arr_1d[np.newaxis, :]

array([[-0.30875999,  0.08430808,  1.25344254]])

In [56]:
# if we had a three-dimensional array and wanted to demean axis 2
# we would need to write
arr = np.random.randn(3, 4, 5)

depth_means = arr.mean(2)
depth_means

array([[-0.13632771,  0.41990525, -0.02729088, -0.15364224],
       [-0.29766037, -0.13777062, -0.74605507, -0.22467474],
       [ 0.06857992,  0.07539315,  0.07249135, -0.2603021 ]])

In [57]:
depth_means.shape

(3, 4)

In [58]:
demeaned = arr - depth_means[:, :, np.newaxis]
demeaned.mean(2)

array([[-2.22044605e-17, -3.33066907e-17,  9.02056208e-18,
        -1.11022302e-17],
       [-2.22044605e-17, -4.44089210e-17,  0.00000000e+00,
         0.00000000e+00],
       [ 2.22044605e-17, -1.66533454e-17,  4.44089210e-17,
         0.00000000e+00]])

### Setting Array Values by Broadcasting

In [59]:
arr = np.zeros((4, 3))
arr[:] = 5
arr

array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

## A.4 Advanced ufunc Usage

Each of NumPy’s binary ufuncs has special methods for performing certain kinds of
special vectorized operations. 


`reduce` takes a single array and aggregates its values, optionally along an axis, by per‐
forming a sequence of binary operations. For example, an alternative way to sum ele‐
ments in an array is to use `np.add.reduce`

In [60]:
arr = np.arange(10)

np.add.reduce(arr)

45

In [61]:
arr.sum()

45

The starting value (0 for add) depends on the ufunc. If an axis is passed, the reduction
is performed along that axis. This allows you to answer certain kinds of questions in a
concise way. As a less trivial example, we can use np.logical_and to check whether
the values in each row of an array are sorted

In [62]:
np.random.seed(123456) # for reproducibility

arr = np.random.randn(5, 5)

arr[::2].sort(1)

arr[:, :-1] < arr[:, 1:]

array([[ True,  True,  True,  True],
       [ True, False,  True, False],
       [ True,  True,  True,  True],
       [False,  True, False, False],
       [ True,  True,  True,  True]])

In [63]:
np.logical_and.reduce(arr[:, :-1] < arr[:, 1:], axis=1)

array([ True, False,  True, False,  True])

`accumulate` is related to `reduce` like `cumsum` is related to `sum`. It produces an array of
the same size with the intermediate “accumulated” values

In [64]:
arr = np.arange(15).reshape((3, 5))

np.add.accumulate(arr, axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

In [65]:
arr = np.arange(3).repeat([1, 2, 2])
arr

array([0, 1, 1, 2, 2])

In [66]:
np.multiply.outer(arr, np.arange(5))

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8],
       [0, 2, 4, 6, 8]])

In [67]:
x, y = np.random.randn(3, 4), np.random.randn(5)

result = np.subtract.outer(x, y)

result.shape

(3, 4, 5)

The last method, `reduceat` , performs a “local reduce,” in essence an array `groupby` 
operation in which slices of the array are aggregated together. It accepts a sequence of
“bin edges” that indicate how to split and aggregate the values:

In [68]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [69]:
np.add.reduceat(arr, [0, 5, 8])

array([10, 18, 17])

In [70]:
arr[0:5].sum(), arr[5:8].sum(), arr[8:].sum()

(10, 18, 17)

In [71]:
# As with the other methods, you can pass an axis argument:
arr = np.multiply.outer(np.arange(4), np.arange(5))
arr

array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])

In [72]:
np.add.reduceat(arr, [0, 2, 4], axis=1)

array([[ 0,  0,  0],
       [ 1,  5,  4],
       [ 2, 10,  8],
       [ 3, 15, 12]])

### Writing New ufuncs in Python

There are a number of facilities for creating your own NumPy ufuncs. The most general is to use the NumPy C API, but that is beyond the scope of this book. In this
section, we will look at pure Python ufuncs.

`numpy.frompyfunc` accepts a Python function along with a specification for the number of inputs and outputs. For example, a simple function that adds element-wise would be specified as

In [73]:
def add_elements(x, y):
    return x + y

add_them = np.frompyfunc(add_elements, 2, 1)

add_them(np.arange(8), np.arange(8))

array([0, 2, 4, 6, 8, 10, 12, 14], dtype=object)

Functions created using `frompyfunc` always return arrays of Python objects, which
can be inconvenient. Fortunately, there is an alternative (but slightly less featureful)
function, `numpy.vectorize`, that allows you to specify the output type

In [74]:
add_them = np.vectorize(add_elements, otypes=[np.float64])

add_them(np.arange(8), np.arange(8))

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14.])

These functions provide a way to create ufunc-like functions, but they are very slow
because they require a Python function call to compute each element, which is a lot
slower than NumPy’s C-based ufunc loops:

In [75]:
arr = np.random.randn(10000)

In [76]:
%timeit add_them(arr, arr)

2.64 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [77]:
%timeit np.add(arr, arr)

5.1 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## A.5 Structured and Record Arrays

A structured array is an
`ndarray` in which each element can be thought of as representing a struct in C (hence
the “structured” name) or a row in a SQL table with multiple named fields

In [78]:
dtype = [('x', np.float64), ('y', np.int32)]

sarr = np.array([(1.5, 6), (np.pi, -2)], dtype=dtype)

sarr

array([(1.5       ,  6), (3.14159265, -2)],
      dtype=[('x', '<f8'), ('y', '<i4')])

## A.6 More About Sorting

Like Python’s built-in list, the ndarray sort instance method is an in-place sort,
meaning that the array contents are rearranged without producing a new array

In [79]:
arr = np.random.randn(6)

arr.sort()

arr

array([-1.90893859, -0.0998147 ,  0.35584976,  0.43531326,  0.67911044,
        1.2145555 ])

On the other hand, `numpy.sort` creates a new, sorted copy of an array. Otherwise, it
accepts the same arguments (such as kind) as `ndarray.sort`:

In [80]:
arr = np.random.randn(5)

arr

array([-0.2625154 ,  0.02812367,  0.87458306,  0.01972901,  1.10790981])

In [81]:
np.sort(arr)

array([-0.2625154 ,  0.01972901,  0.02812367,  0.87458306,  1.10790981])

In [82]:
arr

array([-0.2625154 ,  0.02812367,  0.87458306,  0.01972901,  1.10790981])

All of these sort methods take an axis argument for sorting the sections of data along
the passed axis independently:

In [83]:
arr = np.random.randn(3, 5)

arr

array([[ 0.3825507 , -0.5565042 ,  0.78636179, -0.38964038, -0.56135084],
       [-0.1724924 ,  0.24018256, -1.77194708,  0.82789246, -0.77915441],
       [-1.00592983,  1.24347796,  0.70024873,  0.64729809,  0.31218427]])

In [84]:
arr.sort(axis=1)

arr

array([[-0.56135084, -0.5565042 , -0.38964038,  0.3825507 ,  0.78636179],
       [-1.77194708, -0.77915441, -0.1724924 ,  0.24018256,  0.82789246],
       [-1.00592983,  0.31218427,  0.64729809,  0.70024873,  1.24347796]])

You may notice that none of the sort methods have an option to sort in descending
order. This is a problem in practice because array slicing produces views, thus not
producing a copy or requiring any computational work. Many Python users are
familiar with the “trick” that for a list values, values[::-1] returns a list in reverse
order. The same is true for ndarrays:

In [85]:
arr[:, ::-1]

array([[ 0.78636179,  0.3825507 , -0.38964038, -0.5565042 , -0.56135084],
       [ 0.82789246,  0.24018256, -0.1724924 , -0.77915441, -1.77194708],
       [ 1.24347796,  0.70024873,  0.64729809,  0.31218427, -1.00592983]])

### Indirect Sorts: argsort and lexsort

Given a key or keys (an
array of values or multiple arrays of values), you wish to obtain an array of integer
indices (I refer to them colloquially as indexers) that tells you how to reorder the data
to be in sorted order. Two methods for this are argsort and numpy.lexsort. As an
example

In [86]:
values = np.array([5, 0, 1, 3, 2])

indexer = values.argsort()

indexer

array([1, 2, 4, 3, 0])

In [87]:
values[indexer]

array([0, 1, 2, 3, 5])

In [88]:
# As a more complicated example, this code reorders a 
# two-dimensional array by its first row:
arr = np.random.randn(3, 5)

arr[0] = values

arr

array([[ 5.        ,  0.        ,  1.        ,  3.        ,  2.        ],
       [-1.23442883,  0.91624336,  0.3461365 , -0.92729173,  0.38947689],
       [-0.99011385, -1.6393558 ,  0.38023146,  0.51697934, -1.13637738]])

In [89]:
arr[:, arr[0].argsort()]

array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [ 0.91624336,  0.3461365 ,  0.38947689, -0.92729173, -1.23442883],
       [-1.6393558 ,  0.38023146, -1.13637738,  0.51697934, -0.99011385]])

`lexsort` is similar to argsort, but it performs an indirect lexicographical sort on multi‐
ple key arrays. Suppose we wanted to sort some data identified by first and last
names:

In [90]:
first_name = np.array(['Bob', 'Jane', 'Steve', 'Bill', 'Barbara'])
last_name = np.array(['Jones', 'Arnold', 'Arnold', 'Jones', 'Walters'])

sorter = np.lexsort((first_name, last_name))

sorter

array([1, 2, 3, 0, 4])

In [91]:
for last, first in zip(last_name[sorter], first_name[sorter]):
    print(last, first)

Arnold Jane
Arnold Steve
Jones Bill
Jones Bob
Walters Barbara


* `pandas` methods like Series’s and DataFrame’s `sort_values` method
are implemented with variants of these functions (which also must
take into account missing values

### Alternative Sort Algorithms

A stable sorting algorithm preserves the relative position of equal elements. This can
be especially important in indirect sorts where the relative ordering is meaningful

In [92]:
values = np.array(['2:first', '2:second', '1:first', '1:second', '1:third'])

key = np.array([2, 2, 1, 1, 1])

indexer = key.argsort(kind='mergesort')

indexer

array([2, 3, 4, 0, 1])

In [93]:
values.take(indexer)

array(['1:first', '1:second', '1:third', '2:first', '2:second'],
      dtype='<U8')

### numpy.searchsorted: Finding Elements in a Sorted Array

`searchsorted` is an array method that performs a binary search on a sorted array,
returning the location in the array where the value would need to be inserted to
maintain sortedness

In [94]:
arr = np.array([0, 1, 7, 12, 15])

arr.searchsorted(9)

3

In [95]:
# You can also pass an array of values to get an array of indices back
arr.searchsorted([0, 8, 11, 16])

array([0, 3, 3, 5])

You might have noticed that searchsorted returned 0 for the 0 element. This is
because the default behavior is to return the index at the left side of a group of equal
values

In [96]:
arr = np.array([0, 0, 0, 1, 1, 1, 1])

arr.searchsorted([0, 1])

array([0, 3])

In [97]:
arr.searchsorted([0, 1], side='right')

array([3, 7])

As another application of searchsorted, suppose we had an array of values between
0 and 10,000, and a separate array of “bucket edges” that we wanted to use to bin the
data:

In [98]:
data = np.floor(np.random.uniform(0, 10000, size=50))

bins = np.array([0, 100, 1000, 5000, 10000])

data

array([2747.,  123.,  277.,   36., 2110., 7249., 2598., 9808., 4277.,
       7560., 8986., 7430., 8287., 1414., 8443., 1894., 2922., 7259.,
       5045., 2661., 8543., 8686.,  485., 7402., 7591., 7393., 2257.,
       9751., 4723., 8059., 3257.,  357., 8737., 2253., 1747.,  739.,
       1736., 2554.,   86., 8820., 5611., 4897., 2467., 4438., 3683.,
       5783., 5179., 5300.,  897., 6529.])

In [99]:
# To then get a labeling of which interval each data point belongs to (where 1 would
# mean the bucket [0, 100)), we can simply use searchsorted
labels = bins.searchsorted(data)

labels

array([3, 2, 2, 1, 3, 4, 3, 4, 3, 4, 4, 4, 4, 3, 4, 3, 3, 4, 4, 3, 4, 4,
       2, 4, 4, 4, 3, 4, 3, 4, 3, 2, 4, 3, 3, 2, 3, 3, 1, 4, 4, 3, 3, 3,
       3, 4, 4, 4, 2, 4])

In [100]:
import pandas as pd

pd.Series(data).groupby(labels).mean()

1      61.000000
2     479.666667
3    2875.526316
4    7541.347826
dtype: float64

## A.7 Writing Fast NumPy Functions with Numba

To introduce Numba, let’s consider a pure Python function that computes the expres‐
sion `(x - y).mean()` using a `for` loop

In [101]:
def mean_distance(x, y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] - y[i]
        count += 1
        
    return result / count

In [102]:
x, y = np.random.randn(10000000), np.random.randn(10000000)

In [103]:
%timeit mean_distance(x, y)

4.91 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [104]:
%timeit (x - y).mean()

63.1 ms ± 7.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The NumPy version is over 100 times faster. We can turn this function into a com‐
piled Numba function using the numba.jit function:

In [105]:
import numba as nb

In [106]:
numba_mean_distance = nb.jit(mean_distance)

In [107]:
# or we could've written:

@nb.jit
def mean_distance(x, y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] - y[i]
        count += 1
        
    return result / count

In [108]:
%timeit numba_mean_distance(x, y)

16.5 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Numba cannot compile arbitrary Python code, but it supports a significant subset of
pure Python that is most useful for writing numerical algorithms.