In [1]:
import numpy as np


# Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y. Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr.


In [2]:
xarr = np.arange(start=1.1, step=0.1, stop=1.6)
yarr = np.arange(start=2.1, step=0.1, stop=2.6)
cond = np.array([True, False, True, True, False])
xarr, yarr, cond


(array([1.1, 1.2, 1.3, 1.4, 1.5]),
 array([2.1, 2.2, 2.3, 2.4, 2.5]),
 array([ True, False,  True,  True, False]))

With np.where you can write this very concisely:


In [3]:
result = np.where(cond, xarr, yarr)
result


array([1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to np.where don’t need to be arrays; one or both of them can be scalars. A typical use of where in data analysis is to produce a new array of values based on another array. Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2. This is very easy to do with np.where:


In [4]:
m = np.random.randn(4, 4)
m


array([[ 0.60939168,  1.08276868,  0.98268378, -0.62907132],
       [ 0.1334483 ,  0.74168218, -0.95300511,  1.09515104],
       [ 1.19786805, -0.65440392, -1.00390772,  0.30685435],
       [ 0.6359652 ,  1.0193255 , -0.43315041,  2.35661525]])

In [5]:
result = np.where(m > 0, 2, -2)
result


array([[ 2,  2,  2, -2],
       [ 2,  2, -2,  2],
       [ 2, -2, -2,  2],
       [ 2,  2, -2,  2]])

You can combine scalars and arrays when using np.where. For example, I can replace all positive values in arr with the constant 2 like so:


In [6]:
result = np.where(m > 0, 2, m)
result


array([[ 2.        ,  2.        ,  2.        , -0.62907132],
       [ 2.        ,  2.        , -0.95300511,  2.        ],
       [ 2.        , -0.65440392, -1.00390772,  2.        ],
       [ 2.        ,  2.        , -0.43315041,  2.        ]])

# Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.

Here I generate some normally distributed random data and compute some aggregate statistics:


In [7]:
arr = np.random.randn(5, 4)
arr


array([[ 0.38722757,  1.28912749,  0.936109  ,  0.73506686],
       [-0.46547212,  0.05628195,  0.77696951, -0.31563924],
       [ 1.0002844 , -1.10868044,  0.54102666,  0.61956304],
       [ 0.45451437,  0.90369383, -2.21420719,  0.131808  ],
       [-0.67456163, -1.64932903,  1.26294601, -0.52103275]])

In [8]:
sum = arr.sum()
sum


2.145696315013355

In [9]:
sum_a = arr.sum(axis=0)
sum_b = np.sum(a=arr, axis=0)
sum_a, sum_b


(array([ 0.70199259, -0.50890619,  1.30284399,  0.64976592]),
 array([ 0.70199259, -0.50890619,  1.30284399,  0.64976592]))

In [10]:
mean_a = arr.mean(axis=1)
mean_b = np.mean(a=arr, axis=1)
mean_a, mean_b


(array([ 0.83688273,  0.01303503,  0.26304842, -0.18104775, -0.39549435]),
 array([ 0.83688273,  0.01303503,  0.26304842, -0.18104775, -0.39549435]))

Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:


In [11]:
arr = np.arange(start=0, stop=8)
arr


array([0, 1, 2, 3, 4, 5, 6, 7])

In [12]:
acum = arr.cumsum()
acum


array([ 0,  1,  3,  6, 10, 15, 21, 28])

| Method           | Description                                                                                                          |
| :--------------- | :------------------------------------------------------------------------------------------------------------------- |
| `sum`            | Sum of all the elements in the array or along an axis; zero-length arrays have sum 0                                 |
| `mean`           | Arithmetic mean; zero-length arrays have `NaN` mean                                                                  |
| `std, var`       | Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator `n`) |
| `min, max`       | Minimum and maximum                                                                                                  |
| `argmin, argmax` | Indices of minimum and maximum elements, respectively                                                                |
| `cumsum`         | Cumulative sum of elements starting from 0                                                                           |
| `cumprod`        | Cumulative product of elements starting from 1                                                                       |


# Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array:


In [13]:
arr = np.random.randn(10)
arr


array([ 0.25610137,  1.080277  ,  0.70012291, -0.44552899, -0.09809048,
        0.60182069, -1.93348683,  0.37037748, -1.55542559,  0.12206625])

In [14]:
count_a = (arr > 0).sum()
count_b = np.sum(a=arr > 0)
count_a, count_b


(6, 6)

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True:


In [15]:
bools = arr > 0
bools


array([ True,  True,  True, False, False,  True, False,  True, False,
        True])

In [16]:
np.any(arr > 0)


True

In [17]:
np.all(arr > 0)


False