# Just Numpy

Let's start with a plausible problem. We have a dataset of all daily temperatures measured at Newark since 1893 and we want to analyze it. First, let's try that with a Python list.

In [1]:
temperatures = []
with open("data/newark-temperature-avg.txt") as file:
    for line in file.readlines():
        temperatures.append(float(line))

len(temperatures), temperatures[:10], temperatures[-10:]

(42019,
 [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
 [43.0, 47.0, 49.0, 52.0, 48.0, 52.0, 62.0, 68.0, 59.0, 47.0])

Much of the record is missing, as we can see by counting NaNs:

In [2]:
import math
numbad = 0
for x in temperatures:
    if math.isnan(x):
        numbad += 1

numbad / len(temperatures)

0.887717461148528

We have a more complete dataset of daily minimum and maximum temperatures. It's not as accurate, but we can impute the missing averages by averaging the minimum and maximum.

In [3]:
min_temperatures = []
with open("data/newark-temperature-min.txt") as file:
    for line in file.readlines():
        min_temperatures.append(float(line))

max_temperatures = []
with open("data/newark-temperature-max.txt") as file:
    for line in file.readlines():
        max_temperatures.append(float(line))

(len(min_temperatures), min_temperatures[:10], min_temperatures[-10:],
 len(max_temperatures), max_temperatures[:10], max_temperatures[-10:])

(42019,
 [26.0, 34.0, 17.0, 13.0, 17.0, 13.0, 12.0, 15.0, 11.0, 4.0],
 [36.0, 45.0, 45.0, 44.0, 39.0, 37.0, 52.0, 65.0, 46.0, nan],
 42019,
 [52.0, 43.0, 32.0, 23.0, 27.0, 30.0, 28.0, 28.0, 32.0, 32.0],
 [50.0, 50.0, 54.0, 59.0, 58.0, 68.0, 73.0, 73.0, 67.0, nan])

While we fill in the missing values, let's also measure how long it takes.

In [4]:
%%timeit

imputed_temperatures = []
for average, minimum, maximum in zip(temperatures, min_temperatures, max_temperatures):
    if math.isnan(average):
        imputed_temperatures.append(0.5 * (minimum + maximum))
    else:
        imputed_temperatures.append(average)

18.6 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now let's do the same thing in Numpy, again measuring the time.

In [5]:
import numpy

temperatures = numpy.array(temperatures)
min_temperatures = numpy.array(min_temperatures)
max_temperatures = numpy.array(max_temperatures)

In [9]:
%%timeit

missing = numpy.isnan(temperatures)
imputed_temperatures = numpy.empty(len(temperatures), dtype=numpy.float64)
imputed_temperatures[missing] = 0.5 * (min_temperatures[missing] + max_temperatures[missing])
imputed_temperatures[~missing] = temperatures[~missing]

187 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Or just

In [15]:
%%timeit

imputed_temperatures = numpy.where(
    # condition                # if true                                    # if false
    numpy.isnan(temperatures), 0.5 * (min_temperatures + max_temperatures), temperatures)

73.7 µs ± 596 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


We see that Numpy can be much faster than Python loops, in this case a factor of 10 or 20, but I have seen as much as 1000. (It depends on the application.) The way you tell it what to do is also very different, which may be good or bad. It may read more naturally, maybe not.

One thing we saw was a preoccupation on data types, unusual for Python.

In [20]:
numpy.zeros(5, dtype=numpy.float64)

array([0., 0., 0., 0., 0.])

In [21]:
numpy.zeros(5, dtype=numpy.int32)

array([0, 0, 0, 0, 0], dtype=int32)

In [22]:
numpy.zeros(5, dtype=numpy.bool)

array([False, False, False, False, False])

In [23]:
numpy.zeros(5, dtype="S3")

array([b'', b'', b'', b'', b''], dtype='|S3')

This is where a large part of Numpy's speed comes from. When Python churns, a lot of that time is spent checking and re-checking data types, which in a compiled language like C++ were checked once and for all in the compilation step.

Numpy is a suite of compiled functions applied to data with predetermined types. When you're using Numpy properly, you'll have very few `for` loops and `if` statements in your code: the Python code acts as a high-level director, while Numpy does its looping in compiled code.

The Numpy library consists mainly of one class, `numpy.ndarray`, and operations on it. This is an n-dimensional array of contiguous data. Some operations change that data or make new arrays, but many operations merely change our interpretation of the data. The latter are the fastest.

In [28]:
array = numpy.arange(24, dtype=numpy.float64)    # 64-bit floating point numbers
array

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23.])

In [42]:
array.view(numpy.int64)

array([                  0, 4607182418800017408, 4611686018427387904,
       4613937818241073152, 4616189618054758400, 4617315517961601024,
       4618441417868443648, 4619567317775286272, 4620693217682128896,
       4621256167635550208, 4621819117588971520, 4622382067542392832,
       4622945017495814144, 4623507967449235456, 4624070917402656768,
       4624633867356078080, 4625196817309499392, 4625478292286210048,
       4625759767262920704, 4626041242239631360, 4626322717216342016,
       4626604192193052672, 4626885667169763328, 4627167142146473984])

In [35]:
array.reshape(6, 4)

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.],
       [12., 13., 14., 15.],
       [16., 17., 18., 19.],
       [20., 21., 22., 23.]])

In [40]:
array.reshape(6, 4, order="f")    # Fortran order vs C order: how a 1D sequence covers an nD block

array([[ 0.,  6., 12., 18.],
       [ 1.,  7., 13., 19.],
       [ 2.,  8., 14., 20.],
       [ 3.,  9., 15., 21.],
       [ 4., 10., 16., 22.],
       [ 5., 11., 17., 23.]])

This interpretation has only two parameters:

   * `dtype` (data type, including endianness): how bytes are represented as numbers
   * `shape` and `order`: how those numbers are arranged in an n-dimensional grid

Mistakes in interpretation are usually not subtle, so just be sure to _look_ at your data.

Numpy arrays can be used in mathematical formulae, but instead of computing one value, they compute a whole array of values, element by element.

In [45]:
a = numpy.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]:
a + 100

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

In [47]:
b = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

In [48]:
a + b

array([ 0. ,  2.1,  4.2,  6.3,  8.4, 10.5, 12.6, 14.7, 16.8, 18.9])

In [49]:
a**2

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

Generally, you can imagine a table of data to compute: the columns represent meaningful quantities (often named) while the rows represent anonymous instances.

In [71]:
a = numpy.random.uniform(5, 10, 10000)
b = numpy.random.uniform(10, 20, 10000)
c = numpy.random.uniform(-0.1, 0.1, 10000)
len(a)

10000

A conventional Python approach would be to compute the formula on each instance, one after another.

In [72]:
roots1 = []
for ai, bi, ci in zip(a, b, c):
    roots1.append((-bi + math.sqrt(bi**2 - 4*ai*ci)) / (2*ai))

The Numpy approach computes each step of the formula on all instances before moving on to the next step.

In [78]:
roots2 = (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

It is equivalent to:

In [82]:
# (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)
tmp1 = -b
tmp2 = b**2
tmp3 = 4*a
tmp4 = tmp3*c
tmp5 = tmp2 - tmp4
tmp6 = numpy.sqrt(tmp5)
tmp7 = tmp1 + tmp6
tmp8 = 2*a
roots3 = tmp7 / tmp8

One strange (but useful!) consequence of this rule that mathematical operations are applied elementwise is that it even applies to comparisons. Suppose we want to verify that the `roots1` computed in the Python loop match the `roots2` and `roots3` computed by Numpy.

In [86]:
roots1 == roots2

array([ True,  True,  True, ...,  True,  True, False])

In [87]:
roots2 == roots3

array([ True,  True,  True, ...,  True,  True,  True])

When you want to check that _all_ of the elements are equal, I'd use `.all()`.

In [98]:
(roots2 == roots3).all()

True

In [99]:
(roots1 == roots2).all()

False

Why is that? Didn't we just see that `roots1 == roots2` is `True, True, True, ...`?

In [101]:
(roots1 == roots2).any()

True

In [103]:
(roots1 == roots2).sum(), len(roots1)

(9870, 10000)

Which ones fail?

In [104]:
numpy.nonzero(roots1 != roots2)

(array([  35,   65,  112,  227,  245,  287,  366,  369,  392,  416,  582,
         696,  840,  941,  954,  959, 1206, 1270, 1319, 1380, 1410, 1500,
        1537, 1636, 1727, 1757, 1822, 1935, 2178, 2189, 2305, 2406, 2453,
        2471, 2531, 2539, 2550, 2795, 2892, 3008, 3169, 3173, 3288, 3326,
        3362, 3463, 3495, 3605, 3708, 3763, 3956, 4106, 4144, 4190, 4296,
        4370, 4391, 4519, 4623, 4711, 4798, 5039, 5093, 5153, 5177, 5219,
        5285, 5331, 5355, 5496, 5547, 5587, 5636, 5654, 5938, 6041, 6144,
        6207, 6221, 6222, 6242, 6309, 6358, 6378, 6843, 6869, 6944, 7045,
        7229, 7248, 7292, 7356, 7402, 7456, 7666, 7730, 7863, 8041, 8122,
        8153, 8154, 8254, 8258, 8321, 8374, 8390, 8413, 8468, 8491, 8509,
        8511, 8531, 8755, 8766, 8784, 8790, 8876, 8999, 9059, 9162, 9430,
        9488, 9494, 9607, 9637, 9665, 9798, 9813, 9937, 9999]),)

In [105]:
roots1[35], roots2[35]

(-0.002851809051310325, -0.0028518090513101864)

In [106]:
roots1[35] - roots2[35]

-1.3877787807814457e-16

Numpy uses different routines to do its calculations, so results might not be exactly the same. We don't care about last-digit differences, so we set a tolerance.

In [109]:
(abs(roots1 - roots2) < 1e-15).all()

True

The upshot of this is that you can perform a calculation with the same expression on Numpy arrays as on Python scalars, as long as it's being applied to a table of numbers (i.e. arrays of all the same length).

Now let's get into some fancier gymnastics.

Python has a wonderfully consistent syntax for _slicing_ lists (or tuples or whatever):

In [110]:
alist = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]

In [113]:
alist[4]

4.4

In [111]:
alist[4:]

[4.4, 5.5, 6.6, 7.7, 8.8, 9.9]

In [112]:
alist[:7]

[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6]

In [114]:
alist[-1]

9.9

In [115]:
alist[:-3]

[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6]

In [117]:
alist[3:8]

[3.3, 4.4, 5.5, 6.6, 7.7]

In [118]:
alist[3:8:2]

[3.3, 5.5, 7.7]

In [119]:
alist[::2]

[0.0, 2.2, 4.4, 6.6, 8.8]

In [120]:
alist[::-2]

[9.9, 7.7, 5.5, 3.3, 1.1]

In [125]:
alist[5::-1]

[5.5, 4.4, 3.3, 2.2, 1.1, 0.0]

Numpy arrays share this syntax— everything that works for a list works for an array— but they extend it considerably.

In [130]:
array = numpy.array([[0.0, 1.1, 2.2, 3.3], [0, 10.1, 20.2, 30.3], [0, 100.1, 200.2, 300.3]])
array

array([[  0. ,   1.1,   2.2,   3.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. , 100.1, 200.2, 300.3]])

In [135]:
array[1]

array([ 0. , 10.1, 20.2, 30.3])

In [136]:
array[:, 1]

array([  1.1,  10.1, 100.1])

In [132]:
array[1:, 2:]

array([[ 20.2,  30.3],
       [200.2, 300.3]])

In [134]:
array[::2, 1::2]

array([[  1.1,   3.3],
       [100.1, 300.3]])

In [138]:
array[[False, True, True]]

array([[  0. ,  10.1,  20.2,  30.3],
       [  0. , 100.1, 200.2, 300.3]])

In [143]:
array[[2, 1, 0]]

array([[  0. , 100.1, 200.2, 300.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. ,   1.1,   2.2,   3.3]])

In [146]:
array[[2, 1, 1, 1, 1, 0, 0, 2]]

array([[  0. , 100.1, 200.2, 300.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. ,  10.1,  20.2,  30.3],
       [  0. ,   1.1,   2.2,   3.3],
       [  0. ,   1.1,   2.2,   3.3],
       [  0. , 100.1, 200.2, 300.3]])

What could this possibly be useful for?

**Masking:**

In [147]:
a = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
b = numpy.array([0, 100, 200, 300, 400, 500, 600, 700, 800, 900])

In [150]:
b > 400

array([False, False, False, False, False,  True,  True,  True,  True,
        True])

In [149]:
a[b > 400]

array([5.5, 6.6, 7.7, 8.8, 9.9])

**Sorting and maximizing:**

In [166]:
a = numpy.random.normal(0, 5, 50)
b = abs(a)

In [167]:
numpy.argsort(a)

array([28, 47, 20, 16, 30, 39, 24, 41,  3, 45,  2, 40,  9, 29, 36, 27, 10,
       46, 21,  5, 32, 38, 43, 44, 12,  8,  7, 14, 11, 42, 48, 18,  1, 22,
       25, 26, 23, 13,  4,  6, 15, 17, 19, 34, 33, 35, 49, 31,  0, 37])

In [171]:
a[numpy.argsort(a)]

array([-8.45673353, -7.97031508, -7.49093389, -6.87593934, -6.54589091,
       -6.35934363, -6.20757474, -5.69982821, -4.94300516, -4.93937733,
       -4.66210735, -4.62907433, -4.57591676, -4.22091841, -3.11460159,
       -3.09170541, -3.03656705, -2.86636703, -2.50968847, -2.46435337,
       -1.99969205, -1.70203215, -1.69933041, -1.46477529, -1.11104797,
       -0.71790588, -0.41769265, -0.16998615,  0.02748723,  0.95013039,
        1.15152996,  1.29734584,  1.37406927,  1.40786922,  1.83519781,
        2.18585511,  2.41507666,  3.22516689,  3.28321788,  3.51794208,
        3.80396331,  4.00118425,  4.07277842,  4.64196578,  5.351139  ,
        5.71021168,  7.12489904,  7.30080422,  8.12273737,  9.87689738])

In [214]:
a = numpy.meshgrid(numpy.linspace(-5, 5, 11), numpy.linspace(-5, 5, 11))[1]
a

array([[-5., -5., -5., -5., -5., -5., -5., -5., -5., -5., -5.],
       [-4., -4., -4., -4., -4., -4., -4., -4., -4., -4., -4.],
       [-3., -3., -3., -3., -3., -3., -3., -3., -3., -3., -3.],
       [-2., -2., -2., -2., -2., -2., -2., -2., -2., -2., -2.],
       [-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 4.,  4.,  4.,  4.,  4.,  4.,  4.,  4.,  4.,  4.,  4.],
       [ 5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.]])

In [218]:
a.argmax(axis=0)

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

In [219]:
a[a.argmax(axis=0)]

array([[5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.]])

**Dictionary encoding:**

In [172]:
text = """Four score and seven years ago our fathers brought forth on this
continent, a new nation, conceived in Liberty, and dedicated to the proposition
that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as
a final resting place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can
not hallow -- this ground. The brave men, living and dead, who struggled here,
have consecrated it, far above our poor power to add or detract. The world will
little note, nor long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It
is rather for us to be here dedicated to the great task remaining before us --
that from these honored dead we take increased devotion to that cause for which
they gave the last full measure of devotion -- that we here highly resolve that
these dead shall not have died in vain -- that this nation, under God, shall
have a new birth of freedom -- and that government of the people, by the people,
for the people, shall not perish from the earth."""

In [197]:
words = text.replace(".", "").replace(",", "").replace("--", "").split()

In [198]:
dictionary, integers = numpy.unique(words, return_inverse=True)

In [199]:
len(words), len(dictionary)

(271, 142)

In [200]:
dictionary

array(['But', 'Four', 'God', 'It', 'Liberty', 'Now', 'The', 'We', 'a',
       'above', 'add', 'advanced', 'ago', 'all', 'altogether', 'and',
       'any', 'are', 'as', 'battle-field', 'be', 'before', 'birth',
       'brave', 'brought', 'but', 'by', 'can', 'cause', 'civil', 'come',
       'conceived', 'consecrate', 'consecrated', 'continent', 'created',
       'dead', 'dedicate', 'dedicated', 'detract', 'devotion', 'did',
       'died', 'do', 'earth', 'endure', 'engaged', 'equal', 'far',
       'fathers', 'field', 'final', 'fitting', 'for', 'forget', 'forth',
       'fought', 'freedom', 'from', 'full', 'gave', 'government', 'great',
       'ground', 'hallow', 'have', 'here', 'highly', 'honored', 'in',
       'increased', 'is', 'it', 'larger', 'last', 'little', 'live',
       'lives', 'living', 'long', 'measure', 'men', 'met', 'might',
       'nation', 'never', 'new', 'nobly', 'nor', 'not', 'note', 'of',
       'on', 'or', 'our', 'people', 'perish', 'place', 'poor', 'portion',
       'po

In [201]:
integers

array([  1, 109,  15, 111, 141,  12,  94,  49,  24,  55,  92, 124,  34,
         8,  86,  84,  31,  69,   4,  15,  38, 127, 120, 102, 119,  13,
        81,  17,  35,  47,   5, 133,  17,  46,  69,   8,  62,  29, 132,
       118, 135, 119,  84,  93,  16,  84, 114,  31,  15, 114,  38,  27,
        79,  45,   7,  17,  82,  92,   8,  62,  19,  91, 119, 132,   7,
        65,  30, 127,  37,   8,  99,  91, 119,  50,  18,   8,  51, 107,
        97,  53, 125, 137,  66,  60, 121,  77, 119, 119,  84,  83,  76,
         3,  71,  14,  52,  15, 101, 119, 133, 113,  43, 124,   0,  69,
         8,  73, 110, 133,  27,  89,  37, 133,  27,  89,  32, 133,  27,
        89,  64, 124,  63,   6,  23,  81,  78,  15,  36, 137, 115,  66,
        65,  33,  72,  48,   9,  94,  98, 100, 127,  10,  93,  39,   6,
       140, 138,  75,  90,  88,  79, 105, 134, 133, 108,  66,  25,  72,
        27,  85,  54, 134, 123,  41,  66,   3,  71,  53, 130, 120,  78,
       103, 127,  20,  38,  66, 127, 120, 129, 139, 136, 123, 13

In [202]:
dictionary[integers]

array(['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers',
       'brought', 'forth', 'on', 'this', 'continent', 'a', 'new',
       'nation', 'conceived', 'in', 'Liberty', 'and', 'dedicated', 'to',
       'the', 'proposition', 'that', 'all', 'men', 'are', 'created',
       'equal', 'Now', 'we', 'are', 'engaged', 'in', 'a', 'great',
       'civil', 'war', 'testing', 'whether', 'that', 'nation', 'or',
       'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated',
       'can', 'long', 'endure', 'We', 'are', 'met', 'on', 'a', 'great',
       'battle-field', 'of', 'that', 'war', 'We', 'have', 'come', 'to',
       'dedicate', 'a', 'portion', 'of', 'that', 'field', 'as', 'a',
       'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave',
       'their', 'lives', 'that', 'that', 'nation', 'might', 'live', 'It',
       'is', 'altogether', 'fitting', 'and', 'proper', 'that', 'we',
       'should', 'do', 'this', 'But', 'in', 'a', 'larger', 'sense', 'we',
       '