# Just Numpy

Let's start with a plausible problem. We have a dataset of all daily temperatures measured at Newark since 1893 and we want to analyze it. First, let's try that with a Python list.

In [None]:
temperatures = []
with open("data/newark-temperature-avg.txt") as file:
    for line in file.readlines():
        temperatures.append(float(line))

len(temperatures), temperatures[:10], temperatures[-10:]

Much of the record is missing, as we can see by counting NaNs:

In [None]:
import math
numbad = 0
for x in temperatures:
    if math.isnan(x):
        numbad += 1

numbad / len(temperatures)

We have a more complete dataset of daily minimum and maximum temperatures. It's not as accurate, but we can impute the missing averages by averaging the minimum and maximum.

In [None]:
min_temperatures = []
with open("data/newark-temperature-min.txt") as file:
    for line in file.readlines():
        min_temperatures.append(float(line))

max_temperatures = []
with open("data/newark-temperature-max.txt") as file:
    for line in file.readlines():
        max_temperatures.append(float(line))

(len(min_temperatures), min_temperatures[:10], min_temperatures[-10:],
 len(max_temperatures), max_temperatures[:10], max_temperatures[-10:])

While we fill in the missing values, let's also measure how long it takes.

In [None]:
%%timeit

imputed_temperatures = []
for average, minimum, maximum in zip(temperatures, min_temperatures, max_temperatures):
    if math.isnan(average):
        imputed_temperatures.append(0.5 * (minimum + maximum))
    else:
        imputed_temperatures.append(average)

Now let's do the same thing in Numpy, again measuring the time.

In [None]:
import numpy

temperatures = numpy.array(temperatures)
min_temperatures = numpy.array(min_temperatures)
max_temperatures = numpy.array(max_temperatures)

In [None]:
%%timeit

missing = numpy.isnan(temperatures)
imputed_temperatures = numpy.empty(len(temperatures), dtype=numpy.float64)
imputed_temperatures[missing] = 0.5 * (min_temperatures[missing] + max_temperatures[missing])
imputed_temperatures[~missing] = temperatures[~missing]

Or just

In [None]:
%%timeit

imputed_temperatures = numpy.where(
    # condition                # if true                                    # if false
    numpy.isnan(temperatures), 0.5 * (min_temperatures + max_temperatures), temperatures)

We see that Numpy can be much faster than Python loops, in this case a factor of 100 or 200, but I have seen as much as several thousand. (It depends on the application.) The way you tell it what to do is also very different, which may be good or bad. It may read more naturally, maybe not.

One thing we saw was a preoccupation on data types, unusual for Python.

In [None]:
numpy.zeros(5, dtype=numpy.float64)

In [None]:
numpy.zeros(5, dtype=numpy.int32)

In [None]:
numpy.zeros(5, dtype=numpy.bool)

In [None]:
numpy.zeros(5, dtype="S3")

This is where a large part of Numpy's speed comes from. When Python churns, a lot of that time is spent checking and re-checking data types, which in a compiled language like C++ were checked once and for all in the compilation step.

Numpy is a suite of compiled functions applied to data with predetermined types. When you're using Numpy properly, you'll have very few `for` loops and `if` statements in your code: the Python code acts as a high-level director, while Numpy does its looping in compiled code.

The Numpy library consists mainly of one class, `numpy.ndarray`, and operations on it. This is an n-dimensional array of contiguous data. Some operations change that data or make new arrays, but many operations merely change our interpretation of the data. The latter are the fastest.

In [None]:
array = numpy.arange(24, dtype=numpy.float64)    # 64-bit floating point numbers
array

In [None]:
array.view(numpy.int64)

In [None]:
array.view("S8")

In [None]:
array.tostring()

In [None]:
numpy.array([b"one", b"two", b"three"]).view(numpy.uint8)

In [None]:
list(b"one\x00\x00two\x00\x00three")

In [None]:
array.reshape(6, 4)

In [None]:
array.reshape(6, 4, order="f")    # Fortran order vs C order: how a 1D sequence covers an nD block

This interpretation has only two parameters:

   * `dtype` (data type, including endianness): how bytes are represented as numbers
   * `shape` and `order`: how those numbers are arranged in an n-dimensional grid

Mistakes in interpretation are usually not subtle, so just be sure to _look_ at your data.

Numpy arrays can be used in mathematical formulae, but instead of computing one value, they compute a whole array of values, element by element.

In [None]:
a = numpy.arange(10)
a

In [None]:
a + 100

In [None]:
b = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

In [None]:
a + b

In [None]:
a**2

Generally, you can imagine a table of data to compute: the columns represent meaningful quantities (often named) while the rows represent anonymous instances.

In [None]:
a = numpy.random.uniform(5, 10, 10000)
b = numpy.random.uniform(10, 20, 10000)
c = numpy.random.uniform(-0.1, 0.1, 10000)
len(a)

A conventional Python approach would be to compute the formula on each instance, one after another.

In [None]:
roots1 = []
for ai, bi, ci in zip(a, b, c):
    roots1.append((-bi + math.sqrt(bi**2 - 4*ai*ci)) / (2*ai))

The Numpy approach computes each step of the formula on all instances before moving on to the next step.

In [None]:
roots2 = (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

The Numpy expression (`(-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)`) is equivalent to:

In [None]:
tmp1 = numpy.negative(b)            # -b
tmp2 = numpy.square(b)              # b**2
tmp3 = numpy.multiply(4, a)         # 4*a
tmp4 = numpy.multiply(tmp3, c)      # tmp3*c
tmp5 = numpy.subtract(tmp2, tmp4)   # tmp2 - tmp4
tmp6 = numpy.sqrt(tmp5)             # sqrt(tmp5)
tmp7 = numpy.add(tmp1, tmp6)        # tmp1 + tmp6
tmp8 = numpy.multiply(2, a)         # 2*a
roots3 = numpy.divide(tmp7, tmp8)   # tmp7 / tmp8

One strange (but useful!) consequence of this rule that mathematical operations are applied elementwise is that it even applies to comparisons. Suppose we want to verify that the `roots1` computed in the Python loop match the `roots2` and `roots3` computed by Numpy.

In [None]:
roots1 == roots2

In [None]:
roots2 == roots3

When you want to check that _all_ of the elements are equal, I'd use `.all()`.

In [None]:
(roots2 == roots3).all()

In [None]:
(roots1 == roots2).all()

Why is that? Didn't we just see that `roots1 == roots2` is `True, True, True, ...`?

In [None]:
(roots1 == roots2).any()

In [None]:
(roots1 == roots2).sum(), len(roots1)

Which ones fail?

In [None]:
failures, = numpy.nonzero(roots1 != roots2)
failures

In [None]:
roots1[failures[0]], roots2[failures[0]]

In [None]:
roots1[failures[0]] - roots2[failures[0]]

Numpy uses different routines to do its calculations, so results might not be exactly the same. We don't care about last-digit differences, so we set a tolerance.

In [None]:
(abs(roots1 - roots2) < 1e-15).all()

The upshot of this is that you can perform a calculation with the same expression on Numpy arrays as on Python scalars, as long as it's being applied to a table of numbers (i.e. arrays of all the same length).

Now let's get into some fancier gymnastics.

Python has a wonderfully consistent syntax for _slicing_ lists (or tuples or whatever):

In [None]:
alist = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
alist[4]

In [None]:
alist[4:]

In [None]:
alist[:7]

In [None]:
alist[-1]

In [None]:
alist[:-3]

In [None]:
alist[3:8]

The third argument of a slice is the _stride,_ the amount to skip between elements.

In [None]:
alist = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
alist[3:8:2]

In [None]:
alist[::2]

In [None]:
alist[::-2]

In [None]:
alist[5::-1]

Numpy arrays share this syntax— everything that works for a list works for an array— but they extend it considerably.

In [None]:
array = numpy.array([[0.0, 1.1, 2.2, 3.3], [0, 10.1, 20.2, 30.3], [0, 100.1, 200.2, 300.3]])
array

In [None]:
array[1]

In [None]:
array[:, 1]

In [None]:
array[1:, 2:]

In [None]:
array[::2, 1::2]

Even arrays (or sequences) of booleans or integers can be slices.

In [None]:
array = numpy.array([[0.0, 1.1, 2.2, 3.3], [0, 10.1, 20.2, 30.3], [0, 100.1, 200.2, 300.3]])
array[[False, True, True]]

In [None]:
array[[2, 1, 0]]

In [None]:
array[[2, 1, 1, 1, 1, 0, 0, 2]]

What could this possibly be useful for?

**Masking:**

In [None]:
a = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
b = numpy.array([0, 100, 200, 300, 400, 500, 600, 700, 800, 900])

In [None]:
b > 400

In [None]:
a[b > 400]

**Sorting and maximizing:**

In [None]:
a = numpy.random.normal(0, 5, 50)
b = abs(a)

In [None]:
numpy.argsort(a)

In [None]:
a[numpy.argsort(a)]

In [None]:
a = numpy.meshgrid(numpy.linspace(-5, 5, 11), numpy.linspace(-5, 5, 11))[1]
a

In [None]:
a.argmax(axis=0)

In [None]:
a[a.argmax(axis=0)]

**Dictionary encoding:**

In [None]:
text = """Four score and seven years ago our fathers brought forth on this
continent, a new nation, conceived in Liberty, and dedicated to the proposition
that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as
a final resting place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can
not hallow -- this ground. The brave men, living and dead, who struggled here,
have consecrated it, far above our poor power to add or detract. The world will
little note, nor long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It
is rather for us to be here dedicated to the great task remaining before us --
that from these honored dead we take increased devotion to that cause for which
they gave the last full measure of devotion -- that we here highly resolve that
these dead shall not have died in vain -- that this nation, under God, shall
have a new birth of freedom -- and that government of the people, by the people,
for the people, shall not perish from the earth."""

In [None]:
words = text.replace(".", "").replace(",", "").replace("--", "").split()

In [None]:
dictionary, integers = numpy.unique(words, return_inverse=True)

In [None]:
len(words), len(dictionary)

In [None]:
dictionary

In [None]:
integers

In [None]:
dictionary[integers]

Notice that slicing with integer indexes is a function in the mathematical sense. An array is a mapping from integers to the values of the array: `[0, N) → V`.

Slicing with integer indexes composes a function `[0, m) → [0, N)` with it to get a new function `[0, m) → V`.

In [None]:
a = numpy.array(["zero", "one", "two", "three", "four"])
b = numpy.array([3, 3, 1, 2, 4, 0, 1])

a[b]

Putting things together: we can use reshaping and fancy indexing together to do some surprisingly powerful things. ("Look, ma! No for loops!")

For example, I once had to reverse a list in groups of 8. I struggled with it until I found this online:

In [None]:
original = numpy.array([7, 6, 5, 4, 3, 2, 1, 0,
                        15, 14, 13, 12, 11, 10, 9, 8,
                        23, 22, 21, 20, 19, 18, 17, 16,
                        31, 30, 29, 28, 27, 26, 25, 24])

In [None]:
original.reshape(4, 8)[:, ::-1].reshape(32)

**Question:** How would we change this if we wanted to turn

In [None]:
original = numpy.array([7, 6, 5, 4, 3, 2, 1, 0,
                        15, 14, 13, 12, 11, 10, 9, 8,
                        23, 22, 21, 20, 19, 18, 17, 16,
                        31, 30, 29, 28, 27, 26, 25, 24])

into [31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]?

In [None]:
original.reshape(4, 8)[:, ::-1].reshape(32)    # change this line!

Also unlike Python slices, you can even _assign_ to sliced Numpy arrays. This overwrites parts of the array.

In [None]:
array = numpy.arange(10) * 1.1
array

In [None]:
array[5:] = -9
array

In [None]:
array[::2] = 123
array

In [None]:
array = numpy.ones(10, dtype=numpy.int64) * 999
array[numpy.array([False, False, False, True, True, False, True, False, True, False])] = numpy.array([1, 2, 3, 4])
array

In [None]:
array = numpy.ones(10, dtype=numpy.int64) * 999
array[numpy.array([8, 7, 3, 0])] = numpy.array([1, 2, 3, 4])
array

**Example:** Rubik's square. Let's use the `roll` function to rotate rows and columns of a square matrix.

In [None]:
rubiks = (numpy.tile(numpy.arange(10) * 0.1, 10) + numpy.repeat(numpy.arange(10), 10)).reshape(10, 10)
rubiks

In [None]:
def twist(isrow, index, howmuch):
    if isrow:
        rubiks[index, :] = numpy.roll(rubiks[index, :], howmuch)
    else:
        rubiks[:, index] = numpy.roll(rubiks[:, index], howmuch)
    return rubiks

In [None]:
twist(True, 3, 5)

**Exercise:** Shuffling cards. Suppose you want to model the first cut of a deck of cards. The dealer roughly cuts the deck in two and flips them together such that all cards from the left hand are in order relative to each other and all cards from the right hand are in order relative to each other, but they are randomly interleaved.

<center><img src="img/cards-chance-deck-19060.jpg" width="25%" /></center>

In [None]:
cards = numpy.arange(52)
random_booleans = numpy.random.randint(0, 2, 52, dtype=numpy.bool)
negated_booleans = ~random_booleans
solution = cards.copy()    # overwrite this array with your solution!

In [None]:
cut = random_booleans.sum()
left = cards[:cut]
right = cards[cut:]
solution[random_booleans] = left
solution[negated_booleans] = right
solution

**Something to be careful about:** copies vs views. Numpy tries to operate as little as possible on the big datasets, and that means that it sometimes returns "an array" that is just a reinterpretation of the data. The old and new array objects both point to the same bytes in memory. Example:

In [None]:
a = numpy.arange(10) * 1.1
a

In [None]:
b = a[::3]
b

In [None]:
b[:] = 999
b

In [None]:
a

Changing `b` changes `a`. This can cause some _terrible_ bugs.

... but only upon assignment. If you mostly just apply functions to arrays, Numpy will share memory among stages of your calculation and run as efficiently as possible. When you do need to make an assignment, be careful to check your work (on an interactive prompt or in a notebook).

Here's a way to check:

In [None]:
b.flags.owndata

In [None]:
b.base is a

And here's a way to ensure that you get a copy with no link to the original:

In [None]:
b = a[::3].copy()

**Tricky walk-through exercise:** Let's make a PNG! PNG files have a relatively simple structure ([full specification](https://www.w3.org/TR/PNG)), simple enough that we can make one without using a specialized library.

The format consists of a fixed preamble and compressed data blocks. The purpose of this exercise is to solve array restructuring problems that are similar to what you might encounter in a data analysis— half the problem is data-munging, right?— and end up with a cool tool for visualizing what various functions do in the remainder of this session.

The first part is easy: a preamble consisting of the following bytes: `b"\x89PNG\r\n\x1a\n"`. Let's write a function that produces this as a Numpy array.

In [None]:
def png_preamble():
    return numpy.frombuffer(b"\x89PNG\r\n\x1a\n", dtype="u1")    # u1 is unsigned 1-byte integers: raw bytes

png_preamble()

Next are three "chunks." A "chunk" is a four-byte "tag," some data, and a "cyclic redundancy check" (CRC), a way of cross-checking that the data haven't been garbled. Let's write a function that constructs a chunk, using Numpy to reinterpret numbers as bytes (changing `dtype`).

<center><img src="img/png-spec-chunks.png" width="50%"></center>

In [None]:
import zlib

def png_chunk(tag, data):
    out = numpy.empty(4 + len(tag) + len(data) + 4, dtype="u1")
    
    length_as_u4 = numpy.array([len(data)], dtype=">u4")    # 4-byte integer with most significant byte first (">")
    out[0:4] = length_as_u4.view("u1")
    
    out[4:8] = numpy.frombuffer(tag, dtype="u1")
    out[8 : 8 + len(data)] = numpy.frombuffer(data, dtype="u1")
    
    crc = zlib.crc32(tag)
    crc = zlib.crc32(data, crc)
    crc &= 0xffffffff
    crc_as_u4 = numpy.array([crc], dtype=">u4")
    out[-4:] = crc_as_u4.view("u1")

    return out

The three chunks are `"IHDR"` specifying a width, height, and `8, 6, 0, 0, 0` for "color image," followed by `"IDAT"` with compressed image data, followed by `"IEND"` with nothing.

In [None]:
def png_image(imagedata):
    height, scanline_width = imagedata.shape
    width = int((scanline_width - 1) / 4)
    
    width_height = numpy.array([width, height], dtype=">u4")
    color_image = numpy.array([8, 6, 0, 0, 0], dtype="u1")
    headerdata = numpy.concatenate([width_height.view("u1"), color_image])
    
    preamble = png_preamble()
    header = png_chunk(b"IHDR", headerdata.tostring())
    data = png_chunk(b"IDAT", zlib.compress(imagedata.tostring()))
    end = png_chunk(b"IEND", b"")
    
    return numpy.concatenate([preamble, header, data, end])

Now let's try it out on some randomly generated data.

In [None]:
width, height = 400, 300
imagedata = numpy.zeros((height, 4 * width + 1), dtype="u1")
imagedata[:, 1:] = numpy.random.randint(0, 256, (height, 4 * width))
imagedata

In [None]:
png_image(imagedata)

In [None]:
import IPython.display

def png_display(imagedata):
    return IPython.display.display(IPython.display.Image(data=png_image(imagedata)))

def png_save(filename, imagedata):
    with open(filename, "wb") as file:
        file.write(png_image(imagedata))   # a real PNG file that you can view with other programs

png_display(imagedata)

(You may be itching to tell me that Matplotlib's `imshow` does that, but the DIY solution will pay off in pedagogy.)

The part we skipped over was how to construct the image itself. We saw that it was an array with shape `(height, 4 * width + 1)` and the first byte of each row ("scanline") was zero. Here's a figure from the specification:

<center><img src="img/png-spec-scanline.png" width="60%"></center>

After the "filter type" (zero is simplest), they're alternating red, green, blue, "alpha" (transparency) bytes. If, for instance, we want to make the image all red, we'd set the red bytes and the alpha bytes to maximum (255), the others to zero. Use slices!

In [None]:
imagedata = numpy.zeros((height, 4 * width + 1), dtype="u1")
imagedata[:, 1::4] = 255     # max out red bytes
imagedata[:, 4::4] = 255     # max out alpha bytes
png_display(imagedata)

**Exercise:** Make a gradient from black to red using `linspace`:

In [None]:
numpy.linspace(0, 255, 20, dtype="u1")

In [None]:
imagedata = numpy.zeros((height, 4 * width + 1), dtype="u1")
imagedata[:, 1::4] = numpy.linspace(0, 255, width, dtype="u1")
imagedata[:, 4::4] = 255
png_display(imagedata)

**Harder exercise:** Now use multiplication by a `linspace` to make the gradient peak in the bottom-right corner.

In [None]:
imagedata[:, 1::4] = imagedata[:, 1::4] * numpy.linspace(0, 1, height).reshape(height, 1)
png_display(imagedata)

For flair, let's put a streak of blue through it.

In [None]:
imagedata[:, 3::4] = numpy.linspace(255, 0, width, dtype="u1")
png_display(imagedata)

We can chop up sections and move them around.

In [None]:
imagedata[100:250, 601:681] = imagedata[100:250, 1201:1281]
png_display(imagedata)

And `roll` sections, like we did with the Rubik's square.

In [None]:
imagedata[:, 801:881] = numpy.roll(imagedata[:, 801:881], 100, axis=0)
png_display(imagedata)

In [None]:
imagedata[230:250, 1:] = numpy.roll(imagedata[230:250, 1:], 400, axis=1)
png_display(imagedata)

I'd like to draw your attention to that `axis` parameter. When we applied `roll` to the 2D array, we could tell it which dimension to roll. This is a general feature of Numpy:

   * `axis=0` means apply it to the first axis (the first dimension you'd slice in square brackets)
   * `axis=1` means apply it to the second
   * and so on.

Numpy arrays are truly N-dimensional objects. Nowadays, they'd be called "tensors" to sound cool.

Speaking of tensors, Numpy has a linear algebra module (usually compiles to BLAS/LAPACK, but can be ATLAS or MKL).

In [None]:
matrix = numpy.array([[1.1, 2.2], [3.3, 4.4]])
matrix

In [None]:
matrix.T    # shortcut for .transpose()

In [None]:
numpy.linalg.inv(matrix)

In [None]:
numpy.eye(2)   # identity or "I" (get it?)

In [None]:
matrix @ matrix   # "at" sign means matrix multiplication in Python 3.5 and above

In [None]:
numpy.matmul(matrix, matrix)   # if you're still living in Python 2.7

In [None]:
numpy.linalg.matrix_power(matrix, 2)

In [None]:
for eigenvector, eigenvalue in zip(numpy.linalg.eig(matrix), numpy.linalg.eigvals(matrix)):
    print("")
    print(eigenvector)
    print(eigenvalue)

In [None]:
numpy.linalg.solve(matrix, eigenvector)

And many others. See the documentation.

Numpy has some basic statistics functions as well.

In [None]:
a = numpy.random.normal(100, 5, 10000)
b = numpy.random.normal(0, 2, 10000) + a

In [None]:
numpy.min(a), numpy.max(a)      # use nanmin, nanmax to ignore NaN

In [None]:
numpy.mean(a)

In [None]:
numpy.median(a)

In [None]:
numpy.std(a)

In [None]:
numpy.corrcoef(a, b)

In [None]:
numpy.histogram(a, bins=10, range=(75, 125))    # counts in each bin, bin edges

**Useful odds and ends:** In my very first example, I used `where` as an if-then-else:

In [None]:
            # condition                # if true                                    # if false
numpy.where(numpy.isnan(temperatures), 0.5 * (min_temperatures + max_temperatures), temperatures)

It can also be used to find the indexes of an array where the condition is non-zero.

In [None]:
numpy.where(numpy.isnan(temperatures))

These indexes can then be used as a slice, much like the output of `argsort`, `argwhere`, etc. The reason it's a 1-tuple (notice the comma at the end?) is because it would give you N arrays for an N dimensional index.

In [None]:
numpy.where(imagedata > 250)

In [None]:
imagedata[numpy.where(imagedata > 250)]

**Here's another goodie:** sometimes you want a discrete lookup table. If the table has size `m` and the data you need to look up has size `n`, a naive search would take `O(n*m)` time. But if the lookup table is sorted, it only has to be `O(n*log(m))`.

In [None]:
independent_variable = numpy.array([1.1, 1000.0, 2013.0216, 2099.99, 9999.9, 1e12, 2e55])   # size m
dependent_variable = numpy.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7])                       # size m

dataset = numpy.exp(numpy.random.normal(0, 3, 100000)**2)                                   # size n
dataset

In [None]:
indexes = numpy.searchsorted(independent_variable, dataset, side="left")       # bisection search every element in dataset

indexes[indexes == len(independent_variable)] = len(independent_variable) - 1  # clamp outliers to the valid range

dependent_variable[indexes]

(This is a low-level implementation of a sparse matrix...)

**Another gem:** Cumulative sum.

In [None]:
random_numbers = numpy.random.normal(0, 1, 1000)

In [None]:
random_walk = numpy.cumsum(random_numbers)
random_walk

**Splitting, stacking, and concatenating:** Numpy arrays must be contiguous in memory. Sometimes you need to work on a chunk at a time and then combine results.

In [None]:
a = numpy.arange(103)
abits = numpy.array_split(a, 10)
abits

In [None]:
numpy.concatenate(abits)

In [None]:
b = numpy.arange(100)
bbits = numpy.split(b, 10)    # must be exactly the same size
bbits

In [None]:
numpy.vstack(bbits)    # hstack, vstack, dstack...

**Universal functions:** We've been using "universal functions" or "ufuncs" all along now: they're functions of n numbers → 1 number for small n, such as addition, trigonometry, exponentiation...

In [None]:
a = numpy.arange(10) * 1.1

In [None]:
a + a

In [None]:
numpy.add(a, a)    # same thing

In [None]:
numpy.sin(a)

In [None]:
numpy.exp(a)

In [None]:
type(numpy.add), type(numpy.sin), type(numpy.exp)

This function type constitutes a special protocol: some third-party libraries like SciPy define many more functions in this form and other third-party libraries override them with special powers.

Numpy ufuncs standardize fast computation just as Numpy arrays standardize views of bytes in memory.

They also have some "hidden" features:

In [None]:
[x for x in dir(numpy.add) if not x.startswith("_")]

In [None]:
a = numpy.array([1, 2, 3, 4, 5])
numpy.add.reduce(a)

In [None]:
numpy.sum(a)

Okay, same result, but the `ufunc.reduce` gives you a reducer for any two-argument operation:

In [None]:
numpy.multiply.reduce(a)

In [None]:
numpy.bitwise_or.reduce(a)

Add (or whatever) irregularly sized subsequences of an array with `ufunc.reduceat`.

In [None]:
a = numpy.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
i = numpy.array([0, 3, 5, 6, 9])
numpy.add.reduceat(a, i)

In [None]:
(1 + 2), (3 + 4), (5), (6 + 7 + 8), (9)

Outer products (or sums or whatever) with `ufunc.outer`.

In [None]:
a = numpy.arange(10)
b = numpy.arange(10) * 0.1

In [None]:
numpy.add.outer(a, b)

`ufunc.at` is like assignment with fancy indexing, except that repeated values are accumulated.

In [None]:
random_integers = numpy.random.randint(0, 10, 15)
random_integers

In [None]:
not_a_histogram = numpy.zeros(10, dtype=numpy.int64)
not_a_histogram[random_integers] += 1
not_a_histogram

In [None]:
yes_a_histogram = numpy.zeros(10, dtype=numpy.int64)
numpy.add.at(yes_a_histogram, random_integers, 1)
yes_a_histogram

**Saving your work:** Numpy has a file format; actually just raw data with `dtype/shape` interpretations in ZIP files.

In [None]:
numpy.savez("output.npz", one=a, two=imagedata)

In [None]:
file = numpy.load("output.npz")

In [None]:
file["one"]

In [None]:
file["two"]

This format is simple but pretty efficient. (Use `numpy.savez_compressed` to also compress the arrays.)

For more options and interoperability, use the h5py or pytables libraries to save/load HDF5.

This is a good place to stop because we'll cover other libraries in the Numpy ecosystem after lunch.