# NumPy Arrays

NumPy (module ``numpy``) provides an array datatype with vectorized operations (similar to Matlab or IDL)

In [1]:
import numpy as np

Create two NumPy arrays containing 5 elements each. The ``numpy`` module contains a number of functions for generating common arrays:

In [2]:
x = np.arange(5)
x

array([0, 1, 2, 3, 4])

In [7]:
y = np.ones(5)
y

array([1., 1., 1., 1., 1.])

Operations are vectorized, so we can do arithmetic with arrays (as long as the dimensions match!) as we would with scalar variables.

In [8]:
x - (y+0.005) * 3

array([-3.015, -2.015, -1.015, -0.015,  0.985])

You can try to put different types of objects into an array, and NumPy will pick a data type that can hold them all. 

The results might not be quite what you expect!

In [9]:
np.array([3,3,"string",5,5])

array(['3', '3', 'string', '5', '5'], dtype='<U21')

In [10]:
_[3] * 5

'55555'

More sensibly, it will choose types to avoid losing precision.

In [12]:
z = np.array([5,6.66666666666,7,8,9], dtype=np.float16)
(z, z.dtype)

(array([5.   , 6.668, 7.   , 8.   , 9.   ], dtype=float16), dtype('float16'))

Supports the same type of list operations as ordinary Python lists:

In [13]:
sorted(x - y * 3)

[-3.0, -2.0, -1.0, 0.0, 1.0]

...except the data type must match! A NumPy array only holds values of a single data type.

* This allows them to be packed efficiently in memory like C arrays

In [14]:
y.dtype

dtype('float64')

## Speed comparison

Math with NumPy arrays is much faster and more intuitive than the equivalent native Python operations

Consider the function $y = 1.324\cdot a - 12.99\cdot b + 1$

In pure Python we would define:

In [15]:
def py_add(a, b):
    assert(len(a) == len(b))
    c = [0]*len(a)
    for i in range(0,len(a)):
        c[i] = 1.324 * a[i] - 12.99*b[i] + 1
    return c

Using NumPy we could instead define:

In [16]:
def np_add(a, b):
    return 1.324 * a - 12.99 * b + 1

In [17]:
[ 1.324 * a - 12.99 * b + 1 for a,b in zip(A,B) ]

NameError: name 'A' is not defined

Now let's create a couple of very large arrays to work with:

In [18]:
a = np.arange(1000000)
b = np.random.randn(1000000)
len(a)

1000000

In [19]:
b[0:20]

array([-1.19098327, -1.02221654, -0.43152622,  0.01128563, -0.21292989,
       -0.91963177,  0.14348645,  0.36103372,  0.20562992,  0.98604131,
        0.83426634,  0.62177987, -0.0187549 ,  1.28115168,  0.44128573,
       -0.91314439, -1.06721347,  0.88588316, -0.77604087,  1.58730116])

Use the magic function ``%timeit`` to test the performance of both approaches.

In [20]:
%timeit py_add(a,b)

4.01 s ± 76.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%timeit np_add(a,b)

14 ms ± 282 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%timeit [ 1.324 * x - 12.99 * y + 1 for x,y in zip(a,b) ]

3.78 s ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
import memory_profiler
%load_ext memory_profiler

In [25]:
%memit py_add(a,b)

peak memory: 114.51 MiB, increment: 30.88 MiB


In [26]:
%memit np_add(a,b)

peak memory: 91.66 MiB, increment: 7.63 MiB
