# NumPy Arrays

NumPy (module ``numpy``) provides an array datatype with vectorized operations (similar to Matlab or IDL)

In [1]:
import numpy as np

Create two NumPy arrays containing 5 elements each. The ``numpy`` module contains a number of functions for generating common arrays:

In [2]:
x = np.arange(5)
x

array([0, 1, 2, 3, 4])

In [3]:
y = np.ones(5)
y, y.dtype

(array([1., 1., 1., 1., 1.]), dtype('float64'))

Operations are vectorized, so we can do arithmetic with arrays (as long as the dimensions match!) as we would with scalar variables.

In [4]:
x - (y+0.005) * 3

array([-3.015, -2.015, -1.015, -0.015,  0.985])

You can try to put different types of objects into an array, and NumPy will pick a data type that can hold them all. 

The results might not be quite what you expect!

In [5]:
np.array([3,3,"string",5,5])

array(['3', '3', 'string', '5', '5'], dtype='<U21')

In [6]:
_[0] * 5

'33333'

More sensibly, it will choose types to avoid losing precision.

In [7]:
z = np.array([5,6.66666666666,7,8,9], dtype=np.float128)
(z, z.dtype)

(array([5.        , 6.66666667, 7.        , 8.        , 9.        ],
       dtype=float128),
 dtype('float128'))

Supports the same type of list operations as ordinary Python lists:

In [8]:
sorted(x - y * 3)

[-3.0, -2.0, -1.0, 0.0, 1.0]

...except the data type must match! A NumPy array only holds values of a single data type.

* This allows them to be packed efficiently in memory like C arrays

In [None]:
y.dtype

## Speed comparison

Math with NumPy arrays is much faster and more intuitive than the equivalent native Python operations

Consider the function $y = 1.324\cdot a - 12.99\cdot b + 1$

In pure Python we would define:

In [9]:
def py_add(a, b):
    assert(len(a) == len(b))
    c = [0]*len(a)
    for i in range(0,len(a)):
        c[i] = 1.324 * a[i] - 12.99*b[i] + 1
    return c

Using NumPy we could instead define:

In [10]:
def np_add(a, b):
    return 1.324 * a - 12.99 * b + 1

In [11]:
def comp_add(a,b):
    return [ 1.324 * A - 12.99 * B + 1 for A,B in zip(a,b) ]

Now let's create a couple of very large arrays to work with:

In [12]:
a = np.arange(1000000)
b = np.random.randn(1000000)
len(a)

1000000

In [13]:
b[0:20]

array([ 0.93989476,  0.03073607, -0.18596756,  0.3745804 ,  0.22360301,
        0.63976777,  1.47903519, -0.29962935,  0.70266345,  0.34088602,
        0.28777545,  0.56515756,  0.38781758, -1.5509335 , -0.67734652,
        1.13808735, -0.61758285, -0.78183673,  1.10308637, -1.62331313])

Use the magic function ``%timeit`` to test the performance of both approaches.

In [14]:
%timeit py_add(a,b)

2.81 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%timeit np_add(a,b)

4.25 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit comp_add(a,b)

2.66 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
import memory_profiler
%load_ext memory_profiler

In [18]:
%memit py_add(a,b)

peak memory: 113.00 MiB, increment: 30.06 MiB


In [19]:
%memit np_add(a,b)

peak memory: 91.15 MiB, increment: 7.63 MiB
