# NumPy Basics: Arrays and Vectorized Computation

In [1]:
import numpy as np

In [2]:
my_arr = np.arange(1000000)

In [3]:
my_list = list(range(1000000))

In [4]:
%time for _ in range(10): my_arr2 = my_arr * 2

CPU times: user 21 ms, sys: 19.4 ms, total: 40.4 ms
Wall time: 72.6 ms


In [5]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

CPU times: user 705 ms, sys: 181 ms, total: 886 ms
Wall time: 901 ms


## 4.1 The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray,
which is a fast, flexible container for large datasets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements.

First import NumPy and generate a small
array of random data.

In [6]:
import numpy as np

In [8]:
data = np.random.randn(2, 3)
data

array([[-0.44007968, -0.64415325, -0.05205696],
       [-0.74277142,  2.05566923,  0.80470463]])

I then write mathematical operations with data.

In [9]:
data * 10

array([[ -4.40079684,  -6.44153248,  -0.52056956],
       [ -7.42771421,  20.55669229,   8.04704635]])

In [10]:
data + data

array([[-0.88015937, -1.2883065 , -0.10411391],
       [-1.48554284,  4.11133846,  1.60940927]])

An ndarray is a generic multidimensional container for homogeneous data; that is, all
of the elements must be the same type.

Every array has a shape, a tuple indicating the
size of each dimension, and a dtype, an object describing the data type of the array.

In [11]:
data.shape

(2, 3)

In [12]:
data.dtype

dtype('float64')

Whenever you see “array,” “NumPy array,” or “ndarray” in the text,
with few exceptions they all refer to the same thing: the ndarray
object.

### Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any
sequence-like object (including other arrays) and produces a new NumPy array containing
the passed data.

In [14]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([ 6. ,  7.5,  8. ,  0. ,  1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional
array.

In [15]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape
attributes.

In [16]:
arr2.ndim

2

In [17]:
arr2.shape

(2, 4)

Unless explicitly specified (more on this later), np.array tries to infer a good data
type for the array that it creates. The data type is stored in a special dtype metadata
object; for example, in the previous two examples we have.

In [18]:
arr1.dtype

dtype('float64')

In [19]:
arr2.dtype

dtype('int64')

In addition to np.array, there are a number of other functions for creating new
arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a
given length or shape.

In [20]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [21]:
np.zeros((3, 6))

array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

empty creates an array without initializing its values to any particular
value. To create a higher dimensional array with these methods, pass a tuple
for the shape.

In [22]:
np.empty((2, 3, 2))

array([[[ -2.68156159e+154,  -2.68156159e+154],
        [  2.96439388e-323,   0.00000000e+000],
        [  5.74020225e+180,   1.16095484e-028]],

       [[  6.97283618e+228,   3.68008723e-110],
        [  6.48224638e+170,   3.67145870e+228],
        [  1.09826778e+295,   8.38735347e-309]]])

It’s not safe to assume that np.empty will return an array of all
zeros. In some cases, it may return uninitialized “garbage” values.

arange is an array-valued version of the built-in Python range function.

In [23]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

![alt text](resources/ch_4_1.png "Logo Title Text 1")


### Data types for ndarrays

The data type or dtype is a special object containing the information (or metadata,
data about data) the ndarray needs to interpret a chunk of memory as a particular
type of data.

In [24]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

In [25]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [26]:
arr1.dtype

dtype('float64')

In [27]:
arr2.dtype

dtype('int32')

![alt text](resources/ch_4_2.png "Logo Title Text 1")

You can explicitly convert or cast an array from one dtype to another using ndarray’s
astype method.

In [34]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int64')

In [33]:
float_arr = arr.astype(np.float64)
float_arr.dtype

dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point
numbers to be of integer dtype, the decimal part will be truncated.

In [35]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

array([  3.7,  -1.2,  -2.6,   0.5,  12.9,  10.1])

In [36]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert
them to numeric form.

In [37]:
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_)
numeric_strings

array([b'1.25', b'-9.6', b'42'], 
      dtype='|S4')

In [39]:
numeric_strings.astype(float)

array([  1.25,  -9.6 ,  42.  ])

If casting were to fail for some reason (like a string that cannot be converted to
float64), a ValueError will be raised. Here I was a bit lazy and wrote float instead
of np.float64; NumPy aliases the Python types to its own equivalent data dtypes.

It’s important to be cautious when using the numpy.string_ type,
as string data in NumPy is fixed size and may truncate input
without warning. pandas has more intuitive out-of-the-box behavior
on non-numeric data.

You can also use another array’s dtype attribute.

In [40]:
int_array = np.arange(10)

In [41]:
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

In [42]:
int_array.astype(calibers.dtype)

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

There are shorthand type code strings you can also use to refer to a dtype.

In [44]:
empty_uint32 = np.empty(8, dtype="u4")
empty_uint32

array([         0, 1075314688,          0, 1075707904,          0,
       1075838976,          0, 1072693248], dtype=uint32)

Calling astype always creates a new array (a copy of the data), even
if the new dtype is the same as the old dtype.

### Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data
without writing any for loops. NumPy users call this vectorization. Any arithmetic
operations between equal-size arrays applies the operation element-wise.

In [45]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

In [46]:
arr * arr

array([[  1.,   4.,   9.],
       [ 16.,  25.,  36.]])

In [47]:
arr - arr

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in
the array.

In [48]:
1 / arr

array([[ 1.        ,  0.5       ,  0.33333333],
       [ 0.25      ,  0.2       ,  0.16666667]])

In [49]:
arr ** 0.5

array([[ 1.        ,  1.41421356,  1.73205081],
       [ 2.        ,  2.23606798,  2.44948974]])

Comparisons between arrays of the same size yield boolean arrays.

In [50]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[  0.,   4.,   1.],
       [  7.,   2.,  12.]])

In [51]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]], dtype=bool)

Operations between differently sized arrays is called broadcasting and will be discussed
in more detail in Appendix A.

### Basic Indexing and Slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select
a subset of your data or individual elements. One-dimensional arrays are simple; on
the surface they act similarly to Python lists.

In [52]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]:
arr[5]

5

In [54]:
arr[5:8]

array([5, 6, 7])

In [55]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is
propagated (or broadcasted henceforth) to the entire selection. 

An important first distinction
from Python’s built-in lists is that array slices are views on the original array.
This means that the data is not copied, and any modifications to the view will be
reflected in the source array.

In [56]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

In [58]:
arr_slice[1] = 12345

In [59]:
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,     9])

The “bare” slice [:] will assign to all values in an array.

In [60]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you are new to NumPy, you might be surprised by this, especially if you have used
other array programming languages that copy data more eagerly. As NumPy has been
designed to be able to work with very large arrays, you could imagine performance
and memory problems if NumPy insisted on always copying data.

If you want a copy of a slice of an ndarray instead of a view, you
will need to explicitly copy the array—for example,
arr[5:8].copy().

With higher dimensional arrays, you have many more options. In a two-dimensional
array, the elements at each index are no longer scalars but rather one-dimensional
arrays.

In [61]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]

array([7, 8, 9])

Pass a comma-separated list of indices to select individual elements.

In [62]:
arr2d[0, 2]

3

See Figure 4-1 for an illustration of indexing on a two-dimensional array. I find it
helpful to think of axis 0 as the “rows” of the array and axis 1 as the “columns.”

![alt text](resources/ch_4_3.png "Logo Title Text 1")

In multidimensional arrays, if you omit later indices, the returned object will be a
lower dimensional ndarray consisting of all the data along the higher dimensions. So
in the 2 × 2 × 3 array arr3d.

In [63]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

arr3d[0] is a 2 × 3 array.

In [64]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values and arrays can be assigned to arr3d[0].

In [65]:
old_values = arr3d[0].copy()

In [66]:
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [68]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),
forming a 1-dimensional array.

In [69]:
arr3d[1, 0]

array([7, 8, 9])

This expression is the same as though we had indexed in two steps.

In [70]:
x = arr3d[1]
x

array([[ 7,  8,  9],
       [10, 11, 12]])

In [71]:
x[0]

array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the
returned arrays are views.

#### Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the
familiar syntax.

In [75]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [76]:
arr[1:6]

array([1, 2, 3, 4, 5])

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit
different.

In [78]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [79]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a
range of elements along an axis. It can be helpful to read the expression arr2d[:2] as
“select the first two rows of arr2d.”

You can pass multiple slices just like you can pass multiple indexes.

In [80]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions.
By mixing integer indexes and slices, you get lower dimensional slices.

For example, I can select the second row but only the first two columns like so.

In [81]:
arr2d[1, :2]

array([4, 5])

Similarly, I can select the third column but only the first two rows like so.

In [82]:
arr2d[:2, 2]

array([3, 6])

See Figure 4-2 for an illustration.

![alt text](resources/ch_4_4.png "Logo Title Text 1")

Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing.

In [83]:
arr2d[:, :1]

array([[1],
       [4],
       [7]])

Of course, assigning to a slice expression assigns to the whole selection.

In [84]:
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

### Boolean indexing

Let’s consider an example where we have some data in an array and an array of names
with duplicates. I’m going to use here the randn function in numpy.random to generate
some random normally distributed data.

Suppose each name corresponds to a row in the data array and we wanted to select
all the rows with corresponding name 'Bob'. Like arithmetic operations, comparisons
(such as ==) with arrays are also vectorized. Thus, comparing names with the
string 'Bob' yields a boolean array.

This boolean array can be passed when indexing the array.

In [95]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], 
      dtype='<U4')

In [96]:
data = np.random.randn(7, 4)
data

array([[ 0.14135778,  0.55414834,  1.14182985,  0.30951065],
       [-0.93105691,  0.33365586,  0.89711331,  0.0079815 ],
       [-0.29028495,  0.78833691,  0.97267945,  0.39462022],
       [-1.93520887,  1.14780943, -0.29890964, -0.71611126],
       [ 0.76307273, -0.68972174,  0.1897374 , -2.02428918],
       [-0.96102013, -0.04776227,  0.69019307,  0.29040062],
       [-0.09173222,  1.39013853,  0.30602573,  0.3648359 ]])

In [97]:
names == "Bob"

array([ True, False, False,  True, False, False, False], dtype=bool)

In [98]:
data[names == "Bob"]

array([[ 0.14135778,  0.55414834,  1.14182985,  0.30951065],
       [-1.93520887,  1.14780943, -0.29890964, -0.71611126]])

The boolean array must be of the same length as the array axis it’s indexing. You can
even mix and match boolean arrays with slices or integers (or sequences of integers;
more on this later). 

Boolean selection will not fail if the boolean array is not the correct
length, so I recommend care when using this feature.

In these examples, I select from the rows where names == 'Bob' and index the columns,
too.

In [99]:
data[names == "Bob", 2:]

array([[ 1.14182985,  0.30951065],
       [-0.29890964, -0.71611126]])

In [100]:
data[names == "Bob", 3]

array([ 0.30951065, -0.71611126])

To select everything but 'Bob', you can either use != or negate the condition using ~.

In [101]:
names != "Bob"

array([False,  True,  True, False,  True,  True,  True], dtype=bool)

In [102]:
data[~(names == "Bob")]

array([[-0.93105691,  0.33365586,  0.89711331,  0.0079815 ],
       [-0.29028495,  0.78833691,  0.97267945,  0.39462022],
       [ 0.76307273, -0.68972174,  0.1897374 , -2.02428918],
       [-0.96102013, -0.04776227,  0.69019307,  0.29040062],
       [-0.09173222,  1.39013853,  0.30602573,  0.3648359 ]])

The ~ operator can be useful when you want to invert a general condition:

In [103]:
cond = names == "Bob"

In [104]:
data[~cond]

array([[-0.93105691,  0.33365586,  0.89711331,  0.0079815 ],
       [-0.29028495,  0.78833691,  0.97267945,  0.39462022],
       [ 0.76307273, -0.68972174,  0.1897374 , -2.02428918],
       [-0.96102013, -0.04776227,  0.69019307,  0.29040062],
       [-0.09173222,  1.39013853,  0.30602573,  0.3648359 ]])

Selecting two of the three names to combine multiple boolean conditions, use
boolean arithmetic operators like & (and) and | (or).

In [106]:
mask = (names == "Bob") | (names == "Will")
mask

array([ True, False,  True,  True,  True, False, False], dtype=bool)

In [107]:
data[mask]

array([[ 0.14135778,  0.55414834,  1.14182985,  0.30951065],
       [-0.29028495,  0.78833691,  0.97267945,  0.39462022],
       [-1.93520887,  1.14780943, -0.29890964, -0.71611126],
       [ 0.76307273, -0.68972174,  0.1897374 , -2.02428918]])

Selecting data from an array by boolean indexing always creates a copy of the data,
even if the returned array is unchanged.

The Python keywords and and or do not work with boolean arrays.
Use & (and) and | (or) instead.

Setting values with boolean arrays works in a common-sense way. To set all of the
negative values in data to 0 we need only do.

In [108]:
data[data < 0] = 0
data

array([[ 0.14135778,  0.55414834,  1.14182985,  0.30951065],
       [ 0.        ,  0.33365586,  0.89711331,  0.0079815 ],
       [ 0.        ,  0.78833691,  0.97267945,  0.39462022],
       [ 0.        ,  1.14780943,  0.        ,  0.        ],
       [ 0.76307273,  0.        ,  0.1897374 ,  0.        ],
       [ 0.        ,  0.        ,  0.69019307,  0.29040062],
       [ 0.        ,  1.39013853,  0.30602573,  0.3648359 ]])

Setting whole rows or columns using a one-dimensional boolean array is also easy.

In [109]:
data[names != "Joe"] = 7
data

array([[ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 0.        ,  0.33365586,  0.89711331,  0.0079815 ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 0.        ,  0.        ,  0.69019307,  0.29040062],
       [ 0.        ,  1.39013853,  0.30602573,  0.3648359 ]])

As we will see later, these types of operations on two-dimensional data are convenient
to do with pandas.

### Fancy indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.

Suppose we had an 8 × 4 array:

In [115]:
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

array([[ 0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [ 4.,  4.,  4.,  4.],
       [ 5.,  5.,  5.,  5.],
       [ 6.,  6.,  6.,  6.],
       [ 7.,  7.,  7.,  7.]])

To select out a subset of the rows in a particular order, you can simply pass a list or
ndarray of integers specifying the desired order.

In [116]:
arr[[4, 3, 0, 6]]

array([[ 4.,  4.,  4.,  4.],
       [ 3.,  3.,  3.,  3.],
       [ 0.,  0.,  0.,  0.],
       [ 6.,  6.,  6.,  6.]])

Using negative indices selects rows from the end.

In [117]:
arr[[-3, -5, -7]]

array([[ 5.,  5.,  5.,  5.],
       [ 3.,  3.,  3.,  3.],
       [ 1.,  1.,  1.,  1.]])

Passing multiple index arrays does something slightly different; it selects a onedimensional
array of elements corresponding to each tuple of indices.

In [119]:
arr = np.arange(32).reshape((8, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [120]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

We’ll look at the reshape method in more detail in Appendix A.

Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. 

Regardless of
how many dimensions the array has (here, only 2), the result of fancy indexing is
always one-dimensional.

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new
array.

### Transposing arrays and swapping axes

Transposing is a special form of reshaping that similarly returns a view on the underlying
data without copying anything. Arrays have the transpose method and also the
special T attribute.

In [122]:
arr = np.arange(15).reshape((3, 5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [123]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

When doing matrix computations, you may do this very often—for example, when
computing the inner matrix product using np.dot.

In [125]:
arr = np.random.randn(6, 3)
arr

array([[  6.41504102e-01,   8.18205581e-01,   1.49823896e+00],
       [ -6.38896469e-01,  -4.02331391e-01,  -5.45415939e-01],
       [  2.22931467e-03,  -7.38820347e-01,  -1.79585356e-01],
       [  8.23872424e-01,  -1.16100878e+00,   8.20229578e-01],
       [  1.06279260e+00,   3.10990017e-02,   1.91679634e-02],
       [ -6.25814993e-01,  -7.05818223e-01,   2.80858345e+00]])

In [126]:
np.dot(arr.T, arr)

array([[  3.01965947,   0.29852358,   0.24767287],
       [  0.29852358,   3.22427433,  -1.35606027],
       [  0.24767287,  -1.35606027,  11.13573436]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute
the axes (for extra mind bending).

In [127]:
arr = np.arange(16).reshape((2, 2, 4))
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [128]:
arr.transpose((1, 0, 2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

Here, the axes have been reordered with the second axis first, the first axis second,
and the last axis unchanged.

.T similarly returns a view on the data without making a copy.

## 4.2 Universal functions: fast element-wise array functions