# NumPy

**NumPy**--which stands for Numerical Python--is the foundational package for performing scientific computing. In addition, it provides the primary data structure--*the n-dimensional array*--on which the **pandas** package is built. NumPy includes extensive functionality, but we will use it primarily for:

* Fast (vectorized) array operations for data processing
* Efficient descriptive statistics
* Manipulations for merging multiple data sets

In [2]:
# Import statement
import numpy as np

## ndarrays

The ndarray is an n-dimensional array object, similar to a list but designed to facilitate fast computation. However, in order for arrays to be useful, they must hold a single type of object. We will mostly focus on numerical (int, float) and boolean arrays.

Arrays will most likely be loaded from external data sources (later), but for now, we can create them via casting (using the np.**array** function) or using one of the following generating functions (or class of functions):

* np.**arange**(*start*, *stop*, *step*) (similar to range function for lists)
* np.**zeros**(*shape*), np.**ones**(*shape*) (where *shape* is a sequence of dimension sizes)
* np.random.**rand**(*d0*,*d1*,...,*dn*) (where *d0*,*d1*,...,*dn* are dimension sizes)
* np.random.**randn**(*d0*,*d1*,...,*dn*) (where *d0*,*d1*,...,*dn* are dimension sizes)

In [2]:
# Casting from list
np.array([1,5,-1,2,4])

array([ 1,  5, -1,  2,  4])

In [3]:
# np.arange
arr1d = np.arange(10)
arr1d

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:
# np.ones, np.zeros
np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [13]:
# np.random.rand, np.random.randn
arr2d = np.random.rand(3,3)
arr2d

array([[0.10957236, 0.19204457, 0.06934553],
       [0.8453985 , 0.04668042, 0.43667118],
       [0.29770399, 0.86051763, 0.27657797]])

In [5]:
arr2d = np.random.randn(3,3)
arr2d

array([[-0.64135318, -0.67963489, -0.72913262],
       [-0.77436736,  2.48251485,  0.32284936],
       [ 1.25529653, -0.99003137,  0.53167481]])

In [6]:
# Common attributes
print(arr2d.ndim) # number of dimensions
print(arr2d.shape) # shape of array
print(arr2d.dtype) # data type

2
(3, 3)
float64


In [7]:
# Casting to other dtypes
np.arange(10).astype(float)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

See https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html for additional details on ndarray objects.

## Array Operations

As previously stated, arrays are designed to support fast computation and comparison. The most common type of operations are:

* Between arrays and scalars
* Universal functions (np.func)
    - Unary (performed on a single array): abs, sqrt, exp, log, ceil, floor, logical_not, and more
    - Binary (performed between two arrays): +, -, /, *, **, min, max, mod, >, >=, <, <=, ==, !=, logical_and, logical_or, logical_xor
* Mathematical and statistical functions - Available as NumPy functions (np.func) and array methods (arr.func)
    - Aggregation: mean, sum, std, var, min/max, argmin/argmax
    - Non-aggregation: cumsum, cumprod

In [9]:
# Broadcasting with a scalar
arr2d + 1

array([[1.32040698, 1.08526154, 1.27529211],
       [1.84241435, 1.7110496 , 1.34258231],
       [1.13985641, 1.75312447, 1.53786653]])

In [13]:
# Broadcasting with 1-d array
arr2d + [1,2,3]

array([[1.32040698, 2.08526154, 3.27529211],
       [1.84241435, 2.7110496 , 3.34258231],
       [1.13985641, 2.75312447, 3.53786653]])

In [14]:
# Comparison with a scalar
arr2d > 0.5

array([[False, False, False],
       [ True,  True, False],
       [False,  True,  True]])

In [15]:
# Unary functions
np.sqrt(arr2d)

array([[0.56604504, 0.29199579, 0.52468286],
       [0.91783133, 0.84323757, 0.58530532],
       [0.37397381, 0.86782744, 0.73339385]])

In [19]:
# Binary functions
print(arr2d)
arr2d * arr2d

[[0.32040698 0.08526154 0.27529211]
 [0.84241435 0.7110496  0.34258231]
 [0.13985641 0.75312447 0.53786653]]


array([[0.10266064, 0.00726953, 0.07578574],
       [0.70966194, 0.50559153, 0.11736264],
       [0.01955982, 0.56719647, 0.28930041]])

In [20]:
# Aggregation function
np.mean(arr2d)

0.4453171458393856

In [21]:
# Non-aggregation function
print(arr2d)
arr2d.cumsum()

[[0.32040698 0.08526154 0.27529211]
 [0.84241435 0.7110496  0.34258231]
 [0.13985641 0.75312447 0.53786653]]


array([0.32040698, 0.40566852, 0.68096063, 1.52337498, 2.23442458,
       2.5770069 , 2.71686331, 3.46998778, 4.00785431])

## Arrays vs. Lists

Arrays may seem similar to lists (e.g., they are both mutable and iterable sequences), but they are distinct data structures. Be sure to use an array whenever you are performing any large scale computations or comparisons.

In [22]:
# List operations
L = list(range(10))
print(L + L)
print(L * 3)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [32]:
# Array operations
arr = np.arange(10)
print(arr + arr);
print(arr * 3);

[ 0  2  4  6  8 10 12 14 16 18]
[ 0  3  6  9 12 15 18 21 24 27]


In addition, lists do not have restrictions on the size of nested sequences, whereas arrays have restrictions for constructing a useful form of the object.

In [34]:
[[1,2,3],[4,5,6,7],[8,9]]

[[1, 2, 3], [4, 5, 6, 7], [8, 9]]

In [37]:
np.array([[1,2,3],[4,5,6],[7,8,9]])

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Indexing and Slicing

Indexing and slicing arrays is similar to lists...

In [38]:
print(arr[5])
print(arr[-1])
print(arr[5:8])
print(arr[::2])

5
9
[5 6 7]
[0 2 4 6 8]


...but unlike lists, array slices are views on the original array, so any updates to the array slice will be reflected in the original array. Consider the following example, in which we combine indexing with assignment (which also works as you would expect).

In [39]:
list_slice = L[5:8]
list_slice[0] = -10
print(list_slice, L)

[-10, 6, 7] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [40]:
arr_slice = arr[5:8]
arr_slice[0] = -10
print(arr_slice, arr)

[-10   6   7] [  0   1   2   3   4 -10   6   7   8   9]


Indexing via index and boolean arrays are convenient ways of filtering an array. In this case, these operations return a copy of the array, as opposed to a view on the original array as in the case of slicing.

In [41]:
# Index arrays - Each element in the index array is replaced by the corresponding value in the array
print(arr)
arr[np.array([1,0,1,3,9,-1])]

[  0   1   2   3   4 -10   6   7   8   9]


array([1, 0, 1, 3, 9, 9])

In [42]:
# Boolean arrays - Each element in the array is returned if the corresponding boolean scalar is True
print(arr)
arr[np.array([True, False, False, False, False, True, False, False, False, False])]

[  0   1   2   3   4 -10   6   7   8   9]


array([  0, -10])

In [43]:
# Filtering via conditional
arr[arr % 5 == 0]

array([  0, -10])

You can also use boolean arrays to learn about your data.

In [44]:
# How much of my data satisfies a given condition?
arr = np.random.randn(100)
(arr > 0).sum()

49

In [46]:
# Do any of my data satisfy a given condition?
print(arr)
(arr > 3).any()

[-1.06204261 -0.31270143  0.42023682 -0.93438386  0.67191252 -1.30759517
  0.0129766   0.33497742  0.61173203 -2.37990021  0.1209105   2.79228336
 -0.01083915 -0.97160628 -0.18323697 -0.33430426 -0.15825336 -1.63252715
  0.22596555  0.81567182  1.67540025  2.78012053 -0.70001067  0.23453586
 -0.68719096 -2.05923353  0.48178611 -0.21933761 -0.51808408  0.46378661
  0.25840081 -0.08361518  0.30724442 -0.03303424 -0.39554845 -0.36067976
 -0.26780153 -0.04390018 -1.38475184 -0.61328642 -0.19911534  0.96663269
 -1.99611946  0.96143961  1.22095829  2.30447976  0.22186068  0.74932656
  1.81572099 -2.76921623  1.32026588 -2.3332413   0.21405979 -1.77549816
  0.52946416  1.75807168 -0.40646517  1.14315869  0.60219132  0.012172
  0.99960299 -0.17363984 -1.07104802  0.6553653  -1.81923549  0.45012875
 -0.0677597   2.6662828  -0.51190984 -1.0965868   1.12407386  2.2298758
 -0.09906221  2.10010574 -0.76524024 -1.75219634  0.90538982  0.46879104
  0.22705426  0.33151999 -0.14571304 -0.66490633 -1.95

False

In [47]:
# Do all of my data satisfy a given condition?
(np.abs(arr) < 3).all()

True

Indexing and slicing multi-dimensional arrays is fairly intuitive. Whereas a single dimensional array contains 0-dimensional values (scalars), a 2-dimensional array is an array of 1-d arrays, where the first dimension represents the position of each 1-d array, and the second dimension refers to a specific position within each 1-d array (where all 1-d arrays have the same length). Similarly, a 3-dimension array has 3 dimensions corresponding to the position of each 2-d array, 1-d array, and scalar value, respectively. And so on, for higher dimensions.

![](https://i.stack.imgur.com/R2IDC.png "Multi-dimensional arrays")

When indexing and slicing into an n-d array, each dimension is accessed in order, either via successive indexing or slicing operations or a sequence of dimensional indices or slices. To retain all of the elements for a particular dimension, use the ':' operator.

In [48]:
# 2-d array
arr2d = np.random.rand(3,3)
arr2d

array([[0.41445673, 0.00083282, 0.05439661],
       [0.39634514, 0.15116368, 0.38656848],
       [0.8295717 , 0.24823093, 0.03684579]])

In [49]:
# Index specific element
print(arr2d[0][0])
print(arr2d[0,0])

0.41445672696164015
0.41445672696164015


In [50]:
# Slice rows
print(arr2d[0])
print(arr2d[0,:])

[0.41445673 0.00083282 0.05439661]
[0.41445673 0.00083282 0.05439661]


In [51]:
# Slice columns
arr2d[:,0]

array([0.41445673, 0.39634514, 0.8295717 ])

In [52]:
# Fancy slicing
arr2d[1:,:2]

array([[0.39634514, 0.15116368],
       [0.8295717 , 0.24823093]])

In [53]:
# 3-d array
arr3d = np.random.rand(3,3,3).astype(float)
arr3d

array([[[0.48544318, 0.67328527, 0.86600703],
        [0.44283259, 0.98600632, 0.58849413],
        [0.20048454, 0.47317623, 0.7616548 ]],

       [[0.57531497, 0.27293683, 0.97100797],
        [0.98356107, 0.78183447, 0.73500857],
        [0.64377264, 0.61123671, 0.70130033]],

       [[0.2839236 , 0.76271719, 0.14407741],
        [0.34567677, 0.73021279, 0.02583279],
        [0.68875116, 0.10417464, 0.02564985]]])

In [54]:
# Index specific 2-d array
print(arr3d[1])
print(arr3d[1,:,:])

[[0.57531497 0.27293683 0.97100797]
 [0.98356107 0.78183447 0.73500857]
 [0.64377264 0.61123671 0.70130033]]
[[0.57531497 0.27293683 0.97100797]
 [0.98356107 0.78183447 0.73500857]
 [0.64377264 0.61123671 0.70130033]]


In [55]:
# Slicing 3-d array
arr3d[:,1,:]

array([[0.44283259, 0.98600632, 0.58849413],
       [0.98356107, 0.78183447, 0.73500857],
       [0.34567677, 0.73021279, 0.02583279]])

## Other Important Array Methods

### Conditional Logic

We saw that ternary expressions were a convenient way for us to generate conditional values:

*expr1* if *cond* else *expr2*

There are several ways to perform this task for a list (which we could then cast to an array):

1. Use a **for** loop
2. Use **map** with a lambda function
3. Use a list comprehension

For arrays, we use the np.**where** function!

In [56]:
np.where?

In [57]:
# Flip a coin N times
N = 10
np.where(np.random.rand(N) > 0.5, 'H', 'T')

array(['T', 'H', 'T', 'T', 'H', 'T', 'H', 'H', 'H', 'T'], dtype='<U1')

In [58]:
# Select a value at random
N = 10
a = np.arange(N)
b = np.zeros(N)
np.where(np.random.rand(N) > 0.5, a, b)

array([0., 0., 2., 0., 0., 5., 6., 7., 8., 9.])

In [59]:
# Nested conditions
N = 10
a = np.ones(N)
b = np.zeros(N)
c = -np.ones(N)
np.where(np.random.rand(N) > 2/3, a, np.where(np.random.rand(N) > 1/2, b, c))

array([ 1.,  0., -1.,  0.,  0.,  0., -1.,  0.,  1.,  1.])

### Sorting

In [60]:
arr = np.random.rand(10)
arr

array([0.32899473, 0.0121254 , 0.12036217, 0.74502093, 0.91971865,
       0.1379966 , 0.87945876, 0.60853879, 0.04874319, 0.24084232])

In [61]:
# Return a copy of sorted array
np.sort(arr)

array([0.0121254 , 0.04874319, 0.12036217, 0.1379966 , 0.24084232,
       0.32899473, 0.60853879, 0.74502093, 0.87945876, 0.91971865])

In [62]:
# Return sorting indices
arr.argsort()

array([1, 8, 2, 5, 9, 0, 7, 3, 6, 4])

In [76]:
# Sort in place
arr.sort()
arr

array([0.01929118, 0.14174679, 0.16871017, 0.21343494, 0.37744676,
       0.39079423, 0.41555057, 0.42053132, 0.77197292, 0.97699876])

### Set Logic

In [63]:
arr1 = np.arange(10)
arr2 = np.arange(20,0,-2)
print(arr1, arr2)

[0 1 2 3 4 5 6 7 8 9] [20 18 16 14 12 10  8  6  4  2]


In [64]:
# Membership
7 in arr2

False

In [65]:
# Unique elements
print(arr1 * arr2, np.unique(arr1 * arr2))

[ 0 18 32 42 48 50 48 42 32 18] [ 0 18 32 42 48 50]


In [69]:
# Comparisons - np.intersect1d, .union1d, setdiff1d, setxor1d
print(arr1,arr2)
np.setxor1d(arr1, arr2)
print(np.setxor1d(arr1,arr2))
print(np.union1d(arr1,arr2))

[0 1 2 3 4 5 6 7 8 9] [20 18 16 14 12 10  8  6  4  2]
[ 0  1  3  5  7  9 10 12 14 16 18 20]
[ 0  1  2  3  4  5  6  7  8  9 10 12 14 16 18 20]


## Manipulating and Combining Arrays

Sometimes, you will need to manipulate or combine multiple arrays of data prior to performing any analysis. There are a lot of built-in functions for these purposes. Very rarely will you need to develop your own code.

#### Manipulating Arrays

In [70]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [71]:
# Reshaping arrays
print(arr.reshape((2,4)))
print(arr.reshape((2,-1))) # automatically determines the other dimension size

[[0 1 2 3]
 [4 5 6 7]]
[[0 1 2 3]
 [4 5 6 7]]


In [72]:
# Transpose
arr.reshape((2,4)).T # Our first example of chaining methods together

array([[0, 4],
       [1, 5],
       [2, 6],
       [3, 7]])

In [73]:
# Flatten
print(arr.reshape((2,4)).flatten('C')) # row-major
print(arr.reshape((2,4)).flatten('F')) # column-major

[0 1 2 3 4 5 6 7]
[0 4 1 5 2 6 3 7]


#### Combining and Splitting Arrays

In [74]:
arr1 = np.random.rand(4,2)
arr2 = np.random.rand(4,2)
print(arr1)
print(arr2)

[[0.11752018 0.24441478]
 [0.17452596 0.70944447]
 [0.79377455 0.82717427]
 [0.52269379 0.25310571]]
[[0.24045853 0.97812523]
 [0.55800245 0.0306442 ]
 [0.692898   0.19159116]
 [0.74843069 0.39882295]]


In [75]:
# Concatenation by row
np.concatenate([arr1, arr2], axis=0)

array([[0.11752018, 0.24441478],
       [0.17452596, 0.70944447],
       [0.79377455, 0.82717427],
       [0.52269379, 0.25310571],
       [0.24045853, 0.97812523],
       [0.55800245, 0.0306442 ],
       [0.692898  , 0.19159116],
       [0.74843069, 0.39882295]])

In [76]:
# Stacking rows
np.vstack([arr1, arr2]) # also, np.row_stack

array([[0.11752018, 0.24441478],
       [0.17452596, 0.70944447],
       [0.79377455, 0.82717427],
       [0.52269379, 0.25310571],
       [0.24045853, 0.97812523],
       [0.55800245, 0.0306442 ],
       [0.692898  , 0.19159116],
       [0.74843069, 0.39882295]])

In [77]:
# Concatenate by column
np.concatenate([arr1, arr2], axis=1)

array([[0.11752018, 0.24441478, 0.24045853, 0.97812523],
       [0.17452596, 0.70944447, 0.55800245, 0.0306442 ],
       [0.79377455, 0.82717427, 0.692898  , 0.19159116],
       [0.52269379, 0.25310571, 0.74843069, 0.39882295]])

In [78]:
# Stacking columns
np.hstack([arr1, arr2]) # also, np.column_stack

array([[0.11752018, 0.24441478, 0.24045853, 0.97812523],
       [0.17452596, 0.70944447, 0.55800245, 0.0306442 ],
       [0.79377455, 0.82717427, 0.692898  , 0.19159116],
       [0.52269379, 0.25310571, 0.74843069, 0.39882295]])

In [79]:
# Splitting arrays
np.split(arr1, [1,2], axis=0) # also, np.hsplit, vsplit

[array([[0.11752018, 0.24441478]]),
 array([[0.17452596, 0.70944447]]),
 array([[0.79377455, 0.82717427],
        [0.52269379, 0.25310571]])]

## File Input/Output

There are two primary ways to save/load NumPy arrays to/from a file:

* Binary format (.npy) - np.**save** and np.**load**
* Delimited text file (.txt) - np.**savetxt** and np.**loadtxt** (also, np.**genfromtext** for files with missing data)

In [80]:
# Print working directory
%pwd

'/Users/kzhang/Teaching/budt758x_f19/lecture_notes'

In [82]:
# Change working directory to data folder
%cd ~/Desktop

/Users/kzhang/Desktop


In [84]:
# Save arr2d to binary file
np.save('arr2d', arr2d) # function will add the .npy extension

In [85]:
# Delete arr2d and re-load from file
del arr2d
arr2d = np.load('arr2d.npy') # must include .npy extension
arr2d

array([[0.41445673, 0.00083282, 0.05439661],
       [0.39634514, 0.15116368, 0.38656848],
       [0.8295717 , 0.24823093, 0.03684579]])

In [87]:
# Save arr2d to text file
np.savetxt('arr2d.txt', arr2d, fmt='%.4f', delimiter=',')

In [88]:
# Delete arr2d and re-load from file
del arr2d
arr2d = np.loadtxt('arr2d.txt', delimiter=',')
arr2d

array([[4.145e-01, 8.000e-04, 5.440e-02],
       [3.963e-01, 1.512e-01, 3.866e-01],
       [8.296e-01, 2.482e-01, 3.680e-02]])