Numerical Python (NumPy) is an open source Python library for scientific computing. NumPy provides a host of features that allow a Python programmer to work with high-performance arrays and matrices.

The pandas library relies heavily on the NumPy array for the implementation of the pandas Series and DataFrame objects.

In [3]:
import numpy as np

## Benefits and characteristics of NumPy arrays

Several of these benefits are as follows: Contiguous allocation in memory; Vectorized operations; Boolean selection; Sliceability.
The following example calculates the time required by the for loop in Python to square a list consisting of 100000 sequential integers:

In [4]:
#a function that squares all the values in a sequence
def squares(values):
    result=[]
    for v in values:
        result.append(v*v)
        return result     

In [5]:
#create 100000 numbers using python range
to_square=range(100000)
#time how long it takes to repeatedly square them all
%timeit squares(to_square)

The slowest run took 30.52 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 471 ns per loop


Using NumPy and vectorized arrays, the e.g. can be rewritten as follows.

In [103]:
# now lets do this with a numpy array
array_to_square=np.arange(0,100000)
# and time using a vectorized operation
%timeit array_to_square**2

The slowest run took 347.48 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 76.9 µs per loop


Vectorization of the operation made code simpler and also performed faster!

## Creating NumPy arrays and performing basic array operations

   A NumPy array can be created using multiple techniques. The following code creates a new NumPy array object from a Python list:

In [8]:
#a simple array
a1=np.array([1,2,3,4,5])
a1

array([1, 2, 3, 4, 5])

In [9]:
type(a1)

numpy.ndarray

In [10]:
np.size(a1)

5

 In NumPy, n-dimensional arrays are denoted as ndarray, and it contains five elements, as is reported by the np.size() function. 
  The following code e.g. demonstrates using integer and floating-point values to initialize the array, which are then converted to floating-point numbers by NumPy:

In [11]:
#any float in the sequences makes it an array of floats
a2=np.array([1,2,3,4.0,5.0])
a2

array([ 1.,  2.,  3.,  4.,  5.])

In [12]:
#array is all of one type(float64 in this case)
a2.dtype

dtype('float64')

The following code uses a single item Python list to initialize an array of 10 items:

In [13]:
a3=np.array([[0]*10])
a3

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

   An array can also be initialized with sequential values using the Python range() function.

In [14]:
#convert a python range to numpy array
np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
#create a numpy array of 10 0.0's
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [16]:
#force it to be of int instead of float64
np.zeros(10,dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [17]:
#make 'a range' starting at 0 and with 10 values
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

   Not include the end value!

In [18]:
#0<=x<10 increment by 2
np.arange(0,10,2)

array([0, 2, 4, 6, 8])

In [19]:
#10>=x>0, counting down
np.arange(10,0,-1)

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

In [20]:
#evenly spaced #'s between two intervals
np.linspace(0,10,11)

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

Include the end value!
   

 The datatype of the array by default is float, and that the start and end values are inclusive.

In [21]:
#multiply numpy array by 2
a1=np.arange(0,10)
a1*2

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [22]:
#add two numpy arrays
a2=np.arange(10,20)
a1+a2

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

   The pandas Series and DataFrame object operate similarly to 1-and 2-dimensional arrays, respectively.

In [23]:
#create a 2-dimensional array(2x2)
np.array([[1,2],[3,4]])

array([[1, 2],
       [3, 4]])

In [24]:
#create a 1v20 array, and reshape to a 5x4 2d-array
m=np.arange(0,20).reshape(5,4)
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

use .reshape() to reorganize a 1-dimensional array into 2 dimensions.

In [25]:
#size of any dimensional array is the # of elements
np.size(m)

20

In [26]:
#can ask the size along a given axis (0 is rows)
np.size(m,0)


5

   To determine the number of rows in a 2-dimensional array, we can pass 0 as another parameter.

In [27]:
#and 1 is the columns
np.size(m,1)

4

   To determine the number of columns in a 2-dimensional array, we can pass the value 1.

## Selecting array elements

   There are many variants of this operator, but the basic access to array elements is by passing the 0-based offset of the desired element.

In [28]:
#select 0-based elements 0 and 2
a1[0],a1[2]


(0, 2)

In [29]:
#select an element in 2d array at row 1 column 2
m[1,2]

6

In [30]:
#all items in row 1
m[1,]

array([4, 5, 6, 7])

In [31]:
#all items in column 2
m[:,2]

array([ 2,  6, 10, 14, 18])

## Logical operations on arrays

   Logical operations can be applied to arrays to test the arrays values against specific criteria.

In [32]:
#which items are less than 2?
a=np.arange(5)
a<2

array([ True,  True, False, False, False], dtype=bool)

In [33]:
#create a function that is applied to all array elements
def exp (x):
    return x<3 or x>3
#np.vectorize applies the method to all items in an array
np.vectorize(exp)(a)

array([ True,  True,  True, False,  True], dtype=bool)

   Note that only the function representing the expression is passed to np.vectorize(). The array is then passed as a parameter to the object that results from that operation.

In [104]:
#boolean select items<3
r=a<3
#applying the result of the expression to the [] operator
#selects just the array elements where there is a matching Ture
a[r]

array([0, 1, 2])

In [35]:
#np.sum treats Ture as 1 and False as 0
#so this is how many items are less than 3
np.sum(a<3)

3

In [36]:
#this can be applied across two arrays
a1=np.arange(0,5)
a2=np.arange(5,0,-1)
a1<a2

array([ True,  True,  True, False, False], dtype=bool)

In [37]:
#and even multi dimensional arrays
a1=np.arange(9).reshape(3,3)
a2=np.arange(9,0,-1).reshape(3,3)
a1<a2

array([[ True,  True,  True],
       [ True,  True, False],
       [False, False, False]], dtype=bool)

## Slicing arrays

 A slice object is created using a syntax of start:end:step. Each component of the slice is optional and, as we will see, this provides convenient means to select entire rows or columns by omitting the component of the slice.

 The following code creates a ten-element array and selects items in 0-based positions from 3 up to, but not including, position 8:

In [39]:
import numpy as np
import pandas as pd
#get all items in the array from position 3
#up to position 8(but not inclusive)
a1=np.arange(1,10)
a1[3:8]

array([4, 5, 6, 7, 8])

the step value uses the default value of 1.

In [40]:
a1[::2]

array([1, 3, 5, 7, 9])

choose 0 through the length of the array as those values and then retrieves every other item.

In [41]:
a1[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1])

when using a negative step value, it is important that the start value is greater than the end value. Also note that the following example is not equivalent to the preceding example:

In [42]:
#note that when in reverse, this does not include
#the element specified in the second component of the alice
#that is, there is no 1 printed in this
a1[9:0:-1]

array([9, 8, 7, 6, 5, 4, 3, 2])

In this scenario, the 0 value in the array was not retrieved, because the end value is not inclusive, so when iterating by -1 from 9, NumPy stops at 0 before returning the value at that position in the array.

In [43]:
#all items from position 5 onwards
a1[5:]

array([6, 7, 8, 9])

In [44]:
#the items in the first 5 positons
a1[:5]

array([1, 2, 3, 4, 5])

In [45]:
#we saw this earlier
#:in rows specifier means all rows
#so this gets items in column position 1, all rows
m=np.arange(0,20).reshape(5,4)
m[:,1]


array([ 1,  5,  9, 13, 17])

In [46]:
#in all rows, but for all columns in positions
#1 up to but not including 3
m[:,1:3]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10],
       [13, 14],
       [17, 18]])

Rows can also be sliced, and the step value is valid for both rows and columns. 

In [47]:
#in row positions 3 up to but not including 5, all columns
m[3:5,:]

array([[12, 13, 14, 15],
       [16, 17, 18, 19]])

Both columns and rows can be sliced at the same time.

In [48]:
#combined to pull out a sub matrix of the matrix
m[3:5,1:3]

array([[13, 14],
       [17, 18]])

The following code explicitly selects by position the first, third, and fourth rows:

In [49]:
#using a python array, we can select 
#non-contiguous rows or columns
m[[1,3,4],:]

array([[ 4,  5,  6,  7],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

## Reshaping arrays

the .reshape() can be used to reshape a 1-dimensional array into a matrix, it is also possible to convert from a matrix back to an array.

In [50]:
#create a 9 element array(1x9)
a=np.arange(0,9)
#and reshape to a 3x3 2-d array
m=a.reshape(3,3)
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [51]:
# we can reshape downward in dimensions too 
reshaped=m.reshape(9)
reshaped

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

Note that .reshape() returns a new array with a different shape, the original array's shape remains unchanged.

The .reshape() method is not the only means of reorganizing data. Another means is the .ravel() method that will flatten a matrix to one dimension as shown in the following e.g:

In [52]:
#.ravel will generate array representing a flattened 2-d array 
raveled=m.ravel()
raveled

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [53]:
#it does not alter the shape of the source
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [55]:
reshaped=m.reshape(np.size(m))
raveled=m.ravel()
reshaped[2]=1000
raveled[5]=2000
m

array([[   0,    1, 1000],
       [   3,    4, 2000],
       [   6,    7,    8]])

The .flatten() method functions similarly to .ravel() but instead returns a new
array with copied data instead of a view. Changes to the result do not change the
original matrix:

In [56]:
m2=np.arange(0,9).reshape(3,3)
flattened=m2.flatten()
flattened[0]=1000
flattened

array([1000,    1,    2,    3,    4,    5,    6,    7,    8])

In [57]:
#but not in the origin
m2

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [58]:
#we can reshape by assigning a tuple to the .shape property
#we start with this, which has one dimension
flattened.shape

(9,)

The .shape property returns a tuple representing the shape of the array

The property can also be assigned a tuple, which will force the array to reshape itself as specified:

In [59]:
flattened.shape=(3,3)
flattened

array([[1000,    1,    2],
       [   3,    4,    5],
       [   6,    7,    8]])

In linear algebra, it is common to transpose a matrix. This can be performed with the .transpose() method, as shown here:

In [60]:
#transpose a matrix
flattened.transpose()

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

In [61]:
#can also use .T property to transpose
flattened.T

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

The .resize() method functions similarly to the .reshape() method, except that while reshaping returns a new array with data copied into it, .resize() performs an in-place reshaping of the array.

In [64]:
m=np.arange(0,9).reshape(3,3)
m.resize(1,9)
m

array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

## Combining arrays

This process in NumPy is referred to as stacking. Stacking can take various forms, including horizontal, vertical, and depth-wise stacking.

In [67]:
a=np.arange(9).reshape(3,3)
b=(a+1)*10
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Horizontal stacking combines two arrays in a manner where the columns of the second array are placed to the right of those in the first array.

In [68]:
b

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [69]:
#horizontally stack the two arrays
#b becomes columns of a to the right of a's columns
np.hstack((a,b))

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

This functionally is equivalent to using the np.concatenate() function while specifying axis = 1:

In [70]:
#identical to concatenate along axis=1
np.concatenate((a,b),axis=1)

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

In [71]:
#vertical stack,adding b as rows after a's rows
np.vstack((a,b))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

Like np.hstack(), this is equivalent to using the concatenate function, except specifying axis=0:

In [72]:
#concatenate along axis=0 is the same as vstack
np.concatenate((a,b),axis=0)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [74]:
#dstack stacks each independent column of a and b
np.dstack((a,b))

array([[[ 0, 10],
        [ 1, 20],
        [ 2, 30]],

       [[ 3, 40],
        [ 4, 50],
        [ 5, 60]],

       [[ 6, 70],
        [ 7, 80],
        [ 8, 90]]])

Column stacking performs a horizontal stack of two one-dimensional arrays, making each array a column in the resulting array:

In [75]:
#set up 1-d array
one_d_a=np.arange(5)
one_d_a

array([0, 1, 2, 3, 4])

In [77]:
#another 1-d array
one_d_b=(one_d_a+1)*10
one_d_b

array([10, 20, 30, 40, 50])

In [78]:
#stack the two columns
np.column_stack((one_d_a,one_d_b))

array([[ 0, 10],
       [ 1, 20],
       [ 2, 30],
       [ 3, 40],
       [ 4, 50]])

In [79]:
#stack along rows
np.row_stack((one_d_a,one_d_b))

array([[ 0,  1,  2,  3,  4],
       [10, 20, 30, 40, 50]])

## Splitting arrays

Arrays can also be split into multiple arrays along the horizontal, vertical, and depth axes using the np.hsplit(), np.vsplit(), and np.dsplit() functions. We will only look at the np.hsplit() function as the others work similarly.

If splitting into a number of arrays, each array returned will have the same count of columns. The source array must have a number of columns that is a multiple of the specified value.

In [80]:
# sample array
a=np.arange(12).reshape(3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [81]:
# horiz split the 2-d array into 4 array columns
np.hsplit(a,4)

[array([[0],
        [4],
        [8]]), array([[1],
        [5],
        [9]]), array([[ 2],
        [ 6],
        [10]]), array([[ 3],
        [ 7],
        [11]])]

The result is actually an array containing the four specified arrays.

In [82]:
# horiz split into two array columns
np.hsplit(a,2)

[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

In [83]:
#split at columns 1 and 3
np.hsplit(a,[1,3])

[array([[0],
        [4],
        [8]]), array([[ 1,  2],
        [ 5,  6],
        [ 9, 10]]), array([[ 3],
        [ 7],
        [11]])]

In [84]:
#along the rows
np.split(a,2,axis=1)

[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

The np.split() function performs an identical task when using axis=1

In [85]:
# new array for examples
a=np.arange(12).reshape(4,3)
a

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

We can split this by 4 and get the four arrays representing the rows:

In [86]:
# split into four rows of arrays
np.vsplit(a,4)

[array([[0, 1, 2]]),
 array([[3, 4, 5]]),
 array([[6, 7, 8]]),
 array([[ 9, 10, 11]])]

Alternately, splitting by 2, retrieving two arrays of two rows each:

In [87]:
# into two rows of arrays
np.vsplit(a,2)

[array([[0, 1, 2],
        [3, 4, 5]]), array([[ 6,  7,  8],
        [ 9, 10, 11]])]

Splitting can also be performed on specific rows:

In [88]:
# split along axis=0
# row 0 og original is row 0 of new array
# rows 1 and 2 of original are row 1
np.vsplit(a,[1,3])

[array([[0, 1, 2]]), array([[3, 4, 5],
        [6, 7, 8]]), array([[ 9, 10, 11]])]

Likewise, the split command does the same when specifying axis=0:

In [89]:
# split can specify axis
np.split(a,2,axis=0)

[array([[0, 1, 2],
        [3, 4, 5]]), array([[ 6,  7,  8],
        [ 9, 10, 11]])]

Depth splitting splits three-dimensional arrays. To demonstrate this, we will use the following three-dimensional array:

In [90]:
# 3-d array
c=np.arange(27).reshape(3,3,3)
c

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

In [91]:
# split into 3
np.dsplit(c,3)

[array([[[ 0],
         [ 3],
         [ 6]],
 
        [[ 9],
         [12],
         [15]],
 
        [[18],
         [21],
         [24]]]), array([[[ 1],
         [ 4],
         [ 7]],
 
        [[10],
         [13],
         [16]],
 
        [[19],
         [22],
         [25]]]), array([[[ 2],
         [ 5],
         [ 8]],
 
        [[11],
         [14],
         [17]],
 
        [[20],
         [23],
         [26]]])]

## Useful numerical methods of NumPy arrays

Note that most of these functions work on multi-dimensional arrays, and the axis to which the function is applied to is specified by the axis
parameter. We will examine this for the .min() and .max() functions, but note that the axis parameter applies to many other NumPy functions.

The .min() and .max() methods return the minimum and maximum values in an
array. The .argmax() and .argmin() functions return the position of the maximum
or minimum value in the array:

In [92]:
m = np.arange(10, 19).reshape(3, 3)
print (a)
print ("{0} min of the entire matrix".format(m.min()))
print ("{0} max of entire matrix".format(m.max()))
print ("{0} position of the min value".format(m.argmin()))
print ("{0} position of the max value".format(m.argmax()))
print ("{0} mins down each column".format(m.min(axis = 0)))
print ("{0} mins across each row".format(m.min(axis = 1)))
print ("{0} maxs down each column".format(m.max(axis = 0)))
print ("{0} maxs across each row".format(m.max(axis=1)))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
10 min of the entire matrix
18 max of entire matrix
0 position of the min value
8 position of the max value
[10 11 12] mins down each column
[10 13 16] mins across each row
[16 17 18] maxs down each column
[12 15 18] maxs across each row


The .mean(), .std(), and .var() methods compute the mathematical mean, standard deviation, and variance of the values in an array:

In [93]:
# demonstrate included statistical methods
a=np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [94]:
a.mean(),a.std(),a.var()

(4.5, 2.8722813232690143, 8.25)

The sum and products of all the elements in an array can be computed with the .sum() and .prod() methods:

In [95]:
# demonstrate sum and prod
a=np.arange(1,6)
a

array([1, 2, 3, 4, 5])

In [96]:
a.sum(),a.prod()

(15, 120)

The cumulative sum and products can be computed with the .cumsum() and .cumprod() methods:

In [97]:
# and cumulative sum and prod
a
a.cumsum(),a.cumprod()

(array([ 1,  3,  6, 10, 15], dtype=int32),
 array([  1,   2,   6,  24, 120], dtype=int32))

The .all() method returns True if all elements of an array are true, and .any() returns True if any element of the array is true.

In [98]:
# applying logical operators
a=np.arange(10)
(a<5).any()    # any<5?

True

In [100]:
(a<5).all() #all<5? (a<5).any() #all<5?

False

In [101]:
# size is always the total number of elements
np.arange(10).reshape(2,5).size

10

In [102]:
# .ndim will give you the total
# of dimensions
np.arange(10).reshape(2,5).ndim

2

There are a number of valuable statistical functions, as well as a number of descriptive statistical functions besides those demonstrated here. This was meant to be a brief overview of NumPy arrays, and the next two chapters on pandas Series and DataFrame objects will dive deeper into these additional methods.