# Numpy: ndarray basics

ndarrays are time and space-efficient multi-dimensional arrays at the core of numpy

## Creating Rank 1 numpy arrays: (aka vectors)

In [1]:
import numpy as np  # numpy needs to be imported to use ndarrays
# numpy arrays can be created by typecasting
an_array = np.array([3, 33, 333])  # create a rank 1 array
print(type(an_array))  # the type of ndarray is: "<class 'numpy.ndarray'>"

<class 'numpy.ndarray'>


In [2]:
# test the shape of the array we created; it should have just one dimension
print(an_array.shape)

(3,)


In [3]:
# since this is a 1-rank array, we need only one index to access each element
print(an_array[0], an_array[1], an_array[2])

3 33 333


In [4]:
# ndarrays are mutable; we can change their elements
an_array[0] = 888
print(an_array)

# only elements of the same data type can be assigned
an_array[0] = 'hi'  # should throw ValueError

[888  33 333]


ValueError: invalid literal for int() with base 10: 'hi'

## Creating Rank 2 numpy arrays:

A rank 2 ndarray is one with two dimensions. Notice the format below of [[row], [row]]. 2 dimensional arrays are great for representing matrices, which are very useful in data science.

In [5]:
another = np.array([[11,12,13], [21,22,23]])  # create a rank 2 array
print(another)
print('The shape is 2 rows, 3 columns: ', another.shape)  # rows x columns
print('Accessing elements [0,0], [0,1] and [1,0] of the ndarray: ', another[0,0], another[0,1], another[1,0])


[[11 12 13]
 [21 22 23]]
The shape is 2 rows, 3 columns:  (2, 3)
Accessing elements [0,0], [0,1] and [1,0] of the ndarray:  11 12 21


## There are many ways to create numpy arrays:

Here, we create a number of different size arrays with different shapes and different pre-filled values. numpy has a number of built-in methods which help us quickly and easily create multidimensional arrays.

In [6]:
import numpy as np

# create a 2x2 array of zeros
ex1 = np.zeros((2,2))
print(ex1)

[[ 0.  0.]
 [ 0.  0.]]


In [7]:
# create a 2x2 array filled with 9.0
ex2 = np.full((2,2), 9.0)
print(ex2)

[[ 9.  9.]
 [ 9.  9.]]


In [8]:
# create a 2x2 matrix with diagonal 1s and others 0
ex3 = np.eye(2,2)
print(ex3)

[[ 1.  0.]
 [ 0.  1.]]


In [9]:
# create an array of ones
ex4 = np.ones((1,2))
print(ex4)

[[ 1.  1.]]


In [10]:
# the above ndarray, ex4 is actually rank 2, it is a 2x1 array
print(ex4.shape)

# so, we need two indexes to access an element
print()
print(ex4[0,1])

(1, 2)

1.0


In [11]:
# create a 2x2 matrix of random floats between 0 and 1
ex5 = np.random.random((2,2))
print(ex5)

[[ 0.84007037  0.81599721]
 [ 0.74263431  0.58879155]]


## ndarray indexing:

    - Use slice indexing to access subsets of an ndarray
    - Recognize that such indexing creates a second reference to the same underlying data

In [9]:
import numpy as np

# Rank 2 array of shape (3,4)
an_array = np.array([[11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34]])
print(an_array)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]


In [10]:
# use array slicing to get a subarray consisting of the first 2 rows and 2 columns from each of the rows
a_slice = an_array[:2, 1:3]
print(a_slice)

[[12 13]
 [22 23]]


![ndarray slicing](array_slicing.png)

**a_slice is pointing to the same elements in memory as an_array, which is why we refer to a_slice as part of an_array.**

a_slice has its own indices and they are different from the indices of an_array

![indices of a_slice](a_slice_indices.png)

In [12]:
# when you modify a slice, you actually modify the underlying array
# we demonstrate how changing the element 12 in a_slice will also alter the element in an_array

print("Before: ", an_array[0,1]) # inspect element [0,1] in an_array
a_slice[0,0] = 1000 # a_slice[0,0] is the same piece of data as an_array[0,1]
print("After: ", an_array[0,1])

Before:  12


NameError: name 'a_slice' is not defined

If you wanted the slice to be a copy, rather than reference the original array, do the following:

In [16]:
import numpy as np

# Rank 2 array of shape (3,4)
an_array = np.array([[11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34]])
print(an_array)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]


In [17]:
a_slice = np.array(an_array[:2, 1:3])  # notice that we are creating a new ndarray
print(a_slice)

[[12 13]
 [22 23]]


In [18]:
# now you will see that even though we modify an element in the slice, it doesn't affect the original ndarray
print("Before: ", an_array[0,1]) 
a_slice[0,0] = 1000 
print("After: ", an_array[0,1])

Before:  12
After:  12


## Use both integer indexing and slice indexing

We can use both integer indexing and slice indexing to create different shaped matrices

In [2]:
import numpy as np

# Rank 2 array of shape (3,4)
an_array = np.array([[11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34]])
print(an_array)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]


In [3]:
# using both integer indexing and slice indexing generates an array of lower rank
row_rank1 = an_array[1, :] # Creates a rank 1 matrix
print(row_rank1, row_rank1.shape) # notice only a single []

[21 22 23 24] (4,)


In [4]:
# Slicing alone: generates an array of the same rank as the n_array
row_rank2 = an_array[1:2, :] # Creates a rank 2 matrix
print(row_rank2, row_rank2.shape) # notice the [[]]

[[21 22 23 24]] (1, 4)


In [5]:
# We can do the same thing for columns of an array:
print()
col_rank1 = an_array[:, 1]
col_rank2 = an_array[:, 1:2]

print(col_rank1, col_rank1.shape) # Rank 1
print()
print(col_rank2, col_rank2.shape) # Rank 2


[12 22 32] (3,)

[[12]
 [22]
 [32]] (3, 1)


## Array indexing for changing elements

Sometimes it's useful to use an array of indexes to access or change elements

In [2]:
# create a new array
import numpy as np
an_array = np.array([[11, 12, 13], [21, 22, 23], [31, 32, 33], [41, 42, 43]])

print('Original Array: ')
print(an_array, an_array.shape)

Original Array: 
[[11 12 13]
 [21 22 23]
 [31 32 33]
 [41 42 43]] (4, 3)


In [8]:
# Create an array of indices
col_indices = np.array([0, 1, 2, 0])
print('\nCol indices picked: ', col_indices)

row_indices = np.arange(4)
print('\nRow indices picked: ', row_indices)


Col indices picked:  [0 1 2 0]

Row indices picked:  [0 1 2 3]


In [9]:
# Examine the pairings of row_indices and col_indices. These are the elements we access

for row,col in zip(row_indices, col_indices):
    print(row, ", ", col)

0 ,  0
1 ,  1
2 ,  2
3 ,  0


In [10]:
# Select one element from each row
print("Values in the array at those indices: ", an_array[row_indices, col_indices])

Values in the array at those indices:  [11 22 33 41]


In [11]:
# Change one element from each row using the indices selected
an_array[row_indices, col_indices] += 100000

print("\nChanged Array:")
print(an_array)

# an_array[row_indices, col_indices] = range(4)  # also, try this


Changed Array:
[[100011     12     13]
 [    21 100022     23]
 [    31     32 100033]
 [100041     42     43]]


**Slicing and Array indexing is at the heart of NumPy. It is both convenient and extremely fast.**

## ndarray boolean indexing

In Data Science, we often find ourself caring a great deal about the value in an array.

They are particularly useful while cleaning our dataset.

    - Use boolean indexing to access and permute relevant data in ndarrays.


In [33]:
# create a 3x2 array
an_array = np.array([[11, 12], [21, 22], [31, 32]])
print(an_array)

[[11 12]
 [21 22]
 [31 32]]


In [34]:
# create a filter which will return boolean values for whether each element satisfies this condition
filter = (an_array > 15)
filter

array([[False, False],
       [ True,  True],
       [ True,  True]], dtype=bool)

In [36]:
# The filter we have created has the same dimensions as our original array.
# We can use this filter as indices for the larger array, asking for those values for which the filter is true.

print(an_array[filter])

[21 22 31 32]


In [37]:
# The shorthand for this is as follows
an_array[an_array > 15]

array([21, 22, 31, 32])

In [40]:
# applying an even more complicated logic
print(an_array[(an_array > 20) & (an_array < 30)])
an_array[(an_array % 2 == 0)]

[21 22]


array([12, 22, 32])

**What is particularly useful is that we can actually change elements in the array applying a similar logical filter.** Let's add 100 to all the even values.

In [41]:
an_array[an_array % 2 == 0] += 100
print(an_array)

[[ 11 112]
 [ 21 122]
 [ 31 132]]


Filters are very useful in many Data Science operations and Computer Science algorithms involving matrices.

## ndarray Datatypes and Operations

As we've seen so far, each ndarray has its own datatype.

    - Examine and set the datatype of an ndarray
    - Use common ndarray functions

In [50]:
ex1 = np.array([11, 12]) # Python assigns the data type
print(ex1.dtype)

int64


In [54]:
ex2 = np.array([11.0, 12.0]) # Python assigns the data type
print(ex2.dtype)

float64


In [55]:
ex3 = np.array([11, 21], dtype=np.int64) # You can also tell Python the data type
print(ex3.dtype)

int64


In [56]:
# you can use this to force floats into integers (using floor function), but notice that you are losing some
# information while doing so
ex4 = np.array([11.1, 12.7], dtype=np.int64)
print(ex4.dtype)
print()
print(ex4)

int64

[11 12]


In [57]:
# you can also force integers to floats if you anticipate that the values may change to floats later
ex5 = np.array([11, 21], dtype=np.float64)
print(ex5.dtype)
print()
print(ex5)

float64

[ 11.  21.]


### Arithmetic array operations

In [59]:
x = np.array([[111, 112], [121, 122]], dtype=np.int)
y = np.array([[211.1, 212.1], [221.1, 222.1]], dtype=np.float64)

print(x)
print()
print(y)

[[111 112]
 [121 122]]

[[ 211.1  212.1]
 [ 221.1  222.1]]


In [60]:
# add
print(x + y)  # The plus sign works
print()
print(np.add(x, y))  # so does the numpy function  "add"

[[ 322.1  324.1]
 [ 342.1  344.1]]

[[ 322.1  324.1]
 [ 342.1  344.1]]


In [64]:
# subtract
print(x - y)  # The minus sign works
print()
print(np.subtract(x, y))  # so does the numpy function  "subtract"

[[-100.1 -100.1]
 [-100.1 -100.1]]

[[-100.1 -100.1]
 [-100.1 -100.1]]


In [65]:
# multiply
print(x * y)  # The * sign works
print()
print(np.multiply(x, y))  # so does the numpy function  "multiply"

[[ 23432.1  23755.2]
 [ 26753.1  27096.2]]

[[ 23432.1  23755.2]
 [ 26753.1  27096.2]]


In [3]:
# divide
print(x / y)  # The / sign works
print()
print(np.divide(x, y))  # so does the numpy function  "divide"

NameError: name 'x' is not defined

In [67]:
# square root
print(np.sqrt(x))

[[ 10.53565375  10.58300524]
 [ 11.          11.04536102]]


In [68]:
# exponent (e ** x)
print(np.exp(x))

[[  1.60948707e+48   4.37503945e+48]
 [  3.54513118e+52   9.63666567e+52]]


## Statistical, sorting and set operations

    - Use common ndarray functions for data analysis including statistical, sorting, and set operations
    - These functions will be used frequently in data science

In [4]:
# set up a 2x5 matrix
arr = 10 * np.random.randn(2,5)
print(arr)

[[ 12.7726313   21.33956986  -1.3889379    8.50961393 -18.86314504]
 [  0.28343221   3.48805733 -16.08267311   7.37643144  11.39847962]]


In [71]:
# compute the mean for all the elements
print(arr.mean())

-1.11611162775


In [72]:
# compute the means by row
print(arr.mean(axis = 1))

[ 1.63211969 -3.86434295]


In [73]:
# compute the means by column
print(arr.mean(axis = 0))

[  9.39006564 -12.95951803  -7.28473199   6.04603249  -0.77240625]


In [74]:
# sum all the elements
print(arr.sum())

-11.1611162775


In [75]:
# compute the medians by row
print(np.median(arr, axis=1))

[ 2.3183843  -3.86319681]


### Sorting:

In [76]:
# create a 10 element array of randoms
unsorted = np.random.randn(10)

print(unsorted)

[ 0.57403955 -0.68898809 -0.3847977   0.02113651 -1.24759871  0.20863082
  0.91399034 -0.40041768  0.62803342  0.62007379]


In [77]:
# create a copy of the above array and sort
sorted = np.array(unsorted)
sorted.sort()

print(sorted)
print()
print(unsorted)

[-1.24759871 -0.68898809 -0.40041768 -0.3847977   0.02113651  0.20863082
  0.57403955  0.62007379  0.62803342  0.91399034]

[ 0.57403955 -0.68898809 -0.3847977   0.02113651 -1.24759871  0.20863082
  0.91399034 -0.40041768  0.62803342  0.62007379]


In [78]:
# inplace sorting
unsorted.sort()

print(unsorted)

[-1.24759871 -0.68898809 -0.40041768 -0.3847977   0.02113651  0.20863082
  0.57403955  0.62007379  0.62803342  0.91399034]


### Finding unique elements:

In [79]:
array = np.array([1,2,1,4,2,1,4,2])
print(np.unique(array))

[1 2 4]


### Set operations with np.array data type:

In [80]:
s1 = np.array(['desk', 'chair', 'bulb'])
s2 = np.array(['lamp', 'bulb', 'chair'])
print(s1, s2)

['desk' 'chair' 'bulb'] ['lamp' 'bulb' 'chair']


In [82]:
print(np.intersect1d(s1, s2))  # notice that we are using intersect1d since intersect expects 1d arrays
# This method gives us all the unique elements in both the sets

['bulb' 'chair']


In [83]:
print(np.union1d(s1, s2))

['bulb' 'chair' 'desk' 'lamp']


In [84]:
# we can use difference to check for elements that are in one set but not the other
print(np.setdiff1d(s1, s2))  # elements of s1 that are not in s2

['desk']


In [87]:
print(np.in1d(s1, s2))  # elements of s1 also present in s2
# returns an array of booleans

[False  True  True]


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Additional Common ndarray Operations
<br><br></p>

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Dot Product on Matrices and Inner Product on Vectors:

</p>

In [90]:
# determine the dot product of two matrices
x2d = np.array([[1,1],[1,1]])
y2d = np.array([[2,2],[2,2]])

print(x2d.dot(y2d))
print()
print(np.dot(x2d, y2d))

[[4 4]
 [4 4]]

[[4 4]
 [4 4]]


In [91]:
# determine the inner product of two vectors
a1d = np.array([9 , 9 ])
b1d = np.array([10, 10])

print(a1d.dot(b1d))
print()
print(np.dot(a1d, b1d))

180

180


In [5]:
# dot product on an array/matrix and vector
print(x2d.dot(a1d))
print()
print(np.dot(x2d, a1d))

NameError: name 'x2d' is not defined

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Sum:
</p>

In [93]:
# sum elements in the array
ex1 = np.array([[11,12],[21,22]])

print(np.sum(ex1))          # add all members

66


In [94]:
print(np.sum(ex1, axis=0))  # columnwise sum

[32 34]


In [95]:
print(np.sum(ex1, axis=1))  # rowwise sum

[23 43]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Element-wise Functions: </p>

For example, let's compare two arrays values to get the maximum of each.

In [96]:
# random array
x = np.random.randn(8)
x

array([ 1.10809784, -0.92392684,  0.04244232,  0.53913429, -0.18453419,
       -1.35753789, -0.40041814,  1.24152064])

In [97]:
# another random array
y = np.random.randn(8)
y

array([ 0.40944813,  0.81583018,  0.39156259,  0.07650113,  0.80279792,
       -0.02161284, -1.19020275, -1.38772899])

In [98]:
# returns element wise maximum between two arrays

np.maximum(x, y)

array([ 1.10809784,  0.81583018,  0.39156259,  0.53913429,  0.80279792,
       -0.02161284, -0.40041814,  1.24152064])

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Reshaping array:
</p>

In [99]:
# grab values from 0 through 19 in an array
arr = np.arange(20)
print(arr)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


In [100]:
# reshape to be a 4 x 5 matrix
arr.reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Transpose:

</p>

In [102]:
# transpose
ex1 = np.array([[11,12],[21,22]])

ex1.T

array([[11, 21],
       [12, 22]])

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Indexing using where():</p>

In [104]:
x_1 = np.array([1,2,3,4,5])

y_1 = np.array([11,22,33,44,55])

filter = np.array([True, False, True, False, True])

In [105]:
out = np.where(filter, x_1, y_1)
print(out)

[ 1 22  3 44  5]


In [106]:
mat = np.random.rand(5,5)
mat

array([[ 0.37951327,  0.81400244,  0.55820454,  0.61704261,  0.64494202],
       [ 0.76273291,  0.24963332,  0.6987649 ,  0.17919622,  0.15430031],
       [ 0.308325  ,  0.75092146,  0.01968035,  0.38331505,  0.25865723],
       [ 0.84264595,  0.60479741,  0.3237412 ,  0.93665572,  0.74541257],
       [ 0.61687403,  0.52145583,  0.05153915,  0.94320829,  0.37371856]])

In [6]:
# use np.where() to set values > 0.5 to 1000 and others to -1
np.where( mat > 0.5, 1000, -1)

NameError: name 'mat' is not defined

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

"any" or "all" conditionals:</p>

In [108]:
arr_bools = np.array([ True, False, True, True, False ])

In [109]:
arr_bools.any()

True

In [110]:
arr_bools.all()

False

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Random Number Generation:
</p>

In [112]:
Y = np.random.normal(size = (1,5))[0]
print(Y)

[-0.13410767 -0.6040243   0.34002731  1.26322453  0.14105455]


In [113]:
Z = np.random.randint(low=2,high=50,size=4)
print(Z)

[21 42 46 38]


In [114]:
np.random.permutation(Z) #return a new ordering of elements in Z

array([46, 42, 21, 38])

In [7]:
np.random.uniform(size=4) #uniform distribution

array([ 0.44785999,  0.33245886,  0.71784033,  0.83647467])

In [116]:
np.random.normal(size=4) #normal distribution

array([-0.13773466,  2.19683002, -0.28557715,  0.46991295])

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Merging data sets:
</p>

In [117]:
K = np.random.randint(low=2,high=50,size=(2,2))
print(K)

print()
M = np.random.randint(low=2,high=50,size=(2,2))
print(M)

[[29 49]
 [ 2  8]]

[[43 38]
 [35 17]]


In [118]:
np.vstack((K,M)) # stack vertically

array([[29, 49],
       [ 2,  8],
       [43, 38],
       [35, 17]])

In [119]:
np.hstack((K,M)) # stack horizontally

array([[29, 49, 43, 38],
       [ 2,  8, 35, 17]])

In [120]:
np.concatenate([K, M], axis = 0) # stack along columns

array([[29, 49],
       [ 2,  8],
       [43, 38],
       [35, 17]])

In [121]:
np.concatenate([K, M.T], axis = 1) # stack along rows

array([[29, 49, 43, 35],
       [ 2,  8, 38, 17]])

## Broadcasting:

Broadcasting is one of the more advanced features of NumPy, and it can help make you array operations more convenient

    - Employ broadcasting to perform operations on different size ndarrays

If you have arrays of uneven dimenstions, rather than trying to somehow match their dimensions, broadcasting can help you work with them.

These are our original arrays:


![original array](broadcasting_1.png)

Broadcasting modifies it and helps you generate the result

![after broadcasting](broadcasting_2.png)
![result](broadcasting_3.png)

**Remember that B retatains the original shape and the above picture is for illustration purposes only. No copy of B is involved and this is a memory and computationally efficient process.**

https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

In [122]:
# create a 4x3 matrix of zeros
import numpy as np

start = np.zeros((4,3))
print(start)

[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


In [123]:
# create a rank 1 ndarray with 3 values
add_rows = np.array([1, 0, 2])
print(add_rows)

[1 0 2]


In [124]:
y = start + add_rows  # add to each row of start using broadcasting
print(y)

[[ 1.  0.  2.]
 [ 1.  0.  2.]
 [ 1.  0.  2.]
 [ 1.  0.  2.]]


In [8]:
# create an 4x1 ndarray to broadcast across columns
add_cols = np.array([[0, 1, 2, 3]])
add_cols = add_cols.T  # Transpose function

print(add_cols)

[[0]
 [1]
 [2]
 [3]]


In [132]:
# add to each column of "start" using broadcasting
y = start + add_cols
print(y)

[[ 0.  0.  0.]
 [ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]]


In [134]:
# this will just broadcast in both dimensions
add_scalar = np.array([1])
print(start + add_scalar)

[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]


## Speed test: ndarray vs list

    - Describe the speed benefits of ndarrays over lists

First, setup parameters for the speed test. We'll be testing time to sum elements in an ndarray versus list.

In [136]:
from numpy import arange
from timeit import Timer

size = 1000000
timeits = 1000

In [137]:
# create an ndarray with values 0,1,2,...,size-1
nd_array = arange(size)
print(type(nd_array))

<class 'numpy.ndarray'>


In [138]:
# timer expects the operation as a parameter,
# here, we pass nd_array.sum()
timer_numpy = Timer("nd_array.sum()", "from __main__ import nd_array")

print("Time taken by numpy ndarray: %f seconds" % (timer_numpy.timeit(timeits)/timeits))

Time taken by numpy ndarray: 0.000695 seconds


In [139]:
# create a list with values 0,1,2,...,size-1
a_list = list(range(size))
print(type(a_list))

<class 'list'>


In [140]:
# timer expects the operation as a parameter,
# here, we pass sum(a_list)
timer_list = Timer("sum(a_list)", "from __main__ import a_list")

print("Time taken by list: %f seconds" % (timer_list.timeit(timeits)/timeits))

Time taken by list: 0.007483 seconds
