<h1 id="tocheading">Table of Contents and Notebook Setup</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# Introduction

NumPy is short for Numerical Python. It is a foundational package for numerical computation in python. It includes a number of things including

<b>(i)</b> ndarray; a multidimensional array providing fast arithmethic operations

<b>(ii)</b> Mathematical functions for fast operations on entire arrays

<b>(iii)</b> Tools for writing and reading array data to disk and working with memory wrapped files

<b>(iv)</b> Linear algebra, random number generation, Fourier transform capabilities

<b>(v)</b> A C API for connecting NumPy with libraries written in C, C++, and Fortran

The C API maked it easy to pass data to external libraries written in a low level language and also for external libraries to return data to Python as NumPy arrays. This feature often makes Python a choice for wrapping legacy C/C++/Fortran codebases and giving them an easy to use interface.


# ndarray: The Multidimensional Array Object

ndarray is a fast and flexible container for large data sets in python. Consider some of the simple operations: 

In [2]:
import numpy as np

data = np.random.randn(2,3)
data

array([[ 1.75067497, -0.55541901, -2.29155033],
       [-0.66385816,  0.3124406 , -0.3907257 ]])

In [3]:
data*10

array([[ 17.50674974,  -5.5541901 , -22.91550326],
       [ -6.63858164,   3.12440602,  -3.90725697]])

In [4]:
data+data

array([[ 3.50134995, -1.11083802, -4.58310065],
       [-1.32771633,  0.6248812 , -0.78145139]])

Note that the random.randn(n,m) function returns an nxm array where each entry is a random number from a Gaussian distribution with mean 0 and variance 1.

Every array has a <b> shape </b> and a <b> datatype </b>.

In [5]:
data.shape

(2, 3)

In [6]:
data.dtype

dtype('float64')

## Creating ndarrays

We can convert regular old lists to nparrays as follows:

In [7]:
data1=[3, 4, 5, 6]
arr1 = np.array(data1)
arr1

array([3, 4, 5, 6])

Nested sequences (like a list of lists) can be converted into a multidimensional numpy array.

In [8]:
nested_list=[[1, 2, 3],[4, 5, 6]]
arr1 = np.array(nested_list)
arr1

array([[1, 2, 3],
       [4, 5, 6]])

This array has a few properties, including its <b> number of dimensions </b> and its <b> shape </b> attributes.

In [9]:
arr1.ndim

2

In [10]:
arr1.shape

(2, 3)

In addition, when the array is created, np.array tries to infer a good data type from the data it receives. The datatypw is stored in a special dtype metadata object.

In [11]:
arr1.dtype

dtype('int64')

There are a number of other ways to create arrays. If we want to create an empty numpy array, for example, we can simply just fill it with zeros.

In [12]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [13]:
np.zeros((3,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

## Data Types for ndarrays

We can specify what type of data is contained in an ndarray by invoking a keyword argument.

In [14]:
arr1 = np.array([1, 2, 3], dtype = np.float64)
arr1.dtype

dtype('float64')

dtypes are essentially a way in which numpy interacts with other systems- this involves writing and reading data to a disk. This allows numpy to interact and connect with code written in C++ or Fortran.

We can cast an array from one dtype to another using the <b> astype </b> method.

In [15]:
arr = np.array([1,2,3,4,5])
arr.dtype

dtype('int64')

In [16]:
float_arr = arr.astype(np.float64)
float_arr.dtype

dtype('float64')

In this example, integers were cast to floating point. If we cast floating point to integer, the decimal part is trunctated (<i> not rounded </i>). 

We can also cast strings representing numbers to their numeric form.

In [17]:
numeric_strings = np.array(['1.23', '4.56', '5.67'], dtype=np.string_)
numeric_strings.astype(float)

array([ 1.23,  4.56,  5.67])

Here we wrote 'float' instead of 'np.float64'. NumPy automatically aliases the Python data types to its own equivalent data types:

In [18]:
numeric_strings.astype(float).dtype

dtype('float64')

If casting fails somehow, then a <i>ValueError</i> is raised. 

In [19]:
bad_strings = np.array(['1.23','word'])
bad_strings.astype(float)

ValueError: could not convert string to float: 'word'

Finally, we can make the data type of one array become the datatype of another array. It is often useful to deal with data of the same datatype.

In [20]:
int_array = np.array([1, 2, 3, 4, 5])
decimal_array = np.array([1.45, 8.32, 3.45])

int_array.astype(decimal_array.dtype)

array([ 1.,  2.,  3.,  4.,  5.])

# Arithmetic with NumPy Arrays

Arrays are important because we can do operations on data without requiring for loops. NumPy users call this <b> vectorization.</b> Any arithematic between <i> equally sized </i> NumPy arrays occurs element wise.

In [21]:
arr = np.array([[1, 2, 3],[4, 5, 6]], dtype=float)
arr

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

In [22]:
arr+arr

array([[  2.,   4.,   6.],
       [  8.,  10.,  12.]])

In [23]:
arr*arr

array([[  1.,   4.,   9.],
       [ 16.,  25.,  36.]])

You get the point. Arithmetic opertation with scalars propagate the scalar argument to each element in the array.

In [24]:
2*arr+5

array([[  7.,   9.,  11.],
       [ 13.,  15.,  17.]])

Comparisons between arrays of the same size yield boolean arrays.

In [25]:
2/arr > 0.5*arr+1

array([[ True, False, False],
       [False, False, False]], dtype=bool)

Operation between different arrays is called <b> broadcasting </b> and is discussed in the appendix of the textbook.

# Basic Indexing and Slicing

One dimensional arrays are simple to index: they are exactly the same as a Python list.

In [26]:
arr = np.arange(10)
arr[5:8]

array([5, 6, 7])

We can also change these values like before:

In [27]:
arr[5:8] = 42
arr

array([ 0,  1,  2,  3,  4, 42, 42, 42,  8,  9])

There is an extremely important distinction from python lists: array slices <i> are viewed and not copied. </i> That means that any changes to the array slice are reflected in the original array.

In [28]:
arr_slice = arr[5:8]
arr_slice[1] = 12334

arr

array([    0,     1,     2,     3,     4,    42, 12334,    42,     8,     9])

The array slice "points to" the original values in the NumPy array. This saves space when dealing with large data sets.

The "bare" slice [:] assigns to all values in an array.

In [29]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If we really want, we can copy subsections of arrays to new arrays using, in this case, something like arr[5:8].copy()

## Indexing in Higher Dimensional Arrays

Higher dimensional array indexing is slightly more complicated. Elements at each index are arrays themselves:

In [30]:
arr2d = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
arr2d[1]

array([4, 5, 6])

Since arr2d[1] is a list, we just need to index this list to get specific elements.

In [33]:
arr2d[1][0]
arr2d[1, 0] #equivalent

4

In multidimensional arrays, if you omit later indices then the object returned will be a lower dimensional array consisting of all the data along the higher dimensions:

In [34]:
arr3d = np.array([[[1,2,3],[4,5,6],[7,8,9]], [[7,8,9], [10,11,12]]])

arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

## Indexing with Slices

Indexing can be extended to work with multi-dimesional arrays. Note that in 2d arrays, slicing always occurs in the 'row'-'column' format.

In [37]:
arr3d[0][:2, 1:]

array([[2, 3],
       [5, 6]])

# Boolean Indexing

## Introduction

Boolean indexing in Numpy is similar to that in pandas. Consider the following data:

In [52]:
names = np.array(['Bob','Joe','Will','Joe','Bob','Will','Will'])
data = np.random.randn(7,4)

If each name corresponds to a <i> row </i> in the numpy array and we want to select the rows corresponding to 'Bob', then we can do the following:

In [53]:
data[names=='Bob']

array([[ 0.53167396,  0.0629875 , -0.26724699, -0.68922298],
       [ 0.9609578 ,  0.40406404,  0.33030467, -1.01213574]])

The boolean array based into data[...] must be the same length as the axis it's indexing.

We can mix boolean indexing and regular indexing for concise code:

In [54]:
data[names=='Joe', 1:3]

array([[ 0.21120775,  0.52717437],
       [ 1.98533752, -2.58838035]])

Negations can be obtained using ~ or ~=. The ~ is particularily useful when you have a general (perhaps very large) logical condition which you must enforce.

In [55]:
~(names=='Bob')
names!='Bob' #equivalent

array([False,  True,  True,  True, False,  True,  True], dtype=bool)

## Creating Masks for Complex Logical Statements

We can create logical masks for complex statements.

In [56]:
mask = (names == 'Bob') | (names == 'Will')

data[mask]

array([[ 0.53167396,  0.0629875 , -0.26724699, -0.68922298],
       [-0.45073639,  0.68042297,  2.27010218,  1.24061411],
       [ 0.9609578 ,  0.40406404,  0.33030467, -1.01213574],
       [ 0.36954954,  0.12929251, -2.02457337, -1.38915124],
       [-0.03669644, -0.09884574,  0.30501663,  0.08721723]])

<b> IMPORTANT: </b> Logical keywords 'and' and 'or' do not work for boolean indexing or creating masks. Use & and | instead.

Boolean indexing can be useful to change or get rid of unwanted values.

In [57]:
data[data<0] = 0
data

array([[ 0.53167396,  0.0629875 ,  0.        ,  0.        ],
       [ 0.        ,  0.21120775,  0.52717437,  1.25553763],
       [ 0.        ,  0.68042297,  2.27010218,  1.24061411],
       [ 1.32118295,  1.98533752,  0.        ,  0.60246169],
       [ 0.9609578 ,  0.40406404,  0.33030467,  0.        ],
       [ 0.36954954,  0.12929251,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.30501663,  0.08721723]])

# Transposing Arrays and Swapping Axes

Transposing arrays is particularily useful for matrix algebra. 

In [59]:
arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [60]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

Calculating the inner product of two matrices:

In [61]:
np.dot(arr,arr.T)

array([[ 30,  80, 130],
       [ 80, 255, 430],
       [130, 430, 730]])