
# Numpy 

multidimensional data array

In [1]:
# Ipython magic
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Introduction

In the `numpy` package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. 



## Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

### From lists

We can use the `numpy.array` function.

In [2]:
# a vector: the argument to the array function is a Python list
v = array([1,2,3,4])
v

array([1, 2, 3, 4])

In [3]:
# a matrix: the argument to the array function is a nested Python list
M = array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `numpy.ndarray`

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. 

We can check it with the `ndarray.shape` property.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

The number of elements in the array is available through the `ndarray.size` property:

In [7]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`

In [8]:
shape(M)

(2, 2)

In [9]:
size(M)

4

So far the `numpy.ndarray` looks awefully much like a Python list (or nested list). 

Why not simply use Python lists for computations instead of creating a new array type? 

**There are several reasons**

* Python lists are very general. 
    - They can contain any kind of object. 
    - They are dynamically typed. 
* They do not support mathematical functions 
    - such as matrix and dot multiplications, etc. 
    - Implementating such functions for Python lists would not be very efficient 
        * because of the dynamic typing

* Numpy arrays are **statically typed** and **homogeneous**. 
    - The type of the elements is determined when array is created
    - By already knowing the static type, numpy can implement low-level optimization
* Numpy arrays are memory efficient.
     - fast implementation of mathematical functions can be implemented in a compiled language
        * C and Fortran is used

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [10]:
M.dtype

dtype('int64')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [11]:
M[0,0] = "hello"

ValueError: invalid literal for long() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [12]:
M = array([[1, 2], [3, 4]], dtype=complex)

M

array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

Common types that can be used with `dtype` 

    `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types

    `int64`, `int16`, `float128`, `complex128`.

## If i don't see it, i don't believe it

`ndarray` = n-dimension array

<img src="images/ndarray.png">

In [13]:
import numpy as np
dim = 10000

A quick benchmark

In [14]:
# Normal python vector
a = range(dim)
t1 = %timeit -o [i**2 for i in a]

1000 loops, best of 3: 699 µs per loop


In [15]:
# Numpy vector with normal python loop
b = np.arange(dim)
t2 = %timeit -o [i**2 for i in b]

1000 loops, best of 3: 1.78 ms per loop


In [16]:
# Numpy vector with numpy loop
c = np.arange(dim)
t3 = %timeit -n 1000 -o [c**2]

1000 loops, best of 3: 8.75 µs per loop


In [17]:
print "Python loops (no) speedup: ", t1.best / t2.best

Python loops (no) speedup:  0.393701683621


In [18]:
print "Numpy loops speedup:", int(t1.best / t3.best), "x"

Numpy loops speedup: 79 x


We want to make sure...

In [19]:
print "Type", type(a), [i**2 for i in a][0:10]

Type <type 'list'> [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [20]:
print type(b), (b**2)[0:10]

<type 'numpy.ndarray'> [ 0  1  4  9 16 25 36 49 64 81]


## Using more array-generating functions

#### arange

In [21]:
# create a range
x = arange(0, 10, 1) # arguments: start, stop, step
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [22]:
x = arange(-1, 1, 0.1)
x

array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

In [23]:
type(x)

numpy.ndarray

#### mgrid

In [24]:
print numpy.mgrid.__doc__.split('\n')[0]

`nd_grid` instance which returns a dense multi-dimensional "meshgrid".


In [25]:
x, y = mgrid[0:5, 0:5] # similar to meshgrid in MATLAB

In [26]:
x

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

In [27]:
y

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

#### random data

In [28]:
from numpy import random
# uniform random numbers in [0,1]
random.rand(5,5)

array([[ 0.96801788,  0.63437251,  0.23102988,  0.92648884,  0.95317623],
       [ 0.95315949,  0.05506889,  0.59056288,  0.85642225,  0.61302573],
       [ 0.30939337,  0.03196916,  0.40909037,  0.08196318,  0.56442712],
       [ 0.80711099,  0.43538887,  0.57718446,  0.40423574,  0.16216452],
       [ 0.40554686,  0.17690677,  0.58525439,  0.10148798,  0.41444299]])

In [29]:
# standard normal distributed random numbers
random.randn(5,5)

array([[-0.29381719,  1.2519051 ,  0.10697473,  3.2290514 ,  1.68826376],
       [-1.25652205,  0.75161627,  0.00803016,  1.0439989 , -1.29815987],
       [-0.89138174,  0.74708657,  1.72588807,  0.80113815,  0.21620933],
       [ 0.3777295 ,  1.07946447,  0.82354959,  2.61366112,  0.05284668],
       [ 0.38228489, -1.21605111, -0.98459671,  0.25707789,  0.97696469]])

#### diag

In [30]:
# a diagonal matrix
diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [31]:
# diagonal with offset from the main diagonal
diag([1,2,3], k=1) 

array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

#### zeros and ones

In [32]:
zeros((3,3))

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [33]:
ones((3,3))

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

## More properties of arrays

In [34]:
M.itemsize # bytes per element

16

In [35]:
M.nbytes # number of bytes

64

In [36]:
M.ndim # number of dimensions

2

In [37]:
# With `newaxis`, we can insert new dimensions in an array
v = array([1,2,3])
print "Original:", shape(v)

# column matrix
print "Col:", v[:,newaxis].shape

# row matrix
print "Row:", v[newaxis,:].shape


Original: (3,)
Col: (3, 1)
Row: (1, 3)


## Manipulating arrays

### Indexing

We can index elements in an array using the square bracket and indices:

In [38]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [39]:
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]

(4+0j)

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array) 

In [40]:
M

array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

In [41]:
M[1]

array([ 3.+0.j,  4.+0.j])

The same thing can be achieved with using `:` instead of an index

In [42]:
M[1,:] # row 1

array([ 3.+0.j,  4.+0.j])

In [43]:
M[:,1] # column 1

array([ 2.+0.j,  4.+0.j])

We can assign new values to elements in an array using indexing

In [44]:
M[0,0] = 1

In [45]:
M

array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

In [46]:
# also works for rows and columns
M[1,:] = 0
M[:,2] = -1

IndexError: index 2 is out of bounds for axis 1 with size 2

In [47]:
M

array([[ 1.+0.j,  2.+0.j],
       [ 0.+0.j,  0.+0.j]])

### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array

In [48]:
A = array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [49]:
A[1:3]

array([2, 3])

Array slices are *mutable*: 

if they are assigned a new value the original array from which the slice was extracted is modified

In [50]:
A[1:3] = [-2,-3]

A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [51]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [52]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [53]:
A[:3] # first three elements

array([ 1, -2, -3])

In [54]:
A[3:] # elements from index 3

array([4, 5])

Negative indices counts from the end of the array (positive index from the begining):

In [55]:
A = array([1,2,3,4,5])

In [56]:
A[-1] # the last element in the array

5

In [57]:
A[-3:] # the last three elements

array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:

In [58]:
A = array([[n+m*10 for n in range(5)] for m in range(5)])

A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [59]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [60]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

### Fancy indexing

Fancy indexing is the name for when **an array or list** is used in-place of an *index*

In [61]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [62]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

array([11, 22, 34])

###We can also index masks
* e.g. a Numpy array of data type `bool`
    - an element is selected (True) or not (False) 
    - depending on the value of the index mask at the position each element

In [63]:
B = array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [64]:
row_mask = array([True, False, True, False, False])
B[row_mask]

array([0, 2])

In [65]:
# same thing
row_mask = array([1,0,1,0,0], dtype=bool)
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [66]:
x = arange(0, 10, 0.5)
x

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [67]:
mask = (5 < x) * (x < 7.5)
x[mask]
print mask
print x[mask]

[False False False False False False False False False False False  True
  True  True  True False False False False False]
[ 5.5  6.   6.5  7. ]


## Other functions 
for extracting data from arrays and creating arrays

### where

The index mask can be converted to position index using the `where` function

In [68]:
indices = where(mask)

indices

(array([11, 12, 13, 14]),)

In [69]:
x[indices] # this indexing is equivalent to the fancy indexing x[mask]

array([ 5.5,  6. ,  6.5,  7. ])

### diag

With the diag function we can also extract the diagonal and subdiagonals of an array

In [70]:
diag(A)

array([ 0, 11, 22, 33, 44])

In [71]:
diag(A, -1)

array([10, 21, 32, 43])

### choose

Constructs an array by picking elements form several arrays

In [72]:
which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]

choose(which, choices)

array([ 5, -2,  5, -2])

## Linear algebra

Efficient numerical calculation with Numpy

- Object should always be formulated in terms of matrix and vector operations
- like matrix-matrix multiplication.

### Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [73]:
v1 = arange(0, 5)

In [74]:
v1 * 2

array([0, 2, 4, 6, 8])

In [75]:
v1 + 2

array([2, 3, 4, 5, 6])

In [76]:
# Also works on a matrix
A * 2, A + 2

(array([[ 0,  2,  4,  6,  8],
        [20, 22, 24, 26, 28],
        [40, 42, 44, 46, 48],
        [60, 62, 64, 66, 68],
        [80, 82, 84, 86, 88]]), array([[ 2,  3,  4,  5,  6],
        [12, 13, 14, 15, 16],
        [22, 23, 24, 25, 26],
        [32, 33, 34, 35, 36],
        [42, 43, 44, 45, 46]]))

### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [77]:
print A
print A * A # element-wise multiplication

[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]
 [40 41 42 43 44]]
[[   0    1    4    9   16]
 [ 100  121  144  169  196]
 [ 400  441  484  529  576]
 [ 900  961 1024 1089 1156]
 [1600 1681 1764 1849 1936]]


In [78]:
v1 * v1

array([ 0,  1,  4,  9, 16])

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:

In [79]:
A.shape, v1.shape

((5, 5), (5,))

In [80]:
A * v1

array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

### Matrix algebra

What about matrix mutiplication? 

* We can either use the `dot` function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments: 

In [81]:
dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [82]:
dot(A, v1)

array([ 30, 130, 230, 330, 430])

In [83]:
dot(v1, v1)

30

Alternatively

* we can cast the array objects to the type `matrix`. 

<small>Note: This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra.</small>

In [84]:
M = matrix(A)
M

matrix([[ 0,  1,  2,  3,  4],
        [10, 11, 12, 13, 14],
        [20, 21, 22, 23, 24],
        [30, 31, 32, 33, 34],
        [40, 41, 42, 43, 44]])

In [85]:
v = matrix(v1).T # make it a column vector
v

matrix([[0],
        [1],
        [2],
        [3],
        [4]])

In [86]:
M * M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [87]:
M * v

matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

In [88]:
# inner product
v.T * v

matrix([[30]])

In [89]:
# with matrix objects, standard matrix algebra applies
v + M*v

matrix([[ 30],
        [131],
        [232],
        [333],
        [434]])

###warning
If we try to add, subtract or multiply objects with incomplatible shapes we get an error:

In [90]:
v = matrix([1,2,3,4,5,6]).T

In [91]:
shape(M), shape(v)

((5, 5), (6, 1))

In [92]:
M * v

ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)

See also the related functions: `inner`, `outer`, `cross`, `kron`, `tensordot`

### Matrix computations

#### Inverse

In [93]:
C = matrix([[1j, 2j], [3j, 4j]])

In [94]:
inv(C) # equivalent to C.I 

matrix([[ 0.+2.j ,  0.-1.j ],
        [ 0.-1.5j,  0.+0.5j]])

In [95]:
C.I * C

matrix([[  1.00000000e+00+0.j,   4.44089210e-16+0.j],
        [  0.00000000e+00+0.j,   1.00000000e+00+0.j]])

#### Determinant

In [96]:
det(C)

(2.0000000000000004+0j)

In [97]:
det(C.I)

(0.50000000000000011+0j)

## Data processing
File Input/Output

### Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV).

In [98]:
# To read data from such file into Numpy arrays we can use the `numpy.genfromtxt` function
?genfromtxt

data source: https://archive.ics.uci.edu/ml/datasets/Covertype

In [99]:
A = genfromtxt('data/num.csv.gz', delimiter = ',')

In [100]:
A.shape

(71436, 55)

In [101]:
A.size

3928980

In [102]:
A[:4,:3]

array([[  2.59600000e+03,   5.10000000e+01,   3.00000000e+00],
       [  2.59000000e+03,   5.60000000e+01,   2.00000000e+00],
       [  2.80400000e+03,   1.39000000e+02,   9.00000000e+00],
       [  2.78500000e+03,   1.55000000e+02,   1.80000000e+01]])

Using `numpy.savetxt` we can store a Numpy array to a file in **TSV** format:

In [103]:
M = rand(3,3)

M

array([[ 0.33944456,  0.88615496,  0.2461047 ],
       [ 0.57751225,  0.26211386,  0.42031934],
       [ 0.48373785,  0.12272668,  0.62294105]])

In [104]:
savetxt("random-matrix.csv", M)

In [105]:
!cat random-matrix.csv

3.394445595937714000e-01 8.861549647862649870e-01 2.461046990821519342e-01
5.775122464178170656e-01 2.621138565572320722e-01 4.203193432334487722e-01
4.837378483565024645e-01 1.227266765816557026e-01 6.229410476884408299e-01


### Numpy's native file format

Useful when storing and reading back numpy array data. Use the functions `numpy.save` and `numpy.load`:

In [106]:
# numpy binary file saving
save("random-matrix.npy", M)
# check type of file
!file random-matrix.npy

random-matrix.npy: data


In [107]:
# very fast, but not portable
load("random-matrix.npy")

array([[ 0.33944456,  0.88615496,  0.2461047 ],
       [ 0.57751225,  0.26211386,  0.42031934],
       [ 0.48373785,  0.12272668,  0.62294105]])

### Statistics

Numpy provides a number of functions to calculate statistics of datasets in arrays.

In [108]:
data = A[:1000,:5]
data.shape

(1000, 5)

#### mean

In [109]:
# The mean of the 4th element
mean(data[:,3])

236.58799999999999

#### standard deviations and variance

In [110]:
std(data[:,3]), var(data[:,3])

(189.86956642916738, 36050.452256000004)

#### min and max

In [111]:
# search the lowest value of a column
col = 4
print "Min value for", col, "is", data[:,col].min()
print "Max value for", col, "is", data[:,col].max()

Min value for 4 is -45.0
Max value for 4 is 245.0


There are many other operations. 

...but you will find more power in *pandas* for this.

### Copy and "deep copy"

- when objects are passed between functions 
    - you want to avoid an excessive amount of memory copying when it is not necessary 
    - (techincal term: pass by reference)

In [112]:
A = array([[1, 2], [3, 4]])
A

array([[1, 2],
       [3, 4]])

In [113]:
B = A # now B is referring to the same array data as A 
B

array([[1, 2],
       [3, 4]])

In [114]:
A == B # check this

array([[ True,  True],
       [ True,  True]], dtype=bool)

In [115]:
# changing B affects A
B[0,0] = 10
B

array([[10,  2],
       [ 3,  4]])

In [116]:
A

array([[10,  2],
       [ 3,  4]])

If we want to avoid this behavior 
- get a new completely independent object `B` copied from `A`
- we need to do a so-called "deep copy" using the function `copy`

In [117]:
B = copy(A)

In [118]:
# now, if we modify B, A is not affected
B[0,0] = -5

B

array([[-5,  2],
       [ 3,  4]])

In [119]:
A

array([[10,  2],
       [ 3,  4]])

### Iterating over array elements

> Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” (in optimized, pre-compiled C code).

source: numpy website

- Generally, we want to avoid iterating over the elements of arrays 
    * at all costs
- In a *interpreted language* like Python (or MATLAB, or R)
    * iterations are really slow compared to vectorized operations
- Use always numpy functions which are optimized
    * if you try a `for` loop you know what you get

### Type casting

- Numpy arrays are *statically typed*
- the type of an array does not change once created
- but we can explicitly cast an array of some type to another 
    - using the `astype` functions 
    - (see also the similar `asarray` function) 
    - This always create a new array of new type

In [120]:
M.dtype

dtype('float64')

In [121]:
M

array([[ 0.33944456,  0.88615496,  0.2461047 ],
       [ 0.57751225,  0.26211386,  0.42031934],
       [ 0.48373785,  0.12272668,  0.62294105]])

In [122]:
M2 = M.astype(bool)
M2

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

In [123]:
M3 = M.astype(str)
M3

array([['0.339444559594', '0.886154964786', '0.246104699082'],
       ['0.577512246418', '0.262113856557', '0.420319343233'],
       ['0.483737848357', '0.122726676582', '0.622941047688']], 
      dtype='|S32')

## Versions

In [124]:
%reload_ext version_information

%version_information numpy

Software,Version
Python,2.7.10 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython,3.1.0
OS,Linux 4.0.3 boot2docker x86_64 with debian jessie sid
numpy,1.9.2
Mon Jun 08 14:03:15 2015 UTC,Mon Jun 08 14:03:15 2015 UTC


**Let's move to the next part :)**