In [1]:
# Ipython magic
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Introduction

In the `numpy` package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. 



## Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

### From lists

We can use the `numpy.array` function.

In [2]:
# a vector: the argument to the array function is a Python list
v = array([1,2,3,4])
v

array([1, 2, 3, 4])

In [3]:
# a matrix: the argument to the array function is a nested Python list
M = array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `numpy.ndarray`

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. 

We can check it with the `ndarray.shape` property.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

The number of elements in the array is available through the `ndarray.size` property:

In [7]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`

In [8]:
shape(M)

(2, 2)

In [9]:
size(M)

4

So far the `numpy.ndarray` looks awefully much like a Python list (or nested list). 

Why not simply use Python lists for computations instead of creating a new array type? 

**There are several reasons**

* Python lists are very general. 
    - They can contain any kind of object. 
    - They are dynamically typed. 
* They do not support mathematical functions 
    - such as matrix and dot multiplications, etc. 
    - Implementating such functions for Python lists would not be very efficient 
        * because of the dynamic typing

* Numpy arrays are **statically typed** and **homogeneous**. 
    - The type of the elements is determined when array is created
    - By already knowing the static type, numpy can implement low-level optimization
* Numpy arrays are memory efficient.
     - fast implementation of mathematical functions can be implemented in a compiled language
        * C and Fortran is used

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [10]:
M.dtype

dtype('int64')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [11]:
M[0,0] = "hello"

ValueError: invalid literal for long() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [None]:
M = array([[1, 2], [3, 4]], dtype=complex)

M

Common types that can be used with `dtype` 

    `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types

    `int64`, `int16`, `float128`, `complex128`.

## If i don't see it, i don't believe it

`ndarray` = n-dimension array

<img src="images/ndarray.png">

A quick benchmark

In [None]:
# Normal python vector
dim = 10000
a = range(dim)
t1 = %timeit -o [i**2 for i in a]

In [None]:
# Numpy vector with normal python loop
b = arange(dim)
t2 = %timeit -o [i**2 for i in b]

In [None]:
# Numpy vector with numpy loop
c = arange(dim)
t3 = %timeit -n 1000 -o [c**2]

In [None]:
print "Python loops (no) speedup: ", t1.best / t2.best

In [None]:
print "Numpy loops speedup:", int(t1.best / t3.best), "x"

We want to make sure...

In [None]:
print "Type", type(a), [i**2 for i in a][0:10]

In [None]:
print type(b), (b**2)[0:10]

## Using more array-generating functions

#### arange

In [None]:
# create a range
x = arange(0, 10, 1) # arguments: start, stop, step
x

In [None]:
x = arange(-1, 1, 0.1)
x

In [None]:
type(x)

#### mgrid

In [None]:
print numpy.mgrid.__doc__.split('\n')[0]

In [None]:
x, y = mgrid[0:5, 0:5] # similar to meshgrid in MATLAB

In [None]:
x

In [None]:
y

#### random data

In [None]:
from numpy import random
# uniform random numbers in [0,1]
random.rand(5,5)

In [None]:
# standard normal distributed random numbers
random.randn(5,5)

#### diag

In [None]:
# a diagonal matrix
diag([1,2,3])

In [None]:
# diagonal with offset from the main diagonal
diag([1,2,3], k=1) 

#### zeros and ones

In [None]:
zeros((3,3))

In [None]:
ones((3,3))

## More properties of arrays

In [None]:
M.itemsize # bytes per element

In [None]:
M.nbytes # number of bytes

In [None]:
M.ndim # number of dimensions

In [None]:
# With `newaxis`, we can insert new dimensions in an array
v = array([1,2,3])
print "Original:", shape(v)

# column matrix
print "Col:", v[:,newaxis].shape

# row matrix
print "Row:", v[newaxis,:].shape


##Exercise

Create your own matrix and try some of the operations shown so far

## Manipulating arrays

### Indexing

We can index elements in an array using the square bracket and indices:

In [12]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [13]:
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]

4

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array) 

In [14]:
M

array([[1, 2],
       [3, 4]])

In [15]:
M[1]

array([3, 4])

The same thing can be achieved with using `:` instead of an index

In [16]:
M[1,:] # row 1

array([3, 4])

In [17]:
M[:,1] # column 1

array([2, 4])

We can assign new values to elements in an array using indexing

In [18]:
M[0,0] = 1

In [19]:
M

array([[1, 2],
       [3, 4]])

In [20]:
# also works for rows and columns
M[1,:] = 0
M[:,2] = -1

IndexError: index 2 is out of bounds for axis 1 with size 2

In [21]:
M

array([[1, 2],
       [0, 0]])

### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array

In [22]:
A = array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [23]:
A[1:3]

array([2, 3])

Array slices are *mutable*: 

if they are assigned a new value the original array from which the slice was extracted is modified

In [24]:
A[1:3] = [-2,-3]

A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [25]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [26]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [27]:
A[:3] # first three elements

array([ 1, -2, -3])

In [28]:
A[3:] # elements from index 3

array([4, 5])

Negative indices counts from the end of the array (positive index from the begining):

In [29]:
A = array([1,2,3,4,5])

In [30]:
A[-1] # the last element in the array

5

In [31]:
A[-3:] # the last three elements

array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:

In [32]:
A = array([[n+m*10 for n in range(5)] for m in range(5)])

A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [33]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [34]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

### Fancy indexing

Fancy indexing is the name for when **an array or list** is used in-place of an *index*

In [35]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [36]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

array([11, 22, 34])

##Exercise

- Define two odd number: n, m
- Create a random matrix with shape n x m
- Compute the middle cell position and get the center element

In [37]:
odd_list = range(1,10,2)
import random as stdrand
n = stdrand.choice(odd_list)
m = stdrand.choice(odd_list)
print "Dimenions:", n, m
MAT = random.randn(n,m)
matrix_center = (n/2, m/2)

print MAT
MAT[matrix_center]



Dimenions: 3 1
[[-2.19577913]
 [ 1.64321869]
 [-1.65874573]]


1.6432186948456033

###We can also index masks
* e.g. a Numpy array of data type `bool`
    - an element is selected (True) or not (False) 
    - depending on the value of the index mask at the position each element

In [38]:
B = array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [39]:
row_mask = array([True, False, True, False, False])
B[row_mask]

array([0, 2])

In [40]:
# same thing
row_mask = array([1,0,1,0,0], dtype=bool)
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [41]:
x = arange(0, 10, 0.5)
x

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [42]:
mask = (5 < x) * (x < 7.5)
x[mask]
print mask
print x[mask]

[False False False False False False False False False False False  True
  True  True  True False False False False False]
[ 5.5  6.   6.5  7. ]


## Other functions 
for extracting data from arrays and creating arrays

### where

The index mask can be converted to position index using the `where` function

In [43]:
indices = where(mask)

indices

(array([11, 12, 13, 14]),)

In [44]:
x[indices] # this indexing is equivalent to the fancy indexing x[mask]

array([ 5.5,  6. ,  6.5,  7. ])

### diag

With the diag function we can also extract the diagonal and subdiagonals of an array

In [45]:
diag(A)

array([ 0, 11, 22, 33, 44])

In [46]:
diag(A, -1)

array([10, 21, 32, 43])

### choose

Constructs an array by picking elements form several arrays

In [47]:
which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]

choose(which, choices)

array([ 5, -2,  5, -2])

## Linear algebra

Efficient numerical calculation with Numpy

- Object should always be formulated in terms of matrix and vector operations
- like matrix-matrix multiplication.

### Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [48]:
v1 = arange(0, 5)

In [49]:
v1 * 2

array([0, 2, 4, 6, 8])

In [50]:
v1 + 2

array([2, 3, 4, 5, 6])

In [51]:
# Also works on a matrix
A * 2, A + 2

(array([[ 0,  2,  4,  6,  8],
        [20, 22, 24, 26, 28],
        [40, 42, 44, 46, 48],
        [60, 62, 64, 66, 68],
        [80, 82, 84, 86, 88]]), array([[ 2,  3,  4,  5,  6],
        [12, 13, 14, 15, 16],
        [22, 23, 24, 25, 26],
        [32, 33, 34, 35, 36],
        [42, 43, 44, 45, 46]]))

## Exercise

Can you list the first 20 elements of the *"power of two"* using scalar array operations?

In [52]:
from numpy import array
elements = 20
two = array([2]*elements)
for i in range(len(two)):
    two[i:] = two[i:]*2
print two

[      4       8      16      32      64     128     256     512    1024
    2048    4096    8192   16384   32768   65536  131072  262144  524288
 1048576 2097152]


### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [53]:
print A
print A * A # element-wise multiplication

[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]
 [40 41 42 43 44]]
[[   0    1    4    9   16]
 [ 100  121  144  169  196]
 [ 400  441  484  529  576]
 [ 900  961 1024 1089 1156]
 [1600 1681 1764 1849 1936]]


In [54]:
v1 * v1

array([ 0,  1,  4,  9, 16])

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:

In [55]:
A.shape, v1.shape

((5, 5), (5,))

In [56]:
A * v1

array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

### Matrix algebra

What about matrix mutiplication? 

* We can either use the `dot` function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments: 

In [57]:
dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [58]:
dot(A, v1)

array([ 30, 130, 230, 330, 430])

In [59]:
dot(v1, v1)

30

Alternatively

* we can cast the array objects to the type `matrix`. 

<small>Note: This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra.</small>

In [60]:
M = matrix(A)
M

matrix([[ 0,  1,  2,  3,  4],
        [10, 11, 12, 13, 14],
        [20, 21, 22, 23, 24],
        [30, 31, 32, 33, 34],
        [40, 41, 42, 43, 44]])

In [61]:
v = matrix(v1).T # make it a column vector
v

matrix([[0],
        [1],
        [2],
        [3],
        [4]])

In [62]:
M * M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [63]:
M * v

matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

In [64]:
# inner product
v.T * v

matrix([[30]])

In [65]:
# with matrix objects, standard matrix algebra applies
v + M*v

matrix([[ 30],
        [131],
        [232],
        [333],
        [434]])

###warning
If we try to add, subtract or multiply objects with incomplatible shapes we get an error:

In [66]:
v = matrix([1,2,3,4,5,6]).T

In [67]:
shape(M), shape(v)

((5, 5), (6, 1))

In [68]:
M * v

ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)

See also the related functions: `inner`, `outer`, `cross`, `kron`, `tensordot`

### Matrix computations

#### Inverse

In [69]:
C = matrix([[1j, 2j], [3j, 4j]])

In [70]:
inv(C) # equivalent to C.I 

matrix([[ 0.+2.j ,  0.-1.j ],
        [ 0.-1.5j,  0.+0.5j]])

In [71]:
C.I * C

matrix([[  1.00000000e+00+0.j,   4.44089210e-16+0.j],
        [  0.00000000e+00+0.j,   1.00000000e+00+0.j]])

#### Determinant

In [72]:
det(C)

(2.0000000000000004+0j)

In [73]:
det(C.I)

(0.50000000000000011+0j)

## Data processing
File Input/Output

### Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV).

In [74]:
# To read data from such file into Numpy arrays we can use the `numpy.genfromtxt` function
?genfromtxt

data source: https://archive.ics.uci.edu/ml/datasets/Covertype

In [75]:
A = genfromtxt('data/num.csv.gz', delimiter = ',')

In [76]:
A.shape

(71436, 55)

In [77]:
A.size

3928980

In [78]:
A[:4,:3]

array([[  2.59600000e+03,   5.10000000e+01,   3.00000000e+00],
       [  2.59000000e+03,   5.60000000e+01,   2.00000000e+00],
       [  2.80400000e+03,   1.39000000e+02,   9.00000000e+00],
       [  2.78500000e+03,   1.55000000e+02,   1.80000000e+01]])

Using `numpy.savetxt` we can store a Numpy array to a file in **TSV** format:

In [79]:
M = rand(3,3)

M

array([[ 0.37032271,  0.72107431,  0.00817785],
       [ 0.54815689,  0.43702627,  0.48544123],
       [ 0.99570616,  0.83031926,  0.34833195]])

In [80]:
savetxt("random-matrix.csv", M)

In [81]:
!cat random-matrix.csv

3.703227066684050550e-01 7.210743066150180347e-01 8.177851226846111210e-03
5.481568888557009078e-01 4.370262664386908025e-01 4.854412252923472337e-01
9.957061580132914314e-01 8.303192596515113211e-01 3.483319463859742005e-01


## Exercise

Read from the gzipped csv:
- Only from row 11 to 20
- Only third and sixth column
- Truncate values to integer

In [82]:
read = 10
skip = 10
a_len = len(A)
B = genfromtxt('data/num.csv.gz', delimiter = ',', usecols = (2, 5),
        skip_header=skip, skip_footer=a_len-(read+skip), dtype=np.int16)

### Numpy's native file format

Useful when storing and reading back numpy array data. Use the functions `numpy.save` and `numpy.load`:

In [83]:
# numpy binary file saving
save("random-matrix.npy", M)
# check type of file
!file random-matrix.npy

random-matrix.npy: data


In [84]:
# very fast, but not portable
load("random-matrix.npy")

array([[ 0.37032271,  0.72107431,  0.00817785],
       [ 0.54815689,  0.43702627,  0.48544123],
       [ 0.99570616,  0.83031926,  0.34833195]])

### Statistics

Numpy provides a number of functions to calculate statistics of datasets in arrays.

In [85]:
data = A[:1000,:5]
data.shape

(1000, 5)

#### mean

In [86]:
# The mean of the 4th element
mean(data[:,3])

236.58799999999999

#### standard deviations and variance

In [87]:
std(data[:,3]), var(data[:,3])

(189.86956642916738, 36050.452256000004)

#### min and max

In [88]:
# search the lowest value of a column
col = 4
print "Min value for", col, "is", data[:,col].min()
print "Max value for", col, "is", data[:,col].max()

Min value for 4 is -45.0
Max value for 4 is 245.0


There are many other operations. 

...but you will find more power in *pandas* for this.

### Copy and "deep copy"

- when objects are passed between functions 
    - you want to avoid an excessive amount of memory copying when it is not necessary 
    - (techincal term: pass by reference)

In [89]:
A = array([[1, 2], [3, 4]])
A

array([[1, 2],
       [3, 4]])

In [90]:
B = A # now B is referring to the same array data as A 
B

array([[1, 2],
       [3, 4]])

In [91]:
A == B # check this

array([[ True,  True],
       [ True,  True]], dtype=bool)

In [92]:
# changing B affects A
B[0,0] = 10
B

array([[10,  2],
       [ 3,  4]])

In [93]:
A

array([[10,  2],
       [ 3,  4]])

If we want to avoid this behavior 
- get a new completely independent object `B` copied from `A`
- we need to do a so-called "deep copy" using the function `copy`

In [94]:
B = copy(A)

In [95]:
# now, if we modify B, A is not affected
B[0,0] = -5

B

array([[-5,  2],
       [ 3,  4]])

In [96]:
A

array([[10,  2],
       [ 3,  4]])

### Iterating over array elements

> Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” (in optimized, pre-compiled C code).

source: numpy website

- Generally, we want to avoid iterating over the elements of arrays 
    * at all costs
- In a *interpreted language* like Python (or MATLAB, or R)
    * iterations are really slow compared to vectorized operations
- Use always numpy functions which are optimized
    * if you try a `for` loop you know what you get

### Type casting

- Numpy arrays are *statically typed*
- the type of an array does not change once created
- but we can explicitly cast an array of some type to another 
    - using the `astype` functions 
    - (see also the similar `asarray` function) 
    - This always create a new array of new type

In [97]:
M.dtype

dtype('float64')

In [98]:
M

array([[ 0.37032271,  0.72107431,  0.00817785],
       [ 0.54815689,  0.43702627,  0.48544123],
       [ 0.99570616,  0.83031926,  0.34833195]])

In [99]:
M2 = M.astype(bool)
M2

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

In [100]:
M3 = M.astype(str)
M3

array([['0.370322706668', '0.721074306615', '0.00817785122685'],
       ['0.548156888856', '0.437026266439', '0.485441225292'],
       ['0.995706158013', '0.830319259652', '0.348331946386']], 
      dtype='|S32')

## Versions

In [101]:
%reload_ext version_information

%version_information numpy

Software,Version
Python,2.7.10 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython,3.1.0
OS,Linux 4.0.3 boot2docker x86_64 with debian jessie sid
numpy,1.9.2
Mon Jun 15 14:25:40 2015 UTC,Mon Jun 15 14:25:40 2015 UTC


**Let's move to the next part :)**