# <img width=400 src="http://www.numpy.org/_static/numpy_logo.png" alt="Numpy"/>


## Why do we need numpy?

* You may have heard "Python is slow", this is true when it concerns looping over many small python objects
* Python is dynamically typed and everything is an object, even an `int`. There are no primitive types.
* Numpy's main feature is the `ndarray` class, a fixed length, homogeniously typed array class.
* Numpy implements a lot of functionality in fast c, cython and fortran code to work on these arrays
* python with vectorized operations using numpy can be blazingly fast

See: [Python is not C](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en)

But the most important reason:

* More beautiful code

In [1]:
import numpy as np

## More beautiful code through vectorisation

pure python with list comprehension

In [2]:
voltages = [10.1, 15.1, 9.5]
currents = [1.2, 2.4, 5.2]

resistances = [U * I for U, I in zip(voltages, currents)]
resistances

[12.12, 36.239999999999995, 49.4]

Using numpy

In [3]:
U = np.array([10.1, 15.1, 9.5])
I = np.array([1.2, 2.4, 5.2])

R = U * I
R

array([ 12.12,  36.24,  49.4 ])

### Finding the point with the smallest distance

In [4]:
import math

def euclidean_distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

point = (1, 2)
points = [(3, 2), (4, 2), (3, 0)]

min_distance = float('inf')
for other in points:
    distance = euclidean_distance(point, other)
    if distance < min_distance:
        closest = other
        min_distance = distance 

print(min_distance, closest)

2.0 (3, 2)


In [5]:
point = np.array([1, 2])
points = np.array([(3, 2), (4, 2), (3, 0)])

distance = np.linalg.norm(point - points, axis=1)
idx = np.argmin(distance)

print(distance[idx], points[idx])

2.0 [3 2]


## Small example timings

In [6]:
import math


def var(data):
    '''
    knuth's algorithm for one-pass calculation of the variance
    Avoids rounding errors of large numbers when doing the naive
    approach of `sum(v**2 for v in data) - sum(v)**2`
    '''
    
    n = 0
    mean = 0.0
    m2 = 0.0
    
    if len(data) < 2:
        return float('nan')

    for value in data:
        n += 1
        delta = value - mean
        mean += delta / n
        delta2 = value - mean
        m2 += delta * delta2

    return m2 / n 

In [7]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [8]:
%%timeit

l = list(range(1000))
var(l)

231 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [9]:
%%timeit

a = np.arange(1000)  # array with numbers 0,...,999

np.var(a)

31.1 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Basic math: vectorized

Operations on numpy arrays work vectorized, element-by-element

** Lose your loops **

In [10]:
# create a numpy array from a python a python list
a = np.array([1.0, 3.5, 7.1, 4, 6])

In [11]:
2 * a

array([  2. ,   7. ,  14.2,   8. ,  12. ])

In [12]:
a**2

array([  1.  ,  12.25,  50.41,  16.  ,  36.  ])

In [13]:
a**a

array([  1.00000000e+00,   8.02117802e+01,   1.10645633e+06,
         2.56000000e+02,   4.66560000e+04])

In [14]:
np.cos(a)

array([ 0.54030231, -0.93645669,  0.68454667, -0.65364362,  0.96017029])

**Attention: You need the `cos` from numpy!**

In [15]:
math.cos(a)

TypeError: only length-1 arrays can be converted to Python scalars

Most normal python functions with basic operators like `*`, `+`, `**` simply work because
of operator overloading:

In [16]:
def poly(x):
    return x + 2 * x**2 - x**3

poly(a)

array([   2.   ,  -14.875, -249.991,  -28.   , -138.   ])

In [18]:
poly(np.e), poly(np.pi)

(-2.589142896867319, -8.125475224531307)

## Useful properties

In [19]:
len(a)

5

In [20]:
a.shape

(5,)

In [21]:
a.dtype

dtype('float64')

In [22]:
a.ndim

1

## Arbitrary dimension arrays

In [23]:
# two-dimensional array
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

y + y

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [24]:
## since python 3.5 @ is matrix product
y @ y

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

In [25]:
# Broadcasting, changing array dimensions to fit the larger one

y + np.array([1, 2, 3])

array([[ 2,  4,  6],
       [ 5,  7,  9],
       [ 8, 10, 12]])

## Reduction operations

Numpy has many operations, which reduce dimensionality of arrays

In [26]:
x = np.random.normal(0, 1, 1000)

In [28]:
np.sum(x)

19.655378655439939

In [29]:
np.prod(x)

-1.0344065546056501e-293

In [30]:
np.mean(x)

0.01965537865543994

Standard Deviation

In [31]:
np.std(x)

0.96830751222407152

Standard error of the mean

In [32]:
np.std(x, ddof=1) / np.sqrt(len(x))

0.030635893919156273

Sample Standard Deviation

In [33]:
np.std(x, ddof=1)

0.96879202939836173

Most of the numpy functions are also methods of the array

In [34]:
x.mean(), x.std(), x.max(), x.min()

(0.01965537865543994,
 0.96830751222407152,
 3.29685437238488,
 -2.4088194062162374)

Difference between neighbor elements

In [37]:
z = np.arange(10)**2
diff_z = np.diff(z)

print(z)
print(diff_z)

[ 0  1  4  9 16 25 36 49 64 81]
[ 1  3  5  7  9 11 13 15 17]


### Reductions on multi-dimensional arrays


In [39]:
array2d = np.arange(20).reshape(4, 5)

array2d

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [40]:
np.sum(array2d, axis=0)

array([30, 34, 38, 42, 46])

In [41]:
np.var(array2d, axis=1)

array([ 2.,  2.,  2.,  2.])

## Exercise 1

Write a function that calculates the analytical linear regression for a set of
x and y values.

Reminder:

$$ f(x) = a \cdot x + b$$

with 

$$
\hat{a} = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)} \\
\hat{b} = \bar{y} - \hat{a} \cdot \bar{x}
$$

In [42]:
# %load 04_01_numpy_solutions/exercise_linear.py

In [None]:
x = np.linspace(0, 1, 50)
y = 5 * np.random.normal(x, 0.1) + 2  # see section on random numbers later

a, b = linear_regression(x, y)
a, b

## Helpers for creating arrays

In [44]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [45]:
np.ones((5, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [47]:
np.full(5, np.nan)

array([ nan,  nan,  nan,  nan,  nan])

In [48]:
np.empty(10)  # attention, uninitialised memory, be carefull

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [49]:
np.linspace(-2, 1, 1)

array([-2.])

In [55]:
# like range() for arrays:
np.arange(5)

array([0, 1, 2, 3, 4])

In [56]:
np.arange(2, 10, 2)

array([2, 4, 6, 8])

In [57]:
np.logspace(-4, 5, 10)

array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
         1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
         1.00000000e+05])

In [61]:
np.logspace(1, 4, 4, base=2)

array([  2.,   4.,   8.,  16.])

## Numpy Indexing

* Element access
* Slicing

In [62]:
x = np.arange(0, 10)

# like lists:
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [63]:
# like lists:
x[0]

0

In [64]:
# all elements with indices ≥1 and <4:
x[1:4]

array([1, 2, 3])

In [65]:
# negative indices count from the end
x[-1], x[-2]

(9, 8)

In [66]:
# combination:
x[3:-2]

array([3, 4, 5, 6, 7])

In [67]:
# step size
x[::2]

array([0, 2, 4, 6, 8])

In [68]:
# trick for reversal: negative step
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [69]:
y = np.array([x, x + 10, x + 20, x + 30])
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [70]:
# only one index ⇒ one-dimensional array
y[2]

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [71]:
# other axis: (: alone means the whole axis)
y[:, 3]

array([ 3, 13, 23, 33])

In [72]:
# inspecting the number of elements per axis:
y[:, 1:3].shape

(4, 2)

# Changing array content

In [73]:
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [74]:
y[:, 3] = 0
y

array([[ 0,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [10, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [20, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [30, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Using slices on both sides

In [75]:
y[:,0] = x[3:7]
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Transposing inverts the order of the dimensions

In [76]:
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

In [77]:
y.shape

(4, 10)

In [78]:
y.T

array([[ 3,  4,  5,  6],
       [ 1, 11, 21, 31],
       [ 2, 12, 22, 32],
       [ 0,  0,  0,  0],
       [ 4, 14, 24, 34],
       [ 5, 15, 25, 35],
       [ 6, 16, 26, 36],
       [ 7, 17, 27, 37],
       [ 8, 18, 28, 38],
       [ 9, 19, 29, 39]])

In [79]:
y.T.shape

(10, 4)

# Masks

* A boolean array can be used to select only the element where it contains `True`.
* Very powerfull tool to select certain elements that fullfill a certain condition

In [80]:
a = np.linspace(0, 2, 11)
b = np.random.normal(0, 1, 11)

print(b >= 0)
print(a[b >= 0])

[False  True False False False  True False False  True  True False]
[ 0.2  1.   1.6  1.8]


In [81]:
a[[0, 2]] = np.nan
a

array([ nan,  0.2,  nan,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ])

In [82]:
a[np.isnan(a)] = -1
a

array([-1. ,  0.2, -1. ,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ])

### Random numbers

* numpy has a larger number of distributions builtin

In [84]:
np.random.uniform(-1, 1, 10)

array([-0.57514278, -0.2541962 , -0.27138755, -0.46221023, -0.32766479,
        0.51027397, -0.73860296, -0.75342314, -0.33284773, -0.53300585])

In [86]:
np.random.normal(0, 5, (2, 10))

array([[-7.82113651, -4.07681928, -9.61873752,  0.51879634,  6.27890798,
        -0.52368003,  7.96228987, -3.38776995, -3.12163582,  0.18131958],
       [-5.44323931,  3.48815262,  5.24580562,  4.48100025,  3.16809322,
         4.93363756, -5.9652398 ,  2.35107774,  6.62356386, -1.99691584]])

In [89]:
np.random.poisson(5, 2)

array([5, 2])

## Calculating pi through monte-carlo simulation

* We draw random numbers in a square with length of the sides of 2
* We count the points which are inside the circle of radius 1

The area of the square is

$$
A_\mathrm{square} = a^2 = 4
$$

The area of the circle is
$$
A_\mathrm{circle} = \pi r^2 = \pi
$$

With 
$$
\frac{n_\mathrm{circle}}{n_\mathrm{square}} = \frac{A_\mathrm{circle}}{A_\mathrm{square}}
$$
We can calculate pi:

$$
\pi = 4 \frac{n_\mathrm{circle}}{n_\mathrm{square}}
$$

In [90]:
n_square = 10000000

x = np.random.uniform(-1, 1, n_square)
y = np.random.uniform(-1, 1, n_square)

radius = np.sqrt(x**2 + y**2)

n_circle = np.sum(radius <= 1.0)

print(4 * n_circle / n_square)

3.1424804


## Exercise

1. Draw 10000 gaussian random numbers with mean of $\mu = 2$ and standard deviation of $\sigma = 3$
2. Calculate the mean and the standard deviation of the sample
3. What percentage of the numbers are outside of $[\mu - \sigma, \mu + \sigma]$?
4. How many of the numbers are $> 0$?
5. Calculate the mean and the standard deviation of all numbers ${} > 0$

In [91]:
# %load 04_01_numpy_solutions/exercise_gaussian.py

## Exercise

Monte-Carlo uncertainty propagation

* The hubble constant as measured by PLANCK is
$$
H_0 = (67.74 \pm 0.47)\,\frac{\mathrm{km}}{\mathrm{s}\cdot\mathrm{Mpc}}
$$

* Estimate mean and the uncertainty of the velocity of a galaxy which is measured to be $(500 \pm 100)\,\mathrm{Mpc}$ away
using monte carlo methods

In [None]:
# %load 04_01_numpy_solutions/exercise_hubble.py

## Simple io functions

In [92]:
idx = np.arange(100)
x = np.random.normal(0, 1e5, 100)
y = np.random.normal(0, 1, 100)
n = np.random.poisson(20, 100)

In [93]:
idx.shape, x.shape, y.shape, n.shape

((100,), (100,), (100,), (100,))

In [94]:
np.savetxt(
    'data.txt',
    np.column_stack([idx, x, y, n]),
)

In [95]:
!head data.txt

0.000000000000000000e+00 7.944321587100188481e+04 -1.284045279035829490e-01 2.000000000000000000e+01
1.000000000000000000e+00 2.076340267666639702e+05 -6.483008418687143948e-01 2.700000000000000000e+01
2.000000000000000000e+00 -5.231002174333899893e+04 6.020768043670414738e-01 2.100000000000000000e+01
3.000000000000000000e+00 -1.089939124044197815e+05 2.578496397826560704e-01 2.000000000000000000e+01
4.000000000000000000e+00 1.320775145982903487e+05 -8.199118552380195435e-02 2.300000000000000000e+01
5.000000000000000000e+00 1.133857155710812367e+05 -6.675446289695297075e-02 1.900000000000000000e+01
6.000000000000000000e+00 4.443019883437853423e+04 -1.587392758883499067e-01 2.200000000000000000e+01
7.000000000000000000e+00 -6.791210829997497785e+04 5.009403846673653460e-01 1.400000000000000000e+01
8.000000000000000000e+00 -3.258680175104660157e+04 -7.370906456484876967e-01 2.500000000000000000e+01
9.000000000000000000e+00 -1.120390914855147857e+05 -2.079417428930444289e-01 1.50

In [97]:
# Load back the data, unpack=True is needed to read the data columnwise and not row-wise
idx, x, y, n = np.genfromtxt('data.txt', unpack=True)

idx.dtype, x.dtype

(dtype('float64'), dtype('float64'))

### Problems

* Everything is a float
* Way larger file than necessary because of too much digits for floats
* No column names

## Numpy recarrays

* Numpy recarrays can store columns of different types
* Rows are addressed by integer index
* Columns are addressed by strings

Solution for our io problem → Column names, different types

In [98]:
# for more options on formatting see
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

data = np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
    delimiter=',', # true csv file
    header=','.join(['idx', 'x', 'y', 'n']),
    fmt=['%d', '%.4g', '%.4g', '%d'],  # One formatter for each column
)

In [100]:
!head data.csv

# idx,x,y,n
0,7.944e+04,-0.1284,20
1,2.076e+05,-0.6483,27
2,-5.231e+04,0.6021,21
3,-1.09e+05,0.2578,20
4,1.321e+05,-0.08199,23
5,1.134e+05,-0.06675,19
6,4.443e+04,-0.1587,22
7,-6.791e+04,0.5009,14
8,-3.259e+04,-0.7371,25


In [101]:
data = np.genfromtxt(
    'data.csv',
    names=True, # load column names from first row
    dtype=None, # Automagically determince best data type for each column
    delimiter=',',
)

In [103]:
data[:10]

array([(0,   79440., -0.1284 , 20), (1,  207600., -0.6483 , 27),
       (2,  -52310.,  0.6021 , 21), (3, -109000.,  0.2578 , 20),
       (4,  132100., -0.08199, 23), (5,  113400., -0.06675, 19),
       (6,   44430., -0.1587 , 22), (7,  -67910.,  0.5009 , 14),
       (8,  -32590., -0.7371 , 25), (9, -112000., -0.2079 , 15)], 
      dtype=[('idx', '<i8'), ('x', '<f8'), ('y', '<f8'), ('n', '<i8')])

In [104]:
data[0]

(0,  79440., -0.1284, 20)

In [105]:
data['n']

array([20, 27, 21, 20, 23, 19, 22, 14, 25, 15, 17, 20, 22, 16, 18, 23, 15,
       27, 19, 20, 14, 23, 15, 25, 22, 13, 21, 18, 29, 21, 14, 14, 14, 13,
       21, 17, 25, 19, 19, 17, 20, 18, 17, 19, 26, 23, 20, 17, 15, 24, 19,
       25, 16, 17, 28, 14, 19, 16, 24, 19, 25, 16, 22, 18, 20, 22, 19, 21,
       15, 20, 24, 13, 14, 26, 15, 26, 21, 14, 21, 19, 20, 14, 22, 26, 15,
       19, 23, 20, 15, 25, 14, 23, 33, 17, 20, 24, 19, 21, 18, 20])

In [106]:
data.dtype

dtype([('idx', '<i8'), ('x', '<f8'), ('y', '<f8'), ('n', '<i8')])

## Linear algebra

Numpy offers a lot of linear algebra functionality, mostly wrapping LAPACK

In [109]:
# symmetric matrix, use eigh
# If not symmetric, use eig
mat = np.array([
    [4, 2, 0],
    [2, 1, -3],
    [0, -3, 4]
])

eig_vals, eig_vecs = np.linalg.eigh(mat)

eig_vals, eig_vecs

(array([-1.40512484,  4.        ,  6.40512484]),
 array([[ -3.07818468e-01,   8.32050294e-01,   4.61454330e-01],
        [  8.31898624e-01,  -5.10993288e-17,   5.54927635e-01],
        [  4.61727702e-01,   5.54700196e-01,  -6.92181495e-01]]))

In [110]:
np.linalg.inv(mat)

array([[ 0.13888889,  0.22222222,  0.16666667],
       [ 0.22222222, -0.44444444, -0.33333333],
       [ 0.16666667, -0.33333333,  0.        ]])

## Numpy matrices

Numpy also has a matrix class, with operator overloading suited for matrices

In [111]:
mat = np.matrix(mat)

In [112]:
mat.T

matrix([[ 4,  2,  0],
        [ 2,  1, -3],
        [ 0, -3,  4]])

In [113]:
mat ** 2

matrix([[ 20,  10,  -6],
        [ 10,  14, -15],
        [ -6, -15,  25]])

In [114]:
mat * 5

matrix([[ 20,  10,   0],
        [ 10,   5, -15],
        [  0, -15,  20]])

In [115]:
mat.I

matrix([[ 0.13888889,  0.22222222,  0.16666667],
        [ 0.22222222, -0.44444444, -0.33333333],
        [ 0.16666667, -0.33333333,  0.        ]])

In [116]:
mat * np.matrix([1, 2, 3]).T

matrix([[ 8],
        [-5],
        [ 6]])