# Introduction to NumPy

This material is inspired from different sources:

* https://github.com/SciTools/courses
* https://github.com/paris-saclay-cds/python-workshop/blob/master/Day_1_Scientific_Python/01-numpy-introduction.ipynb

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 1. Create numpy array

So we can easily create a NumPy array from sequence using the function `np.array`.

In [2]:
X = np.array([0, 1, 2, 3, "5"], dtype=float)

In [3]:
X.dtype

dtype('float64')

In [4]:
X

array([0., 1., 2., 3., 5.])

Sometimes, we want our array to be in particular way: only zeros (`np.zeros`), only ones (`np.ones`), equally spaced (`np.linspace`) or logarithmic spaced (`np.logspace`), etc.

### Exercise

Try out some of these ways of creating NumPy arrays. See if you can:

* create a NumPy array from a list of integer numbers. Use the function [`np.array()`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) and pass the Python list. You can refer to the example from the documentation.

In [5]:
# %load solutions/01_solutions.py
x = np.array([1, 2, 3], dtype=np.int8)

In [6]:
x.dtype

dtype('int8')

While checking the documentation of [np.array](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html) an interesting parameter to pay attention is ``dtype``. This parameter can force the data type inside the array.

* create a 3-dimensional NumPy array filled with all zeros or ones numbers. You can check the documentation of [np.zeros](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.zeros.html) and [np.ones](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html).

In [7]:
# %load solutions/03_solutions.py
np.ones(shape=(3, 2, 4), dtype=int)

array([[[1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1]]])

In [8]:
x

array([1, 2, 3], dtype=int8)

* a NumPy array filled with a constant value -- not 0 or 1. (Hint: this can be achieved using the last array you created, or you could use [np.empty](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.empty.html) and find a way of filling the array with a constant value),

In [9]:
# %load solutions/04_solutions.py
np.ones(shape=(3, 2, 2)) * 5

array([[[5., 5.],
        [5., 5.]],

       [[5., 5.],
        [5., 5.]],

       [[5., 5.],
        [5., 5.]]])

In [10]:
X = np.empty(shape=(3, 2))       
X

array([[4.67863973e-310, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000]])

* a NumPy array of 8 elements with a range of values starting from 0 and a spacing of 3 between each element (Hint: check the function [np.arange](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html)), and

In [11]:
# %load solutions/05_solutions.py
np.arange(start=0, stop=8 * 3, step=3)

array([ 0,  3,  6,  9, 12, 15, 18, 21])

## 2. Manipulating NumPy array

### 2.1 Indexing

Note that the NumPy arrays are zero-indexed:

In [12]:
data = np.random.randn(10000, 5)

In [13]:
data[0, 0]

-0.23029497223370993

It means that that the third element in the first row has an index of [0, 2]:

In [14]:
data[0, 2]

-0.27040295137999937

We can also assign the element with a new value:

In [15]:
data[0, 2] = 100.
print(data[0, 2])

100.0


NumPy (and Python in general) checks the bounds of the array:

In [16]:
print(data.shape)
data[60, 10]

(10000, 5)


IndexError: index 10 is out of bounds for axis 1 with size 5

Finally, we can ask for several elements at once:

In [17]:
data[0, [0, 3, 4]]

array([-0.23029497,  0.31147795,  0.33208437])

In [18]:
data[[0, 1], [0, 1]]

array([-0.23029497, -0.30358052])

In [19]:
data[:3, :3]

array([[ -0.23029497,   1.1899337 , 100.        ],
       [  0.17569359,  -0.30358052,   0.73374989],
       [ -0.20183838,   0.25542878,  -0.19675949]])

You can even pass a negative index. It will go from the end of the array.

In [20]:
data[-1, -1]

-0.6776255287079505

### 2.2 Slices

We can reuse the slicing as with the Python list or Pandas dataframe to get element from one of the axis.

In [21]:
data[0, 0:2]

array([-0.23029497,  1.1899337 ])

Note that the returned array does not include third column (with index 2).

You can skip the first or last index (which means, take the values from the beginning or to the end):

In [22]:
data[0, :2]

array([-0.23029497,  1.1899337 ])

If you omit both indices in the slice leaving out only the colon (:), you will get all columns of this row:

In [23]:
data[0, :]

array([ -0.23029497,   1.1899337 , 100.        ,   0.31147795,
         0.33208437])

In [24]:
data[3:6, 2:5]

array([[-1.51754594,  0.84286967, -0.73546053],
       [-1.67262402,  1.06791149, -1.0035775 ],
       [-0.55015901, -0.69630301, -1.45661449]])

### 2.3 Filtering data

In [25]:
data

array([[ -0.23029497,   1.1899337 , 100.        ,   0.31147795,
          0.33208437],
       [  0.17569359,  -0.30358052,   0.73374989,  -0.65402449,
         -0.45535334],
       [ -0.20183838,   0.25542878,  -0.19675949,  -0.4466217 ,
          0.67718881],
       ...,
       [ -0.32746943,   1.27187817,   0.29350658,   0.24936921,
         -0.72684177],
       [  0.57077365,  -1.45978908,   0.81825912,   0.55576393,
         -0.88271376],
       [ -0.39503392,   0.60551727,  -0.42491952,   0.66696407,
         -0.67762553]])

We can produce a boolean array when using comparison operators.

In [26]:
data > 0

array([[False,  True,  True,  True,  True],
       [ True, False,  True, False, False],
       [False,  True, False, False,  True],
       ...,
       [False,  True,  True,  True, False],
       [ True, False,  True,  True, False],
       [False,  True, False,  True, False]])

This mask can be used to select some specific data.

In [27]:
data[data > 0]

array([  1.1899337 , 100.        ,   0.31147795, ...,   0.55576393,
         0.60551727,   0.66696407])

It can also be used to affect some new values

In [28]:
data[data > 0] = np.inf
data

array([[-0.23029497,         inf,         inf,         inf,         inf],
       [        inf, -0.30358052,         inf, -0.65402449, -0.45535334],
       [-0.20183838,         inf, -0.19675949, -0.4466217 ,         inf],
       ...,
       [-0.32746943,         inf,         inf,         inf, -0.72684177],
       [        inf, -1.45978908,         inf,         inf, -0.88271376],
       [-0.39503392,         inf, -0.42491952,         inf, -0.67762553]])

### 2.4 Quizz

Answer the following quizz:

In [29]:
data = np.random.randn(20, 20)

* Print the element in the $1^{st}$ row and $10^{th}$ column of the data.

In [30]:
# %load solutions/08_solutions.py
data[0, 9]


0.49973606163895756

* Print the elements in the $3^{rd}$ row and columns of $3^{rd}$ and $15^{th}$.

In [31]:
# %load solutions/09_solutions.py
data[2, [2, 14]]

array([2.08542219, 0.95440914])

* Print the elements in the $4^{th}$ row and columns from $3^{rd}$ t0 $15^{th}$.

In [32]:
# %load solutions/10_solutions.py
data[2, 2:15]

array([ 2.08542219,  0.82392869, -0.82206832, -1.52094535,  1.0045179 ,
        0.22698106,  0.69275847,  0.23348535,  1.43505282,  0.87571082,
       -0.89670575,  0.04510817,  0.95440914])

* Print all the elements in column $15^{th}$ which their value is above 0.

In [33]:
# %load solutions/11_solutions.py
X_col = data[:, 14]
X_col[X_col > 0]

array([0.56174787, 1.24223949, 0.95440914, 0.34653856, 1.68908027,
       1.04195303, 0.27676806, 0.99595048, 0.06246563, 0.6799173 ,
       0.06070187])

In [34]:
data[0]

array([-0.81220411,  1.61108105,  0.66946423,  0.84254325,  1.48910386,
        0.43477308, -0.18313331,  1.48903927, -0.03938966,  0.49973606,
       -0.87823306,  0.98800442,  0.36895262, -0.38490438,  0.56174787,
        1.85126724, -2.46254636, -0.14519051, -1.63050743,  1.56542158])

## 3. Numerical analysis

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations.

### 3.1 Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [35]:
import numpy as np

In [36]:
v1 = np.arange(0, 5, dtype=int)
v1

array([0, 1, 2, 3, 4])

In [37]:
id(v1)

140195965850768

In [38]:
v1 *= 2

In [39]:
id(v1)

140195965850768

In [40]:
v1 + 2

array([ 2,  4,  6,  8, 10])

In [41]:
np.sin(v1, out=v1)  # np.log(A), np.arctan(A),...

TypeError: ufunc 'sin' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

In [42]:
v1 = np.arange(0, 5, dtype=float)
v1

array([0., 1., 2., 3., 4.])

In [43]:
np.sin(v1, out=v1)  # np.log(A), np.arctan(A),...

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

### 3.2 Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [44]:
A = np.array([[1, 2], [3, 4]])
A

array([[1, 2],
       [3, 4]])

In [45]:
A * A  # element-wise multiplication

array([[ 1,  4],
       [ 9, 16]])

In [46]:
v1 * v1

array([0.        , 0.70807342, 0.82682181, 0.01991486, 0.57275002])

In [47]:
A = np.ones(shape=(3, 2))
v = np.ones(shape=(2, 1)) * 3

In [48]:
A

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [49]:
v

array([[3.],
       [3.]])

In [50]:
np.dot(A, v)

array([[6.],
       [6.],
       [6.]])

In [51]:
A_mat = np.matrix(A)
v_mat = np.matrix(v)

In [52]:
A_mat * v_mat

matrix([[6.],
        [6.],
        [6.]])

In [53]:
v = A[:1].copy()

In [54]:
v[0, :] = 20

In [55]:
v.shape

(1, 2)

In [56]:
x = A[[0, 1], [0, 1]]

In [57]:
x[0] = 1000

In [58]:
x

array([1000.,    1.])

In [59]:
A

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

### 3.3 Calculations

Often it is useful to store datasets in NumPy arrays. NumPy provides a number of functions to calculate statistics of datasets in arrays. 

In [60]:
a = np.random.random(40)
a

array([0.11257118, 0.00707165, 0.796651  , 0.59683051, 0.36965898,
       0.34240898, 0.47536519, 0.63063751, 0.92863102, 0.12215135,
       0.8916469 , 0.99855034, 0.67240504, 0.58511666, 0.89322534,
       0.02270786, 0.07802652, 0.38323684, 0.16494139, 0.9300676 ,
       0.87359102, 0.78278558, 0.66394003, 0.05666255, 0.77615218,
       0.41116696, 0.93758067, 0.4321839 , 0.71327878, 0.30137876,
       0.48104147, 0.05153989, 0.71460069, 0.30797328, 0.11598163,
       0.66273747, 0.18612732, 0.98652375, 0.36451471, 0.18288094])

Different frequently used operations can be done:

In [61]:
np.cumsum(a)[-1]

20.004543428142725

In [62]:
print ('Mean value is', np.mean(a))
print ('Median value is',  np.median(a))
print ('Std is', np.std(a))
print ('Variance is', np.var(a))
print ('Min is', a.min())
print ('Element of minimum value is', a.argmin())
print ('Max is', a.max())
print ('Sum is', np.sum(a))
print ('Prod', np.prod(a))
print ('Cumsum is', np.cumsum(a)[-1])
print ('CumProd of 5 first elements is', np.cumprod(a)[4])
print ('Unique values in this array are:', np.unique(np.random.randint(1, 6, 10)))
print ('85% Percentile value is: ', np.percentile(a, 85))

Mean value is 0.5001135857035681
Median value is 0.47820333233603174
Std is 0.31254338361511347
Variance is 0.09768336664158397
Min is 0.00707164650271308
Element of minimum value is 1
Max is 0.9985503384902731
Sum is 20.004543428142725
Prod 1.949457400243667e-19
Cumsum is 20.004543428142725
CumProd of 5 first elements is 0.0001399162419689769
Unique values in this array are: [2 3 4 5]
85% Percentile value is:  0.8918836654057933


In [63]:
a = np.random.random(40)
print(a.argsort())
a.sort() #sorts in place!
print(a.argsort())

[28  7  8 17  3 35 10 24 23 21 22 38  2 26 12 15 36  6 13  9  1  0 32 34
 29 27 16 14 25 39  4 19  5 18 33 37 20 30 11 31]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]


In [64]:
np.sort(a)

array([0.00892998, 0.02912979, 0.03415597, 0.03662595, 0.04609003,
       0.08032552, 0.14470749, 0.17638347, 0.24236012, 0.24394232,
       0.24732706, 0.27415769, 0.30138838, 0.34456106, 0.36267078,
       0.37166876, 0.38202955, 0.40892199, 0.46472382, 0.47009349,
       0.48531824, 0.50501087, 0.50924792, 0.51789323, 0.51997152,
       0.5541998 , 0.61423311, 0.63647774, 0.63703256, 0.74178062,
       0.74312347, 0.84048966, 0.85500738, 0.87631295, 0.87721729,
       0.89212601, 0.91865751, 0.92535548, 0.92930412, 0.97365426])

#### Calculations with higher-dimensional data

When functions such as `min`, `max`, etc., is applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the `axis` argument we can specify how these functions should behave: 

In [65]:
m = np.random.rand(3, 2)
m

array([[0.40607757, 0.6404395 ],
       [0.77145628, 0.27917383],
       [0.44531463, 0.31360372]])

In [66]:
# global max
m.max()

0.7714562816466072

In [67]:
# max in each column
m.max(axis=0)

array([0.77145628, 0.6404395 ])

In [68]:
# max in each row
m.max(axis=1)

array([0.6404395 , 0.77145628, 0.44531463])

Many other functions and methods in the `array` and `matrix` classes accept the same (optional) `axis` keyword argument.

## 4. Data reshaping and merging

* How could you change the shape of the 8-element array you created previously to have shape (2, 2, 2)? Hint: this can be done without creating a new array.

In [69]:
arr = np.arange(8)

In [70]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [71]:
arr.reshape((2, 2, 2))

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [72]:
arr[np.newaxis,:]

array([[0, 1, 2, 3, 4, 5, 6, 7]])

In [73]:
arr.reshape(1, -1)

array([[0, 1, 2, 3, 4, 5, 6, 7]])

In [74]:
# %load solutions/07_solutions.py
arr = np.random.random(8)
print('Shape of the array', arr.shape)
arr_reshaped = arr.reshape((2, 2, 2))
print('Shape of the array', arr_reshaped.shape)


Shape of the array (8,)
Shape of the array (2, 2, 2)


* Could you reshape the same 8-element array to a column vector. Do the same, to get a row vector. You can use `np.reshape` or `np.newaxis`.

In [75]:
# %load solutions/22_solutions.py
column_arr = arr.reshape(-1, 1)
column_arr = arr[np.newaxis, :]

row_arr = arr.reshape(1, -1)
row_arr = arr[:, np.newaxis]


* Stack vertically two 1D NumPy array of size 10. Then, stack them horizontally. You can use the function [np.hstack](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.hstack.html) and [np.vstack](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.vstack.html). Repeat those two operations using the function [np.concatenate](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.concatenate.html) with two 2D NumPy arrays of size 5 x 2.

In [76]:
# %load solutions/20_solutions.py
X = np.random.randn(10)
Y = np.random.randn(10)
print(np.hstack((X, Y)).shape)
print(np.vstack((X, Y)).shape)


(20,)
(2, 10)


In [77]:
# %load solutions/21_solutions.py
X = np.random.randn(5, 2)
Y = np.random.randn(5, 2)
print(np.concatenate((X, Y), axis=0).shape)
print(np.concatenate((X, Y), axis=1).shape)


(10, 2)
(5, 4)
