<a href="https://colab.research.google.com/github/oegedijk/python-datascience-course/blob/master/Week4_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy Introduction

Useful resources:

https://numpy.org/doc/1.18/reference/index.html

https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html

https://www.analyticsvidhya.com/blog/2020/04/the-ultimate-numpy-tutorial-for-data-science-beginners/ 



NB all the print("\n") statements are in order to let google colab output multiple things per cell nicely. In a Jupyter notebook, these would not be necessary.


In [0]:
import numpy as np
import pandas as pd

In [0]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Pandas and NumPy
Pandas dataframes actually are just a (very fancy) wrapper around numpy arrays

you can access the underlying numpy array using the .values attribute:

In [0]:
df = pd.DataFrame(dict(x=[1, 2, 3], y=[4, 5, 6]))
df
df.values

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


array([[1, 4],
       [2, 5],
       [3, 6]])

# what is NumPy and why learn it?

Numpy stands for Numerical Python, and is a library developed for doing highly optimized (using C++ backend) numerical computing in Python. It forms the backbone for most of the datascience stack in Python.

So it's useful to at least know the basics. 

Sidenote:
 - Numpy arrays are optimizied for, you guessed it, numerical operations. So it does not support that much functionality for toring strings in numpy arrays, as is typical in pandas  DataFrames. This is one of the things that pandas tries to make easier for the end user. 


    



## Reason 1: Speeding up your code
It's also really useful to know numpy in order to speed up your code:

If you perform some operation using a for loop over all values in a DataFrame columns for example, chances are that there is a vectorized numpy implementation that is usually faster, sometimes a lot faster (as in 1000x). 

In [0]:
def add_func(a, b):
    assert len(a)==len(b)
    
    new_list = []
    for i in range(len(a)):
        new_list.append(a[i] + b[i])
    return new_list

a = [x for x in range(1000000)]
b = [100000 + x for x in range(1000000)]

%timeit add_func(a, b)

10 loops, best of 3: 149 ms per loop


In [0]:
a = np.arange(0,1000000)
b = np.arange(1000000,2000000)
%timeit a+b

1000 loops, best of 3: 1.56 ms per loop


## Reason 2: Applying numpy functions to your pandas DataFrame.

Given dat pandas dataframes are built on top of numpy arrays, you can apply numpy functions to them: `df.apply(np.mean)`


In [0]:
df.apply(np.mean)

x    2.0
y    5.0
dtype: float64

## Reason 3: PyTorch arrays are based on NumPy arrays

- if you want to get into deep learning with PyTorch and you already know numpy, your learning curve will be easier:
    - PyTorch arrays are basically just numpy arrays that are easy to move to the GPU and remember all operations performed on it so that you can perform autograd on it.
    - all the same operations (reshape, squeeze, etc) (mostly) work with the exact same syntax on torch arrays

# Defining numpy arrays:

In [0]:
np.array([1, 2, 3, 4]) # construct from a list

array([1, 2, 3, 4])

In [0]:
np.ones((3,3)) # 3x3 matrix ons 1s

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [0]:
np.zeros((4,4,4)) # 4x4x4 nmatrix of 0s

array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]])

In [0]:
np.eye(5) #identity matrix

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [0]:
np.arange(2,10,2) # numpy implementation of python range()

array([2, 4, 6, 8])

In [0]:
np.full((5,5), 7) # fill a 5x5 matrix of an array with a certain value (7 in this case)

array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

# Size of array

In [0]:
a = np.ones((3,4))
print(f"shape: {a.shape}")
print(f"size: {a.size}")

shape: (3, 4)
size: 12


# Reshaping arrays

In [0]:
np.reshape(a,(2,6))
b = np.reshape(np.arange(0, 12), (3, 4))
print("\n")
b

array([[1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.]])





array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [0]:
b.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [0]:
a
print("\n")
np.transpose(a)

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])





array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

# Adding "empty" dimensions

- sometimes it's useful to turn an n-dimensional vector into a 1xn matrix.
- or a 2x2 matrix into a 1x2x2 three dimensional matrix

In [0]:
a = np.array([1,2,3])
b = np.expand_dims(a,axis=0)
c = np.expand_dims(b,axis=0)

a
print("\n")
a.shape
print("\n")

b
print("\n")
b.shape
print("\n")

c
print("\n")
c.shape

array([1, 2, 3])





(3,)





array([[1, 2, 3]])





(1, 3)





array([[[1, 2, 3]]])





(1, 1, 3)

## removing empty dimensions
- sometimes you would like to get rid of these empty dimension instead
- turn a 1xn matrix into a n dimensional vector for example

In [0]:
np.squeeze(c, axis=0)
print("\n")
np.squeeze(np.squeeze(c, axis=0), axis=0)

array([[1, 2, 3]])





array([1, 2, 3])

# Stacking


In [0]:
a = np.arange(0,5)
b = np.arange(5,10)

np.vstack((a,b))
print("\n")
np.hstack((a,b))

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Indexing

indexing can be done with slicing just like python lists:

In [0]:
a = np.array([1,2,3,4,5,6])
a[1:3]
print("\n")
a[-4:-1]
print("\n")
a[1:6:2]

array([2, 3])





array([3, 4, 5])





array([2, 4, 6])

In [0]:
a = np.reshape(np.arange(0, 12), (3,4))
a

print("\nfirst row:")
a[0, :]
print("\nsecond column:")
a[:, 1]
print("\nexcluding first row and first column:")
a[1:, 1:]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])


first row:


array([0, 1, 2, 3])


second column:


array([1, 5, 9])


excluding first row and first column:


array([[ 5,  6,  7],
       [ 9, 10, 11]])

## boolean indexing:


In [0]:
a = np.array([1,2,3,4,5,6])
a[[True, False, False, True, True, True]] 

array([1, 4, 5, 6])

## using boolean indexing to index on condition:

In [0]:
a > 3
print("\n")
a[a > 3]

array([False, False, False,  True,  True,  True])





array([4, 5, 6])

# broadcasting

When performing operations on array Numpy performs broadcasting: expanding dimensions as necessary. 

To perform element-wise operations on two arrays all dimension have to either be equal or 1. If the dimension is 1, the value will be copied along that dimension to match the dimension of the other array.

All this will be clearer with some examples:

In [0]:
a = np.reshape(np.arange(0, 12), (3, 4))
a
print("\n")
a.shape

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])





(3, 4)

In [0]:
b = np.array([2])
b.shape
print("\n")
a + b

(1,)





array([[ 2,  3,  4,  5],
       [ 6,  7,  8,  9],
       [10, 11, 12, 13]])

In [0]:
b = np.array([[2], [4], [6]])
b
print("\n")
b.shape
print("\n")
a + b

array([[2],
       [4],
       [6]])





(3, 1)





array([[ 2,  3,  4,  5],
       [ 8,  9, 10, 11],
       [14, 15, 16, 17]])

In [0]:
b =  np.array([[2, 4, 6, 8]])
b
print("\n")
b.shape
print("\n")
a + b

array([[2, 4, 6, 8]])





(1, 4)





array([[ 2,  5,  8, 11],
       [ 6,  9, 12, 15],
       [10, 13, 16, 19]])

## Broadcasting 2d array to 3d array;


In [0]:
a = np.reshape(np.arange(0, 24), (3, 4, 2))
a
print("\n")
a.shape

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15]],

       [[16, 17],
        [18, 19],
        [20, 21],
        [22, 23]]])





(3, 4, 2)

In [0]:
b = np.reshape(np.arange(0, 4), (1, 4, 1))
b
print("\n")
b.shape

array([[[0],
        [1],
        [2],
        [3]]])





(1, 4, 1)

In [0]:
a + b

array([[[ 0,  1],
        [ 3,  4],
        [ 6,  7],
        [ 9, 10]],

       [[ 8,  9],
        [11, 12],
        [14, 15],
        [17, 18]],

       [[16, 17],
        [19, 20],
        [22, 23],
        [25, 26]]])

# Numpy functions

Just a small taste, check the numpy documentation for all the cool stuff numpy can do: https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs

(like if you need to do some discrete fourier transforms)



## basic functions

In [0]:
a = np.arange(5,15,2)
np.mean(a)
print("\n")
np.std(a)
print("\n")
np.median(a)
print("\n")

9.0





2.8284271247461903





9.0





## trigonometry

In [0]:
theta = np.linspace(0, np.pi, 4)

print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.04719755 2.0943951  3.14159265]
sin(theta) =  [0.00000000e+00 8.66025404e-01 8.66025404e-01 1.22464680e-16]
cos(theta) =  [ 1.   0.5 -0.5 -1. ]
tan(theta) =  [ 0.00000000e+00  1.73205081e+00 -1.73205081e+00 -1.22464680e-16]


## exponentials

In [0]:
x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))

x     = [1, 2, 3]
e^x   = [ 2.71828183  7.3890561  20.08553692]
2^x   = [2. 4. 8.]
3^x   = [ 3  9 27]


## logs:


In [0]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


## applying functions across a certain axis:

In [86]:
a = np.array([[1,2], [3,4]])
a
print("\n")
np.min(a)
print("\n")
np.min(a,axis=0)
print("\n")
np.max(a,axis=1)

array([[1, 2],
       [3, 4]])





1





array([1, 2])





array([2, 4])

## Sorting:

In [0]:
a = np.array([1,4,2,5,3,6,8,7,9])
np.sort(a, kind='quicksort')

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

## keep performing operation until only one element left: reduce:

numpy functions have sub methods like .reduce() and .accumulate()

In [0]:
a = np.arange(1, 5)
a
np.add.reduce(a)

array([1, 2, 3, 4])

10

## Store intermediate steps:

In [0]:
np.add.accumulate(a)

array([ 1,  3,  6, 10])

# Random sampling

Numpy also has a substantial random sampling submodule `np.random`

Documentation:

https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html

Some examples:

In [0]:
np.random.seed(42) # set the seed so that your random numbers are reproducible

In [81]:
np.random.rand(5)

array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864])

In [82]:
np.random.choice(np.arange(0, 10))

2

In [83]:
np.random.permutation(np.arange(0, 10))

array([0, 5, 9, 1, 8, 2, 3, 4, 7, 6])

In [79]:
np.random.beta(10, 4) # draw from a beta distribution with parameters a=10 and b=4

0.6669335944569289