# NumPy: Numerical Python

### Karl N. Kirschner

Numpy is the foundation for
- Pandas
- Matplotlib
- Scikit-learn
- PyTorch


- Excels at large **arrays** of data


- Array: an n-dimensional array (i.e. ndarray):
    - a collections of values that have 1 or more dimensions
    - 1D array --> vector
    - nD array --> matrix


- All array data must be the same (i.e. homogeneous)


- Can perform computations on entire arrays without the need of loops


- Does not come by default with Python - must be installed

numpy:
https://numpy.org/doc/stable/

---

Comparisons to a regular list:
1. Both are a container for items/elements
2. Numpy allows for faster items/elements getting (allows for faster mathematics), but
3. List are faster to a insert new and remove existing items/elements

---

#### Key Concept for Numpy
1. Each element in an array must be the same type (e.g. floats)


2. **Vectorizing operations**<br>
    "This practice of replacing explicit loops with array expressions is commonly referred to as vectorization."
    - source: https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html


3. Integrates with C, C++ and Fortran to improve performance

---

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import timeit

#%matplotlib inline

## N-dimensional array object (i.e. ndarray)

Let's create two objects:
1. a regular list
2. numpy array (via Array RANGE: https://numpy.org/doc/stable/reference/generated/numpy.arange.html), and


Then we can find demonstrate which is faster using the timeit library.

- timeit (to time code for performance): https://docs.python.org/3/library/timeit.html

In [None]:
my_list = list(range(100000))

In [None]:
my_array = np.arange(100000)
my_array

Now, lets multiply containers by 2, and do that math 1000 times, then repeat the whole Time function 5 times.

In [None]:
def numpy_multiply(test_array=None):
    return test_array*2


def list_multiply(test_list=None):
    return test_list*2

In [None]:
start_time = time.process_time()

for _ in range (10000):
    numpy_multiply(my_array)

stop_time = time.process_time()
print(f"Timing: {stop_time - start_time:0.1e} seconds")

In [None]:
start_time = time.process_time()

for _ in range (10000):
    list_multiply(my_list)

stop_time = time.process_time()
print(f"Timing: {stop_time - start_time:0.1e} seconds")

#### timeit

An very good altertive library for testing performance: timeit

Multiply containers by 2, and do that math 1000 times, then repeat the whole Time function 5 times

In [None]:
timeit.timeit(lambda:numpy_multiply(my_array), number=10000)

In [None]:
timeit.timeit(lambda:list_multiply(my_list), number=10000)

---
## General information

In [None]:
## Two data lists with 5 data points
data_1 = [6, 1, 6, 7, 9]
data_2 = [3, 5, 4, 2, 8]

In [None]:
data_1

In [None]:
data_2

In [None]:
## Two different arrays, each with a shape of (1,5)
array_1 = np.array(data_1)
array_2 = np.array(data_2)

In [None]:
array_1

In [None]:
array_2

In [None]:
## A nested list, with each sublist contains 5 data points
data_3 = [[-6, 1, 6, 7, 9], [-5, 0, 2, 4, 3]]

## One array, with a shape of (2,5)
array_3 = np.array(data_3)

array_3

Put array_3 to memory - we will use it a lot later on  in the lecture.

#### Array shapes and dimensions

In [None]:
## 1D shape
array_1.shape

In [None]:
## Note, this would change if you added double brackets above
##     data_1 = [[6, 1, 6, 7, 9]]
##     array_1 = np.array(data_1)
example = [[6, 1, 6, 7, 9]]
test = np.array(example)
test.shape

In [None]:
## nD shape
array_3.shape

In [None]:
array_3.ndim

Reminder of using type to figure out what you are dealing with.

In [None]:
type(array_3)

#### Data types

- https://numpy.org/doc/stable/reference/arrays.dtypes.html
- https://numpy.org/doc/stable/reference/generated/numpy.dtype.html?highlight=dtype#numpy.dtype

In [None]:
array_3.dtype

---
## More on creating new arrays

Create an array with a shape of (3,5),and fill it with ca. pi

In [None]:
np.full((3, 5), 3.14)

Create an array with a shape of (1,30) from -10 to 50 using a stepping size of 2

In [None]:
np.arange(-10, 52, 2)

Create an array that contains 10 evenely spaced values between -1 and 1
- numpy's linspace: https://numpy.org/devdocs/reference/generated/numpy.linspace.html

In [None]:
np.linspace(-1, 1, 10)

Create array with random, but continuous distributed, values between 0 and 1
- random.random_sample function: https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html#numpy.random.random_sample

In [None]:
np.random.random_sample(3)

In [None]:
## shape of (3,4)
np.random.random_sample((3, 4))

---
## Accessing arrays

In [None]:
## 1D array (i.e. (1,5) from above)
array_1

In [None]:
## accessing the fourth item position (i.e. index of 3)
array_1[3]

In [None]:
## 2D array (i.e. (2,5) from above)
array_3

In [None]:
## access the second sublist from the 2D array
array_3[1]

In [None]:
## access the first sublist and the fourth item position
array_3[1, 4]

Slicing - demo using [0:1], [1:2], [0:2] and [0:3]

In [None]:
array_3[0:1]

In [None]:
array_3[1:2]

In [None]:
array_3[0:2]

In [None]:
array_3[0:3]

---
## Searching for elements

- numpy arrays are not index like a list, so the more typical methods are available here
- numpy.where is used instead: https://numpy.org/doc/stable/reference/generated/numpy.where.html

In [None]:
array_3

In [None]:
elements = np.where(array_3 < 0)
array_3[elements]

---
## Joining arrays

In [None]:
array_1

In [None]:
array_2

**Concatenate**: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

Multiple 1D arrays

In [None]:
np.concatenate([array_1, array_2, array_1])

Multiple nD arrays, along their first axis (i.e. axis=0) - conceptually, this like adding items to columns in a table

(or as an altertive perspective - adding additional rows)

In [None]:
array_3

In [None]:
array_big = np.concatenate([array_3, array_3, array_3], axis=0)
array_big

In [None]:
array_big.shape

In [None]:
## Use Pandas to print out the table in a more human (i.e. a scientist) readable form
print(pd.DataFrame(array_big))

Multiple nD arrays, along their second axis (i.e. axis=1) - conceptually, this like adding items to rows in a table

In [None]:
## Use Pandas
print(pd.DataFrame(array_3))

In [None]:
## Multiple nD arrays, along their second axis
array_long = np.concatenate([array_3, array_3, array_3], axis=1)
array_long

In [None]:
array_long.shape

In [None]:
## Use Pandas
print(pd.DataFrame(array_long))

#### Multiple mixed dimensional arrays
- must pay attention to the dimensions


- vertical stacked
- horizontal stacked

Vertical stacked
- nD arrays must be (x,N) and (y,N) where N is the same value

In [None]:
print(array_3)
print()
print(array_3.shape)

In [None]:
print(array_big)
print()
print(array_big.shape)

In [None]:
array_vstack = np.vstack([array_3, array_big])
array_vstack

In [None]:
array_vstack.shape

Now logically, we can also do this with our array_1 (i.e. a shape of (5,) ) that we created above

In [None]:
## review what we have for array_3
print(array_1)
print()
print(array_1.shape)

In [None]:
array_vstack = np.vstack([array_1, array_3])
array_vstack

In [None]:
array_vstack.shape

Now, let's show when it doesn't work:

In [None]:
## example of when the arrays have different N values
array_4 = np.array([['99', 99, 99, 99]])
print(array_4)
print()
print(array_4.shape)

In [None]:
np.vstack([array_4, array_3])

Horizontal Stacked
- nD arrays must be (N,x) and (N,y) where N is the same value

Using our examples, we need a new array that has (2,x) values since array_3 is (2,y)

In [None]:
array_4 = np.array([[99], [99]])
print(array_4)
print()
print(array_4.shape)

In [None]:
array_hstack = np.hstack([array_4, array_3])
array_hstack

In [None]:
array_hstack.shape

Now, let's show when it doesn't work:

In [None]:
array_big

In [None]:
array_big.shape

In [None]:
array_hstack = np.hstack([array_4, array_big])
array_hstack

---
## Math with ndarrays
- np.add and np.subtract
- np.multiple and np.divide
- np.power
- np.negative (multiplies x by -1)

In [None]:
array_3

In [None]:
## Method 1 (numpy function)
np.add(array_3, 5)

In [None]:
## Method 2 (python built-in function)
array_3 + 5

Note
- Using numpy vs built-in functions doesn't matter too much here since you are performing the action on a numpy array

In [None]:
timeit.timeit(lambda:np.add(array_3, 5), number=10000)

In [None]:
timeit.timeit(lambda:array_3 + 5, number=10000)

### Math between arrays
- math operations between equal sized arrays is done via element-wise

In [None]:
## add and subtract
array_3 + array_3

In [None]:
array_3 - array_3

In [None]:
## multiplication
array_3 * array_3

In [None]:
data_4 = [[-1, -1, -1, -1, -1], [-1, -1, -1, -1, -1]]
array_4 = np.array(data_4)

In [None]:
array_3 * array_4

In [None]:
## division
1/array_3

In [None]:
## powers
array_3**3

### Absolute values

In [None]:
## Python3 built-in function
abs(array_3)

In [None]:
np.absolute(array_3)

### Booleans

In [None]:
print(array_3 == -6)

### Trigonometric
- np.sin()
- np.cos()
- np.arcsin()
- etc.

In [None]:
## trig on a single input value
np.sin(-6)

In [None]:
## trig on an numpy array
np.sin(array_3)

### Exponents and logarithms

In [None]:
## Note that the resulting lists aren't seperated by a comma
##    (as seen above) due to the print statement
x = np.array([2, 3])
print("x     =", x)
print("2^x   =", np.exp2(x))
print("10^x   =", np.power(10, x))
print("e^x   =", np.exp(x))

In [None]:
x = [4., 8.]
print("log2(x)  =", np.log2(x))

x = [100., 1000.]
print("log10(x) =", np.log10(x))

x = [7.3890561, 20.08553692]
print("ln(x)    =", np.log(x))

---
### A more complex example

In [None]:
## Celcius to Fahrenheit
## Average temperature in Bonn (January ---> December)
data_celcius = [2.0, 2.8, 5.7, 9.3, 13.3, 16.5, 18.1, 17.6, 14.9, 10.5, 6.1, 3.2]
array_celcius = np.array(data_celcius)
array_celcius

In [None]:
array_fahrenheit = array_celcius*(9/5) + 32

In [None]:
plt.plot(array_celcius)
plt.plot(array_fahrenheit)
plt.show()

---
## Numpy statistics

#### Side note: numpy's random number generators
- generators (e.g. normal/gaussian, geometric, bionomial) https://numpy.org/doc/1.18/reference/random/generator.html


Example:
What is the random distribution of 10 attempts that have a success propobility of 60%, where the distribution itself is governed by a geometric distribution?


In [None]:
## a 1 means success
s = np.random.geometric(0.60, size=10)
s

Filling a numpy array with random numbers
- Create an array (3,3) of Gaussian distributed random values: mean=0.0 and standard deviation=0.1

In [None]:
random_data = np.random.normal(0, 0.1, (3, 3))
random_data

In [None]:
plt.plot(random_data)
plt.show()

In [None]:
np.mean(random_data)

In [None]:
np.median(random_data)

**Side note**: what happens if we convert the numpy array to a list using numpy.tolist()?

In [None]:
import statistics

random_data_list = random_data.tolist()
print(random_data_list)

statistics.median(random_data_list)

To do this like numpy, we must flatten the array first, and then tolist:

In [None]:
random_data_flattened_list = random_data.flatten().tolist()
random_data_flattened_list

In [None]:
statistics.median(random_data_flattened_list)

---
## Standard deviation and variance

In [None]:
data = [1, 2, 4, 5, 8]

#### variance
- Libreoffice spreadsheet give a variance of '=VAR(1,2,4,5,8)' of 7.5
- I beleive Matlab also gives 7.5

Using the statistics's library

In [None]:
statistics.variance(data)

These are actually the 'sample variance.'

However, if you use NumPy by simply typing:

In [None]:
np.var(data)

In this case there is a hidden variable called ddof ("Delta Degrees of Freedom")
    - the denomenator is divided by 'N -ddof'

ddof = 1 gives you a population variance

In [None]:
## population variance
np.var(data, ddof=0)

In [None]:
## sample variance (always larger than the population variance)
np.var(data, ddof=1)

https://numpy.org/doc/1.18/reference/generated/numpy.var.html?highlight=variance

- sample: "ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population"
- population: "ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables"

The same is true for standard deviation.

In [None]:
## standard deviation
## libreoffice gives '=stdev(1,2,4,5,8)' of 2.7386127875

statistics.stdev(data)

In [None]:
## sample standard deviation
np.std(data, ddof=1)

In [None]:
## population standard deviation
np.std(data, ddof=0)

**Take home message**: you should always take a look at NumPy's manual to make sure you are doing what you think you are doing -- keep an eye out for default settings (e.g. ddof=0).

---

Addtional resource to further learn and test your knowledge: https://github.com/rougier/numpy-100

---

### And finally, some weirdness

In [None]:
import statistics

## Should provide a mean value of 1.0 (i.e. sum of the numbers is 4 and then divide by 4)
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])
np_mean = np.mean([1e30, 1, 3, -1e30])

print('Statistics mean: {}'.format(statistics_mean))
print('NumPy mean: {}'.format(np_mean))


From https://www.python.org/dev/peps/pep-0450/

"The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this "torture test"..."