# NumPy: NUMerical PYthon

### Karl N. Kirschner

Numpy is the foundation for
- Pandas
- Matplotlib
- Scikit-learn
- PyTorch


- Excels at large **arrays** of data (i.e. VERY efficient)
    - RAM usage, and thus
    - Speed


- Array: an n-dimensional array (i.e. NumPy's name: ndarray):
    - a collections of values that have 1 or more dimensions
    - 1D array --> vector
    - 2D array --> matrix
    - nD array --> tensor


- All array data must be the same (i.e. homogeneous)


- Can perform computations on entire arrays without the need of loops


- Contains some nice mathematical funtions/tools (e.g. data extrapolation) - will be covered in the SciPy lecture


- Does not come by default with Python - must be installed

numpy:
https://numpy.org/doc/stable/

---

Comparisons to a regular list:
1. Both are a container for items/elements
2. Numpy allows for faster items/elements getting (allows for faster mathematics), but
3. List are faster to a insert new and remove existing items/elements

---

#### Key Concept for Numpy
1. Each element in an array must be the same type (e.g. floats)
    - allows for efficient useage of RAM
    - NumPy always knows what the content of the array is

2. **Vectorizing operations**<br>
    "This practice of replacing explicit loops with array expressions is commonly referred to as vectorization."
    - source: https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html
    - do math operations all at once (i.e. does it one times) on an ndarray

3. Integrates with C, C++ and Fortran to improve performance
    - In this sense, NumPy is a intermediary between these low-level libraries and Python


4. The raw array data is put into contiguous (and fixed) block of RAM
    - good at allocating space in RAM for storing the ndarrays
    

**More Inforamation** for what is happening "under-the-hood": https://numpy.org/doc/stable/reference/internals.html

---

**Citing Numpy**: (https://numpy.org/citing-numpy/)

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

@Article{         harris2020array,  
 title         = {Array programming with {NumPy}},  
 author        = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
                 van der Walt and Ralf Gommers and Pauli Virtanen and David
                 Cournapeau and Eric Wieser and Julian Taylor and Sebastian
                 Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
                 and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
                 Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
                 R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
                 G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
                 Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
                 Travis E. Oliphant},  
 year          = {2020},  
 month         = sep,  
 journal       = {Nature},  
 volume        = {585},  
 number        = {7825},  
 pages         = {357--362},  
 doi           = {10.1038/s41586-020-2649-2},  
 publisher     = {Springer Science and Business Media {LLC}},  
 url           = {https://doi.org/10.1038/s41586-020-2649-2}  
}

---

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statistics
import time
import timeit

#%matplotlib inline

## N-dimensional array object (i.e. ndarray)

Let's create two objects:
1. a regular list
2. numpy array (via Array RANGE: https://numpy.org/doc/stable/reference/generated/numpy.arange.html), and


Then we can find demonstrate which is faster using the timeit library.

- timeit (to time code for performance): https://docs.python.org/3/library/timeit.html

In [None]:
my_list = list(range(100000))
my_list

In [None]:
my_array = np.arange(100000)
my_array

Now, lets multiply containers by 2, and do that math 1000 times, then repeat the whole Time function 5 times.

In [None]:
def list_multiply(test_list=None):
    return test_list*2


def numpy_multiply(test_array=None):
    return test_array*2

First, let's see how the list performance is using the `time` library (https://docs.python.org/3/library/time.html):

In [None]:
start_time = time.process_time()

for _ in range (10000):
    list_multiply(my_list)

stop_time = time.process_time()

print(f"Timing: {stop_time - start_time:0.1e} seconds")

Now for the Numpy array's performance:

In [None]:
start_time = time.process_time()

for _ in range (10000):
    numpy_multiply(my_array)

stop_time = time.process_time()

print(f"Timing: {stop_time - start_time:0.1e} seconds")

The use of Numpy arrays is significantly faster than that for lists.

#### timeit
https://docs.python.org/3/library/timeit.html

An very good altertive library for testing performance

Multiply containers by 2, and do that math 10000 times.

In [None]:
timeit.timeit(lambda:numpy_multiply(my_array), number=10000)

In [None]:
timeit.timeit(lambda:list_multiply(my_list), number=10000)

---
## Creating Numpy Arrays from Scratch

In the following we will create several arrays that we can uses throughout this lecture.

### Conversion from lists
Let's create 2 data lists with 5 data points each

In [None]:
data_1 = [6, 1, 6, 7, 9]
data_2 = [3, 5, 4, 2, 8]

data_2

Now create 2 arrays (each with a shape of (1,5))

In [None]:
array_1 = np.array(data_1)
array_2 = np.array(data_2)

array_2

A slightly more complicated example...

Create a **nested list**, with each sublist contains 5 data points:

In [None]:
data_3 = [[-6, 1, 6, 7, 9], [-5, 0, 2, 4, 3]]

Convert the nested lists to a Numpy array, with a shape of (2, 5)

In [None]:
array_3 = np.array(data_3)
array_3

Put `array_3` to memory - we will use it a lot later on  in the lecture.

### Array shapes and dimensions

#### 1D shape

In [None]:
array_1.shape

Recall that we created `array_1` via:

`data_1 = [6, 1, 6, 7, 9]`

`array_1 = np.array(data_1)`

**Note** this would change if you added double brackets above

`data_1 = [[6, 1, 6, 7, 9]]`

`array_1 = np.array(data_1)`

As a demonstration:

In [None]:
example = [[6, 1, 6, 7, 9]]
test = np.array(example)
test.shape

#### nD shape

In [None]:
array_3.shape

In [None]:
array_3.ndim

#### Data types

- https://numpy.org/doc/stable/reference/arrays.dtypes.html
- https://numpy.org/doc/stable/user/basics.types.html

In [None]:
array_3.dtype

Reminder of using type to figure out what the object is that you are dealing with:

In [None]:
type(array_3)

---
## More on creating new arrays

#### An array that contains the same number

Create an array with a shape of (3, 5),and fill it with ca. pi

In [None]:
np.full((3, 5), 3.14)

#### An array of integers

Create an array with a shape of (1, 30) from -10 to 50 using a stepping size of 2

(similar to built-in `range` function)

In [None]:
np.arange(-10, 52, 2)

#### An array of floats

Create an array that contains 10 evenly spaced values between -1 and 1
- numpy's linspace: https://numpy.org/devdocs/reference/generated/numpy.linspace.html

In [None]:
np.linspace(-1, 1, 10)

#### An array of random numbers

Create array with random, but continuous distributed, values between 0 and 1
- random.random_sample function: https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html#numpy.random.random_sample

An array with a shape of (3,):

In [None]:
np.random.random_sample(3)

An array with a shape of (3,4):

In [None]:
np.random.random_sample((3, 4))

---
## Accessing arrays

#### One dimensional array
Let's look at the (5,) `array_1` from above

In [None]:
array_1

Accessing the fourth item position (i.e. at an index of 3)

In [None]:
array_1[3]

#### A multidimensional array

Now look at a 2D array (i.e. (2, 5) from above)

In [None]:
array_3

Access the fist sublist from the 2D array

In [None]:
array_3[0]

Access the second sublist and the fourth item position

In [None]:
array_3[1, 3]

#### Slicing

Demo using [0:1], [1:2], [0:2] and [0:3]

Slice to obtain the first nested array

In [None]:
array_3[0:1]

Slice to obtain the second nested array

In [None]:
array_3[1:2]

Slice to obtain the entire array

In [None]:
array_3[0:2]

Notice that we can specify upper numbers that go beyond the array:

In [None]:
array_3[0:6]

---
## Filter (search) for elements

- numpy arrays are not index like a list, so the more typical methods are not available
- numpy.where is used instead: https://numpy.org/doc/stable/reference/generated/numpy.where.html

In [None]:
array_3

Filter `array_3` for values less than 0:

In [None]:
negative_items = np.where(array_3 < 0)

array_3[negative_items]

### Flatten a multidimensional array

In [None]:
array_3.flatten()

Convert the results to a list

In [None]:
array_3.flatten().tolist()

---
## Joining arrays

#### Multiple arrays with the same dimensions

In [None]:
array_1

In [None]:
array_2

**Concatenate**: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

Multiple 1D arrays will create a single larger 1D array

In [None]:
np.concatenate([array_1, array_2, array_1])

Multiple nD arrays, along their first axis (i.e. **axis=0**) - conceptually, this like adding items to columns in a table

(or as an altertive perspective - **adding more rows**)

Let's join `array_3` to itself three times.

In [None]:
array_3

In [None]:
array_big = np.concatenate([array_3, array_3, array_3], axis=0)
array_big

In [None]:
array_big.shape

Okay, we can present an Numpy array a bit more aesthetically pleasing.

Use Pandas to print out the table in a more human (i.e. a scientist) readable form

In [None]:
pd.DataFrame(array_big)

In [None]:
print(pd.DataFrame(array_big))

Multiple nD arrays, along their second axis (i.e. **axis=1**) - conceptually, this like adding items to rows in a table

(or as an altertive perspective - **adding more columns**)

In [None]:
pd.DataFrame(array_3)

In [None]:
## Multiple nD arrays, along their second axis
array_long = np.concatenate([array_3, array_3, array_3], axis=1)
array_long

In [None]:
array_long.shape

In [None]:
## Use Pandas
pd.DataFrame(array_long)

#### Multiple arrays with inconsistent (i.e. mixed) dimensional
- must pay attention to the dimensions


- vertical stacked
- horizontal stacked

##### Vertical stacked
- nD arrays must be (x, N) and (y, N) where N is the same value

Below we will combine `array_3` (shape: (2, 5)) with `array_big` (shape: (6, 5)).

In [None]:
array_3

In [None]:
array_big

In [None]:
array_vstack = np.vstack([array_3, array_big])
array_vstack

In [None]:
array_vstack.shape

Now logically, we can also do this with our array_1 (shape: (5,))

In [None]:
array_1

In [None]:
array_vstack = np.vstack([array_1, array_3])
array_vstack

In [None]:
array_vstack.shape

When would this not work?

Demo when the arrays (ie. (x, N) and (y, N)) have different N values

In [None]:
array_4 = np.array([['99', 99, 99, 99]])
array_4

In [None]:
array_4.shape

In [None]:
np.vstack([array_4, array_3])

##### Horizontal Stacked
- nD arrays must be (N, x) and (N, y) where N is the same value

Using our examples, we need a new array that has (2, x) values since array_3 is (2, y)

In [None]:
array_5 = np.array([[99], [99]])
array_5

In [None]:
array_hstack = np.hstack([array_5, array_3])
array_hstack

In [None]:
array_5.shape

In [None]:
array_hstack.shape

When would this not work?

Demo when the arrays (ie. (N, x) and (N, y)) have different N values

In [None]:
array_big

In [None]:
array_big.shape

In [None]:
array_hstack = np.hstack([array_4, array_big])
array_hstack

---
## Math with ndarrays
- np.add and np.subtract
- np.multiple and np.divide
- np.power
- np.negative (multiplies x by -1)


#### Math performed on a single array

In [None]:
array_3

#### Method 1: numpy a function

In [None]:
np.add(array_3, 5)

#### Method 2: using Python3's built-in function

In [None]:
array_3 + 5

Note
- Using numpy vs built-in functions doesn't matter **too much** here since you are performing the action on a numpy array

In [None]:
timeit.timeit(lambda:np.add(array_3, 5), number=100000)

In [None]:
timeit.timeit(lambda:array_3 + 5, number=100000)

### Math between arrays
- math operations between equal sized arrays is done via element-wise

Add and subtract

In [None]:
array_3 + array_3

In [None]:
array_3 - array_3

Multiplication

In [None]:
array_3 * array_3

In [None]:
data_4 = [[-1, -1, -1, -1, -1], [-1, -1, -1, -1, -1]]
array_4 = np.array(data_4)

In [None]:
array_3 * array_4

Division

In [None]:
1/array_3

Raise to a power

In [None]:
array_3**3

### Absolute values

Using a Numpy function

In [None]:
np.absolute(array_3)

Python3's built-in function

In [None]:
abs(array_3)

### Booleans

In [None]:
print(array_3 == -6)

### Trigonometric
- np.sin()
- np.cos()
- np.arcsin()
- etc.

Trigonometry on a single input value

In [None]:
np.sin(-6)

Trigonometry on an numpy array

In [None]:
np.sin(array_3)

### Exponents and logarithms

In [None]:
x = np.array([2, 3])

## Note that the resulting lists aren't seperated by a comma
##    (as seen above) due to the print statement

print("x     =", x)
print("2^x   =", np.exp2(x))
print("10^x   =", np.power(10, x))
print("e^x   =", np.exp(x))

(Recall that you reverse the exponential calculations using log functions.)

Taking the above exponential outout and operate on them using log functions:

In [None]:
x = [4., 8.]
print("log2(x)  =", np.log2(x))

x = [100., 1000.]
print("log10(x) =", np.log10(x))

x = [7.3890561, 20.08553692]
print("ln(x)    =", np.log(x))

---
### A more complex example

Convert temperature values form Celcius to Fahrenheit

Data set: Average temperature in Bonn throughout the calendar year (i.e. January ---> December)

In [None]:
data_celcius = [2.0, 2.8, 5.7, 9.3, 13.3, 16.5, 18.1, 17.6, 14.9, 10.5, 6.1, 3.2]

array_celcius = np.array(data_celcius)

array_celcius

In [None]:
array_fahrenheit = array_celcius*(9/5) + 32

Visualize the results for clarity using Matplotlib:

In [None]:
plt.plot(array_celcius)
plt.plot(array_fahrenheit)
plt.show()

---
## Numpy statistics

#### Side note: numpy's random number generators
- generators (e.g. normal/gaussian, geometric, bionomial) https://numpy.org/doc/1.18/reference/random/generator.html

Two examples will be given as demonstrations

##### Geometric distribution
Generate a random distribution that contains 10 attempt entries that have a success propobility of 60%, where the distribution itself is governed by a geometric distribution:

In [None]:
random_geom = np.random.geometric(0.60, size=10)
random_geom

##### Normal distribution
Create an array (3,3) of Gaussian distributed random values: mean=10.0 and standard deviation=0.1

In [None]:
random_data = np.random.normal(10.0, 0.1, (3, 3))
random_data

Visualize the random array for clarity

(demo by repeating the code above and replotting)

In [None]:
plt.plot(random_data)
plt.show()

Let's also prove to ourselves that our mean is close to 10 and the standard deviation is close to 0.1

In [None]:
np.mean(random_data)

In [None]:
np.std(random_data)

---
## Details concerning standard deviation and variance

(Why can't I reproduce results using spreadsheets or Matlab?)

In [None]:
data = [1, 2, 4, 5, 8]

#### variance
- Libreoffice spreadsheet give a variance of '=VAR(1,2,4,5,8)' of 7.5
- I beleive Matlab also gives 7.5

Using the statistics's library

In [None]:
statistics.variance(data)

These above results are actually termed 'the sample variance.'

However, if you use NumPy by simply typing:

In [None]:
np.var(data)

In this case there is a "hidden" variable called `ddof` ("Delta Degrees of Freedom")
    - the denomenator is divided by 'N -ddof'

https://numpy.org/doc/1.18/reference/generated/numpy.var.html?highlight=variance

- population: "ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables"
- sample: "ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population"

The same is true for standard deviation.

Population variance:

In [None]:
np.var(data, ddof=0)

Sample variance (always larger than the population variance):

In [None]:
np.var(data, ddof=1)

Standard deviation demo

Libreoffice gives '=stdev(1,2,4,5,8)' of 2.7386127875

And statistics library gives:

In [None]:
statistics.stdev(data)

Numpy's sample standard deviation

In [None]:
np.std(data, ddof=1)

Numpy's population standard deviation

In [None]:
np.std(data, ddof=0)

**Take home message**: you should always take a look at NumPy's manual to make sure you are doing what you think you are doing -- keep an eye out for default settings (e.g. ddof=0).

---

Addtional resource to further learn and test your knowledge: https://github.com/rougier/numpy-100

---

### And finally, some weirdness

The following should provide a mean value of 1.0

(i.e. sum of the numbers is 4 and then divide by 4)

In [None]:
large_numbers_list = [1e30, 1, 3, -1e30]

In [None]:
statistics_mean = statistics.mean(large_numbers_list)
statistics_mean

In [None]:
np_mean = np.mean(large_numbers_list)
np_mean

In [None]:
np_sum = np.sum(large_numbers_list)
np_sum

This appears to be coming from the data type

In [None]:
np.array(large_numbers_list).dtype

In [None]:
np_sum = np.mean(large_numbers_list, dtype=np.float64)
np_sum

In [None]:
np_sum = np.mean(large_numbers_list, dtype=np.int8)
np_sum