<a href="https://colab.research.google.com/github/munich-ml/MLPy2020/blob/master/15_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

NumPy is an array proessing package for Python. 

NumPy is written an C what makes computations with NumPy very fast.

NumPy is usually imported as **np** (alias):

In [0]:
import numpy as np   

The basic NumPy datatype is the NumPy array:

In [0]:
z = np.array([2, 4, 7])
z

## NumPy arrays vs. lists

### Similarities

NumPy arrays share many properties with Python lists, e.g. slicing: 

In [0]:
a = [2, 7, 5, 1]

In [0]:
a[1:-1]

In [0]:
z = np.array(a)
z

In [0]:
z[1:-1]

### Differences

#### Data types
- Python lists may contain multiple data types, while
- NumPy arrays are **constant type**:

In [0]:
a = [2, 7, 1]
z = np.array(a)
z.dtype

In [0]:
a = [2, "seven", 1]
z = np.array(a)
z

... all items are converted to text. `U` for unicode and `21` describes the memory allocation for each item. 

#### Element-wise operations
- **list** objects don't support element-wise operations:


In [0]:
a = [1, 3, 4] 
b = [1, 3, 5]

In [0]:
try:
    a * b
except Exception as e:
    print(e)

- **NumPy arrays** are made for element-wise operations:

In [0]:
np.array(a) * np.array(b)

## Initializing arrays

Initializing **all zeros**:

In [0]:
shape = (2, 3)
np.zeros(shape)

Initializing **all ones** with specific data type:

In [0]:
shape = (2, 4)
np.ones(shape, dtype="int32")

Initializing with **arbitraty value**:

In [0]:
np.full(shape=(2, 3), fill_value=1.33)

### Initalizing with **random numers**

Random **float** values 


In [0]:
np.random.random_sample(size=(2, 2, 4))

Random **integers**

In [0]:
np.random.randint(low=3, high=7, size=(15,))

Set the random **seed** prior to sample generation for reproducable results.

In [0]:
np.random.seed(seed=42)

In [0]:
np.random.random_sample(size=(2, 4))

# NumPy math

## Element-wise operations

In [0]:
z = np.array([1, 2, 3, 4])

In [0]:
z + 4

In [0]:
z / 2

In [0]:
z

In [0]:
np.round(np.sin( z/4 * np.pi), decimals=3)

## Matrix multiplication

In [0]:
z = np.ones(shape=(2,3))
z

In [0]:
y = np.full(shape=(3,2), fill_value=2)
y

In [0]:
try:
    z * y
except Exception as e:
    print(e)

... `z * y` tries element-wise multiplications, which failes because of the different shapes. 

For matrix multiplication use `np.matmul` instead:

In [0]:
np.matmul(z, y)

In [0]:
np.matmul(y, z)

## Transposing matricies

`array.T` returns the transpose of the array

In [0]:
z = np.array([[1,2,3], [4,5,6]])
print(z, z.shape)

In [0]:
print(z.T, z.T.shape)

# Statistics

In [0]:
z = np.array([[1,2,3], [4,5,6]])
z

Min of all values:

In [0]:
z.min()

Min of all columns:

In [0]:
z.min(axis=1)

Standard deviation

In [0]:
z.std()

In [0]:
np.median(z, axis=1)

## Booleans
- `np.all()` is a logical `AND`
- `np.any()` is a logical `OR`
 

#### Logical `AND`

In [0]:
z = np.array([[True, True, False, True], [True, True, True, True]])
z

In [0]:
z.all()

In [0]:
z.all(axis=1)

#### Logical `OR`

In [0]:
z.any()

#### Logical `NOT`

In [0]:
np.logical_not(z.all(axis=0))

# Reorganizing arrays

## Reshape

In [0]:
z = np.array([[1,2,3], [4,5,6]])
z

In [0]:
z.shape

**Option 1**: Use `reshape` method:

In [0]:
z = z.reshape((3, 2))
z

In [0]:
z.shape

**Option 2**: Assign a new shape to the `.shape` attribute:

In [0]:
z.shape = (6, )
z

In [0]:
z.shape

## Concatenation

In [0]:
a = np.array([[1,2,3], [4,5,6]])
a

In [0]:
b = np.array([[7,8,9], [10,11,12]])
b

### `np.vstack()`  

`np.vstack()` is equivalent to `np.concatenate(..., axis=0)`:

In [0]:
z = np.concatenate((a, b), axis=0)
z

In [0]:
z = np.vstack((a, b))
z

### `np.hstack()`  

`np.hstack()` is equivalent to `np.concatenate(..., axis=1)`:

In [0]:
z = np.concatenate((a, b), axis=1)
z

In [0]:
z = np.hstack((a, b))
z

### Concat arrays and slices (1D arrays)


In [0]:
z = np.linspace(start=0, stop=11, num=12, dtype="int8")
z

In [0]:
z.shape = (6, 2)
z

Compute new features (columns) based on exisitng features:

In [0]:
new_sum = z[:, 0] + z[:, 1]
new_sum

In [0]:
new_prod = z[:, 0] * z[:, 1]
new_prod

Trial to concatenate new features to the right fails:

In [0]:
try:
    new_z = np.concatenate((z, new_sum, new_prod), axis=0)
except Exception as e:
    print("Exception occured: " + str(e))

Problem: The new features are slices (1D arrays), not 2D arrays:

In [0]:
print(z.ndim, z.shape)

In [0]:
print(new_sum.ndim, new_sum.shape)

#### Option 1: Reshape first
Manually reshape the new features before concatanation

In [0]:
new_sum.shape

In [0]:
for array in [new_sum, new_prod]:
    new_shape = (array.size, 1)
    array.shape = new_shape

In [0]:
new_sum.shape

In [0]:
np.hstack([z, new_sum, new_prod])

#### Option 2: `np.column_stack(tup)`
Stacks 1D or 2D arrays as columns into a 2D array.
- 2D arrays are stacked as-is, just like with `hstack`.  
- 1D arrays are turned into 2D arrays
first.

Recreating the test bench:

In [0]:
z

Recalculate the new features in order to undo the manual reshaping and to get 1D arrays, again.

In [0]:
new_sum = z[:, 0] + z[:, 1]
new_sum

In [0]:
new_prod = z[:, 0] * z[:, 1]
new_prod

In [0]:
np.column_stack([z, new_sum, new_prod])

#### Option 3: Deep dive `np.c_`, `np.r_` 

- `.c_`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html
- `.r_`: https://docs.scipy.org/doc/numpy/reference/generated/numpy.r_.html

In [0]:
np.c_[z, new_sum, new_prod]

In [0]:
np.r_[z, np.c_[new_sum, new_prod]]

## Advanced indexing

In [0]:
z = np.random.randint(16, size=(3, 12))
z

Create a **masking array**, e.g. where the values are greater than a limit:

In [0]:
mask = z > 10
mask

.. and index the array at the *mask* positions:

In [0]:
z[mask]

In [0]:
print("{} chosen out of {}, equivalent to {:.1%}".format(
    mask.sum(), 
    mask.size, 
    mask.sum()/mask.size))

### NumPy versus list - speed comparison
Measure the compute time to find the smallest value within a list / array.

In [0]:
%matplotlib inline
from time import time
import matplotlib.pyplot as plt

labels = ["NumPy array", "list"]
compute_times = [[], []]
lengths = (1e5, 3e5, 1e6, 3e6, 1e7)
for length in lengths:
    np_vals = np.random.random_sample(size=(int(length),))
    list_vals = list(np_vals)
    funcs = [np_vals.min, lambda: min(list_vals)]

    for func, compute_time in zip(funcs, compute_times):
        t0 = time()
        func()
        compute_time.append(time() - t0)

for compute_time, label in zip(compute_times, labels):
    plt.loglog(lengths, compute_time, "o-", label=label)

plt.xlabel("array length"), plt.ylabel("computation time [s]")
plt.legend(), plt.grid(which="both"), plt.tight_layout();