# <font color='red'>**NumPy**</font>

1. Array creation & data types
1. Universal functions: vectorized ops
1. Basic manipulations: indexing, slicing, boolean mask, fancy index, etc.
1. Broadcast: vectorized ops on arrays with different shapes

Nature article: https://www.nature.com/articles/s41586-020-2649-2.pdf

##### Package installation

In [None]:
!pip install numpy
!pip install matplotlib
!pip install pandas

In [4]:
import numpy as np

np.__version__

'1.18.5'

### Why numpy array?
* **Array**: collection of items, with the **same data type**
* Save computational time and memory space 
    * Python list is slower and take up more space
* Vectorized operations: write lesser for-loops
* Extremely popular, data exchange for different pacakges
* Linear algebra operations

### Numpy is faster than list

In [None]:
# 1) compute double-list using list [SLOWER]
num = 100000
douleOld = [x*2 for x in range(num)] # list method

# 2) compute double-list using numpy array [FASTER]
# arr = np.arange(num)
arr = np.array(range(num))
# douleNew = arr + arr
douleNew = arr * 2

# ms - millisecond 毫秒
# µs - microsecond 微秒
# ns - nanoseconds 奈秒

# is it the same?
# douleOld == douleNew
# (douleOld == douleNew).all()

---
## 1) **Array creation & data types**

### Basic Operations

In [None]:
# 1d array
data = np.arange(1000)
data

# basic operations
data.ndim # dimension: 1
data.shape # shape 
data.dtype
data.strides
# len(data)

data.strides # 步長, (4,) => skip 4 byte to next item in this dimension

In [9]:
# 2d array
np.random.seed(1) # make sure we saw the same random values
data = np.random.randn(2,3)
data
# data
# data.ndim
# data.dtype # float64 -> 8bytes
# len(data)
# data.strides # (24,8) => skip 24 / 8 byte to next item in this dimension

# format the output
# np.set_printoptions(precision=8,suppress=True)
# np.set_printoptions?

# pandas based on numpy

# please install pandas first
# import pandas as pd
# df = pd.DataFrame(data)
# url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv'
# tips = pd.read_csv(url)   # tips dataset
# tips.values


array([[16.99, 1.01, 'Female', ..., 'Sun', 'Dinner', 2],
       [10.34, 1.66, 'Male', ..., 'Sun', 'Dinner', 3],
       [21.01, 3.5, 'Male', ..., 'Sun', 'Dinner', 3],
       ...,
       [22.67, 2.0, 'Male', ..., 'Sat', 'Dinner', 2],
       [17.82, 1.75, 'Male', ..., 'Sat', 'Dinner', 2],
       [18.78, 3.0, 'Female', ..., 'Thur', 'Dinner', 2]], dtype=object)

### Creating ndarray

1. np.array()
1. np.arange()
1. np.zeros()
1. np.ones()
1. np.empty()
1. np.full()
1. np.eye() / np.identity()
1. np.random.random() / np.random.randint()
1. np.linspace()

In [None]:
# method 1: list
np.array([[1,2],[3,4],])

In [None]:
# method 2: np.arange(start, stop, step, dtype)
np.arange(10,100, 3, dtype='float')
np.arange(100).reshape(10,10) # reshape() to change the shape

In [None]:
# method 3: np.zeros(shape, dtype)
# data = np.zeros(5)
data = np.zeros((2,3))
# data = np.zeros([2,3])
data
# data.shape

In [None]:
# method 4: np.ones(shape, dtype)
# data = np.ones((2,3), 'object')
np.ones([2,3], 'object')

In [None]:
# method 5: np.empty()
np.empty(10) # no init, but faster


In [None]:
# method 6: np.full(shape, fill_value)
np.full([2,3], 9.9) # filled by 9.9

In [None]:
# method 7: np.eye(N) / np.identity(N)
# np.eye
np.eye(5)

# np.identity
# np.identity(5)

In [None]:
# method 8a: np.random.random(shape)
np.random.random((3,3))

In [None]:
# method 8b: np.random.randint(low, high, shape)
np.random.randint(0, 100, (3,3))

In [None]:
# method 9: np.linspace(start, stop, num)

np.linspace(0, 1, 10) # from 0 to 1, split into 10 equal parts

### <font color=blue>Data types (dtype)</font>
* https://numpy.org/doc/stable/user/basics.types.html

**Some facts:**
* 1 byte = 8 bits
* signed -> positive, negative, 0
* unsigned -> positive, 0

**Type codes for the following types:**
1. int
1. float
1. bool
1. object
1. string
1. unicode


##### **Create array using <font color=red>Type codes</font>**

In [None]:
# Type code
# 1) int
#  i1  -> signed 8bit int, Type: int8 
#  i2  -> signed 16bit int, Type: int16
#  i4  -> signed 24bit int, Type: int32
#  i8  -> signed 64bit int, Type: int64

# u1,u2,u4,u8 -> unsigned int (uintX)

# 2) float
#  f2  -> 16bit floating point, Type: float16
#  f4  -> 32 bit floating point, Type: float32
#  f8  -> 64 bit floating point, Type: float64

# 3) bool
#  ?  -> True OR False, Type: bool

# 4) object
#  O  -> python object, Type: object

# 5) string
#  S  -> Fixed-length ASCII char, S4 -> 4 ASCII characters, Type: string_

# 6) unicode
#  U  -> Fixed-length Unicode, Type: unicode_

dt = np.dtype('i2')
# dt

dt.byteorder

data = np.array([0,-1,2,3,4], dtype=dt)
data


##### **Create array using np data types (not type code here)**

In [None]:
# data = np.arange(10, dtype=np.int8) # 8bit integer
data = np.arange(10, dtype=np.int16) # 16bit integer
# data = np.arange(10, dtype=np.float64) # 64bit floating point

# data = np.array([0,-1,2,3,4], dtype=np.bool) # zero is False, non-zero is True
data = np.array([0,-1,2,3,4, 99999999], dtype=np.string_) # fixed length string, b: byte
                
# data
data

# Byte order
# https://numpy.org/doc/stable/reference/generated/numpy.dtype.byteorder.html
# (>) big-endian (low to high)
# (<) little-endian (high to low)
# (|) not applicable

# S is single byte - always |

##### **Changing array dtype after creation**

In [None]:
data = np.array([0,-1,2,3,4, 11112222], dtype=np.string_)

# data.astype('?') # to boolean
# data.astype('i4') # to int32
data.astype('|S10')


---

## 2) **Universal functions: vectorized ops**

### <font color=red>Vectorized Computations</font>

In [None]:
data = np.arange(1, 11, dtype=np.int16) # 16bit integer
data

In [None]:
data + data # same shape
# np.add(data, data)

In [None]:
data - 1
# np.subtract(data, 1)

In [None]:
data * 2
np.multiply(data,2) # element-wise

In [None]:
data * data # same shape
# np.multiply(data, data) # element-wise

In [None]:
1/data
# np.divide(1, data)

In [None]:
data**2
np.square(data)

### <font color=blue>**Universal Functions (ufunc)**</font>
* perform <font color=red>**element-wise operations**</font> on ndarray
* vectorized computations listed above are all ufuncs
* Some ufuncs:
    1. np.add, np.substract, np.negative
    1. np.multiply, np.divide
    1. np.power
    1. np.mod
    1. np.maxiumn, np.minimum
    1. np.square, np.sqrt
    1. np.exp
    1. np.log, np.log10, np.log2, np.log1p
    1. np.isnan
    1. np.sin, np.cos, np.tan
    1. np.greater, np.greater_equal
    1. np.less, np.less_equal
    1. np.equal, np.not_equal

In [None]:
np.random.seed(1)
arr1 = np.array([-1,2,3,4,5])
arr2 = np.array([2,0,7,-1,8])

np.maximum(arr1, arr2)

In [None]:
arr1 = np.array([-1,2,3,4,5])
abs(arr1) # python abs func, np array can work with some native funcs
np.abs(arr1) # np.abs 

# for performance reason, always use np version first

In [None]:
arr1 = np.array([1,2,3,4,5])
np.square(arr1)
# np.sqrt(arr1)

In [None]:
x = np.arange(10, step=0.1) 
s = np.sin(x)
c = np.cos(x)

# plot it
import matplotlib.pyplot as plot
plot.grid(True) 
plot.plot(x, s) # sin
plot.plot(x, c) # cos
plot.legend(['Sin(x)','Cos(x)'])

plot.title('Sin & Cos Wave')
plot.xlabel('X')
plot.ylabel('Amplitude')

plot.axhline(color='black')

In [None]:
arr1 = np.array([1,2,3,4,5]) 
# arr1 > 3 # it's ufunc too
# np.greater(arr1,3)

---

### <font color=blue>Aggregate functions</font>

In [None]:
data = np.arange(1,10) 

# numerical array only, not work for string!
data.mean()
# np.mean(data)
# data.sum()
# data.max()
# data.min()
# data.argmin() # index val for min item
# data.argmax() # index val for max item

# reduction - aggregate all values in a single array
# np.add.reduce(data) 
# np.sum(data)

# accumulate - aggregate all values in a single array, and keep the intermediate results
# np.add.accumulate(data)
# np.cumsum(data)

### <font color=blue>Cumulative functions</font>

In [None]:
data = np.arange(1,10) 

# cumulative results
data.cumsum()
# data.cumprod()

# data = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
# data.cumsum(axis=1) # left-right
# data.cumsum(axis=0) # top-down

---
## 3) **Basic manipulations**

### <font color=blue>Index & Slice</font>

In [None]:
# 1d examples
arr = np.arange(10)

# indexing[pos]
arr[0] = 999  # arr[pos] to index a item and then update it
# arr[0] # use arr[pos] to get a value
# arr[0] = 'abc' # wrong, can't assign a different type

# slicing
# arr[:]  # all
# arr[1:3] # pos 1,2, stop at pos 3
# arr[::-1] # reverse order
# arr[::2] # step size 2

# WARNING
# view = arr[1:4]
# view[:] = [777,888,999] # Update on this view also update the original array

# copy = arr[1:4].copy() # if you don't want to update thr original array
# copy[:] = [777,888,999]

In [None]:
# 2d examples
arr2d = np.arange(16).reshape(4,4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11],
#        [12, 13, 14, 15]])

# indexing [row][col] OR [row, col]
# arr2d[0][0] # 0
# arr2d[1][2] # 6
# arr2d[1, 2] # 6 

# slicing[row, col]
# arr2d[1] # row 1
# arr2d[1:3] # row 1, 2
# arr2d[1:3, 1:3] # center parts

# update examples
# arr2d[1:3, 1:3] = np.full([2,2], -1)
# arr2d

In [None]:
# increase dimension
arr = np.arange(10) # 1d

arr[np.newaxis, :] # to 2d (aad new row)
# arr[:, np.newaxis] # to 2d (add new col)

### <font color=blue>Boolean Mask</font>

In [None]:
animals = np.array(['cat', 'dog', 'fish', 'duck', 'cat'])
np.unique(animals)

animals == 'cat' # return a boolean mask
animals[ animals == 'cat' ] # select records 

np.random.seed(1)
data = np.random.randint(1,100, [5,5]) # assume this data is associcated with animals
# data
# data[animals == 'cat'] # cat records
# data[~(animals == 'cat')] # non-cat records
# data[animals != 'cat'] # non-cat records
# data[(animals == 'cat') | (animals == 'dog')] # cat OR dogs 
# data[(animals == 'cat') & (animals == 'dog')] # cat AND dogs (nothing)

# data[data>30] # select all items > 30
# data[data>30] = -1 # assign values 

# any / all function
# (data>30).any() # any T?
# (data>30).all() # all T?
# (data>30).sum() # count of item > 30

### <font color=blue>Sorting</font>

In [None]:
arr = np.array([4,54,2,6,8,-1,0]) 
arr.sort()
arr
# np.sort(arr) # sorted now
# np.sort(arr)[::-1] # reverse order

# np.random.seed(1)
# arr2d = np.random.randn(3, 3)
# arr2d
# arr2d.sort(axis=0) # sort by each columns
# arr2d.sort(axis=1) # sort by each row
# arr2d


### <font color=blue>Fancy Index</font>

In [None]:
animals = np.array(['cat', 'dog', 'fish', 'duck']) 

# animals[[0,1]] # cat, dog
# animals[[-1,1]] # duck, dog
# animals[[1,1,1]] # dog, dog, dog


### <font color=blue>Transpose</font>

In [None]:
arr2d = np.arange(16).reshape(4,4)
arr2d
arr2d.T # transpose

arr2d.swapaxes(0,1)

### <font color=blue>Concatenate & Split</font>

In [None]:
arr1 = np.arange(16).reshape(2,8)
arr2 = arr1 + 10

# np.concatenate([arr1, arr2], axis=0) # vertically
# np.vstack([arr1, arr2]) # 2 dim
# np.row_stack([arr1, arr2])

# np.concatenate([arr1, arr2], axis=1) # horizontally
# np.hstack([arr1, arr2])
# np.column_stack([arr1, arr2])

# arr1
# array([[ 0,  1,  2,  3,  4,  5,  6,  7],
#        [ 8,  9, 10, 11, 12, 13, 14, 15]])
# arrayList = np.split(arr1, [1,3,5], axis=1) # it is a list of ndarray
# type(arrayList[0])


##### Stack helpers: **np.r_** & **np.c_**

In [None]:
arr1 = np.arange(16).reshape(2,8)
arr2 = arr1 + 10

# np.r_[arr1, arr2] # row stack
np.c_[arr1, arr2] # col stack
# np.c_[ np.r_[arr1, arr2], np.r_[arr1, arr2] ] # nested example

# can also accept slice OR list to create arrays
# np.c_[:10] # vertically 
# np.c_[10:0:-1] # vertically, reverse order
# np.r_[:10] # horizontally

# slice with different columns
# np.c_[1:5, 6:10, 11:15]

### <font color=blue>File I/O</font>

In [None]:
arr = np.arange(20)
filename = 'my_arr'

# save ndarray to a file
# np.save(filename, arr)

# load ndarray from a file
# np.load(f'{filename}.npy')

# compress and save multiple arrays to a file
# np.savez_compressed(filename, arr1=arr, arr2=np.arange(10)) # data is the key, it could be any keys
# tmp = np.load(f'{filename}.npz')
# tmp['arr1']
# tmp['arr2']

### <font color=blue>Linear Algebra</font>

1. np.dot
1. np.diag
1. np.linalg.det
1. np.linalg.eig
1. np.linalg.inv
1. np.linalg.solve, etc.

##### dot product
* https://www.mathsisfun.com/algebra/matrix-multiplying.html

In [None]:
a = np.array([[1,2,3], [4,5,6]])
b = np.array([[7,8],[9,10],[11,12]])

np.dot(a, b)
# a.dot(b)

##### determinant
* https://www.mathsisfun.com/algebra/matrix-determinant.html

In [None]:
X = np.array([[6,1,1],[4,-2,5], [2,8,7]])
np.linalg.det(X)

##### inverse
* https://www.mathsisfun.com/algebra/matrix-inverse.html

In [None]:
np.random.seed(1)
X = np.random.rand(3,3)
np.linalg.inv(X)

##### diagonal

In [None]:
a = np.arange(9).reshape((3,3))
np.diag(a)

##### solve linear matrix equations

> x + y + z = 6 <br>
> 2y + 5z = −4 <br>
> 2x + 5y − z = 27<br>
    
* https://www.mathsisfun.com/algebra/systems-linear-equations-matrices.html

In [None]:
A = [[1,1,1], [0,2,5], [2,5,-1]]  
B = [6, -4, 27]

np.linalg.solve(A, B)
# np.linalg.inv(A).dot(B)

---

## 4) **Broadcast: vectorized ops on arrays with different shapes**

### <font color=blue>**Broadcast**</font>
* Doing arithmetics with arrays in <font color=red>**different shapes**</font>

* Examples and diagrams are extracted from NumPy official websites:
    * https://numpy.org/doc/stable/user/basics.broadcasting.html
    * https://numpy.org/devdocs/user/theory.broadcasting.html 

##### Arrays with **same shape** -> **element-wise** computations

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0]) # all 2 here

a * b # element-wise multiplication -> the shape of a & b are the same

##### Arrays **NOT** in **same shape**
<font color=green>**CASE 1.**</font> One of them is a <font color=red>**scalar** (number)</font>

1. The **smaller array** must **stretch** to match the shape of the **larger array**
1. After matching the shape -> **element-wise** computations

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = 2 # b is a scalar now

a * b # since shape not match -> broadcasting (b stretch to match A - conceptually) -

![ndarray with a scalar](https://numpy.org/devdocs/_images/theory.broadcast_1.gif)

Source: https://numpy.org/devdocs/user/theory.broadcasting.html

<font color=green>**CASE 2**.</font> Both of them are arrays but **shape** are different

> **RULE**: In order to broadcast, the size of the **trailing axes** for both arrays in an operation **must either be the same size** or **one of them must be one**.

In [None]:
a = np.array([[ 0.0,  0.0,  0.0],
            [10.0, 10.0, 10.0],
            [20.0, 20.0, 20.0],
            [30.0, 30.0, 30.0]])
b = np.array([0, 1.0, 2.0])
# b = np.array([[0, 1.0, 2.0],
#               [0, 1.0, 2.0],
#               [0, 1.0, 2.0],
#               [0, 1.0, 2.0]])
a + b
# a.shape # (4, 3)
# b.shape # (3,)

#######################
# Broadcast rule check
#######################
# (4, 3) 
# (3,)

# step 1: smaller array -> append 1 on LHS (increase the dimension)
# (4, 3) => (4, 3) # no change
# (3,)   => (1, 3) # append 1 to the left 

# Conclusion => Compatible

# step 2: stretch on ALL 1s to match the larger array
# (4, 3) => (4, 3) # no change
# (3,)   => (1->4, 3) # 1 stretch to 4

# step 3: if there is a match in shape, do element-wise computations with the new arrays
# (4, 3) 
# (4, 3)

![Broadcast](https://numpy.org/devdocs/_images/theory.broadcast_2.gif)

Source: https://numpy.org/devdocs/user/theory.broadcasting.html

<font color=red>**Incompatible**.</font> Cannot do broadcasting

In [None]:
a = np.array([[ 0.0,  0.0,  0.0],
            [10.0, 10.0, 10.0],
            [20.0, 20.0, 20.0],
            [30.0, 30.0, 30.0]])
b = np.array([1.0, 2.0, 3.0, 4.0])

#######################
# Broadcast rule check
#######################
# (4, 3) => 
# (4,)   => trailing axes is not the same size AND not 1
# even though I increase the dimension and add 1 to LHS, 4 did not match with 3 (trailing axes not equal AND not one)
# Conclusion => incompatible 

# a + b # value Error => incompatible shape for broadcasting

![Broadcast](https://numpy.org/devdocs/_images/theory.broadcast_3.gif)

Source: https://numpy.org/devdocs/user/theory.broadcasting.html

In [None]:
a = np.array([0.0, 10.0, 20.0, 30.0])
b = np.array([0.0, 1.0, 2.0])

a.shape # (4,)
b.shape # (3,)

a[:,np.newaxis] # shape => (4, 1)
# array([[ 0.],
#        [10.],
#        [20.],
#        [30.]])
a[:,np.newaxis] + b

#######################
# Broadcast rule check
#######################
# (4, 1) => trailing axes is 1
# (4, 1) => (4, 1->3) # stretch 1 to 3 to match the other array
# (3,)   => # this 3 is matching with the previous array's 3 after stretching
# Conclusion => Compatible

# (4, 1->3) => (4, 3)
# (3,)      => (1, 3) # append 1 on the LHS to increase dimension
# (3,)      => (1->4, 3) # stretch 1 to 4 to match the other array


![Broadcast](https://numpy.org/devdocs/_images/theory.broadcast_4.gif)

Source: https://numpy.org/devdocs/user/theory.broadcasting.html

In practice,
* one of the array could be very large
* since this process is conceptual only, it does not require so many memory

In [None]:
np.random.seed(1)
array = np.random.random([1000,5])
mean = array.mean(axis=0) # mean for each col

array - mean # 2 shapes are different, it is a broadcast

#######################
# Broadcast rule check
#######################
# (1000, 5)
# (5) 

# step 1: smaller array -> append 1 on LHS (increase the dimension)
# (1000, 5) => (1000, 5) # no change
# (5,)   => (1, 5) # append 1 to the left 

# Conclusion => Compatible
# (5,)   => (1->1000, 5) # append 1 to the left

# it is conceptual, you don't need (1000, 5) mean array to perform this operation!