# NumPy: Arrays and Vectorized Computations
## DAT540 Introduction to Data Science
## University of Stavanger

#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

- **NumPy** or *Numerical Python*, is one of the most important foundational package for numerical computing in python
 - *ndarray*, an efficient multidimensional array provides fast array-oriented arithmetic operations and flexible *broadcasting* capabilities
 - Mathematical functions for fast operations on entire array of data without having to write loops
 - Tools for reading/writing array data to disk and working with memory-mapped files
 - Linear algebra, random number generation and transformation capabilities
 - A C-API for connecting NumPy with libraries written in C, C++, or FORTRAN
 - NumPy C-API allows data to be easily passed to external libraries written in a low-level language and for external libraries to return data to python as NumPy arrays

  - The main areas of functionality for the use of NumPy arrays for data analysis:
    - Fast vectorized array operations for data mugging and cleaning, sub-setting, and filtering, transformation, and any kinds of computations
    - Common array algorithms like sorting, unique, and set operations
    - Efficient descriptive statistics and aggregating/summarizing data
    - Data alignment and relational data manipulation for merging and joining heterogeneous datasets
    - Expressing conditional logic as array expressions instead of loops 
    - Group-wise data manipulations (aggregation, transformation, function application)

- NumPy based algorithms are generally 10 to 100 times faster than their pure python counterparts and use significantly less memory

In [None]:
import numpy as np
#million numbers
n = 1000000
my_arr = np.arange(n)
my_list = list(range(n))
# Now let us multipy each sequence by 2

In [None]:
import sys

In [None]:
print(sys.getsizeof(my_arr)) #bytes
%timeit my_arr * 2

In [None]:
print(sys.getsizeof(my_list)) #bytes
%timeit [v * 2 for v in my_list]

- **ndarray** a multidimensional array object that provides a fast and flexible container for large homogeneous datasets

In [None]:
import numpy as np
data = np.random.randn(2,3)
data

 - ndarrays also allows direct scalar and matrix operations 

In [None]:
print(data * 10)
print('-' * 100)
print(data + data)

 - Each ndarray has a *shape* tuple describing the size dimensions, and  *dtype*, object describing the data type 

In [None]:
print(data.shape)
print(data.dtype)

 - ndarray can be created using the **array** function
 - It accepts any sequence-like object (including other arrays) and produces a new numpy array containing the passed data

```python
data1 = [6,7.5,8,0,1]
arr1 = np.array(data)
```
 - Nested sequences, like a list of equal-length list, will also be converted into a multidimensional array

```python
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)
```
 - What will be shape of arr2?
 - In addition to *np.array*, there are other functions for creating new arrays.
 - **zeros** and **ones** create arrays of 0s or 1s respectively
 - **empty** would create an array that contains zeros, ones or garbage values

In [None]:
print(np.zeros(10))
print('-' * 100)
print(np.zeros((2,4))) # multidimentional
print('-' * 100)
print(np.ones(10))
print('-' * 100)
print(np.empty(10))

  - **arange** is an array-valued version of built-in python *range* function

In [None]:
np.arange(15)

  - Array creation functions

  <img src='./images/arr_creation_func.png'>

- *NumPy* - a multidimensional array object
  - **Data Types**
    - *dtype* is a special object containing the information / meta-data about the objects stored in a ndarray
    - dtypes provide mapping directly onto an underlying disk or memory representation
    - Allows easy read and write of binary stream of data to disk and to connect to code written in a low level language like C

    <img src='./images/numpy_datatypes.png'>

  - A NumPy array can be converted or casted from one dtype to another using the ndarray's **astype** method
  - astype accepts a NumPy data type or the dtype of an existing ndarray
  - Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype

In [None]:
import numpy as np
arr = np.arange(5)
print(arr.dtype)
arr = arr.astype(float)
print(arr.dtype)

In [None]:
b = np.array([1.0,2,3], dtype=int) # the dtype can also be specified while creating the array
print(b.dtype)
b = b.astype(float)
print(b.dtype)

- **Pseudorandom Number Generator**
  - The numpy.random module supplements the build-in python random with functions for efficiency generating whole arrays of sampled values from different kinds of probability distributions
  - They are termed as pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the *seed* of the random number generator
  - The data generation functions in numpy.random use a global random seed
  
  ```python
  import numpy as np
  np.random.seed(123)
  np.random.normal(size=(4,4))
  ```
  - To avoid global state, *numpy.random.RandomState* can be used to create a random number generator isolated from other
  
  ```python
  import numpy as np
  rgn = np.random.RandomState(1234)
  rgn.normal(size=(4,4))
  ```

<img src='./images/np_random.png' width='450'>
  

 - Arithmetic Operations
   - ndarray offers *vectorization*, that is allow batch operation on data without writing any for loops
   - Any arithmetic operations between equal size arrays applies the operation element wise
   - Operations with scalars propagate the scalar argument to each element in the array
   - Comparisons between arrays of same size yields boolean arrays
   - Operations between different sized arrays is called broadcasting

In [None]:
# Operation between two equally sized arrays
arr1 = np.arange(4).reshape(2,2) # reshape shapes the array into the given share
arr2 = np.arange(4, 8).reshape(2,2) # Here arange is generating numbers from 4 until 8
print('arr1 shape:', arr1.shape)
print('arr2 shape:', arr2.shape)

# Operation between equal sized arrays are applied element wise
result1 = arr1 * arr2
print('op: arr1 * arr2')
print(result1)
print('op: arr1 - arr2')
result2 = arr1 - arr2
print(result2)

In [None]:
# Operation with scalars propagate the scalar argument to each element in the array
result3 = arr1 / 10
print('op: arr1 / 10')
print(result3)

# Comparisions between arrays of same size yields a boolean array
np.random.seed(12345) # A random seed is a number used to initialize a pseudorandom number generator. 
arr3 = np.random.randint(-100, 100, size=4).reshape(2,2) # randint is used to generate random ints. args are low, high and size
arr4 = np.random.randint(-100, 100, size=4).reshape(2,2) # reshape is used to structure a sequence on number into a shape
result4 = arr4 > arr3
print('op: arr4 > arr3')
print(result4)

 - **Basic Indexing and Slicing**
   - Indexing and slicing on one dimensional ndarrays are similar to python lists on the surface
   - If an scalar value is assigned to a slice, arr[5:8] = 12, the value is propagated (or broadcasted) to the entire section
   - What happens if you do a bare slice assignment: arr[:] = 12 
   - Array slice on ndarrays are views on the original array. The data does not get replicated or copied and any changes to the slice or view gets reflected to the original array
   - A copy of a array slice can be created using *copy.copy* or directly the **copy** method of any dnarray   

In [None]:
#Basic Indexing and Slicing
arr = np.arange(10)
print('arr: ', arr)
print(type(arr[5:8]))
arr[5:8] = 12
print('arr: ', arr)

In [None]:
import copy
arr1 = copy.copy(arr[1:4])
arr1[0] = 99
arr2 = arr[7:10].copy()
arr2[:] = 100
print('arr1: ', arr1)
print('arr2: ', arr2)
print('arr: ', arr)

   - Elements in a multidimensional ndarray can be accessed directly by passing a comma separated list of indices to select individual elements

<img src='./images/numpy_indexing.png' width='250'>

In [None]:
# accessing multidimentional arrays
arr2d = np.random.randint(0, 100, 4).reshape(2,2)
print('arr2d=', arr2d)
print('arr2d[0][1]=', arr2d[0][1], ' is similar to arr2d[0,1]=', arr2d[0,1])

In [None]:
arr3d = np.random.randint(0, 100, 20).reshape(2,5,2)
print('arr3d:\n', arr3d.shape)
print('2x2 array:\n', arr3d[0].shape)
print('1x2 array:\n', arr3d[0, 1].shape)
print('indivisual element:', arr3d[0, 1, 1])

   - Modifying and copying slices in a multidimensional array

   - **Indexing with slices**'
    - 1d ndarrays can be slices in similar fashion to that of normal python lists
    - multi-dimensional arrays can be sliced on each **axis** generating a view on a sub-array or element
    - *axis* are the dimensions on a ndarray. For example a 3d array has 3 axes. 

In [None]:
np.random.seed(123)
arr1d = np.random.randint(0, 100, 5)
print(arr1d[1:4]) # get all elements from index 1 until index 4

In [None]:
# Two dimentional array slicing
arr2d = np.random.randint(0, 100, 9).reshape(3,3)
print('complete arr2d:\n', arr2d)
print('sliced arr2d[:2]\n', arr2d[:2]) # get all sub-arrays from index 0 until index 2 on axis 0

In [None]:
# get all sub-arrays from index 0 until index 2 on axis 0 and from index 1 until index 3 on axis 1
print('sliced arr2d[:2, 1:3]\n', arr2d[:2, 1:3]) 

In [None]:
arr = np.arange(10)
print('arr:', arr)
print('arr-f:', arr[0:5:2])
print('arr-b:', arr[-5::3])


<img src='./images/2darray_slicing.png' width='150'>

In [None]:
#Swapping Values
np.random.seed(123)
arr3d = np.random.randint(0, 100, 8).reshape(2,2,2)
print('arr3d (original):\n', arr3d)
print('shape of array', arr3d.shape)
print('slice 0,0', arr3d[0,0])

In [None]:
# Swapping values at arr3d[0,0] with arr3d[1,1]
temp = arr3d[0,0]
arr3d[0,0] = arr3d[1,1] # assigning a multi-dimentional sequence
arr3d[1,1] = temp # is it a success?

In [None]:
print('arr3d (swapped):\n', arr3d) # Why?

In [None]:
#Swapping Values
np.random.seed(123)
arr3d = np.random.randint(0, 100, 8).reshape(2,2,2)
print('arr3d (original):\n', arr3d)
# Swapping values at arr3d[0,0] with arr3d[1,1]. The right way
temp = arr3d[0,0].copy() # store a copy
arr3d[0,0] = arr3d[1,1] # assigning a multi-dimentional sequence
arr3d[1,1] = temp 
print('arr3d (swapped):\n', arr3d)

In [None]:
# What happens when we only swap a single element
# Swapping values at arr3d[0,0,0] with arr3d[1,1,1] and vice versa
np.random.seed(123)
arr3d = np.random.randint(0, 100, 8).reshape(2,2,2)
print('arr3d (original): \n', arr3d)

temp = arr3d[0,0,0]
arr3d[0,0,0] = arr3d[1,1,1] # assigning a single element
arr3d[1,1,1] = temp

In [None]:
print('arr3d (swapped):\n', arr3d)

  - **Boolean Indexing**
  

In [None]:
np.random.seed(123)
data = np.random.randn(7, 4) # randn function generates a random normally distribution. args: d1, d2, ... describes the shape of the returned array
names = np.array(['Bob', 'Bob', 'Will', 'Joe', 'Will', 'Joe', 'Joe'])
names

In [None]:
data

   - Let each name corresponds to a row in the data array
   - We want to select all rows with corresponding name 'Bob'
   - Like arithmetic operations, comparisons (eg. ==) with arrays are also vectorized, returning a boolean array

In [None]:
names == 'Bob'

   - The boolean array can be passed when indexing the array

In [None]:
data[names == 'Bob']

   - The boolean array must be of the same length as the array axis it's indexing
   - Boolean array for indexing can be also combined with other boolean arrays or even other sequences for slicing an array on different axis

In [None]:
data[names == 'Bob', 2:3]

   - To select everything but 'Bob', we can use the operators **\~**  or **!**

In [None]:
print(data[names != 'Bob'])
print(' ')
print(data[~ (names == 'Bob')])

   - Multiple boolean conditions can be combined using the **&** (and) or **|** (or) symbols

In [None]:
print(data[(names == 'Bob') | (names == 'Will')])

In [None]:
np.random.seed(123)
data = np.random.randn(7, 4)
print(names == 'Bob')
data1 = data[0:2, 1:3] #Bob is present on axis=0 indices 0 and 1. Using basic indexing here
data1[:] = -1

In [None]:
print('data:\n', data)
print('data1:\n', data1)

In [None]:
data2 = data[names == 'Bob', 1:3]
data2[:] = 0

In [None]:
print('data:\n', data)
print('data2:\n', data2)

   - Selecting data from an array by boolean indexing always creates a **copy** of the data even if the array is unchanged

 - **Fancy Indexing**
    - Fancy indexing is used to describe indexing using integers arrays

In [None]:
# Creating a 8x4 array where each row has the index number as value for all its columns
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

In [None]:
# Slicing using fancy indexing on axis=0
arr[[1,3,2,1,7]] # What do you observe here?

In [None]:
# Using negative indices
arr[[-1,-3,-2]]

   - Passing a multidimensional array to slice on multiple axis results in a one dimensional array of elements corresponding to each tuple of indices
   - the shape of the arrays for each axis needs to be the same or scalar for all but array on axis=0

In [None]:
arr = np.arange(16).reshape(4,4)
arr

In [None]:
arr[[1,3,2], [0,3,2]]

   - In order to get the slice on each axis using multidimensional arrays, first select the rows(axis=0) and for those rows do indexing on the columns(axis=1) and so forth, it can be done as follows:

In [None]:
arr[[1,3,2]][:,[0,3,2]]

   - Like **boolean indexing**, **fancy indexing** also creates a *copy* of the array

 - **Transposing Arrays and Swapping Axes**
   - Transposing is a form of reshaping that returns a view on the underlying data without copying.
   - Arrays have the *transpose* method and also the special attribute *T*

In [None]:
arr = np.arange(15).reshape(3,5)
print('arr:\n', arr)
print('arr.T:\n', arr.T) # rows become cols and cols become rows. The shape also changes
a1 = arr.T

   - Transposing is useful while performing matrix computations.
   - For example, the inner matrix product can be computed using np.dot

In [None]:
arr = np.random.randn(2,2)
np.dot(arr, arr.T)

   - For higher dimensional arrays, *transpose* might also accept a tuple of axis numbers to permute the axes
   

In [None]:
arr = np.arange(24).reshape(2,3,4)
arr

In [None]:
# Transpose without any arguments acts the same way as T. It also means that we are transposing on all axis
arr.transpose()

In [None]:
print('arr:\n', arr)
# If we want to transpose on only certain axis
arr.transpose((1,0,2)) # Here we transpose only axis 0 and 1 and keep 2 unchanged

   - **swapaxes** is a method that takes a pair of axis numbers and switches the indicated axes to rearrange the data

In [None]:
arr

In [None]:
arr.swapaxes(0,1) # same as arr.transpose((1,0,2))

- Universal Functions (*ufunc*): Fast Element Wise Array Functions
  - Performs element wise operations on data in ndarrays
  - They act as vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results
  - Most ufuncs are simple element wise transformations on an array, like sqrt or exp and are called as *unary* ufuncs

<img src='./images/unary_ufuncs.png' width='400'>

In [None]:
arr = np.arange(5)
print('arr: ', arr)
print('np.sqrt: ', np.sqrt(arr))
print('np.exp: ', np.exp(arr))


   - *binary* ufuncs take two arrays and return a single array as result, like maximum or add

<img src='./images/binary_ufuncs.png' width='250'>

In [None]:
x = np.random.randn(8)
y = np.random.randn(8)
np.maximum(x,y)

   - There are also ufuncs that can return multiple arrays, like modf that returns fractional and integral parts of a floating point array

In [None]:
arr = np.random.randint(1, 10, 10) * 5.7
remainder, whole_part = np.modf(arr)
print('arr: ', arr)
print('remainder: ', remainder)
print('whole_part: ', whole_part)

  - ufunc instance methods
    - Each NumPy ufuncs has special methods for performing certain kinds of special vectorized operations

<img src='./images/ufuncs_instance_methods.png'>

In [None]:
# reduce takes a single array and aggregates its values, optionally along an axis, by performing a sequence of binary operations
# sum elements in an array
arr = np.arange(10).reshape(2,5)
print(arr)
np.add.reduce(arr)

In [None]:
# accumulate produces an array of same size with the intermediate accumulated or cumulative sum values
arr = np.arange(4).reshape(2,2)
print('arr:\n', arr)
print('np.add.accumulate(arr):\n', np.add.accumulate(arr)) 
print('np.add.accumulate(arr, axis=1):\n', np.add.accumulate(arr, axis=1)) 

In [None]:
# repeate duplicates values on specified indices and axis
arr = np.arange(3).repeat([1,2,2])
arr

In [None]:
# outer performs a pairwise cross-product between two arrays
arr1 = np.arange(3)
np.multiply.outer(arr, arr1) # outer will have a dimension that is the sum of the dimensions of the inputs

In [None]:
# reduceat performs local reduce, in essence an array groupby operation in which slices of the array are aggregated together
# it accepts a sequence of "bin edges" that indicates how to split and aggrregate the values
arr = np.arange(10)
print('arr:\n', arr)
# The results are the reductions performed over arr[0:5], arr[5:8], and arr[8:]
# As with other methods, the axis argument can also be passed
np.add.reduceat(arr, [0, 5, 8])

  - Writing new ufuncs in python
    - **numpy.frompyfunc** accepts a python function along with a specification for number of inputs and outputs

In [None]:
# Creating a add element-wise function
def add_elements(x, y):
    return x + y

add_them = np.frompyfunc(add_elements, 2, 1) # we could also use a lambda function instead of add_elements
add_them(np.arange(8), np.arange(8))