## NumPy (Numerical Python)

Efficient storage and manipulation of numerical arrays is fundamental in data science. Python NumPy package provides specialized tools for handling numerical arrays.  
The core data type in Numpy is the ndarray, which enables fast and space-efficient multidimensional array processing.
NumPy has many features that won't be covered here. But we are going to cover some basic data types and operations within numpy.

More detailed documentation can be found at:
http://www.numpy.org

### Readings 

Chapter 2 from _Python Data Science Handbook_ by Jake VanderPlas

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

NumPy arrays are much more efficient compare to lists.  They are fixed types as oppose to list flexible (dynamic-type).


![title](img/pyds.png)

In [5]:
#Lists are flexible (dynamic-type) but this flexibility comes at a price of efficiency 
l0 = [True, "54" , 3.00, 9]
[type(item) for item in l0]

[bool, str, float, int]

In [4]:
# First, import numpy

import numpy as np 

# we can use np.array to create arrays from Python lists

l1 = [1,3,5,7] 
l2 = [2,4,6,8]


#Create N*1 or N*M arrays
arr1 = np.array(l1)
arr2 = np.array([l1,  l2])



In [8]:
#Basic info to describe the ndarrays

print(arr1.shape[0])

print(arr2.shape[0] , arr2.shape[1])

4
2 4


In [12]:
# some common arrays

# a) an array of all zeros

zero = np.zeros(5)

# b) an array of all ones

ones = np.ones(5)

# c) an identity array of size k where k = 5

identity = np.eye(5)

# d) an array of uniformly distributed random values
rand = np.random.random(5)


print(identity)

[zero, ones ,rand]

[array([0., 0., 0., 0., 0.]),
 array([1., 1., 1., 1., 1.]),
 array([0.06526006, 0.45899169, 0.84611527, 0.41968077, 0.37961491])]

In [15]:
#you can choose data type and dimensions as well

zero2 = np.zeros((3,4) , dtype = int)
zero2

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

#### Useful array attributes
Here we look at some useful array attributes.
Don't forget to check http://www.numpy.org for more information.

The first attributes are ndim (number of dimensions), shape, size, and dtype:


In [18]:
np.random.seed(0)  # seed for reproducibility   

a1 = np.random.randint(10, size=6)  # One-dimensional array
a2 = np.random.randint(10, size=(4, 5))  # Two-dimensional array   
a3 = np.random.randint(10, size=(3, 4, 6))  # Three-dimensional array

print("dimensions of a3 is: ", a3.ndim)
print("shape of a3 is: " , a3.shape)
print("size of a3 is: ", a3.size)
print("type of a3 is: ", a3.dtype)

dimensions of a3 is:  3
shape of a3 is:  (3, 4, 6)
size of a3 is:  72
type of a3 is:  int64


#### Array Indexing

 If you have a one-dimensional array, indexing is similar to regular Python lists.
 In a multidimensional array, you access items using a comma-separated tuple of indices.

In [20]:

# randn: return a sample from the “standard normal” distribution
arr_2d = np.random.randn(5,3)


print(arr_2d)

#A single index gets a full row
print("The second row: ", arr_2d[1])

#For accessing individual values use both indecies
print("The element at [3,1]:" , arr_2d[3,1])


[[-0.42018339  0.99982969  0.43103415]
 [-0.65091287 -1.49874039 -1.23063497]
 [ 0.19400719 -0.99838235 -0.3676376 ]
 [ 1.73719932  0.59361275 -0.54236358]
 [-1.71967238 -0.57890879  1.42694855]]
The second row:  [-0.65091287 -1.49874039 -1.23063497]
The element at [3,1]: 0.5936127527039693


#### Universal Functions

Computation on NumPy arrays can be very fast if we use vectorized operations.  
These operations (generally known as  NumPy’s universal functions (ufuncs)) allow batch operations on data without any for loops. 
Any arithmetic operations between equal-size arrays applies the operation element-wise. 

In [22]:
#Examples: 

array = np.arange(5)

print(array**2)
print(array+2)
print(- array)



[ 0  1  4  9 16]
[2 3 4 5 6]
[ 0 -1 -2 -3 -4]


#### Matrix Operations

NumPy facilitates matrix operations. Just make sure the shapes are compatible
The standard multiplication operator does elementwise multiplication and the _dot_ method provides the inner product.

In [40]:

grid = np.arange(1, 10).reshape((3, 3))
print(grid)

grid2 = np.arange(11,20).reshape((3,3))
print(grid2)


print("Add:" , grid+grid2)



[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[11 12 13]
 [14 15 16]
 [17 18 19]]
Add: [[12 14 16]
 [18 20 22]
 [24 26 28]]


In [41]:
# You can also find the transpose of a matrix

gridT = grid.T
print(gridT)

[[1 4 7]
 [2 5 8]
 [3 6 9]]


In [43]:
# Matrix Multiplication

print("Using standard multiplication operator: " )
print(grid*grid2)

print("Using dot operator: ")
print(grid.dot(grid2.T))

Using standard multiplication operator: 
[[ 11  24  39]
 [ 56  75  96]
 [119 144 171]]
Using dot operator: 
[[ 74  92 110]
 [182 227 272]
 [290 362 434]]


Numpy modules are significantly faster than Python modules. Whenever you can, use NumPy function for calculations as they are much faster. 
NumPy has fast built-in aggregation functions for working on arrays. They are particularly useful in computing summary statistics.


![title](img/agg.png)

In [44]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

65.4 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
341 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Multidimensional  aggregate

The default option applies the aggregation function on the entire array. But, you can aggregate along a row or column by specifying the proper axis. 
The axis keyword can be given to the aggregation function as a parameter and it will collapse the corresponding dimension. So specifying axis=0 means that the first
axis will be collapsed: for a 2-dimensional arrays, this means that values within each
column will be aggregated. 


In [49]:
M = np.random.random((5,4))
M

array([[0.18848218, 0.36402392, 0.66008906, 0.74667817],
       [0.0534757 , 0.90424808, 0.91822802, 0.54774195],
       [0.4437489 , 0.4001331 , 0.88743974, 0.49971592],
       [0.54092482, 0.92154715, 0.3452127 , 0.34641165],
       [0.61873058, 0.66986327, 0.58516602, 0.00616579]])

In [52]:
print(M.max())
print(M.max(axis=0))
print(M.max(axis=1))

0.9215471450238449
[0.61873058 0.92154715 0.91822802 0.74667817]
[0.74667817 0.91822802 0.88743974 0.92154715 0.66986327]


## Exercises

Now it's your turn to do some practice with NumPy:

#### 1. Create a 1-dimensional NumPy array of 100 random integers

1.  Find and print all the summary statistics (mean, std, median, max, min, ...)
2.  Compare the time for finding the max using python built-in functions and NumPy corresponding function
3. Create a new array that is the base-2 logarithm of your array

In [None]:
# Your code here

#### 2. Create a 2-dimensional NumPy array of (3,4) random integers (mat1)

1. Create another 2-D array that is the square root of your original array (mat2)
2. Find how many values are greater than 20 using np.count_nonzero(  ) function ( you can also use np.sum( )
3. Perform a dot product between two 2-D arrays (mat3)
4. Find all the values that are less than 10 and greater than 30 in mat3

In [None]:
#Your Code here