**Data Science Numpy**

Numpy is one of the main libraries used in machine learning an data science: it is used for a variety of mathematical computations, written in optimized C code at its base.

In [1]:
# import packages with alias
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

**Numpy Arrays**

Arrays are simply collections of objects. A 1-rank array is a list. A 2-rank array is a matrix, or a list of lists. A 3-rank array is a list of lists of lists, and so on.
We can create a numpy array with the np.array() constructor with a regular Python list as its argument:

In [2]:
# 1D array 
a1 = np.array([1,2,3,4])
a1

array([1, 2, 3, 4])

One of the most common properties of an np array is it’s shape , which indicates the rank of the array:

In [3]:
a1.shape

(4,)

We get a tuple with the corresponding rank, which we call the dimension of the array. In this case, the array above is uni-dimensional, also called a flat-array. We can also use a list of list to obtain a more clear matrix-like shape:

In [4]:
# Rank 2 array 
a2 = np.array([[1,2,3,4]]) # initialized with a numpy list 
a2 

array([[1, 2, 3, 4]])

In [5]:
a2.shape

(1, 4)

In this case, this looks more like a *row vector*. Similarly, we can initiate a *column vector* as follows:

In [6]:
# column vector 
a3 = np.array([[1], 
               [2], 
               [3], 
               [4]])
a3.shape

(4, 1)

We can also initialize full matrices the same way:

In [7]:
M1 = np.array([[1,0,0], 
               [0,1,0], 
               [0,0,1], 
               [0,0,1]])
M1 

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1]])

In [8]:
M1.shape

(4, 3)

Of course, this is just a tuple:

In [9]:
m = M1.shape[0] # number of rows
n = M1.shape[1] # number of columns 
print("m: {}  n:{}".format(m,n))

m: 4  n:3


**Reshaping an array**

We can use the `np.reshape` function to change the dimensions of an array, as long as it contains the same number of elements and the dimensions make sense. E.g.: reshaping the M1 matrix into a row vector:

In [10]:
# use reshape function
# Make sure shape makes sense
M2 = np.reshape(M1, (1,12))
display(M2)
print(M2.shape)

array([[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]])

(1, 12)


**Flat array to vector**

We can change a flat array into a 2D array (vector) as follows:

In [11]:
print("original a2.shape: ", a1.shape) 

# reshape into a vector
a1_vec = np.reshape(a1, (a1.shape[0],1))

print("reshaped: ", a1_vec.shape)

original a2.shape:  (4,)
reshaped:  (4, 1)


**Numpy arrays are not lists**

Note that there is a crucial difference between lists and NumPy arrays!

In [12]:
# python list: 
l = [1,2,3,4] 

# np.array 
l_np = np.array(l)

# different print style 
print(l) 
print(l_np) # note the lack of commas 

[1, 2, 3, 4]
[1 2 3 4]


One thing we can see straight away is the printing style. We also have very different behaviour:

In [13]:
# different behaviour!! 
print(l*5) # multiplying a python list replicates it
print(l_np*5) # numpy applies operation elemntwise

[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
[ 5 10 15 20]


**Dot Product**

We can “multiply” arrays of matching neighbour dimensions with the `np.dot(a1,a2)` function, where `a1.shape[1]==a2.shape[0]`. Ex. the mathematical dot product of two arrays (often used in machine learning) or n-dimensional vectors , given by

In [14]:
### Define two vectors 
x = np.array([1,9.0,3,4,5.5,6,7,8.9,10]) 
y = np.array([1,10,1,10,1,10,1,11,11]) 
x = np.reshape(x, (x.shape[0],1))
y = np.reshape(y, (y.shape[0],1))

# Verify shapes are correct 
assert x.shape == y.shape
print("x.shape: ", x.shape)
print("y.shape: ", x.shape)

x.shape:  (9, 1)
y.shape:  (9, 1)


With a loop, this would look like this:

In [15]:
import time
## Naive way 
t1 = time.time() 
res = 0 
for i in range(x.shape[0]): 
    res += float(x[i]*y[i])
t2 = time.time() 
print("Took {} seconds".format(round(t2-t1, 10)))
print("dot: ", res)

Took 0.0002188683 seconds
dot:  414.4


Better way to do it:

In [16]:
## The RIGHT way
t1 = time.time() 
res = np.dot(x.T,y) # the .T === transpose
t2 = time.time() 
print("Took {} seconds".format(round(t2-t1, 10)))

# display
print("res.shape: ", res.shape)
print(float(res)) # cast

Took 0.0001559258 seconds
res.shape:  (1, 1)
414.4


Now, the difference is a lot in speed is noticeable with a lot of data, as it can be seen on the example below:

In [17]:
# Generate two huge arrays
x = np.random.randn(9999, 1)
y = np.random.randn(9999, 1)

# Verify shapes are correct 
assert x.shape == y.shape
print("x.shape: ", x.shape)
print("y.shape: ", x.shape)

## 1. Naive Method: 
t1 = time.time() 
res = 0 
for i in range(x.shape[0]): 
    res += float(x[i]*y[i])
t2 = time.time() 
print("Naive method took {} seconds".format(round(t2-t1, 10)))

## 2. np.dot 

## The RIGHT way
t1 = time.time() 
res = float(np.dot(x.T,y)) # the .T === transpose
t2 = time.time() 
print("Numpy took {} seconds".format(round(t2-t1, 10)))

x.shape:  (9999, 1)
y.shape:  (9999, 1)
Naive method took 0.0330636501 seconds
Numpy took 0.0003883839 seconds


This is because it takes advantage of parallelization.

**Generating random arrays**

One way to do this is with the np.random.randn() function:

In [18]:
# generate based on normal distribution 
M3 = np.random.randn(5,5)
M3 

array([[-1.21724292,  0.71143241,  0.0622101 ,  0.45708855,  0.08837573],
       [ 1.23258122,  1.30325301, -0.30211276, -0.5510608 ,  0.65149722],
       [ 1.12470324, -0.14554432,  0.18780847, -0.57850032, -0.47673519],
       [ 0.41036223, -0.73582556, -0.84919322, -0.08503235,  0.00587766],
       [-0.07189282, -0.57046579, -0.52272808,  0.21099149, -0.45730797]])

In [19]:
# Ex. initialize random 10-dimensional column vector
n = 10
theta = np.random.randn(n,1) 
theta

array([[-0.08643036],
       [-0.20076212],
       [ 0.7927608 ],
       [-1.3481705 ],
       [ 0.93064234],
       [-0.93215079],
       [-0.04173901],
       [-0.20908959],
       [ 0.0050671 ],
       [-0.2810268 ]])