# <font color='#eb3483'> Introduction to Numpy </font>


If there's one thing you should be comfortable with by the end of this course, it's numpy (NUMerical PYthon) and Pandas - the backbone of data science in python. Both numpy and pandas package really helpful (and behind the scenes very quick) functions that you'll want to whip out time and time again. 

Today we'll start with numpy - a package that takes lists and supercharges them to be lean, mean data science machines. By convention most people import numpy as np - before we continue make sure that the import statement works (i.e. numpy is installed in your environment)!

In [28]:
#the convention is to import numpy as np
import numpy as np

## <font color='#eb3483'> Arrays </font>

Numpy arrays can be thought of as generalizations of lists. Instead of having one dimension along which they may have multiple entries, arrays can have arbitrarily many axes!

### <font color='#eb3483'> 1 dimensional arrays (a.k.a vectors) </font>

In [29]:
#To create a numpy array you can feed in a standard python list as the input argument
example1 = np.array([4, 5, 3])
print(example1)
type(example1)

[4 5 3]


numpy.ndarray

In [30]:
#Let's check out how long our array is
len(example1)

3

In [31]:
#Shape gives us the length along each dimension
example1.shape # (rows, columns)

(3,)

In [32]:
#ndim tells us how many dimensions our array has
example1.ndim 

1

In [33]:
#Note that unlike lists, each array has a specific date type (we can see it with dtype)
example1.dtype

dtype('int64')

In [34]:
#We can change the type use astype (to set a type use np.TYPE - google for a list of types)
example1.astype(np.float64)

array([4., 5., 3.])

### <font color='#eb3483'> 2 dimensional arrays (a.k.a matrices) </font>

In [35]:
#Now let's feed in a list of lists
example2 = np.array([[1, 2, 1], [5, 43, 5]])
print(example2)

[[ 1  2  1]
 [ 5 43  5]]


In [36]:
#Remember you can always check-out the help docs
np.array?

[0;31mDocstring:[0m
array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0,
      like=None)

Create an array.

Parameters
----------
object : array_like
    An array, any object exposing the array interface, an object whose
    __array__ method returns an array, or any (nested) sequence.
    If object is a scalar, a 0-dimensional array containing object is
    returned.
dtype : data-type, optional
    The desired data-type for the array.  If not given, then the type will
    be determined as the minimum type required to hold the objects in the
    sequence.
copy : bool, optional
    If true (default), then the object is copied.  Otherwise, a copy will
    only be made if __array__ returns a copy, if obj is a nested sequence,
    or if a copy is needed to satisfy any of the other requirements
    (`dtype`, `order`, etc.).
order : {'K', 'A', 'C', 'F'}, optional
    Specify the memory layout of the array. If object is not an array, the
    newly created array will be i

In [37]:
example2.shape

(2, 3)

In [38]:
example2.ndim

2

## <font color='#eb3483'> Creating Arrays </font>

`arange` returns equally-spaced values within a given interval (inclusive of the start and exclusively of the stop). It has the following parameters:
1. **start** : number, optional
    Start of interval.  The interval includes this value.  The default
    start value is 0.
2. **stop** : number
    End of interval.  The interval does not include this value, except
    in some cases where `step` is not an integer and floating point
    round-off affects the length of `out`.
3. **step** : number, optional
    Spacing between values.  For any output `out`, this is the distance
    between two adjacent values, ``out[i+1] - out[i]``.  The default
    step size is 1.  If `step` is specified as a position argument,
    `start` must also be given.

In [39]:
# make range using arange

arng = np.arange(0, 10, 1)

print(arng)

# arange isn't constrained to integer values!
arng = np.arange(1.2, 7.87, 0.47)

print('\n', arng)

#what does '\n' do?

[0 1 2 3 4 5 6 7 8 9]

 [1.2  1.67 2.14 2.61 3.08 3.55 4.02 4.49 4.96 5.43 5.9  6.37 6.84 7.31
 7.78]


Numpy also provides a number of utility functions to generate arrays in commonly used forms. A few of these are:

1. `ones` returns an array of the specified shape populated entirely with ones.

2. `zeros` returns an array of the specified shape, filled with zeros.

3. `eye` returns a 2D array with ones on the diagonal and zeros everywhere else

4. `diag` returns a 2D array with the given array on its diagonal and zeros elsewhere.

4. `repeat` repeats a number or each element in a sequence of numbers a given number of times in an array.

4. `np.random` creates an array with random values in the interval (0, 1)

In [40]:
np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [41]:
np.zeros((3,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [42]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [43]:
np.diag([1, 2, 3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [44]:
np.repeat([1, 2, 3, 4], 3)

array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

To generate random numbers we use ... 

In [45]:
np.random.random(2)

array([0.2915122 , 0.02028906])

In [46]:
np.random.random((2,3))

array([[0.98941479, 0.04518727, 0.86676687],
       [0.05234234, 0.84420291, 0.74257321]])

### <font color='#eb3483'> Quick knowledge check! </font>
1. We learned how to make an array using `arange`. A closely related function is `linspace` - take a look at the help doc. for linspace and recreate the array generated by `np.arange(0, 10, 1)` using linspace.

In [47]:
# your code goes here

2. Make a 5 x 3 matrix of all threes using numpy.

In [48]:
# your code goes here

## <font color='#eb3483'> Combining and Splitting Arrays </font>

Numpy provides a number of ways in which we can combine arrays to form new arrays.

Two of the key functions are `vstack` and `hstack`. These functions respectively stack arrays vertically (one on top of another) and horizontally (one alongside another).

You can see that this implies that `vstack` requires that the stacked arrays have the same number of columns, while `hstack` requires that the stacked arrays have the same number of rows.

In [49]:
# create a dataframe which we'll play around with stacking
arr = np.array([[1, 2, 3], [4, 5, 6]])

print('arr:\n', arr, '\n')

# lets stack the same array vertically
vstacked = np.vstack([arr, arr])

print('vstacked:\n', vstacked, '\n')

# lets stack the same array horizontally
hstacked = np.hstack([arr, arr])

print('hstacked:\n', hstacked, '\n')

arr:
 [[1 2 3]
 [4 5 6]] 

vstacked:
 [[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]] 

hstacked:
 [[1 2 3 1 2 3]
 [4 5 6 4 5 6]] 



Where we can combine arrays, so to can we break them up! `vsplit` and `hsplit` are the splitting analogs to `vstack` and `hstack`. Instead of passing in two arrays and getting one back, we now feed in one array and a list of indices where we want to split them (note that N split points, will give us N + 1 new arrays).

In [50]:
# create a dataframe which we'll play around with splitting 
# Note we're creating an array counting from 0 to 11 and then turning it into a matrix using reshape
arr = np.arange(20).reshape((5,4))

print('arr:\n', arr, '\n')

# lets split the array vertically (try change the)
upper, middle, lower = np.vsplit(arr, [2,4])

print('Upper Split:\n', upper, '\n',
      'Middle Split:\n', middle, '\n',
     'Lower Split:\n', lower, '\n',)

# lets stack the same array horizontally
left, right = np.hsplit(arr, [2])

print('Left Split:\n', left, '\n',
      'Right Split:\n', right, '\n')

arr:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]] 

Upper Split:
 [[0 1 2 3]
 [4 5 6 7]] 
 Middle Split:
 [[ 8  9 10 11]
 [12 13 14 15]] 
 Lower Split:
 [[16 17 18 19]] 

Left Split:
 [[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]
 [16 17]] 
 Right Split:
 [[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]
 [18 19]] 



### <font color='#eb3483'> Quick knowledge check! </font>
1. Replace the `?` below with either h or v (think about which one will work) - run the code to see if you got it right!

In [51]:
arr1 = np.ones((1,7))
arr2 = np.zeros((7,7))

myBigArray = np.vstack([arr1,arr2])

print(myBigArray)

[[1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]


## <font color='#eb3483'> Operations and Arithmetic </font>


Numpy arrays support element-wise arithmetic via a set of overloaded operators that you should be very familiar with (they simply perform the usual operations on each element of the given tensor).

1. `+` — element-wise addition
2. `-` — element-wise subtraction
3. `*` — element-wise multiplication
4. `/` — element-wise division
5. `**` — element-wise power

In [52]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

print('addition:\n', x + y, '\n')
print('subtraction:\n', x - y, '\n')
print('multiplication:\n', x * y, '\n')
print('division:\n', x / y, '\n')
print('power:\n', x ** 2, '\n')

addition:
 [5 7 9] 

subtraction:
 [-3 -3 -3] 

multiplication:
 [ 4 10 18] 

division:
 [0.25 0.4  0.5 ] 

power:
 [1 4 9] 



We can also perform the **dot product** (which you may be familiar with):

$ \begin{bmatrix}x_1 \ x_2 \ x_3\end{bmatrix}
\cdot
\begin{bmatrix}y_1 \\ y_2 \\ y_3\end{bmatrix}
= x_1 y_1 + x_2 y_2 + x_3 y_3$

In [53]:
x.dot(y)

32

This can also be written as:

In [54]:
x @ y

32

Taking the dot product implicitly transposes the first vector `x`. We can also achieve this through use of the transpose method for Numpy arrays.

In [55]:
print('array:\n', arr, '\n')
print('transposed array:\n', arr.T, '\n')
print('transposed array shape:\n', arr.T.shape, '\n')

array:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]] 

transposed array:
 [[ 0  4  8 12 16]
 [ 1  5  9 13 17]
 [ 2  6 10 14 18]
 [ 3  7 11 15 19]] 

transposed array shape:
 (4, 5) 



We can also use the same idea for boolean operators

In [56]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([1, -124, 5])

print('arr1 > arr2? ', arr1 > arr2)
print('arr1 < arr2? ', arr1 < arr2)
print('arr1 = arr2? ', arr1 == arr2) #Note for checking equality we use two == 
print('arr1 != arr2? ', arr1 != arr2) #!= means not equal

arr1 > arr2?  [False  True False]
arr1 < arr2?  [False False  True]
arr1 = arr2?  [ True False False]
arr1 != arr2?  [False  True  True]


We can also extend the same idea of comparing boolean variables (i.e. and, or) to vectors of boolean arrays.

In [57]:
arr1 = np.array([True, True, False, False])
arr2 = np.array([True, False, True, False])

print('arr1 AND arr2', arr1 & arr2) #Note that for vectors, we need & and not 'and'
print('arr1 OR arr2', arr1 | arr2) #Note that for vectors, we need | and not 'or'

arr1 AND arr2 [ True False False False]
arr1 OR arr2 [ True  True  True False]


You might be wondering why we use numpy's functions instead of doing a loop and using Python's base functions (i.e. why not loop through our numpy array and add the two arrays together one at a time)? Beyond being a little more readable, numpy's functions actually run a lot faster than Python's built-in functions. It won't come up for smaller applications, but when you're working with massive datasets it's important to keep in mind that you want to rely on numpy whenever possible to keep your code working fast!

## <font color='#eb3483'> Aggregations </font>

A lot of times you might want to get a summarized view of your numpy array - that's where aggregations come in! Python has a lot of aggregation functions built-in (i.e. `sum`,  `max`, `min`...etc.) but again, the numpy varieties run a lot faster!

Some handy aggregation functions you might want to use are:

1. `sum` — add together all the elements
2. `mean` — get the average of all the elements
3. `std` — get the standard deviation of all the elements
4. `max` — get the maximum value in the array
5. `min` — get the minimum value in the array
6. `argmax` — get the index of the maximum value in the array
7. `argmin` — get the index of the minimum value in the array

Let's take some for a spin

In [58]:
#Let's make an array to play around with
arr = np.arange(10)
print('Our array: ', arr)

#Let's start by trying out some basic aggregation functions
print('Sum: ', np.sum(arr))
print('Average: ', np.mean(arr)) #Mean = Stats-y way to say average
print('Max: ', np.max(arr)) 
print('Argmax: ', np.argmax(arr)) #

Our array:  [0 1 2 3 4 5 6 7 8 9]
Sum:  45
Average:  4.5
Max:  9
Argmax:  9


For these aggregation functions, we can also access them directly from the array object (remember too include the '()' because they're functions not attributes).

In [59]:
#Remix of the above with a different syntax
print('Sum: ', arr.sum())
print('Mean: ', arr.mean())
print('Max: ', arr.max())
print('Argmax: ', arr.argmax()) #

Sum:  45
Mean:  4.5
Max:  9
Argmax:  9


When we're working with matrices, we can use the same aggregation functions and even apply them to only one dimension (note `axis = 0` means columns, `axis = 1` means rows - though I usually can't remember and just go to the help doc to double check)

In [60]:
matrix = np.arange(20).reshape(5,4)
print('Our matrix: \n', matrix)

print('Sum (everything): ', matrix.sum())
print('Sum (rows): ', matrix.sum(axis=1)) #Notice that we have 4 results and 4 roows
print('Sum (cols): ', matrix.sum(axis=0)) #Notice that we now have 5 results (1 per col)

Our matrix: 
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
Sum (everything):  190
Sum (rows):  [ 6 22 38 54 70]
Sum (cols):  [40 45 50 55]


Sometimes our matrices might have missing values (numpy has a handy data type called `nan`), a lot of our aggregation functions stop working if we hit a nan so we have to use 'nan-safe' pieces (generally you just tack on nan to the name of the function).

In [61]:
#Let's make an array with a missing value
nanArray = np.array([1, np.nan, 3, 4])

print('Normal sum: ', nanArray.sum())
print('Nan-safe sum: ', np.nansum(nanArray)) #Note that we have to use the np.FUN format for nan safe functions

Normal sum:  nan
Nan-safe sum:  8.0


We also have aggregations for boolean arrays too!

In [62]:
arr1 = np.array([True, False, True, False])

#Any checks if any value of the array is true
print('Any :', arr1.any())

#All checks if ALL value of the array are true
print('All :', arr1.all())

Any : True
All : False


### <font color='#eb3483'> Quick knowledge check! </font>
1. Add together matrices a and b and get the average value of each column in the added matrices.

In [63]:
a = np.arange(25).reshape((5,5))
b = np.eye(5)

#Your code goes here

## <font color='#eb3483'> Slicing </font>

It is very easy to take subsections of numpy arrays, this is called *"slicing"* (since we take a slice of the array)

In [64]:
matrix_34 = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
matrix_34

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

We get the first row the same way we would do with a regular python list

In [65]:
matrix_34[0]

array([1, 2, 3, 4])

We can choose the first 2 rows

In [66]:
matrix_34[:2]

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

We can select the 2nd element of each row (that is, the 2nd column):

In [67]:
matrix_34[:,1]

array([ 2,  6, 10])

When we are slicing we dont get copies, but references to the same elements in the original array:

In [68]:
two_first_rows = matrix_34[:2,:]
two_first_rows

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [69]:
print('Before changing slice: \n', matrix_34)
two_first_rows[0, 0] = 100
print('After changing slice: \n',matrix_34)

Before changing slice: 
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
After changing slice: 
 [[100   2   3   4]
 [  5   6   7   8]
 [  9  10  11  12]]


However, if we can always get around this by using numpy arrays copy function, which will make a copy of the sliced view.

In [70]:
print('Before changing copied slice: \n', matrix_34)
selection = matrix_34[:2,:].copy() #Makes a copy of our slice
selection[0,1] = 100
matrix_34
print('After copied slice: \n',matrix_34)

Before changing copied slice: 
 [[100   2   3   4]
 [  5   6   7   8]
 [  9  10  11  12]]
After copied slice: 
 [[100   2   3   4]
 [  5   6   7   8]
 [  9  10  11  12]]


### <font color='#eb3483'> Quick knowledge check! </font>
1. Using the matrix below, use slicing to return only even number rows and columns (i.e. no element that's in either an odd row or an odd column should be included).

In [71]:
import numpy as np
matrix = np.arange(100).reshape((10,10))

matrix[::2, ::2]

array([[ 0,  2,  4,  6,  8],
       [20, 22, 24, 26, 28],
       [40, 42, 44, 46, 48],
       [60, 62, 64, 66, 68],
       [80, 82, 84, 86, 88]])

## <font color='#eb3483'> Fancy Indexing </font>

So far, we've looked at indexing pretty similar to lists - we either put in the element we want to see (i.e. `x[0]`) or slice (`x[1:4]`) - but we can do a whole lot more in numpy! One additional way we can index is called 'masking' - feeding in a boolean array that tells us whether or not return each row with a true or false. Let's see it in action:

In [72]:
arr = np.arange(3)
mask = [True, False, True]

print('Masking in action: ', arr[mask])

Masking in action:  [0 2]


Masking let's us get data that meet some conditions.

In [73]:
# Let's get all the values greater than 0
print('> 0: ', arr[arr > 0])

#Or not equal to 1
print('!= 1: ', arr[arr != 1])

#It works on matrices too!
matrix = np.eye(5)
print('Matrix: \n', matrix)
print('Matrix > 0: \n', matrix[matrix > 0])

> 0:  [1 2]
!= 1:  [0 2]
Matrix: 
 [[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
Matrix > 0: 
 [1. 1. 1. 1. 1.]


We can also feed in the specific indices we want to acces in array format (i.e. `x[[0,6,9]]` to get the 1st, 7th and 10th elements!

In [74]:
arr = np.arange(100)
ind = [2, 5, 30]

#Get the 3rd, 6th and 31st values of the array
print('Array indexing: ', arr[ind])

#It works for matrices too!
matrix = np.random.rand(100).reshape(10,10)
rows = [1, 3, 5]
cols = [3, 8, 2] 

#Note that this code is returning elements at (1,3), (3,8), and (5,2)
print('Matrix indexing: ', matrix[rows, cols])

Array indexing:  [ 2  5 30]
Matrix indexing:  [0.01636855 0.16861968 0.0814864 ]


### <font color='#eb3483'> Quick knowledge check! </font>
1. For the matrix below, get the average of all the values in the matrix above 50

In [75]:
matrix = np.arange(100).reshape(10,10)