# Basic `numpy`

In this notebook we will learn about the python package `numpy`.

By the end of this notebook you will know about:
- Checking the version of a python package,
- `numpy` `ndarray`s,
- `ndarray` functions,
- `numpy` functions,
- Pseudorandom numbers in `numpy` and
- Linear algebra in `numpy`.

## `numpy`

`numpy` is a python package that is a real workhorse of machine learning and data science.

If you are new to python, this will be the first true package you will import. That being said we should check that you have the package installed, try to run the following code chunk. (Note if you installed the Anaconda platform, <a href="https://www.anaconda.com/">https://www.anaconda.com/</a>, `numpy` should be installed already).

In [1]:
## it is standard to import numpy as np
import numpy as np

In [2]:
## let's check what version of numpy you have
## when I wrote this I had version 1.24.2
## yours may be different
print(np.__version__)

2.2.1


If you had a version of `numpy` installed, both of those code chunks should have run without error. If not, you will need to install it onto your machine because we will be using it heavily in the boot camp. For installation instructions check out the `numpy` documentation here, <a href="https://numpy.org/install/">https://numpy.org/install/</a>. If you are unsure how to install a python package in general check our python package installation guide, <a href="https://www.erdosinstitute.org/data-science">https://www.erdosinstitute.org/data-science</a>.


##### Be sure you can run both of the above code chunks before continuing with this notebook, again it should be fine if your package version is slightly different than mine.

### `numpy`'s `ndarray`

While base python likes to store data in objects like `list`s and `tuple`s, in `numpy` data is stored in an `ndarray` it is similar to a list, but has a number of features that make it more useful for numeric data manipulation in a number of data science applications.

<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html">https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html</a>.

In [7]:
## You can make an array with np.array
## You just put np.array() around a python list or tuple
# array1 = [1,2,3,4,5,6,7,8,9,10]
array1 = np.array([1,2,3,4,5,6,7,8,9,10])
print(array1)
print()
print(type(array1))

[ 1  2  3  4  5  6  7  8  9 10]

<class 'numpy.ndarray'>


`numpy` `ndarray`s can have any finite number of dimensions. This can be constructed by wrapping `np.array` around a `list` of `list`s.

In [8]:
## this produces a 2-dimensional array
## it is a 2x2 array
array2 = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(array2)
print()

## we can check the array's dimensions with np.shape()
## np.shape() returns a tuple with the size of each dimension
## array2 should be a 2 by 2 array
print("array2 is a", np.shape(array2), "ndarray")

[[1 2 3]
 [4 5 6]
 [7 8 9]]

array2 is a (3, 3) ndarray


In [9]:
array2.shape

(3, 3)

<i>Note: the dimensionality of an `ndarray` will be quite important in our boot camp because certain algorithms will not run if the `ndarray` is the wrong shape.</i>

In [15]:
## You code
## Try making a 2x2x2 array
array3 = np.array([[[1,1],[1,1]],[[1,1],[1,1]]])
print(array3.shape)
print(array3)

(2, 2, 2)
[[[1 1]
  [1 1]]

 [[1 1]
  [1 1]]]


In [12]:
## You code 
## Try making a 2x2x2x2 array
array4 = np.array([[[[1,1],[2,2]],[[3,3],[4,4]]],[[[5,5],[6,6]],[[7,7],[8,8]]]])
print(array4.shape)
print(array4)

(2, 2, 2, 2)
[[[[1 1]
   [2 2]]

  [[3 3]
   [4 4]]]


 [[[5 5]
   [6 6]]

  [[7 7]
   [8 8]]]]


### `ndarray` Functions

#### Vectorized Operations

`ndarray`s are nice because, for the most part, they work the way you'd expect a vector or matrix to work. Let's compare and contrast with python's `list`s.

In [16]:
## You code
## see what happens when you code up 2*list1
list1 = [1,2,3,4,5]
print(2*list1)

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]


In [18]:
## You code
## Now compare it to 2*array1
print(2*array1)

[ 2  4  6  8 10 12 14 16 18 20]


In [19]:
## what happens here?
list2 = [2,4,6,8]
list1 + list2

[1, 2, 3, 4, 5, 2, 4, 6, 8]

In [21]:
## You code
## code up the comparable ndarray expression
## and see what happens
array1+array1

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [25]:
## Finally what happens here?
list1 + 2

TypeError: can only concatenate list (not "int") to list

In [23]:
## Try the comparable ndarray expression
array2+1

array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [30]:
A = np.array([[1,2], [1,2]])

B = np.array([[-1,2], [1,-2]])


print(A*B)

print(np.matmul(A,B))

[[-1  4]
 [ 1 -4]]
[[ 1 -2]
 [ 1 -2]]


### Preset `numpy` Arrays

There are a number of standard array types that you will want to use, that can be quickly generated.

In [31]:
## np.ones(shape) makes an array of all ones of the desired shape
print(np.ones(1))

print()

print(np.ones((4,10)))

print()

print(np.ones((2,2,2)))

print()
print(np.ones((2,3,2)))

[1.]

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]

[[[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]]


In [32]:
## You code
## np.zeros(shape) is similar to np.ones, but instead of 1s
## it makes an array of 0s
## print 3 arrays of zeros, 
## one that is a single dimension of size 4
print(np.zeros(4))
print()

## one that is 4x5
print(np.zeros((4,5)))
print()

## one that is 3x1x4
print(np.zeros((3,1,4)))
print()

[0. 0. 0. 0.]

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

[[[0. 0. 0. 0.]]

 [[0. 0. 0. 0.]]

 [[0. 0. 0. 0.]]]



In [33]:
## nxn identity matrix 
## np.eye(n)

## 2x2
np.eye(2)

array([[1., 0.],
       [0., 1.]])

### Built-In `numpy` Functions

`numpy` also has a number of built-in functions that provide useful mathematical operations on arrays. Let's look at some examples.

In [34]:
y = 2*np.array([1,2,3]) - 4
y

array([-2,  0,  2])

In [35]:
## absolute value
np.abs(y)

array([2, 0, 2])

In [36]:
## raising each entry to a power
np.power(y, 3)

array([-8,  0,  8])

In [37]:
## the square root
np.sqrt(np.abs(y))

array([1.41421356, 0.        , 1.41421356])

In [39]:
## You code
## using np.exp and np.log define y to be 
## e^(x+3) + log(|x|+1)
## https://numpy.org/doc/stable/reference/generated/numpy.exp.html
## https://numpy.org/doc/stable/reference/generated/numpy.log.html
x = np.array([0,1,2,3,4])

y = np.exp(x+3) + np.log(np.abs(x)+1)
print(y)

[  20.08553692   55.29129721  149.51177139  404.81508785 1098.24259634]


In [41]:
## You can sum all of the entries of an array with
## np.sum

print(np.sum(y))

1727.946289722884


### `numpy` for Pseudorandomness

`numpy` is useful for generating pseudorandom numbers as well. We can look at common statistics of arrays too.

The pseudorandom functionality is stored in the `random` subpackage of `numpy`. Documentation for `numpy.random` can be found here, <a href="https://numpy.org/doc/stable/reference/random/index.html">https://numpy.org/doc/stable/reference/random/index.html</a>.

In [42]:
## random generators are stored in np.random
## a np.random.random() gives a number selected uniformly
## at random from [0,1]
## https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html
print(np.random.random())

print()

## You can get a random array of any shape as well
## just call np.random.random(tuple containing the shape)
print("A (10,2) uniform random array:\n", np.random.random((10,2)))

0.752710843503806

A (10,2) uniform random array:
 [[0.28768376 0.90268912]
 [0.57830019 0.81712583]
 [0.9249535  0.52007587]
 [0.68516137 0.90914002]
 [0.22032016 0.77870416]
 [0.76399423 0.27461846]
 [0.25492849 0.8921012 ]
 [0.67202039 0.21503276]
 [0.4908133  0.337636  ]
 [0.15267226 0.30119184]]


In [43]:
## Another Example
## np.random.randn() is a normal(0,1) number
## a single draw
## https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html
print(np.random.randn())
print()

## an array of draws
## note the slight difference here, we don't have to put
## the 10 and 2 in a tuple to get a 10 by 2 array
## numpy is slightly inconsistent in this area so always
## check the docs to get it right
np.random.randn(10,2)

-0.5323185250964186



array([[-0.71659037, -0.23371257],
       [ 0.67133447, -0.82621725],
       [ 1.73172679, -0.43246798],
       [ 0.87157905, -0.28664284],
       [ 0.198256  ,  0.58050909],
       [ 1.36893807,  0.93261717],
       [-0.78042236,  0.94796512],
       [-0.15880971,  0.07275499],
       [ 0.32323988, -1.34648211],
       [-0.95222681,  1.61823381]])

In [44]:
## A third example
## np.random.binomial()
## an array of binomial(n,p) outcomes
## https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html
np.random.binomial(n=4, p=.3, size=(10,10))

array([[0, 0, 2, 0, 1, 0, 3, 2, 0, 1],
       [2, 2, 1, 3, 1, 1, 0, 2, 1, 2],
       [0, 2, 2, 2, 3, 2, 0, 1, 2, 1],
       [0, 1, 2, 2, 1, 1, 1, 1, 1, 0],
       [1, 0, 0, 2, 1, 0, 2, 2, 2, 3],
       [0, 2, 2, 1, 1, 1, 1, 1, 2, 1],
       [0, 3, 1, 2, 2, 1, 1, 1, 1, 3],
       [1, 0, 1, 1, 0, 0, 0, 2, 1, 2],
       [2, 1, 0, 1, 3, 2, 1, 0, 1, 0],
       [0, 0, 0, 1, 1, 2, 0, 1, 1, 0]], dtype=int32)

##### Random Seeds

You may have noticed that your randomly generated numbers are different from the ones in the pre-recorded lecture (if you are watching the lecture that is!). This is expected because they are random numbers. It would be quite the coincidence if two different runs came up with the exact same random draw (for the random distributions we have looked at above).

If you want to ensure that you get the same random draw across runs you first need to set a random seed. In `numpy` this is done with `numpy.random.seed()`, <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html">https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html</a>. Let's see it in action.

In [45]:
## Run this code chunk as many times as you'd like
## it should always give the same number

## to set a random seed you call np.random.seed(integer >= 0)
## Note that your number can be any integer so long as it is non-negative
np.random.seed(440)

np.random.randn()

-0.3202545309014756

In [46]:
## You code
## make a 20 by 3 array of random normal draws
## call it X
X = np.random.randn(20,3)

In [47]:
X

array([[ 1.09050674,  1.52886149,  1.37612136],
       [-0.3062176 , -0.32709566, -1.4372358 ],
       [-0.72984114, -1.39495541,  0.87123193],
       [-1.55909171,  0.92335339,  0.07040371],
       [ 1.55344434, -1.39755153, -0.12756144],
       [-0.33480946, -0.02569027, -0.61265739],
       [ 0.3461294 , -0.4505475 , -0.44789901],
       [-0.90636936, -0.24323007, -0.09989317],
       [-0.11314842, -0.65692752,  1.46753968],
       [-0.83684519, -0.49411999,  1.59754225],
       [-0.08843056, -1.76865057, -0.40919394],
       [-2.53475325, -0.06538607,  0.12405334],
       [-0.85589135, -0.49764564, -0.5827299 ],
       [-0.70518513,  1.20940779,  1.22638777],
       [ 0.31923843, -1.52496383, -1.70568229],
       [-0.42592553,  1.24464831,  0.49398348],
       [ 1.87258836, -0.06424433,  1.40322558],
       [-0.87162199, -0.79654564, -0.04781258],
       [-0.57865208,  0.81793672, -0.42460452],
       [-0.90231823,  0.46981075,  1.11543939]])

Now that you have a data matrix, `X`. Let's compute some summary statistics about `X`.

In [48]:
## You can get the mean of all the entries of X with np.mean
## https://numpy.org/doc/stable/reference/generated/numpy.mean.html
print("The overall mean of X is", np.mean(X))
print()

## Adding in the argument "axis = " allows you to get
## the mean of each column
print("The column means of X are", np.mean(X, axis=0))
print()

## and the mean of each row
print("The row means of X are", np.mean(X, axis=1))
print()

## the axis argument tells numpy the axis or axes along 
## which the means are computed.
## so axis = 0 adds up the values in each row position
## and divides by the number of rows

## If you find this confusing, do not worry, I do too

The overall mean of X is -0.10383451439077582

The column means of X are [-0.32835969 -0.17567678  0.19253292]

The row means of X are [ 1.33182986 -0.69018302 -0.41785487 -0.18844487  0.00944379 -0.32438571
 -0.1841057  -0.41649753  0.23248791  0.08885902 -0.75542502 -0.82536199
 -0.6454223   0.57687014 -0.97046923  0.43756875  1.0705232  -0.5719934
 -0.06177329  0.22764397]



In [49]:
## You code
## np.sum also has an axis argument
## https://numpy.org/doc/stable/reference/generated/numpy.sum.html
## calculate the row sums and column sums of X
print("Row sums of X", np.sum(X,axis=0))

print()
print()

print("Column sums of X",np.sum(X,axis=1))




Row sums of X [-6.56719372 -3.51353559  3.85065845]


Column sums of X [ 3.99548959 -2.07054906 -1.25356461 -0.56533462  0.02833137 -0.97315712
 -0.55231711 -1.24949259  0.69746373  0.26657707 -2.26627506 -2.47608598
 -1.9362669   1.73061043 -2.91140769  1.31270626  3.21156961 -1.71598021
 -0.18531988  0.68293191]


In [50]:
## You code
## where does the max value occur in each column of X?
## https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
print("The max. values of each column of X occur at the", np.argmax(X,axis=0), "rows")


## what is the max value in each column of X?
## https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html
print("The max values in each column of X are",np.argmax(X,axis=1))


The max. values of each column of X occur at the [16  0  9] rows
The max values in each column of X are [1 0 2 1 0 1 0 2 2 2 0 2 1 2 0 1 0 2 1 2]


In [51]:
## Another useful function is np.cumsum()
## https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html

## randint generates a random integer between the
## first two arguments, the third argument tells numpy
## how many random draws to perform
## https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html
x = np.random.randint(1,10,10)

print(x)

np.cumsum(x)

## What do you think it does?

[3 5 3 9 4 4 8 9 9 9]


array([ 3,  8, 11, 20, 24, 28, 36, 45, 54, 63])

### Linear Algebra with `numpy`

A final important use for us is `numpy` as a way to perform linear algebra calculations.

A bulk of data science algorithms use linear algebra, since we will dive into the math behind the scenes of these algorithms we will use `numpy`'s linear algebra capabilities.

##### Note: If you're not a math heavy person, that's okay! I have written the boot camp's notebooks so that you don't need to understand the math to learn how to perform the algorithms we cover. I just like to cover the mathematical aspects of these data science algorithms to explain what is going on to those boot campers (like myself) that are interested in the mathematical/statistical underpinnings of the algorithms.

In [52]:
## We can think of a 2D array as a matrix
A = np.random.binomial(n=10,p=.4,size=(2,2))

A

array([[3, 4],
       [4, 3]], dtype=int32)

In [62]:
## A 1d array can be a row vector
x = np.array([1,2])
print(x)
x.shape

y = np.ones((2,1))
print(y)

print(np.ones((2,)))

[1 2]
[[1.]
 [1.]]
[1. 1.]


In [65]:
## or a column vector
## .reshape() will attempt to reshape your array into the given
## shape
## https://numpy.org/doc/stable/reference/generated/numpy.ndarray.reshape.html

## When one of the shape dimensions is -1, the value is inferred from 
## the length of the array and remaining dimensions.
## so -1,1 tells numpy that you want a 2-D array with 1 column
## and it should infer the number of rows from the original shape
## of the array
## Here this reshapes x as a 2x1 column vector
print(x.reshape(-1, 1).shape)

(2, 1)


In [66]:
## We can now calculate A*x
## matrix.dot() is used for matrix mult
A.dot(x.reshape(-1,1))

array([[11],
       [10]])

In [68]:
## You code
## make a 3x1 column vector of ones, call it x
x = np.ones((3,1))

np.random.seed(576)
## Take that vector and find B*x
B = np.random.binomial(n=5, p=.6, size=(3,3))


print(B, "x", x, "=")
## code here
B.dot(x)

[[3 1 1]
 [2 4 4]
 [4 3 3]] x [[1.]
 [1.]
 [1.]] =


array([[ 5.],
       [10.],
       [10.]])

In [69]:
## numpy.linalg contains a number of useful
## matrix operations, let's import a few
## note this is just for brevity in typing
from numpy.linalg import inv, eig, det

In [70]:
## Recall our A
A

array([[3, 4],
       [4, 3]], dtype=int32)

In [71]:
## the inverse of A
## Note you may get an error here if A is not
## invertible
inv(A)

array([[-0.42857143,  0.57142857],
       [ 0.57142857, -0.42857143]])

In [72]:
## the determinant of A
det(A)

np.float64(-6.999999999999999)

In [73]:
## the eigenvalues and eigenvectors of A
eig(A)

## this returns a tuple of arrays
## the first entry are the eigenvalues
## the second entry are the corresponding eigenvectors

EigResult(eigenvalues=array([ 7., -1.]), eigenvectors=array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

In [74]:
## matrix.transpose() computes the transpose of the matrix
A.transpose()

array([[3, 4],
       [4, 3]], dtype=int32)

In [75]:
## You code
b = np.array([2,5]).reshape(-1,1)

## Attempt to solve Ax = b for x
## Hint remember that if A is invertible, 
## x = A^{-1} b, where A^{-1} is the inverse of A
print("Want to solve Ax=b for x:")
print("x=",inv(A).dot(b))

Want to solve Ax=b for x:
x= [[ 2.]
 [-1.]]


That's it for this notebook. You have now been introduced to `numpy` and our ready to take on the practice problems. Be sure to get a fair level of comfort with `numpy`'s functionality because we will be using it a lot.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)