# NB: Introducing NumPy

Programming for Data Science

<img src="../../media/numpy-logo.png" style="float:right;"/>

## What is NumPy

NumPy stands for "**Numerical Python**." 

It is designed for **high-performance numerical computing** in Python. 

> Numerical methods are computational techniques used to find approximate solutions to mathematical problems that cannot be solved exactly, typically involving equations, integrals, derivatives, or optimization problems. They rely on algorithms to perform calculations and are essential in fields where analytical solutions are difficult or impossible to obtain.

Because [numerical methods](https://www.britannica.com/science/numerical-analysis) are so important to so many sciences, NumPy is the basis of what is called **the scientific "stack"** in Python, which consists of _SciPy_, _Matplotlib_, _SciKitLearn_, and _Pandas_. 

Understanding what NumPy does and how it works is **essential** for almost anything data science related in Python.

## A New Data Structure

Essentially, NumPy introduces **a new data structure** to Python &mdash; the **n-dimensional array**. 

Along with it, it introduces a collection of **functions and methods** that take advantage of this data structure.

The data structure is designed to support the use of **numerical methods**: algorithmic approximations to the problems of mathematical analysis that are fundamental to modern science.

It also provides a new way of applying functions to data made possible by the data structure -- **vectorized functions**. 

Vectorized functions **replace the use of loops** and comprehensions to apply a function to a set of data. 

In addition, it provides a library of **linear algebra** functions. 

NumPy also introduces a bunch of new **data types**.

Let's take a look at it.

## Importing the Library

To import NumPy, we typically alias it as `np`.

In [10]:
import numpy as np

NumPy is by widespread convention aliased as `np`.

## The ndarray

The ndarray is a multidimensional array object.

Unlike Python lists, ndarrays **enforce a data type** among elements.

Also, they are by definition n-dimensional &mdash; they are inherently multi-dimensional.

Note we will sometimes call ndarrays just arrays.

To explore ndarrays, let's generate some play data using NumPy's built-a random number generator.

Note that `np.random.randn()` samples from the "standard normal" distribution.

We construct an array by passing dimensional arguments to the the function.

Here, we create an array of two dimensions: $2$ rows by $3$ columns:

In [24]:
data2 = np.random.randn(2, 3)

In [25]:
data2

array([[-0.17822775,  1.4807077 ,  1.23837876],
       [-1.97014657, -0.5958817 ,  0.55137607]])

The `shape` property tells us that two dimension &mdash; the number of elements in the tuple &mdash; with $2$ elements in the first dimension and $3$ in the second.

In [26]:
data.shape

(2, 3)

And the data type `dtype` refers to _all_ of the elements in the structure.

In [27]:
data.dtype

dtype('float64')

Here we create an array of $3$ dimensions:

In [28]:
data3 = np.random.randn(2, 3, 2)

In [29]:
data3

array([[[ 0.87768583, -0.22121848],
        [-0.51085571,  0.39506273],
        [ 1.86574009, -1.83249675]],

       [[ 0.1487833 ,  1.50264272],
        [-0.13138186, -0.31934921],
        [-2.0748371 , -0.05439636]]])

In [30]:
data3.shape

(2, 3, 2)

## About Dimensions

The term "dimension" is ambiguous.

Sometimes it refers to the dimensions of things in the world, such as space and time.

Sometimes it refers to the dimensions of a data structure, independent of what it represents in the world.

Note that you can represent multiple world dimensions in a two-dimensional data structure &mdash; each column can be dimension in this sense.

For example, three-dimensional space can be represented as three columns in a two-dimensional table _or_ as three axes in a data cube. 

The dimensions of data structures are sometimes called **axes**.

## Creating Arrays

There are many ways to create arrays in NumPy.

From a list:

In [31]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

From a list of lists:

In [10]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [11]:
arr2.ndim

2

In [12]:
arr2.shape

(2, 4)

In [13]:
arr1.dtype

dtype('float64')

In [14]:
arr2.dtype

dtype('int64')

Initializing with $0$s using a convenience function:

In [15]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [16]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [17]:
np.empty((2, 3, 2))

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

Using `.arange()` (instead of `range()`)

In [18]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

## Data Types

Unlike any of the previous data structures we have seen in Python, 
**ndarrays must have a single data type** associated with them.

Here we initialize a series of arrays as different data types (`dtypes`).

In [32]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr1.dtype

dtype('float64')

Note that dtypes are defined by some **constants attached to the NumPy object**.

We can also refer to them as strings in some contexts. 

In other words, in the context of the dtype argument, `'float64'` can substitute for `np.float64`.

In [20]:
np.array([1, 2, 3], dtype='float64')

array([1., 2., 3.])

In [21]:
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

dtype('int32')

Integer arrays default to `int64`:

In [22]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int64')

So you may want in use a more capacious type:

In [23]:
float_arr = arr.astype(np.float64)
float_arr.dtype

dtype('float64')

Arrays can be cast:

In [24]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

From floats to ints:

In [25]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

From strings to floats:

In [26]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)

array([ 1.25, -9.6 , 42.  ])

Note that NumPy converts data types to make the array uniform:

In [27]:
non_uniform = np.array([1.25, -9.6, 42])
non_uniform, non_uniform.dtype

(array([ 1.25, -9.6 , 42.  ]), dtype('float64'))

Ranges default to integers:

In [28]:
int_array = np.arange(10)

In [29]:
int_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We can use the dtype on one array to cast another:

In [30]:
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

And here is an empty array of unsigned integers:

In [31]:
empty_uint32 = np.empty(8, dtype='u4')
empty_uint32

array([         0, 1075314688,          0, 1075707904,          0,
       1075838976,          0, 1072693248], dtype=uint32)

**NumPy Data Types**

```
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
```

**Data Type Hierarchy**

NumPy introduces 24 new fundamental Python types to describe different types of scalars.

These derive from the C programming language with which NumPy is built.

![](../../media/dtype-hierarchy.png)

See the [NumPy docs](https://numpy.org/doc/1.25/reference/arrays.scalars.html).

## Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping which similarly returns a view on the underlying data without copying anything. 

Arrays have the transpose method and also the special `T` attribute:

In [105]:
arr = np.arange(15).reshape((3, 5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [106]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

Transposing is often used when computing the dot product between two arrays.

Here's an example.

In [107]:
arr = np.random.randn(6, 3)
arr

array([[ 0.13663893,  1.00187709, -0.49963842],
       [-0.36045802,  0.11501271,  0.39756581],
       [ 0.14672184, -0.2152224 ,  2.09433159],
       [-1.6139847 ,  0.74614412, -2.0898423 ],
       [ 0.12286791,  0.57296788, -0.18656951],
       [ 1.16320932,  0.97720159,  2.05399248]])

In [108]:
np.dot(arr.T, arr)

array([[ 4.14322653,  0.0666845 ,  5.83498162],
       [ 0.0666845 ,  2.90325249, -0.5646554 ],
       [ 5.83498162, -0.5646554 , 13.41505604]])

For higher dimensional arrays, `transpose` will accept a tuple of axis numbers to permute the axes.

Warning -- this can get confusing to conceptualize and visualize!

In [109]:
arr = np.arange(16).reshape((2, 2, 4))
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [110]:
arr.transpose((1, 0, 2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

Simple transposing with `.T` is just a special case of swapping axes. ndarray has the method `swapaxes` which takes a pair of axis numbers:

In [111]:
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [112]:
arr.swapaxes(1, 2)

array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])