# Numpy

[Numpy](https://numpy.org/learn/) (stands for *Num*erical *py*thon) and forms the basis of many other modules/packages. For instance, Pandas on of the main packages we will use in this course depends a lot on Numpy behind the scences. Numpy contains many, many numerical operations and allows us do many operations at the same time.

Numpy is one of the most important modules/packages in the Python ecosystem on which many other modules (e.g. Pandas) are built. Numpy is often used in connection with Scipy (*Sci*entific *Py*thon), the main packages where you will find many statistical functions.

## Numpy *Arrays*

The main component of Numpy is the numpy `array`. An `array` is often built from a Python list. It looks like a list but BUT you cannot store elements of different types. Typically an array is used to store (though it can also be used to store strings). On the other hand numpy arrays allow you to do many numerical calculations.

In [35]:
import numpy as np

In [36]:
# create a simple numerical array
a = np.array([2,0,0.25,-3,5])

### Properties

A numpy `array` has a set of properties that we can query, even change some times.

In [37]:
# shape of the array?
a.shape

(5,)

This is telling us that it has 5 elements

In [38]:
# dimensions of the array
a.ndim

1

This is telling us that our array has only one dimension

In [39]:
# array type
a.dtype

dtype('float64')

This tells us that the array holds floating point values (i.e. decimal numbers) that are 64 bits long (this refers to how much memory we want to use to store the number).

Let's consider other examples

In [40]:
# create a 2 x 2 numpy array
a = np.array([ [2,3], [4,5] ])
a

array([[2, 3],
       [4, 5]])

how many dimensions does it have?

In [41]:
a.ndim

2

what is its dimensions?

In [42]:
a.shape

(2, 2)

create an array that holds floating point values 0. to 29.

In [43]:
a = np.arange(29+1, dtype= np.float32)
a

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29.], dtype=float32)

reshape the array into a 5x5 array

In [44]:

a = a.reshape((6,5)) # or change directly the shape attribute of a, e.g. a.shape= (5,5)
a

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.],
       [25., 26., 27., 28., 29.]], dtype=float32)

### Parameter `Axis`

Many numpy functions have a parameter called `axis` (... somewhat confusing):    
- to iterate down each column, set `axis = 0`    
- to iterate across each row, set `axis = 1`   

find minimum of the entire array

In [45]:
np.amin(a)

np.float32(0.0)

find minimum across each row

In [46]:
np.amin(a, axis=1)

array([ 0.,  5., 10., 15., 20., 25.], dtype=float32)

sum all values in each column

In [47]:
np.sum(a, axis= 0)

array([75., 81., 87., 93., 99.], dtype=float32)

find the mean of each column

In [48]:
np.mean(a, axis= 0)

array([12.5, 13.5, 14.5, 15.5, 16.5], dtype=float32)

### **n**ot **a** **n**umber (`np.nan`)

**Floating point** numpy arrays have a special code to indicate when an entry is a `null` value. It is important to be familiar with this as many numpy operations will return a `nan` result if a `nan` is among the different entries. This is not an issue right now as we have been generating most of our data but it will be once we start reading files that contain data and are missing some values.

change value at position (2,2) for a NaN value

In [49]:
# change entry 2,2 to a nan
a[2,2] = np.nan
a

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., nan, 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.],
       [25., 26., 27., 28., 29.]], dtype=float32)

check the array contains any null value

In [50]:
# check if any is NaN  -all, any
np.any(np.isnan(a))

np.True_

check if any column contains a null value

In [51]:
# NEW CELL
np.any(np.isnan(a), axis=0)

array([False, False,  True, False, False])

calculate the mean for the entire array

In [52]:
# mean of the entire array
np.mean(a)

np.float32(nan)

In [53]:
# mean of each row
np.mean(a, axis= 1)

array([ 2.,  7., nan, 17., 22., 27.], dtype=float32)

Whenever we encounter a `nan` we did not get a result. There are several ways we can deal with this. Depending on the operation, you could substitute the `nan` for some value like 0. Though this may not always work (e.g. if you are multiplying values). Another way is to use a different version of the operation, one that can handle `nans` values.

In [54]:
# mean of the entire array
np.nanmean(a)

np.float32(14.586206)

In [55]:
# mean of each row
np.nanmean(a, axis= 1)

array([ 2.,  7., 12., 17., 22., 27.], dtype=float32)

### Simple indexing

    
Items in a 2D array are referenced by their position:  *arr*\[**row**, **column**]

In [56]:
# restore item in a at 2,2 back to 12
a[2,2] = 12.0
a

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.],
       [25., 26., 27., 28., 29.]], dtype=float32)

```{note}
Arrays position start at 0
```

return items in second row and third column

In [57]:
a[1,2]

np.float32(7.0)

### Slicing
Elements inside an `array`can be accessed and selected in many ways. Things can become complicated if you are dealing with arrays that have more than two dimensions. Luckily for us, this seldom occurs. Same as with a `list` numpy arrays can be sliced.

```{image} ../../images/slice.png
:align: center
```

1. *start*: the position of the first item first to be included in the slice.   
2. *stop*: the position of the last entry THIS ELEMENT IS NOT INCLUDED    
3. *step*: how many positions we want to jump.    


return top 3x3 values,

In [87]:
a[0:3,0:3]

array([[ 0.,  1.,  2.],
       [ 5.,  6.,  7.],
       [10., 11., 12.]], dtype=float32)

return lower 3x3 values,

In [88]:
# return lower 3x3 values
a[3:6,2:6]

array([[17., 18., 19.],
       [22., 23., 24.],
       [27., 28., 29.]], dtype=float32)

return lower 3x3 values,

In [89]:
# return lower 3x3 values 
a[-3:,-3:] # -3:6. -3,6

array([[17., 18., 19.],
       [22., 23., 24.],
       [27., 28., 29.]], dtype=float32)

As with Python Lists, use a **colon**  `':'` to represent an entire row/column 

return the entire second row

In [90]:
a[1,:]

array([5., 6., 7., 8., 9.], dtype=float32)

return the entire first column

In [91]:
a[0,:]

array([0., 1., 2., 3., 4.], dtype=float32)

return the entire last column

In [92]:
a[:,-1]

array([ 4.,  9., 14., 19., 24., 29.], dtype=float32)

return all rows from the entire second column onwards 

In [93]:
a[2:,:]

array([[10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.],
       [25., 26., 27., 28., 29.]], dtype=float32)

return entries betweeen the first three(2+1) rows and first two columns(1+1)

In [94]:
a[:3, :2]

array([[ 0.,  1.],
       [ 5.,  6.],
       [10., 11.]], dtype=float32)

return entries betweeen row 3 onwards and column 2 onwards

In [95]:
a[3:, 2:]

array([[17., 18., 19.],
       [22., 23., 24.],
       [27., 28., 29.]], dtype=float32)

return block of items from 2,2 to 5,5

In [96]:
a[2:6, 2:6]

array([[12., 13., 14.],
       [17., 18., 19.],
       [22., 23., 24.],
       [27., 28., 29.]], dtype=float32)

generate flatten (one-dimensional) views of an array

In [97]:
a.flatten() # copy, slower

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29.], dtype=float32)

In [98]:
a.ravel() # view, faster

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29.], dtype=float32)

### Fancy Indexing

Numpy allows you to index arrays using Python List or other Numpy Arrays. 

Return items at each (r,c) position when r and c come from combining these rows [0,3,4] with these columns [1,3,2]. 
Return the items in [0,1], [3,3], [4, 2]

In [70]:
rows = [0,3,4]
cols = [1,3,2]
a[rows,cols]

array([ 1., 18., 22.], dtype=float32)

We can generate more sophisticated selection schemes using [comparison and  logical operators](https://www.w3schools.com/python/python_operators.asp). We will discuss some of them later in the course.

In [71]:
# create a number generator
ng = np.random.default_rng(2021)

create a numpy array with 10 rows and 2 columns full of random values between 0-1

In [72]:
b = ng.random( (10,2))
b

array([[0.75694783, 0.94138187],
       [0.59246304, 0.31884171],
       [0.62607384, 0.03551387],
       [0.25212696, 0.48501366],
       [0.30126688, 0.72195094],
       [0.92989066, 0.90823581],
       [0.08495856, 0.27036617],
       [0.97168176, 0.26015007],
       [0.80052031, 0.79827865],
       [0.6425692 , 0.66806864]])

select all of those that are greater than 0.5

In [73]:
gt05 = b > 0.5
gt05

array([[ True,  True],
       [ True, False],
       [ True, False],
       [False, False],
       [False,  True],
       [ True,  True],
       [False, False],
       [ True, False],
       [ True,  True],
       [ True,  True]])

Numpy returns [boolean](https://www.w3schools.com/python/python_booleans.asp) array as a result. This is an array where entries can only have two values, **True** or **False**. Because this array is of the same size than b we are able to use it as an index of b. Next, we return values that are greater than 0.5

In [74]:
b[gt05]

array([0.75694783, 0.94138187, 0.59246304, 0.62607384, 0.72195094,
       0.92989066, 0.90823581, 0.97168176, 0.80052031, 0.79827865,
       0.6425692 , 0.66806864])

This in turn can allow us to do more sophisticated selections,

In [75]:
criterion1 = b > 0.25
criterion1

array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [False,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True]])

In [76]:
criterion2 = b < 0.75
criterion2

array([[False, False],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [False, False],
       [ True,  True],
       [False,  True],
       [False, False],
       [ True,  True]])

In [77]:
criteria = criterion1 & criterion2
criteria

array([[False, False],
       [ True,  True],
       [ True, False],
       [ True,  True],
       [ True,  True],
       [False, False],
       [False,  True],
       [False,  True],
       [False, False],
       [ True,  True]])

In [78]:
np.logical_and(criterion1, criterion2)

array([[False, False],
       [ True,  True],
       [ True, False],
       [ True,  True],
       [ True,  True],
       [False, False],
       [False,  True],
       [False,  True],
       [False, False],
       [ True,  True]])

In [79]:
b[criteria]

array([0.59246304, 0.31884171, 0.62607384, 0.25212696, 0.48501366,
       0.30126688, 0.72195094, 0.27036617, 0.26015007, 0.6425692 ,
       0.66806864])

We can combine all of the above in a single line,

In [80]:
# return all entries that are greater than 0.25 and less than 0.75
b[ (b>0.25) & (b<0.75)]

array([0.59246304, 0.31884171, 0.62607384, 0.25212696, 0.48501366,
       0.30126688, 0.72195094, 0.27036617, 0.26015007, 0.6425692 ,
       0.66806864])

### Array operations

One of the largest benefits of using arrays is that we can do operations *with* them. As a result we are able to do many calculations in a fraction of a second.

Let's consider the following operation:
$$ area = width \cdot height$$

In [81]:
# select a large number
N= 1000000

In [82]:
# create a number generator
rg= np.random.default_rng(2023)

# random widths between 5 and 25 of one million datapoints
width = rg.uniform(5, 25, size= N)

# random heights between 10 and 100 of one million datapoints
height = rg.uniform(10, 100, size= N)

#calculate the area
area = width * height

len(width), len(height), len(area)

(1000000, 1000000, 1000000)

In [83]:
# Show the first 20 areas
area[0:20]

array([ 265.19134798,  238.57331715,  674.04844159,  412.1534066 ,
       1590.02215564, 1180.29305652, 1938.54652793,  857.93025482,
        400.40868646,  319.8115903 ,  849.23933716, 1151.01637437,
        491.90510865, 1206.33553037,  639.03426375,  264.63747376,
        489.50078888, 1046.18099141,  657.64652892,  919.33889369])

Compare the above calculation with the following one done by using a for-loop to iterate through each calculation.

In [84]:
# create a number generator
rg= np.random.default_rng(2023)

# initialize area list to an empty list
area2=[]

for i in range(N):
    # random widths between 5 and 25
    width = rg.uniform(5, 25, 1)

    # random heights between 10 and 100 of one million datapoints
    height = rg.uniform(10, 100, 1)

    #calculate the area
    temp = width * height
    area2 += [temp[0]]

In [85]:
area2[:20]

[np.float64(201.74822114592902),
 np.float64(362.20463045605027),
 np.float64(1120.8141987479803),
 np.float64(1267.681410784116),
 np.float64(540.5322602417758),
 np.float64(1306.5456880430643),
 np.float64(564.2977038789402),
 np.float64(934.6076464376883),
 np.float64(781.2913099338399),
 np.float64(1091.6196496595753),
 np.float64(981.6887538544491),
 np.float64(1308.6533310301277),
 np.float64(873.0606154138706),
 np.float64(1718.6202277814525),
 np.float64(725.6347420444179),
 np.float64(2060.7186795472016),
 np.float64(1539.426532206244),
 np.float64(525.3725149370016),
 np.float64(2107.389651154456),
 np.float64(439.4605647072959)]

There are a **very large** large number of operations that you can do on [arrays](https://numpy.org/doc/stable/reference/index.html). There is a good chance that most of the operations you may want to every use (statistical, logical, so on) already exist. So before you write any code you may want to do a search on google.

In [86]:
print('\nMean, median and standard deviation of areas', np.mean(area), np.median(area), np.std(area))
print('\n25th and 75th percentile of the area', np.percentile(area, [25,75]))
print('\nMinimum and maximum of area', np.min(area), np.max(area))


Mean, median and standard deviation of areas 823.8399779422148 706.1332157576957 524.0912191467664

25th and 75th percentile of the area [ 402.54295455 1157.8861751 ]

Minimum and maximum of area 50.312094665273946 2496.6397983577444
