<table border="0" style="width:100%">
 <tr>
    <td>
        <img src="https://static-frm.ie.edu/university/wp-content/uploads/sites/6/2022/06/IE-University-logo.png" width=150>
     </td>
    <td><div style="font-family:'Courier New'">
            <div style="font-size:25px">
                <div style="text-align: right"> 
                    <b> MASTER IN BIG DATA</b>
                    <br>
                    Python for Data Analysis II
                    <br><br>
                    <em> Daniel Sierra Ramos </em>
                </div>
            </div>
        </div>
    </td>
 </tr>
</table>

# **S05: NUMPY**

## What is Numpy

![](https://github.com/numpy/numpy/raw/main/branding/logo/primary/numpylogo.svg)

Numpy is a numerical computing library for Python.
- Most of numpy functionality is written in C
- Around 50x faster than traditional Python lists
- The main data structure in numpy is the multidimensional array
    - Vectors (1 dimension)
    - Matrix (2 dimensions)
    - Tensors (>2 dimensions)
- Is memory efficient because, on the contrary as Python lists, numpy arrays are stored consecutively in memory, which improves memory access eficiency.
- It's open source. Here is the code: https://github.com/numpy/numpy

To use munpy library, first we have to import it with

```python
import numpy as np
```

In [2]:
import numpy as np

## Lists -vs- Numpy Arrays

The main difference between lists and numpy arrays is the way in which the inner elements are stored.
 - A Python `list` is a collection on pointers to each of the elements that belongs to the list.
 - A `ndarray` is a sequence of contiguous elements stored in memory
 
![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png)

A Python list is much **more flexible** because it can store any kind fo data while the numpy array don't. On the contrary, the numpy array is **more efficient** for storing and manipulating data

## Arrays in Numpy

In [3]:
# Create arrays from lists

L = [1,2,3,4,5,6]
aL = np.array(L)

In [4]:
type(L)

list

In [5]:
type(aL)

numpy.ndarray

In [6]:
print(aL)

[1 2 3 4 5 6]


In [7]:
aL.dtype

dtype('int32')

> ***EXAMPLE 1***
>
> Create an array from list `L=[1,2,3,"A","B"]`
> - What happened with data types in the resulting array?

In [8]:
L=[1,2,3,"A","B"]

In [9]:
aL = np.array(L)

In [10]:
aL

array(['1', '2', '3', 'A', 'B'], dtype='<U11')

## Numpy API reference
https://numpy.org/doc/stable/reference/

## Create arrays from scratch
https://numpy.org/doc/stable/reference/routines.array-creation.html

The most common methods are
|Function|Description|
|-|-|
|`np.zeros`|Creates an array of zeros|
|`np.ones`|Creates an array of ones|
|`np.full`|Creates an array filled with the given value| 
|`np.arange`|Analogous to the `range` built-in function|

In [11]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [14]:
np.ones(4)

array([1., 1., 1., 1.])

In [15]:
np.full(5, fill_value=23)

array([23, 23, 23, 23, 23])

In [16]:
np.arange(1, 10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:
np.random.random(10)

array([0.25725336, 0.93213681, 0.68094552, 0.62961932, 0.32661102,
       0.65632093, 0.48236255, 0.30286024, 0.69704257, 0.46623954])

## N-dimensional arrays

In Numpy, the manipulation of multidimensional data structures is very efficient and straightforward

In [18]:
vector = np.array([1,2,3,4,5,6,7,8,9])
print(vector)
print(f"Shape: {vector.shape}")

[1 2 3 4 5 6 7 8 9]
Shape: (9,)


In [13]:
matrix = np.array([[1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9]])
print(matrix)
print(f"Shape: {matrix.shape}")

[[1 2 3 4 5 6 7 8 9]
 [1 2 3 4 5 6 7 8 9]]
Shape: (2, 9)


In [14]:
tensor = np.array([[[1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9]], [[1,2,3,4,5,6,7,8,9], [1,2,3,4,5,6,7,8,9]]])
print(tensor)
print(f"Shape: {tensor.shape}")

[[[1 2 3 4 5 6 7 8 9]
  [1 2 3 4 5 6 7 8 9]]

 [[1 2 3 4 5 6 7 8 9]
  [1 2 3 4 5 6 7 8 9]]]
Shape: (2, 2, 9)


In [15]:
# identity matrix
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

## The `np.random` module
Random variables: https://numpy.org/doc/stable/reference/random/generator.html#distributions

> ***EXAMPLE 2***
>
> Sample 10K samples form the following univariate Gaussian distributionn
> $$d\sim N(0,1)$$

In [16]:
d = np.random.normal(0, 1, size=10000)

In [17]:
d.shape

(10000,)

> ***EXAMPLE 3***
>
> Sample 10K samples form the following bivariate Gaussian distributionn
> $$d\sim N(\textbf{0},\mathbb{\textbf{1}})$$

In [18]:
d = np.random.multivariate_normal([0,0], np.eye(2), size=10000)

In [19]:
d

array([[ 0.39306392, -0.74254578],
       [-0.74110313,  0.36648224],
       [-1.24433849, -1.39381916],
       ...,
       [-1.50267327,  0.2229117 ],
       [-0.30380493, -1.04866362],
       [-0.80986956, -2.19197741]])

## Indexing and slicing

### Accessing single elements

In [20]:
x = np.array([1,2,3,4,5,6,7,8,9])

In [21]:
x[3]

4

In [22]:
X = np.array([[1,2,3], [4,5,6]])

In [23]:
X

array([[1, 2, 3],
       [4, 5, 6]])

In [24]:
# access by rows-columns
X[0,2]

3

> ***EXAMPLE 4***
>
> Create a 10-by-10 random matrix and select the element in the 6th row / 7th column

In [25]:
d = np.random.multivariate_normal(np.zeros(10), np.eye(10), size=10)

In [26]:
d[5,6]

-0.3816946486424704

### Accessing subarrays

In [27]:
d = np.random.multivariate_normal([0,0], np.eye(2), size=10000)

In [28]:
d

array([[ 0.94661764,  1.4659761 ],
       [-1.00121841, -2.12763245],
       [ 0.36264015, -1.58072496],
       ...,
       [ 0.07390019,  0.78427233],
       [-0.69968362,  2.0578795 ],
       [ 0.21860964, -0.24750858]])

In [29]:
# select the first column
d[:,0]

array([ 0.94661764, -1.00121841,  0.36264015, ...,  0.07390019,
       -0.69968362,  0.21860964])

In [30]:
# select the second column
d[:,1]

array([ 1.4659761 , -2.12763245, -1.58072496, ...,  0.78427233,
        2.0578795 , -0.24750858])

In [31]:
# select the third row
d[2,:]

array([ 0.36264015, -1.58072496])

In [32]:
# select second column starting by 20th row
d[19:,1]

array([-0.08516494,  0.6339979 , -0.2940304 , ...,  0.78427233,
        2.0578795 , -0.24750858])

In [33]:
# select second column starting by 20th row, but by twos
d[19::2,1]

array([-0.08516494, -0.2940304 , -1.27031711, ..., -0.90180941,
        0.78427233, -0.24750858])

> ***EXAMPLE 5***
>
> Give the matrix generated with `d = np.random.multivariate_normal([0,0,0,0], np.eye(4), size=10000)`
>
> Select the following submatrix
> - Second and third column and all rows from 30th by threes

In [34]:
d = np.random.multivariate_normal([0,0,0,0], np.eye(4), size=10000)

In [49]:
d[29::3,1:3]

array([[-1.59756835, -0.15430683],
       [-0.68490674, -0.17072197],
       [ 0.52346258,  0.53902201],
       ...,
       [ 0.6569955 ,  0.62993447],
       [-0.05122118, -0.52798257],
       [-1.95052798,  0.30303867]])

In [48]:
d[29::3,[1,2]]

array([[-1.59756835, -0.15430683],
       [-0.68490674, -0.17072197],
       [ 0.52346258,  0.53902201],
       ...,
       [ 0.6569955 ,  0.62993447],
       [-0.05122118, -0.52798257],
       [-1.95052798,  0.30303867]])

## Operations

In [50]:
A = np.array([1,2,3])
B = np.array([
    [1,2,1],
    [3,4,2],
    [8,1,6]
])

In [51]:
A

array([1, 2, 3])

In [54]:
B

array([[1, 2, 1],
       [3, 4, 2],
       [8, 1, 6]])

### Scalar operations

In [52]:
A + 5

array([6, 7, 8])

In [53]:
B + 5

array([[ 6,  7,  6],
       [ 8,  9,  7],
       [13,  6, 11]])

In [55]:
B / 6

array([[0.16666667, 0.33333333, 0.16666667],
       [0.5       , 0.66666667, 0.33333333],
       [1.33333333, 0.16666667, 1.        ]])

In [56]:
np.sqrt(A)

array([1.        , 1.41421356, 1.73205081])

### Matrix operations

In [57]:
A + B

array([[2, 4, 4],
       [4, 6, 5],
       [9, 3, 9]])

In [58]:
# element-wise multiplication
A * B

array([[ 1,  4,  3],
       [ 3,  8,  6],
       [ 8,  2, 18]])

In [59]:
# dot product
B.dot(A)

array([ 8, 17, 28])

In [60]:
# trasposing a mtrix
B.T

array([[1, 3, 8],
       [2, 4, 1],
       [1, 2, 6]])

In [61]:
B

array([[1, 2, 1],
       [3, 4, 2],
       [8, 1, 6]])

In [62]:
# inverse of a matrix
np.linalg.inv(B)

array([[-2.00000000e+00,  1.00000000e+00, -4.68375339e-17],
       [ 1.81818182e-01,  1.81818182e-01, -9.09090909e-02],
       [ 2.63636364e+00, -1.36363636e+00,  1.81818182e-01]])

In [63]:
# matrix qr decomposition
Q,R = np.linalg.qr(B)