<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Statistical machine learning - Notebook 1

**Author: Maciej Bartczak**



# Bootcamp - introduction to machine learning: NumPy & Pandas


## What is NumPy?

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, basic linear algebra, statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences:

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, **such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences**.

- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.



from: https://numpy.org/doc/stable/user/whatisnumpy.html

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Array creation

There are multiple ways to create an array, try examples below.

### From nested lists

In [2]:
array_1d = np.array([1, 2, 3, 4, 5])
array_1d

array([1, 2, 3, 4, 5])

In [3]:
array_2d = np.array(
  [
    [1,  2,  3,  4,  5 ],
    [6,  7,  8,  9,  10],
    [11, 12, 13, 14, 15]
  ]
)
array_2d

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [4]:
array_3d = np.array(
  [
    [
      [1,  2,  3,  4,  5 ],
      [6,  7,  8,  9,  10],
      [11, 12, 13, 14, 15],
    ],
    [
      [16, 17, 18, 19, 20],
      [21, 22, 23, 24, 25],
      [26, 27, 28, 29, 30],
    ]
  ]
)
array_3d

array([[[ 1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10],
        [11, 12, 13, 14, 15]],

       [[16, 17, 18, 19, 20],
        [21, 22, 23, 24, 25],
        [26, 27, 28, 29, 30]]])

### Using zeros / ones / full



In [5]:
shape = (2, 3)

np.zeros(shape)

array([[0., 0., 0.],
       [0., 0., 0.]])

In [6]:
np.ones(shape)

array([[1., 1., 1.],
       [1., 1., 1.]])

In [7]:
fill_value = 42
np.full(shape, fill_value)

array([[42, 42, 42],
       [42, 42, 42]])

### Using zeros_like / ones_like / full_like

In [8]:
np.zeros_like(array_2d)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [9]:
np.ones_like(array_2d)

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [10]:
fill_value = 42
np.full_like(array_2d, fill_value)

array([[42, 42, 42, 42, 42],
       [42, 42, 42, 42, 42],
       [42, 42, 42, 42, 42]])

### Using arange / linspace (1D arrays only)

which are used to fill an array with values from the interval $[low, high]$ with specific step or number of points.

In [12]:
start = 0
stop  = 1
step  = 0.1

arranged = np.arange(start, stop, step)
arranged

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [13]:
np.arange(start, stop + 1e-4, step)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [14]:
n_steps = int((stop-start)/step) + 1
np.linspace(start, stop, n_steps)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

### Using identity / diag (2D arrays only)
 to create identity or diagonal (with optional offset) matrices.

In [15]:
n = 4
np.identity(n)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [16]:
np.diag([1, 2, 3, 4])

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

In [20]:
offset = 1
np.diag([1, 2, 3], offset)

array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

### Using meshgrid
For 1D arrays X_1, ..., X_n meshgrid behavior is specified as follows

```
G_1, ..., G_n = np.meshgrid(X_1, ..., X_n, indexing='ij')
G_j[i_1, i_2, ..., i_n] = X_j[i_j]

G_1, ..., G_n = np.meshgrid(X_1, ..., X_n, indexing='xy')  # default indexing
G_j[i_2, i_1, i_3, ..., i_n] = X_j[i_j]
```

what facilitates creating coordinate grids,




In [21]:
x = np.array([1, 2, 3], dtype=int)
y = np.array([4, 5, 6, 7], dtype=int)

x_grid, y_grid = np.meshgrid(x, y, indexing='ij')

such as:

In [22]:
x_grid

array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3]])

In [23]:
y_grid

array([[4, 5, 6, 7],
       [4, 5, 6, 7],
       [4, 5, 6, 7]])

In [24]:
x_grid.shape == y_grid.shape == (len(x), len(y))

True

### Using tile & repeat
to create an array filled with "repeating pattern".

In [27]:
array = np.array([0,1,2])
array

array([0, 1, 2])

In [26]:
np.tile(array, 2)

array([0, 1, 2, 0, 1, 2])

In [30]:
np.tile(array, (2, 1))

array([[0, 1, 2],
       [0, 1, 2]])

In [31]:
np.repeat(array, 2)

array([0, 0, 1, 1, 2, 2])

In [32]:
np.repeat(array, 2, axis=0)

array([0, 0, 1, 1, 2, 2])

In [33]:
np.repeat(array.reshape(1, -1), 2, axis=0)

array([[0, 1, 2],
       [0, 1, 2]])

In [34]:
np.repeat(array.reshape(1, -1), 2, axis=1)

array([[0, 0, 1, 1, 2, 2]])

## NumPy dtypes

Full list of NumPy dtypes is available here: https://numpy.org/doc/stable/user/basics.types.html

Among which there are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex.

Data-types can be used as arguments to the dtype keyword that many numpy functions or methods accept, in particular array creation routines.

See examples below.

In [35]:
np.array([1, 2, 3, 4, 5], dtype=int)

array([1, 2, 3, 4, 5])

In [36]:
np.array([1, 2, 3, 4, 5], dtype=float)

array([1., 2., 3., 4., 5.])

In [37]:
np.array([1, 2, 3, 4, 5], dtype=complex)

array([1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j])

Types can be cast as follows

In [38]:
float_array   = np.array([1, 2, 3, 4, 5], dtype=float)
complex_array = float_array.astype(complex)

float_array, float_array.dtype

(array([1., 2., 3., 4., 5.]), dtype('float64'))

In [39]:
complex_array, complex_array.dtype

(array([1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j]), dtype('complex128'))

## Saving / loading data

### Using savetxt / loadtxt

to process a single 1D/2D array in human readable format. File created with `savetxt` can be easily accessed with a text editor

In [40]:
data = np.array(
  [
    [1, 2],
    [3, 4],
  ]
)

np.savetxt('data.txt', data)
!cat data.txt

1.000000000000000000e+00 2.000000000000000000e+00
3.000000000000000000e+00 4.000000000000000000e+00


as well as seamlessly loaded back with `loadtxt`.


In [41]:
np.loadtxt('data.txt')

array([[1., 2.],
       [3., 4.]])

File format can be specified, see https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html for reference

In [42]:
np.savetxt('data.txt', data, fmt='%.2f')
!cat data.txt

1.00 2.00
3.00 4.00


In [43]:
np.savetxt('data.txt', data, fmt='%d')
!cat data.txt

1 2
3 4


Nevertheless information regarding dtype is not preserved.

In [44]:
np.loadtxt('data.txt')

array([[1., 2.],
       [3., 4.]])

In [45]:
np.loadtxt('data.txt', dtype=int)

array([[1, 2],
       [3, 4]])

`savetxt` method can be used to save an array in csv format, we only need to specify the proper delimiter

In [46]:
np.savetxt('data.csv', data, delimiter=',', fmt='%.2f')
!cat data.csv

1.00,2.00
3.00,4.00


as well as loaded back

In [47]:
np.loadtxt('data.csv', delimiter=',')

array([[1., 2.],
       [3., 4.]])

### Using save / load
to process an arbitrary array in binary format. Unfortunately the resulting file is not human readable. On the other hand these methods can handle array of arbitrary shape and preserve the underlying dtype.

In [48]:
data = np.array(
  [
    [
      [1, 2],
      [3, 4],
    ]
  ]
)

np.save('data.npy', data)

!cat data.npy

�NUMPY v {'descr': '<i8', 'fortran_order': False, 'shape': (1, 2, 2), }                                                       
                            

In [49]:
np.load('data.npy')

array([[[1, 2],
        [3, 4]]])

In [50]:
np.load('data.npy').dtype == data.dtype

True

### Using savez / load

to process **multiple** arbitrary arrays in binary format. The result of the load operation is an object of dictionary-like structure.

In [51]:
array_1 = np.array([1, 2, 3], dtype=int)
array_2 = np.array([[4, 5, 6]], dtype=float)
array_3 = np.array([[[7, 8, 9]]], dtype=complex)

np.savez('data.npz', array_1, array_2, some_name=array_3)
loaded_data = np.load('data.npz')
loaded_data.files

['some_name', 'arr_0', 'arr_1']

In [52]:
loaded_data['arr_0']

array([1, 2, 3])

In [53]:
loaded_data['arr_1']

array([[4., 5., 6.]])

In [54]:
loaded_data['some_name']

array([[[7.+0.j, 8.+0.j, 9.+0.j]]])

### Using savez_compressed / load
to process multiple arbitrary arrays in binary format with compression. It behaves just as savez but the resulting file is also compressed, what save space on the disk.



In [55]:
shape = (1000, 1000)
data = np.ones(shape)
np.savez('data_uncompressed.npz', data)
np.savez_compressed('data_compressed.npz', data)

We can compare the filesizes with `ls` command.

In [56]:
!ls data_*compressed.npz -lh

-rw-r--r-- 1 root root  12K Oct 23 18:50 data_compressed.npz
-rw-r--r-- 1 root root 7.7M Oct 23 18:50 data_uncompressed.npz


## Shape manipulation
It is possible to change the structure of data in the array. See examples below.

#### Reshape

In [57]:
data = np.arange(0, 12, 1, dtype = int)
data

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [58]:
data.reshape(3, 4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [59]:
data.reshape(4, 3)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [60]:
data_reshaped = data.reshape(2, 3, 2)
data_reshaped

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])

In [61]:
shape = data_reshaped.shape
shape

(2, 3, 2)

#### Inferring **one** dimension while usigng reshape method

In [62]:
data.reshape(6, -1)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [63]:
data.reshape(-1, 6)

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

In [64]:
data.reshape(3, 2, -1)

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

#### Flatten

In [65]:
data_reshaped.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

#### Adding / deleting dummy dimensions
Facilites vectorized arithmetic with conjunction with broadcasting mechanism (explained later)

In [66]:
data

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [67]:
data.shape

(12,)

In [68]:
data_expanded1 = np.expand_dims(data, 0)
data_expanded1

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]])

In [69]:
data_expanded1.shape

(1, 12)

In [70]:
data_expanded2 = np.expand_dims(data, 1)
data_expanded2

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11]])

In [71]:
data_expanded2.shape

(12, 1)

In [72]:
data_expanded3 = np.expand_dims(data_expanded1, 2)
data_expanded3

array([[[ 0],
        [ 1],
        [ 2],
        [ 3],
        [ 4],
        [ 5],
        [ 6],
        [ 7],
        [ 8],
        [ 9],
        [10],
        [11]]])

In [73]:
data_expanded3.shape

(1, 12, 1)

In [74]:
data_squeezed = data_expanded3.squeeze()
data_squeezed

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [75]:
data_squeezed.shape

(12,)

#### Transposing

For 2D arrays it behaves just as expected (performs matrix transpose).

In [76]:
data = np.arange(0, 4, 1).reshape(2, 2)
data

array([[0, 1],
       [2, 3]])

In [77]:
data.transpose()

array([[0, 2],
       [1, 3]])

For higher dimensional arrays its best visualized via observing how the shape changes.

In [78]:
shape = (1, 2, 3, 4)
data = np.ones(shape)
data.shape

(1, 2, 3, 4)

The default behavior is to flip dimensions.

In [80]:
data.transpose().shape

(4, 3, 2, 1)

However other permutations of dimensions might be specified.

In [81]:
dimensions_permutation = [1, 0, 3, 2]
data.transpose(dimensions_permutation).shape

(2, 1, 4, 3)

`.T` is a shortand for `.transpose()`

In [82]:
data.T.shape

(4, 3, 2, 1)

## Concatenation & stacking
The following methods are used to merge multiple arrays (with compatible shapes) into one array.

### concatenate
used to join a sequence of arrays along an **existing** axis.

In [92]:
shape = (4,)
array_1 = np.full(shape, 1)
array_2 = np.full(shape, 2)

array_1, array_2

(array([1, 1, 1, 1]), array([2, 2, 2, 2]))

In [89]:
np.concatenate([array_1, array_2, array_1])

array([1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1])

In [110]:
array_1 = array_1.reshape(1, -1)
array_2 = array_2.reshape(1, -1)

array_1, array_2

(array([[1, 1, 1, 1]]), array([[2, 2, 2, 2]]))

The default axis to join on is 0

In [111]:
np.concatenate([array_1, array_2, array_1])

array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

But it can be changed

In [114]:
np.concatenate([array_1, array_2, array_1], axis=1)

array([[1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1]])

### stack

used to stack a sequence of arrays along a **new** axis.

In [99]:
shape = (4,)
array_1 = np.full(shape, 1)
array_2 = np.full(shape, 2)

array_1, array_2

(array([1, 1, 1, 1]), array([2, 2, 2, 2]))

In [100]:
np.stack([array_1, array_2, array_1])

array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

In [101]:
np.stack([array_1, array_2, array_1], axis=1)

array([[1, 2, 1],
       [1, 2, 1],
       [1, 2, 1],
       [1, 2, 1]])


### hstack, vstack and dstack
stand for horizontal, vertical and depthwise stacking

In [115]:
np.hstack([array_1, array_2, array_1])

array([[1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1]])

In [116]:
np.vstack([array_1, array_2, array_1])

array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [1, 1, 1, 1]])

In [119]:
np.dstack([array_1, array_2, array_1])

array([[[1, 2, 1],
        [1, 2, 1],
        [1, 2, 1],
        [1, 2, 1]]])

## Mathematical operations
As stated before

> NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

Let us observe some examples.



In [120]:
x = np.arange(0, 3, 1)
y = np.arange(3, 6, 1)

x, y

(array([0, 1, 2]), array([3, 4, 5]))

#### Elementwise

Basic mathematical operations will be applied to the arrays elementwise.

In [121]:
2 * x

array([0, 2, 4])

In [122]:
x * 2

array([0, 2, 4])

In [123]:
np.sin(x)

array([0.        , 0.84147098, 0.90929743])

In [124]:
x**2

array([0, 1, 4])

The same applies to the binary (taking two arguments) operators and arrays of the same shape.

In [125]:
x + y

array([3, 5, 7])

In [126]:
x * y

array([ 0,  4, 10])

In [127]:
x / y

array([0.  , 0.25, 0.4 ])

In [130]:
y ** x

array([ 1,  4, 25])

#### Broadcasting

When arrays have different shapes the broadcasting mechanism comes into play.

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when

- they are equal, or
- one of them is 1

If these conditions are not met, a `ValueError: operands could not be broadcast together` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.

from: https://numpy.org/devdocs/user/basics.broadcasting.html

In [131]:
x_shape = (2, 4)
y_shape = (2, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

(2, 4)

In [132]:
x_shape =    (2, 4)
y_shape = (3, 1, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

(3, 2, 4)

In [133]:
x_shape = (   1, 4)
y_shape = (3, 2, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

(3, 2, 4)

### Matrix / tensor operations
NumPy package wouldn't be complete without means to manipulate arrays other than elementwise operations, such as most common linear algebra functionalities. See the examples below.

#### Matrix multiplication

can be performed with `matmul` method

In [137]:
a = np.eye(3)

a[2,2] = 2
b = np.arange(0, 3, 1).reshape(3, 1)

a

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 2.]])

In [138]:
b

array([[0],
       [1],
       [2]])

In [139]:
np.matmul(a, b)

array([[0.],
       [1.],
       [4.]])

or `@` operand

In [140]:
a @ b

array([[0.],
       [1.],
       [4.]])

Be mindfull of its varying behaviour for different input shapes

- If both arguments are 2-D they are multiplied like conventional matrices.
- If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
- If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
- If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.

from: https://numpy.org/doc/stable/reference/generated/numpy.matmul.html

#### Dot product

can be performed using `dot` method.

In [141]:
x = np.arange(0, 3, 1)
y = np.arange(3, 6, 1)

x, y

(array([0, 1, 2]), array([3, 4, 5]))

In [142]:
np.dot(x, y)

14

Be mindfull of its varying behaviour for different input shapes

- If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).

- If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.

- If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred.

- If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.

- If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

  ```
  dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
  ```

from: https://numpy.org/doc/stable/reference/generated/numpy.dot.html


#### Einstein summation convention
Einstein summation convention provide a concise way of expressing multilinear operations. You can look it up here https://mathworld.wolfram.com/EinsteinSummation.html and its NumPy specification here https://numpy.org/doc/stable/reference/generated/numpy.einsum.html.

Take a look a the following ilustrative examples.


In [145]:
data = np.tri(3)
data

array([[1., 0., 0.],
       [1., 1., 0.],
       [1., 1., 1.]])

In [146]:
np.einsum('ij->ji', data)  # transpose

array([[1., 1., 1.],
       [0., 1., 1.],
       [0., 0., 1.]])

In [147]:
np.einsum('ij->i', data)  # row sum

array([1., 2., 3.])

In [148]:
np.einsum('ij->j', data)  # column sum

array([3., 2., 1.])

In [149]:
np.einsum('ii->i', data)  # diagonal

array([1., 1., 1.])

In [150]:
vector = np.ones(3)
vector

array([1., 1., 1.])

In [151]:
np.einsum('ij,j->i', data, vector)  # matrix multiplication (matrix x vector)

array([1., 2., 3.])

In [152]:
np.einsum('ij,jk->ik', data, vector.reshape(3, 1))  # matrix multiplication (matrix x matrix)

array([[1.],
       [2.],
       [3.]])

## Other operations

### Aggregations: mean / var
Very often we are required to calculate some aggregations with respect to specific dimensions of an array, such as mean or variance.

In [153]:
data = np.arange(0, 20, 1).reshape(4, 5)
data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

By default the whole array is aggregated

In [154]:
data.mean()

9.5

But aggregations can also be applied axiswise

In [155]:
data.mean(axis=0)

array([ 7.5,  8.5,  9.5, 10.5, 11.5])

In [156]:
data.mean(axis=1)

array([ 2.,  7., 12., 17.])

In [157]:
data.var(axis=0)

array([31.25, 31.25, 31.25, 31.25, 31.25])

In [158]:
data.var(axis=1)

array([2., 2., 2., 2.])

### Aggregation: quantiles

Many more aggregations are possible, e.g. determining data quantiles.

In [162]:
data = np.linspace(1, 2, 101)

np.quantile(data, 0.3)

1.3

In [163]:
np.quantile(data, [0.3, 0.6])

array([1.3, 1.6])

### Sorting
Arrays can be sorted. By default sorting is performed with respect to the last dimension

In [164]:
data = np.array([
  [1, 4, 2, 8, 5],
  [2, 3, 1, 9, 6],
  [8, 2, 3, 7, 4],
])

np.sort(data)

array([[1, 2, 4, 5, 8],
       [1, 2, 3, 6, 9],
       [2, 3, 4, 7, 8]])

but sorting dimension can be explicitly specified.

In [165]:
np.sort(data, axis=0)

array([[1, 2, 1, 7, 4],
       [2, 3, 2, 8, 5],
       [8, 4, 3, 9, 6]])

In [166]:
np.sort(data, axis=1)

array([[1, 2, 4, 5, 8],
       [1, 2, 3, 6, 9],
       [2, 3, 4, 7, 8]])

You can also get the "sorting permutations" that can be used in conjunction with advanced indexing (introduced below).

In [168]:
np.argsort(data, axis=0)

array([[0, 2, 1, 2, 2],
       [1, 1, 0, 0, 0],
       [2, 0, 2, 1, 1]])

In [169]:
np.argsort(data, axis=1)

array([[0, 2, 1, 4, 3],
       [2, 0, 1, 4, 3],
       [1, 2, 4, 3, 0]])

## Indexing

Allows selecting items and subarrays. It can be also used to modify fragments of an array.

### Basic indexing

In [170]:
data = np.arange(0, 16, 1)
data

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [171]:
data[10]

10

In [172]:
data[-2]

14

In [173]:
data[:5]

array([0, 1, 2, 3, 4])

In [174]:
data[::2]

array([ 0,  2,  4,  6,  8, 10, 12, 14])

In [175]:
data[-3:]

array([13, 14, 15])

It is also possible in higher dimensional setting


In [176]:
data = data.reshape(4, 4)
data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [177]:
data[1, 1]

5

In [178]:
data[1, :]

array([4, 5, 6, 7])

One does not need to specify trailing colons `:`

In [179]:
data[:-2]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

Note how the column selection is achieved. In this case `:` can't be ommited.

In [180]:
data[:, -2:]

array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15]])

One group of subsequent colons `:` can be replaced with `...`

In [186]:
data_reshaped = data.reshape(4, 2, 2)
data_reshaped

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15]]])

In [187]:
data_reshaped[..., 0]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14]])

Note the difference

In [188]:
data_reshaped[:, 0]

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13]])

### View & copies

Basic indexing routines presented above return views of the data in the underlying arrays. When the underlying array is modified the changes are reflected in the view.

In [189]:
data = np.arange(0, 16, 1).reshape(4, 4)
data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [190]:
view = data[:, 3]
view

array([ 3,  7, 11, 15])

In [191]:
data *= 2
data

array([[ 0,  2,  4,  6],
       [ 8, 10, 12, 14],
       [16, 18, 20, 22],
       [24, 26, 28, 30]])

In [192]:
view

array([ 6, 14, 22, 30])

To "detach" view from the underlying array use `copy` method

In [195]:
view_copy = view.copy()
data *= 2
view, view_copy

(array([ 48, 112, 176, 240]), array([ 24,  56,  88, 120]))

Views are only possible when selecting with slices, then it's a matter of rememebering offset and stride only.

You can read more about this here: https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

### Advanced indexing

it is also possible to index with integer or boolean arrays

In [196]:
data = np.arange(0, 16, 1).reshape(4, 4)
data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

We can select specific rows

In [199]:
data[[0, 3]]

array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

and columns.

In [200]:
data[:, [0, 3]]

array([[ 0,  3],
       [ 4,  7],
       [ 8, 11],
       [12, 15]])

Seleceting both together is a little more complicated

In [201]:
rows = [
 [0, 0],
 [3, 3]
]
columns = [
 [0, 3],
 [0, 3]
]
data[rows, columns]

array([[ 0,  3],
       [12, 15]])

Use indices shape to control the output shape.

In [203]:
rows = [0, 0, 3, 3]
columns = [0, 3, 0, 3]
data[rows, columns]
# data[i] = data[rows[i]][columns[i]]

array([ 0,  3, 12, 15])

Boolean arrays can be used as well, though they work somewhat differently.

In [204]:
index = (data % 3 == 0)
index

array([[ True, False, False,  True],
       [False, False,  True, False],
       [False,  True, False, False],
       [ True, False, False,  True]])

In [205]:
data[index]

array([ 0,  3,  6,  9, 12, 15])

Its most common use case is to update array entries based on some condition.

In [206]:
data[index] = -1
data

array([[-1,  1,  2, -1],
       [ 4,  5, -1,  7],
       [ 8, -1, 10, 11],
       [-1, 13, 14, -1]])

Detailed bahavior of indexing with boolean arrays is described here: https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing

# NumPy Exercises

### Exercise: multiplication table
Create "{0, ..., 10} x {0, ..., 10} multiplication table" using only
- `meshgrid` method,
- multiplication of 1D arrays, dummy dimensions and broadcasting mechanism.

In [6]:
### YOUR CODE BEGINS HERE ###
import numpy as np

# Create arrays representing the values from 0 to 10
x = np.arange(11)
y = np.arange(11)

# Use meshgrid to create a grid of x and y values
X, Y = np.meshgrid(x, y)

# Compute the multiplication table
multiplication_table = X * Y

# Print the multiplication table
print(multiplication_table)

### YOUR CODE ENDS HERE ###

[[  0   0   0   0   0   0   0   0   0   0   0]
 [  0   1   2   3   4   5   6   7   8   9  10]
 [  0   2   4   6   8  10  12  14  16  18  20]
 [  0   3   6   9  12  15  18  21  24  27  30]
 [  0   4   8  12  16  20  24  28  32  36  40]
 [  0   5  10  15  20  25  30  35  40  45  50]
 [  0   6  12  18  24  30  36  42  48  54  60]
 [  0   7  14  21  28  35  42  49  56  63  70]
 [  0   8  16  24  32  40  48  56  64  72  80]
 [  0   9  18  27  36  45  54  63  72  81  90]
 [  0  10  20  30  40  50  60  70  80  90 100]]


In [7]:
### YOUR CODE BEGINS HERE ###
import numpy as np

# Create arrays representing the values from 0 to 10
x = np.arange(11)
y = np.arange(11)

# Compute the multiplication table by broadcasting
multiplication_table = x[:, np.newaxis] * y

# Print the multiplication table
print(multiplication_table)

### YOUR CODE ENDS HERE ###

[[  0   0   0   0   0   0   0   0   0   0   0]
 [  0   1   2   3   4   5   6   7   8   9  10]
 [  0   2   4   6   8  10  12  14  16  18  20]
 [  0   3   6   9  12  15  18  21  24  27  30]
 [  0   4   8  12  16  20  24  28  32  36  40]
 [  0   5  10  15  20  25  30  35  40  45  50]
 [  0   6  12  18  24  30  36  42  48  54  60]
 [  0   7  14  21  28  35  42  49  56  63  70]
 [  0   8  16  24  32  40  48  56  64  72  80]
 [  0   9  18  27  36  45  54  63  72  81  90]
 [  0  10  20  30  40  50  60  70  80  90 100]]


### Exercise: speed comparison
In this exercise we will measure the speed improvements due to NumPy implementation.

In [9]:
length = 10000
a = b = np.arange(length)

Write the code to calculate dotproduct of arrays `a` and `b` using only native Python funcions. Use `timeit` magic to time the execution.

In [19]:
%%timeit


### YOUR CODE BEGINS HERE ###
result = sum(a[i] * b[i] for i in range(length))
### YOUR CODE ENDS HERE ###

2.79 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now compare the execution time with `np.dot`.

In [20]:
### YOUR CODE BEGINS HERE ###
print(np.dot(a,b))
### YOUR CODE ENDS HERE ###

333283335000


### Exercise: vector operations
Calculate
$$ \int_0^1 \sin(a\cdot2\pi x)\sin(b\cdot 2\pi x)dx $$
for
- $a=b=1$,
- $a=1$ and $b=2$,

using discrete approximation of an integral and following NumPy functions: `linspace, sin, mean`. Do not use `vectorize`.


In [30]:
# Case a = b = 1
### YOUR CODE BEGINS HERE ###
import numpy as np

# Define constants
a = b = 1
lower_limit = 0
upper_limit = 1
step = 0.1
num_points = 10  # Number of discrete points for approximation

# Create an array of discrete points
x = np.linspace(lower_limit, upper_limit, int((upper_limit - lower_limit) / step) + 1)
# Calculate the integrand for each point
integrand = np.sin(a * 2 * np.pi * x) * np.sin(b * 2 * np.pi * x)
# Calculate the discrete approximation of the integral
answer = (upper_limit - lower_limit) / num_points * np.mean(integrand)
### YOUR CODE ENDS HERE ###

print(answer.round(4))
assert np.isclose(answer, 0.5)

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
0.0455


AssertionError: ignored

In [31]:
# Case a = 1, b = 2
### YOUR CODE BEGINS HERE ###
import numpy as np

# Define constants
a = 1
b = 2
lower_limit = 0
upper_limit = 1
step = 0.1
num_points = 10  # Number of discrete points for approximation

# Create an array of discrete points
x = np.linspace(lower_limit, upper_limit, int((upper_limit - lower_limit) / step) + 1)
# Calculate the integrand for each point
integrand = np.sin(a * 2 * np.pi * x) * np.sin(b * 2 * np.pi * x)
# Calculate the discrete approximation of the integral
answer = (upper_limit - lower_limit) / num_points * np.mean(integrand)
### YOUR CODE ENDS HERE ###
print(answer.round(4))
assert np.isclose(answer, 0)

-0.0


### Exercise: vector operations ctd.

Let $a_j = j$. Calculate in a similar fashion matrix $(b_{ij})_{i,j=0, ..., 4}$, where
$$ b_{ij} = \int_0^1 \sin(a_i\cdot2\pi x)\sin(a_j\cdot 2\pi x)dx. $$

Try to structure your code in such a way that it would accept an arbitrary array `a`. Do not use `vectorize`.

In [32]:
a = np.arange(5)

### YOUR CODE BEGINS HERE ###

### YOUR CODE ENDS HERE ###

print(answer.round(2))
assert np.isclose(answer, np.diag([0, 0.5, 0.5, 0.5, 0.5])).all()

-0.0


AssertionError: ignored

### Exercise: subarrays

Let us sample $n=100 000$ standard normal random variables. Calculate mean of values greater than zero.

In [46]:
np.random.seed(0)

n = 100000
sample = np.random.randn(n)

### YOUR CODE BEGINS HERE ###
positive_values = sample[sample > 0]
mean_positive_values = np.mean(positive_values)
print(mean_positive_values)
print(np.sqrt(2/np.pi))
### YOUR CODE ENDS HERE ###

assert np.isclose(answer, np.sqrt(2/np.pi), rtol=0.01)

0.7952217326161258
0.7978845608028654


AssertionError: ignored

### Exercise: subarrays ctd.
Let us sample an array of random integers.
1. Select columns for which sum of its values gives remainder = 1 when divided by 3.
2. Change entries in that columns to ones.
3. Calculate sum of all entries.

In [63]:
np.random.seed(0)
sample = np.random.randint(0, 3, size=(10, 10))

### YOUR CODE BEGINS HERE ###
column_sums = np.sum(sample, axis=0)
selected_columns = np.where(column_sums % 3 == 1)[0]
sample[:, [selected_columns]] = 1
answer = sample.sum()
print(answer)
### YOUR CODE ENDS HERE ###

assert answer == 88

88



### Exercise: matrix multiplication & merging arrays

In this exercies we will approximate operator norm of some matrices.

1. Create an an array of $n=1000$ vectors $v\in\mathbb{R}^2$ "evenly spaced" on the unit circle.
2. Combine given three matrices into one array.
3. Use **one** matrix multiplication operation with these arrays to obtain matrix products of all the vectors with all the matrices.
4. For each vector calculate its (euclidean) length.
5. Take maximum with respect to each matrix.
6. You should obtain three values representing approximations of operator norm of provided matrices.


In [92]:
matrix_1 = np.array([
  [1, 0],
  [0, 1]
])

matrix_2 = np.array([
  [1, 1],
  [1, 1]
])

matrix_3 = np.array([
  [1, 1],
  [0, 1]
])

### YOUR CODE BEGINS HERE ###
import numpy as np

# Step 1: Create an array of n=1000 vectors evenly spaced on the unit circle
n = 1000
theta = np.linspace(0, 2 * np.pi, n)
vectors = np.array([np.array([np.cos(t), np.sin(t)]) for t in theta])

# Step 2: Combine the given matrices into one array
matrix_1 = np.array([[1, 0], [0, 1]])
matrix_2 = np.array([[1, 1], [1, 1]])
matrix_3 = np.array([[1, 1], [0, 1]])

matrices = [matrix_1, matrix_2, matrix_3]

# Step 3: Use matrix multiplication to obtain products of vectors with matrices
matrix_products = [np.dot(vectors, matrix) for matrix in matrices]

# Step 4: Calculate the Euclidean length for each vector
vector_lengths = [np.linalg.norm(vector) for vector in vectors]

# Step 5: Take the maximum with respect to each matrix
operator_norms = [np.max(vector_length) for vector_length in vector_lengths]

# Step 6: Print the operator norms for each matrix
for i, norm in enumerate(operator_norms):
    print(f"Operator Norm (matrix_{i + 1}): {norm}")
### YOUR CODE ENDS HERE ###


print(answer.round(4))
assert np.isclose(answer, [1, 2, (1+np.sqrt(5))/2]).all()

Operator Norm (matrix_1): 1.0
Operator Norm (matrix_2): 1.0
Operator Norm (matrix_3): 1.0
Operator Norm (matrix_4): 1.0
Operator Norm (matrix_5): 1.0
Operator Norm (matrix_6): 1.0
Operator Norm (matrix_7): 0.9999999999999999
Operator Norm (matrix_8): 0.9999999999999999
Operator Norm (matrix_9): 1.0
Operator Norm (matrix_10): 1.0
Operator Norm (matrix_11): 0.9999999999999999
Operator Norm (matrix_12): 1.0
Operator Norm (matrix_13): 1.0
Operator Norm (matrix_14): 1.0
Operator Norm (matrix_15): 1.0
Operator Norm (matrix_16): 1.0
Operator Norm (matrix_17): 0.9999999999999999
Operator Norm (matrix_18): 0.9999999999999999
Operator Norm (matrix_19): 1.0
Operator Norm (matrix_20): 0.9999999999999999
Operator Norm (matrix_21): 1.0
Operator Norm (matrix_22): 1.0
Operator Norm (matrix_23): 1.0
Operator Norm (matrix_24): 1.0
Operator Norm (matrix_25): 1.0
Operator Norm (matrix_26): 1.0
Operator Norm (matrix_27): 1.0
Operator Norm (matrix_28): 1.0
Operator Norm (matrix_29): 1.0
Operator Norm (matri

ValueError: ignored

## Other NumPy functionalities
NumPy package implements plethora of mathematical functionalities, not covered in this tutorial. NumPy User guide and API Reference are great starting point to discover those functionalities:
- https://numpy.org/doc/stable/user/
- https://numpy.org/doc/stable/reference/


# What is pandas?

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

It is meant for manipulating datasets in tabular format, that is 2D arrays with rows representing observations and columns representing variables, the so called **DataFrames**.

from: https://pandas.pydata.org/

### Creating dataframes

In the most basic form you just specify 2D array of values

In [94]:
import pandas as pd

df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4)
)

df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


It is possible to additionally specify columns and index labels

In [95]:
df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4),
    columns = ['column_1', 'column_2', 'column_3', 'column_4'],
    index   = ['a', 'b', 'c', 'd']
)

df

Unnamed: 0,column_1,column_2,column_3,column_4
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


as well as type of entries

In [96]:
df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4),
    columns = ['column_1', 'column_2', 'column_3', 'column_4'],
    index   = ['a', 'b', 'c', 'd'],
    dtype   = float
)

df

Unnamed: 0,column_1,column_2,column_3,column_4
a,0.0,1.0,2.0,3.0
b,4.0,5.0,6.0,7.0
c,8.0,9.0,10.0,11.0
d,12.0,13.0,14.0,15.0


In [97]:
df.astype(int)

Unnamed: 0,column_1,column_2,column_3,column_4
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


It can be done in an alternative manner

In [98]:
df = pd.DataFrame(
    {
      'column_1': [0,  4,  8, 12],
      'column_2': [1,  5,  9, 13],
      'column_3': [2,  6, 10, 14],
      'column_4': [3,  7, 11, 15]
    },
    index   = ['a', 'b', 'c', 'd']
)

df

Unnamed: 0,column_1,column_2,column_3,column_4
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


It is worth mentioning that the underlying data can be accesed under `values` attribute

In [99]:
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

What's important Pandas DataFrames can handle entries of different type. In the most common situation different columns will have different types.

In [100]:
df = pd.DataFrame(
    {
      'integers': [0,   1,   2,  3],
      'floats':   [1.0, 2.0, 3,  4],
      'booleans': [True, False, True, False],
      'arbitrary objects':  [0, 1.0, 'abc', False]
    }
)

df

Unnamed: 0,integers,floats,booleans,arbitrary objects
0,0,1.0,True,0
1,1,2.0,False,1.0
2,2,3.0,True,abc
3,3,4.0,False,False


In [101]:
df.dtypes

integers               int64
floats               float64
booleans                bool
arbitrary objects     object
dtype: object

Another available creation methods are listed below
- `pd.DataFrame.from_dict`
- `pd.DataFrame.from_records`

You can read about them here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

### Loading data
Let us peek into the CSV file conveniently provided in the Google Colab runtime

In [102]:
!head './sample_data/california_housing_train.csv'

"longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income","median_house_value"
-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000
-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000
-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000
-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000
-114.570000,33.570000,20.000000,1454.000000,326.000000,624.000000,262.000000,1.925000,65500.000000
-114.580000,33.630000,29.000000,1387.000000,236.000000,671.000000,239.000000,3.343800,74000.000000
-114.580000,33.610000,25.000000,2907.000000,680.000000,1841.000000,633.000000,2.676800,82400.000000
-114.590000,34.830000,41.000000,812.000000,168.000000,375.000000,158.000000,1.708300,48500.000000
-114.590000,33.610000,34.000000,4789.000000,1175.000000,3134.000000

*and* load it using `pd.read_csv` method.

In [103]:
pd.read_csv('./sample_data/california_housing_train.csv')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In Pandas it is also possible to load JSON files.

In [104]:
!head './sample_data/anscombe.json'

[
  {"Series":"I", "X":10.0, "Y":8.04},
  {"Series":"I", "X":8.0, "Y":6.95},
  {"Series":"I", "X":13.0, "Y":7.58},
  {"Series":"I", "X":9.0, "Y":8.81},
  {"Series":"I", "X":11.0, "Y":8.33},
  {"Series":"I", "X":14.0, "Y":9.96},
  {"Series":"I", "X":6.0, "Y":7.24},
  {"Series":"I", "X":4.0, "Y":4.26},
  {"Series":"I", "X":12.0, "Y":10.84},


In [105]:
pd.read_json('./sample_data/anscombe.json')

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


## Other I/O functionalities
They are listed here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html and are suitable for: XML, HDF, Feather, Parquet files as well as for interaction with various databases and much more.

## Handling missing data and duplicates

In the real world it's often the case that there are missing or duplicate values in the dataset. They can be easily delt with in Pandas.

In [106]:
df = pd.DataFrame([
  [0,       1     ],
  [2,       0     ],
  [2,       np.nan],
  [np.nan,  np.nan]
])

df

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0
2,2.0,
3,,


Incomplete observations can be filtered out.

In [107]:
df.dropna()

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0


In [108]:
df.dropna(how='all')

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0
2,2.0,


It is also possible to "fill the gaps".

In [109]:
df.fillna(0)

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0
2,2.0,0.0
3,0.0,0.0


In [110]:
df.fillna(0).drop_duplicates()

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0
3,0.0,0.0


In [111]:
df.fillna({0: 0, 1: 1})

Unnamed: 0,0,1
0,0.0,1.0
1,2.0,0.0
2,2.0,1.0
3,0.0,1.0


## Data access

Let us observe how we can access and manipulate DataFrame structure.

In [112]:
df = pd.read_csv('./sample_data/california_housing_train.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


### describe & info
Once the dataframe is loaded `describe` and `info`methods provide an easy way to obtain a high level view of the underlying data.

In [113]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


### head & tail
we can use these methods to look at the first and last entries

In [115]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [116]:
df.tail()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0
16999,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0


In [117]:
df.head(8)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0


### rename
can be used to rename column labels.

In [118]:
df.rename(columns={'longitude': 'lon', 'latitude': 'lat'})

Unnamed: 0,lon,lat,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


### Accessing a subset of data

To select multiple columns specify list of labels within square brackets.

In [119]:
df[['longitude', 'latitude']]

Unnamed: 0,longitude,latitude
0,-114.31,34.19
1,-114.47,34.40
2,-114.56,33.69
3,-114.57,33.64
4,-114.57,33.57
...,...,...
16995,-124.26,40.58
16996,-124.27,40.69
16997,-124.30,41.84
16998,-124.30,41.80


Single column can be retrieved by specifying its name within square brackets. Note that this will yield a `Series` object which behaves a little different than DataFrames, but in general is a representation of single varaible values.

In [120]:
df['longitude']

0       -114.31
1       -114.47
2       -114.56
3       -114.57
4       -114.57
          ...  
16995   -124.26
16996   -124.27
16997   -124.30
16998   -124.30
16999   -124.35
Name: longitude, Length: 17000, dtype: float64

If column name does not contain spaces and other special character it can accesed as an attribute.

In [121]:
df.longitude

0       -114.31
1       -114.47
2       -114.56
3       -114.57
4       -114.57
          ...  
16995   -124.26
16996   -124.27
16997   -124.30
16998   -124.30
16999   -124.35
Name: longitude, Length: 17000, dtype: float64

Subset of a DataFrame can be selected using `loc` attribute. You need to specify index and column labels. Note that it supports slicing with respect to column names. Also, as [documentation](https://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#different-choices-for-indexing) says *note that contrary to usual python slices, both the start and the stop are included!*.

In [124]:
df.loc[5:10, 'population': 'median_house_value']

Unnamed: 0,population,households,median_income,median_house_value
5,671.0,239.0,3.3438,74000.0
6,1841.0,633.0,2.6768,82400.0
7,375.0,158.0,1.7083,48500.0
8,3134.0,1056.0,2.1782,58400.0
9,787.0,271.0,2.1908,48100.0
10,2434.0,824.0,2.6797,86500.0


It also possible to select data with slices correspong to indices and columns **positions** (as in NumPy).

In [125]:
df.iloc[5:10, -4:]

Unnamed: 0,population,households,median_income,median_house_value
5,671.0,239.0,3.3438,74000.0
6,1841.0,633.0,2.6768,82400.0
7,375.0,158.0,1.7083,48500.0
8,3134.0,1056.0,2.1782,58400.0
9,787.0,271.0,2.1908,48100.0


Boolean arrays (including boolean `Series` object) can be used as well.

In [126]:
index = df.median_house_value < 25000
index

0        False
1        False
2        False
3        False
4        False
         ...  
16995    False
16996    False
16997    False
16998    False
16999    False
Name: median_house_value, Length: 17000, dtype: bool

In [127]:
df.loc[index, 'median_house_value']

264      22500.0
568      14999.0
3226     14999.0
7182     17500.0
11653    22500.0
15499    22500.0
16643    14999.0
16801    14999.0
Name: median_house_value, dtype: float64

It is also possible to do

In [128]:
df[index]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
264,-116.57,35.43,8.0,9975.0,1743.0,6835.0,1439.0,2.7138,22500.0
568,-117.02,36.4,19.0,619.0,239.0,490.0,164.0,2.1,14999.0
3226,-117.86,34.24,52.0,803.0,267.0,628.0,225.0,4.1932,14999.0
7182,-118.33,34.15,39.0,493.0,168.0,259.0,138.0,2.3667,17500.0
11653,-121.29,37.95,52.0,107.0,79.0,167.0,53.0,0.7917,22500.0
15499,-122.32,37.93,33.0,296.0,73.0,216.0,63.0,2.675,22500.0
16643,-122.74,39.71,16.0,255.0,73.0,85.0,38.0,1.6607,14999.0
16801,-123.17,40.31,36.0,98.0,28.0,18.0,8.0,0.536,14999.0


## Merging dataframes
can be used to merge multiple arrays into one.


In [129]:
df_1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'], index=[0, 1, 2])
df_2 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['b', 'c', 'd'], index=[1, 2, 3])

In [130]:
df_1

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [131]:
df_2

Unnamed: 0,b,c,d
1,0,1,2
2,3,4,5
3,6,7,8


There are to basic methods for merging dataframes. The first one is `append` that concatenates dataframes along the index.

In [132]:
df_1.append(df_2)

  df_1.append(df_2)


Unnamed: 0,a,b,c,d
0,0.0,1,2,
1,3.0,4,5,
2,6.0,7,8,
1,,0,1,2.0
2,,3,4,5.0
3,,6,7,8.0


The second one is join that concatenates dataframes along the columns.

In [133]:
df_1.join(df_2, how='outer', lsuffix='_left', rsuffix='_right')

Unnamed: 0,a,b_left,c_left,b_right,c_right,d
0,0.0,1.0,2.0,,,
1,3.0,4.0,5.0,0.0,1.0,2.0
2,6.0,7.0,8.0,3.0,4.0,5.0
3,,,,6.0,7.0,8.0


These two methods are special cases of more general function `concat` that allows more advanced oprations. There is also `merge` method which makes use of database-like joins.

All of these methods are described and explained here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

## Operations on dataframes
Pandas can be used to perfrom mathematical and other operations on the underlying data.

### Elementwise operations
The most common use case is to create a new column being a result of some operation on existing columns.

In [134]:
people_per_household = df.population / df.households
people_per_household

0        2.150424
1        2.438445
2        2.846154
3        2.278761
4        2.381679
           ...   
16995    2.457995
16996    2.567742
16997    2.728070
16998    2.715481
16999    2.985185
Length: 17000, dtype: float64

In [135]:
df['people_per_household'] = people_per_household
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,people_per_household
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,2.150424
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,2.381679
...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,2.457995
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,2.567742
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,2.728070
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,2.715481


More general function can be applied with use of `apply` method

In [136]:
def func(row):
  if row.latitude > 0:
    ns_hemisphere = 'N'
  else:
    ns_hemisphere = 'S'

  if row.longitude > 0:
    ew_hemisphere = 'E'
  else:
    ew_hemisphere = 'W'

  gps_label = f"{abs(row.latitude)} {ns_hemisphere}, {abs(row.longitude)} {ew_hemisphere}"

  return gps_label

df.apply(func, axis=1).head()

0    34.19 N, 114.31 W
1     34.4 N, 114.47 W
2    33.69 N, 114.56 W
3    33.64 N, 114.57 W
4    33.57 N, 114.57 W
dtype: object

### GroupBy
can be used to split a dataframe into multiple based on some condition/values

In [137]:
values = df.median_house_value.round(-4)
values


0         70000.0
1         80000.0
2         90000.0
3         70000.0
4         70000.0
           ...   
16995    110000.0
16996     80000.0
16997    100000.0
16998     90000.0
16999     90000.0
Name: median_house_value, Length: 17000, dtype: float64

In [138]:
gb = df.groupby(values)
gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7b797c3a7d00>

Groups labels and indices can be accessed under `groups` attribute and themselves with the use of `get_group` method.

In [139]:
gb.groups

{10000.0: [568, 3226, 16643, 16801], 20000.0: [17, 264, 7182, 11653, 15499], 30000.0: [19, 110, 116, 157, 9636, 9637, 9641, 10026, 10692, 10935, 11902, 12341, 12671], 40000.0: [14, 18, 20, 36, 46, 50, 59, 115, 126, 158, 288, 427, 497, 1902, 2151, 2591, 3513, 6592, 9043, 9065, 9071, 9086, 9094, 9095, 9122, 9255, 9289, 9390, 9392, 9468, 9486, 9494, 9624, 9647, 9685, 9689, 9743, 9948, 9975, 10050, 10053, 10055, 10125, 10186, 10359, 10470, 10648, 10791, 11306, 11652, 11752, 11796, 12364, 12382, 12478, 13341, 15653, 16823, 16887], 50000.0: [7, 9, 12, 22, 24, 47, 51, 74, 109, 113, 121, 122, 123, 124, 125, 138, 144, 170, 182, 242, 266, 317, 1104, 1863, 2674, 2709, 3054, 4765, 4766, 4767, 8009, 8892, 8897, 8912, 8955, 8966, 9010, 9027, 9028, 9029, 9036, 9040, 9042, 9045, 9046, 9051, 9055, 9056, 9057, 9058, 9059, 9060, 9061, 9062, 9064, 9075, 9077, 9080, 9081, 9082, 9083, 9084, 9090, 9096, 9097, 9099, 9100, 9104, 9105, 9106, 9110, 9114, 9117, 9121, 9130, 9137, 9138, 9157, 9159, 9161, 9162, 9181

In [140]:
gb.get_group(10000)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,people_per_household
568,-117.02,36.4,19.0,619.0,239.0,490.0,164.0,2.1,14999.0,2.987805
3226,-117.86,34.24,52.0,803.0,267.0,628.0,225.0,4.1932,14999.0,2.791111
16643,-122.74,39.71,16.0,255.0,73.0,85.0,38.0,1.6607,14999.0,2.236842
16801,-123.17,40.31,36.0,98.0,28.0,18.0,8.0,0.536,14999.0,2.25


The main use case of the `groupby` functionality is to split dataframe based on some condition, perform some aggregation within groups and then join the results. An example of such an operation is calcualtin mean within each group.

In [141]:
gb.mean().head()

Unnamed: 0_level_0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,people_per_household
median_house_value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10000.0,-120.1975,37.665,30.75,443.75,151.75,305.25,108.75,2.122475,14999.0,2.56644
20000.0,-118.632,35.65,30.6,2183.0,419.2,1508.2,344.0,1.88086,22000.0,3.115305
30000.0,-118.98,36.394615,26.230769,2336.615385,472.692308,1422.230769,392.307692,2.0335,30992.307692,2.779726
40000.0,-118.979661,36.169831,31.610169,1583.355932,379.338983,1094.084746,306.389831,1.695885,41915.254237,3.709225
50000.0,-119.536537,36.579339,29.821012,1671.498054,398.18677,1200.85214,357.003891,1.750918,51044.747082,3.466028


You can read more about `groupby` funcionality in teh pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

## Pandas Exercises

### Pandas exercise: IMDB movies

In this exercise we will look into IMDB Top 1000 movies dataset. We will first load it into a dataframe.


In [143]:
url = "https://raw.githubusercontent.com/peetck/IMDB-Top1000-Movies/master/IMDB-Movie-Data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40
...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,0.00,45
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,0.00,22


Part 1.

1. How much should movie earn to be among 2% best earning movies?
2. Select movies that earned more than this value.
3. Calculate their mean rating.

In [162]:
### YOUR CODE BEGINS HERE ###
sorted_df = df.sort_values(by='Revenue (Millions)', ascending=False)


top_two_percent = sorted_df[:20]
answer = top_two_percent['Rating'].mean()
print(answer)

### YOUR CODE ENDS HERE ###

7.614999999999999


In [160]:
# Your code in cell(s) above
assert np.isclose(answer, 7.615)

Part 2.
1. Calculate mean metascore and mean revenue per director. (use groupby)
2. Restrict results to the directors that directed more than four movies. (look up `value_counts` method)
3. Sort the results with respect to the metascore and display the rating as a dataframe.
4. Who is the 3rd director it this rating and what is his/her mean revenue?

In [171]:
### YOUR CODE BEGINS HERE ###

### YOUR CODE ENDS HERE ###

In [None]:
print(f"Mean revenue of {director} is {revenue:.3f}.")
assert np.isclose(revenue, 43.242)

### Pandas exercises: Anscombe's quartet

In this exercise we will look into so called Anscombe's quartet that visualizes pitfalls of some statistics used to describe datasets

![image](https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg)

Firstly, let us load the data.

In [163]:
df = pd.read_json("./sample_data/anscombe.json")
df

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


For each series calculate its
- mean value of $X$
- mean value of $Y$
- sample variance of $X$
- sample variance of $Y$
- covariance beetwen $X$ and $Y$

and verify whether for each series you get
- mean of $X$ = 9,
- var of $X$ = 11,
- mean of $Y$ = 7.5 +- 0.01,
- var of $Y$ = 4.125 +- 0.01,
- cov of $X$ and $Y$ = 5.50 +- 0.01,

(you can try to make use of `aggregate` function of a `GroupBy` object)

In [167]:
### YOUR CODE BEGINS HERE ###
means = df.groupby('Series').mean()
vars = df.groupby('Series').var()
covs = df.groupby('Series').cov()
print(means)
print(vars)
print(covs)
### YOUR CODE ENDS HERE ###

          X         Y
Series               
I       9.0  7.500000
II      9.0  7.500909
III     9.0  7.500000
IV      9.0  7.500909
           X         Y
Series                
I       11.0  4.132640
II      11.0  4.127629
III     11.0  4.122620
IV      11.0  4.123249
               X         Y
Series                    
I      X  11.000  5.503000
       Y   5.503  4.132640
II     X  11.000  5.500000
       Y   5.500  4.127629
III    X  11.000  5.497000
       Y   5.497  4.122620
IV     X  11.000  5.499000
       Y   5.499  4.123249


<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>