<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Statistical machine learning - Notebook 1

**Author: Maciej Bartczak**



# Bootcamp - introduction to machine learning: NumPy & Pandas


## What is NumPy?

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, basic linear algebra, statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences:

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, **such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences**.

- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.



from: https://numpy.org/doc/stable/user/whatisnumpy.html

In [3]:
import numpy as np
import matplotlib.pyplot as plt

## Array creation

There are multiple ways to create an array, try examples below.

### From nested lists

In [None]:
array_1d = np.array([1, 2, 3, 4, 5])
array_1d

In [None]:
array_2d = np.array(
  [
    [1,  2,  3,  4,  5 ],
    [6,  7,  8,  9,  10],
    [11, 12, 13, 14, 15]
  ]
)
array_2d

In [None]:
array_3d = np.array(
  [
    [
      [1,  2,  3,  4,  5 ],
      [6,  7,  8,  9,  10],
      [11, 12, 13, 14, 15],
    ],
    [
      [16, 17, 18, 19, 20],
      [21, 22, 23, 24, 25],
      [26, 27, 28, 29, 30],
    ]
  ]
)
array_3d

### Using zeros / ones / full



In [None]:
shape = (2, 3)

np.zeros(shape)

In [None]:
np.ones(shape)

In [None]:
fill_value = 42
np.full(shape, fill_value)

### Using zeros_like / ones_like / full_like

In [None]:
np.zeros_like(array_2d)

In [None]:
np.ones_like(array_2d)

In [None]:
fill_value = 42
np.full_like(array_2d, fill_value)

### Using arange / linspace (1D arrays only)

which are used to fill an array with values from the interval $[low, high]$ with specific step or number of points.

In [None]:
start = 0
stop  = 1
step  = 0.1

np.arange(start, stop, step)

In [None]:
np.arange(start, stop + 1e-4, step)

In [None]:
n_steps = int((stop-start)/step) + 1
np.linspace(start, stop, n_steps)

### Using identity / diag (2D arrays only)
 to create identity or diagonal (with optional offset) matrices.

In [None]:
n = 4
np.identity(n)

In [None]:
np.diag([1, 2, 3, 4])

In [None]:
offset = 1
np.diag([1, 2, 3], offset)

### Using meshgrid
For 1D arrays X_1, ..., X_n meshgrid behavior is specified as follows

```
G_1, ..., G_n = np.meshgrid(X_1, ..., X_n, indexing='ij')
G_j[i_1, i_2, ..., i_n] = X_j[i_j]

G_1, ..., G_n = np.meshgrid(X_1, ..., X_n, indexing='xy')  # default indexing
G_j[i_2, i_1, i_3, ..., i_n] = X_j[i_j]
```

what facilitates creating coordinate grids,




In [None]:
x = np.array([1, 2, 3], dtype=int)
y = np.array([4, 5, 6, 7], dtype=int)

x_grid, y_grid = np.meshgrid(x, y, indexing='ij')

such as:

In [None]:
x_grid

In [None]:
y_grid

In [None]:
x_grid.shape == y_grid.shape == (len(x), len(y))

### Using tile & repeat
to create an array filled with "repeating pattern".

In [None]:
array = np.array([0,1,2])

In [None]:
np.tile(array, 2)

In [None]:
np.tile(array, (2, 1))

In [None]:
np.repeat(array, 2)

In [None]:
np.repeat(array, 2, axis=0)

In [None]:
np.repeat(array.reshape(1, -1), 2, axis=0)

In [None]:
np.repeat(array.reshape(1, -1), 2, axis=1)

## NumPy dtypes

Full list of NumPy dtypes is available here: https://numpy.org/doc/stable/user/basics.types.html

Among which there are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex.

Data-types can be used as arguments to the dtype keyword that many numpy functions or methods accept, in particular array creation routines.

See examples below.

In [None]:
np.array([1, 2, 3, 4, 5], dtype=int)

In [None]:
np.array([1, 2, 3, 4, 5], dtype=float)

In [None]:
np.array([1, 2, 3, 4, 5], dtype=complex)

Types can be cast as follows

In [None]:
float_array   = np.array([1, 2, 3, 4, 5], dtype=float)
complex_array = float_array.astype(complex)

float_array, float_array.dtype

In [None]:
complex_array, complex_array.dtype

## Saving / loading data

### Using savetxt / loadtxt

to process a single 1D/2D array in human readable format. File created with `savetxt` can be easily accessed with a text editor

In [None]:
data = np.array(
  [
    [1, 2],
    [3, 4],
  ]
)

np.savetxt('data.txt', data)
!cat data.txt

as well as seamlessly loaded back with `loadtxt`.


In [None]:
np.loadtxt('data.txt')

File format can be specified, see https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html for reference

In [None]:
np.savetxt('data.txt', data, fmt='%.2f')
!cat data.txt

In [None]:
np.savetxt('data.txt', data, fmt='%d')
!cat data.txt

Nevertheless information regarding dtype is not preserved.

In [None]:
np.loadtxt('data.txt')

In [None]:
np.loadtxt('data.txt', dtype=int)

`savetxt` method can be used to save an array in csv format, we only need to specify the proper delimiter

In [None]:
np.savetxt('data.csv', data, delimiter=',', fmt='%.2f')
!cat data.csv

as well as loaded back

In [None]:
np.loadtxt('data.csv', delimiter=',')

### Using save / load
to process an arbitrary array in binary format. Unfortunately the resulting file is not human readable. On the other hand these methods can handle array of arbitrary shape and preserve the underlying dtype.

In [None]:
data = np.array(
  [
    [
      [1, 2],
      [3, 4],
    ]
  ]
)

np.save('data.npy', data)

!cat data.npy

In [None]:
np.load('data.npy')

In [None]:
np.load('data.npy').dtype == data.dtype

### Using savez / load

to process **multiple** arbitrary arrays in binary format. The result of the load operation is an object of dictionary-like structure.

In [None]:
array_1 = np.array([1, 2, 3], dtype=int)
array_2 = np.array([[4, 5, 6]], dtype=float)
array_3 = np.array([[[7, 8, 9]]], dtype=complex)

np.savez('data.npz', array_1, array_2, some_name=array_3)
loaded_data = np.load('data.npz')
loaded_data.files

In [None]:
loaded_data['arr_0']

In [None]:
loaded_data['arr_1']

In [None]:
loaded_data['some_name']

### Using savez_compressed / load
to process multiple arbitrary arrays in binary format with compression. It behaves just as savez but the resulting file is also compressed, what save space on the disk.



In [None]:
shape = (1000, 1000)
data = np.ones(shape)
np.savez('data_uncompressed.npz', data)
np.savez_compressed('data_compressed.npz', data)

We can compare the filesizes with `ls` command.

In [None]:
!ls data_*compressed.npz -lh

## Shape manipulation
It is possible to change the structure of data in the array. See examples below.

#### Reshape

In [None]:
data = np.arange(0, 12, 1, dtype = int)
data

In [None]:
data.reshape(3, 4)

In [None]:
data.reshape(4, 3)

In [None]:
data_reshaped = data.reshape(2, 3, 2)
data_reshaped

In [None]:
shape = data_reshaped.shape
shape

#### Inferring **one** dimension while usigng reshape method

In [None]:
data.reshape(6, -1)

In [None]:
data.reshape(-1, 6)

In [None]:
data.reshape(3, 2, -1)

#### Flatten

In [None]:
data_reshaped.flatten()

#### Adding / deleting dummy dimensions
Facilites vectorized arithmetic with conjunction with broadcasting mechanism (explained later)

In [None]:
data

In [None]:
data.shape

In [None]:
data_expanded1 = np.expand_dims(data, 0)
data_expanded1

In [None]:
data_expanded1.shape

In [None]:
data_expanded2 = np.expand_dims(data, 1)
data_expanded2

In [None]:
data_expanded2.shape

In [None]:
data_expanded3 = np.expand_dims(data_expanded1, 2)
data_expanded3

In [None]:
data_expanded3.shape

In [None]:
data_squeezed = data_expanded3.squeeze()
data_squeezed

In [None]:
data_squeezed.shape

#### Transposing

For 2D arrays it behaves just as expected (performs matrix transpose).

In [None]:
data = np.arange(0, 4, 1).reshape(2, 2)
data

In [None]:
data.transpose()

For higher dimensional arrays its best visualized via observing how the shape changes.

In [None]:
shape = (1, 2, 3, 4)
data = np.ones(shape)
data.shape

The default behavior is to flip dimensions.

In [None]:
data.transpose().shape

However other permutations of dimensions might be specified.

In [None]:
dimensions_permutation = [1, 0, 3, 2]
data.transpose(dimensions_permutation).shape

`.T` is a shortand for `.transpose()`

In [None]:
data.T.shape

## Concatenation & stacking
The following methods are used to merge multiple arrays (with compatible shapes) into one array.

### concatenate
used to join a sequence of arrays along an **existing** axis.

In [None]:
shape = (4,)
array_1 = np.full(shape, 1)
array_2 = np.full(shape, 2)

array_1, array_2

In [None]:
np.concatenate([array_1, array_2, array_1])

In [None]:
array_1 = array_1.reshape(1, -1)
array_2 = array_2.reshape(1, -1)

array_1, array_2

The default axis to join on is 0

In [None]:
np.concatenate([array_1, array_2, array_1])

But it can be changed

In [None]:
np.concatenate([array_1, array_2, array_1], axis=1)

### stack

used to stack a sequence of arrays along a **new** axis.

In [1]:
shape = (4,)
array_1 = np.full(shape, 1)
array_2 = np.full(shape, 2)

array_1, array_2

NameError: name 'np' is not defined

In [None]:
np.stack([array_1, array_2, array_1])

In [None]:
np.stack([array_1, array_2, array_1], axis=1)


### hstack, vstack and dstack
stand for horizontal, vertical and depthwise stacking

In [None]:
np.hstack([array_1, array_2, array_1])

In [None]:
np.vstack([array_1, array_2, array_1])

In [None]:
np.dstack([array_1, array_2, array_1])

## Mathematical operations
As stated before

> NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

Let us observe some examples.



In [None]:
x = np.arange(0, 3, 1)
y = np.arange(3, 6, 1)

x, y

#### Elementwise

Basic mathematical operations will be applied to the arrays elementwise.

In [None]:
2 * x

In [None]:
x * 2

In [None]:
np.sin(x)

In [None]:
x**2

The same applies to the binary (taking two arguments) operators and arrays of the same shape.

In [None]:
x + y

In [None]:
x * y

In [None]:
x / y

In [None]:
y ** x

#### Broadcasting

When arrays have different shapes the broadcasting mechanism comes into play.

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when

- they are equal, or
- one of them is 1

If these conditions are not met, a `ValueError: operands could not be broadcast together` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.

from: https://numpy.org/devdocs/user/basics.broadcasting.html

In [None]:
x_shape = (2, 4)
y_shape = (2, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

In [None]:
x_shape =    (2, 4)
y_shape = (3, 1, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

In [None]:
x_shape = (   1, 4)
y_shape = (3, 2, 1)

x = np.ones(x_shape)
y = np.ones(y_shape)

(x + y).shape

### Matrix / tensor operations
NumPy package wouldn't be complete without means to manipulate arrays other than elementwise operations, such as most common linear algebra functionalities. See the examples below.

#### Matrix multiplication

can be performed with `matmul` method

In [None]:
a = np.eye(3)
a[2,2] = 2
b = np.arange(0, 3, 1).reshape(3, 1)

a

In [None]:
b

In [None]:
np.matmul(a, b)

or `@` operand

In [None]:
a @ b

Be mindfull of its varying behaviour for different input shapes

- If both arguments are 2-D they are multiplied like conventional matrices.
- If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
- If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
- If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.

from: https://numpy.org/doc/stable/reference/generated/numpy.matmul.html

#### Dot product

can be performed using `dot` method.

In [None]:
x = np.arange(0, 3, 1)
y = np.arange(3, 6, 1)

x, y

In [None]:
np.dot(x, y)

Be mindfull of its varying behaviour for different input shapes

- If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).

- If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.

- If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred.

- If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b.

- If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

  ```
  dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
  ```

from: https://numpy.org/doc/stable/reference/generated/numpy.dot.html


#### Einstein summation convention
Einstein summation convention provide a concise way of expressing multilinear operations. You can look it up here https://mathworld.wolfram.com/EinsteinSummation.html and its NumPy specification here https://numpy.org/doc/stable/reference/generated/numpy.einsum.html.

Take a look a the following ilustrative examples.


In [None]:
data = np.tri(3)
data

In [None]:
np.einsum('ij->ji', data)  # transpose

In [None]:
np.einsum('ij->i', data)  # row sum

In [None]:
np.einsum('ij->j', data)  # column sum

In [None]:
np.einsum('ii->i', data)  # diagonal

In [None]:
vector = np.ones(3)
vector

In [None]:
np.einsum('ij,j->i', data, vector)  # matrix multiplication (matrix x vector)

In [None]:
np.einsum('ij,jk->ik', data, vector.reshape(3, 1))  # matrix multiplication (matrix x matrix)

## Other operations

### Aggregations: mean / var
Very often we are required to calculate some aggregations with respect to specific dimensions of an array, such as mean or variance.

In [None]:
data = np.arange(0, 20, 1).reshape(4, 5)
data

By default the whole array is aggregated

In [None]:
data.mean()

But aggregations can also be applied axiswise

In [None]:
data.mean(axis=0)

In [None]:
data.mean(axis=1)

In [None]:
data.var(axis=0)

In [None]:
data.var(axis=1)

### Aggregation: quantiles

Many more aggregations are possible, e.g. determining data quantiles.

In [None]:
data = np.linspace(1, 2, 101)

np.quantile(data, 0.3)

In [None]:
np.quantile(data, [0.3, 0.6])

### Sorting
Arrays can be sorted. By default sorting is performed with respect to the last dimension

In [None]:
data = np.array([
  [1, 4, 2, 8, 5],
  [2, 3, 1, 9, 6],
  [8, 2, 3, 7, 4],
])

np.sort(data)

but sorting dimension can be explicitly specified.

In [None]:
np.sort(data, axis=0)

In [None]:
np.sort(data, axis=1)

You can also get the "sorting permutations" that can be used in conjunction with advanced indexing (introduced below).

In [None]:
np.argsort(data, axis=0)

In [None]:
np.argsort(data, axis=1)

## Indexing

Allows selecting items and subarrays. It can be also used to modify fragments of an array.

### Basic indexing

In [None]:
data = np.arange(0, 16, 1)
data

In [None]:
data[10]

In [None]:
data[-2]

In [None]:
data[:5]

In [None]:
data[::2]

In [None]:
data[-3:]

It is also possible in higher dimensional setting


In [None]:
data = data.reshape(4, 4)
data

In [None]:
data[1, 1]

In [None]:
data[1, :]

One does not need to specify trailing colons `:`

In [None]:
data[:-2]

Note how the column selection is achieved. In this case `:` can't be ommited.

In [None]:
data[:, -2:]

One group of subsequent colons `:` can be replaced with `...`

In [None]:
data_reshaped = data.reshape(4, 2, 2)
data_reshaped

In [None]:
data_reshaped[..., 0]

Note the difference

In [None]:
data_reshaped[:, 0]

### View & copies

Basic indexing routines presented above return views of the data in the underlying arrays. When the underlying array is modified the changes are reflected in the view.

In [None]:
data = np.arange(0, 16, 1).reshape(4, 4)
data

In [None]:
view = data[:, 3]
view

In [None]:
data *= 2
data

In [None]:
view

To "detach" view from the underlying array use `copy` method

In [None]:
view_copy = view.copy()
data *= 2
view, view_copy

Views are only possible when selecting with slices, then it's a matter of rememebering offset and stride only.

You can read more about this here: https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

### Advanced indexing

it is also possible to index with integer or boolean arrays

In [None]:
data = np.arange(0, 16, 1).reshape(4, 4)
data

We can select specific rows

In [None]:
data[[0, 3]]

and columns.

In [None]:
data[:, [0, 3]]

Seleceting both together is a little more complicated

In [None]:
rows = [
 [0, 0],
 [3, 3]
]
columns = [
 [0, 3],
 [0, 3]
]
data[rows, columns]

Use indices shape to control the output shape.

In [None]:
rows = [0, 0, 3, 3]
columns = [0, 3, 0, 3]
data[rows, columns]

Boolean arrays can be used as well, though they work somewhat differently.

In [None]:
index = (data % 3 == 0)
index

In [None]:
data[index]

Its most common use case is to update array entries based on some condition.

In [None]:
data[index] = -1
data

Detailed bahavior of indexing with boolean arrays is described here: https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing

# NumPy Exercises

### Exercise: multiplication table
Create "{0, ..., 10} x {0, ..., 10} multiplication table" using only
- `meshgrid` method,
- multiplication of 1D arrays, dummy dimensions and broadcasting mechanism.

In [66]:
### YOUR CODE BEGINS HERE ###
x = np.arange(0, 11, 1)
y = np.arange(0, 11, 1)
x_grid, y_grid = np.meshgrid(x, y, indexing='ij')

mul_table = x_grid * y_grid
mul_table
### YOUR CODE ENDS HERE ###

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  0,   3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  0,   4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  0,   5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  0,   6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  0,   7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  0,   8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  0,   9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])

### Exercise: speed comparison
In this exercise we will measure the speed improvements due to NumPy implementation.

In [68]:
length = 10000
a = b = np.arange(length)

Write the code to calculate dotproduct of arrays `a` and `b` using only native Python funcions. Use `timeit` magic to time the execution.

In [69]:
%%timeit

### YOUR CODE BEGINS HERE ###
sum(a * b)
### YOUR CODE ENDS HERE ###

735 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Now compare the execution time with `np.dot`.

In [70]:
%%timeit

### YOUR CODE BEGINS HERE ###
np.dot(a, b)
### YOUR CODE ENDS HERE ###

8.38 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### Exercise: vector operations
Calculate
$$ \int_0^1 \sin(a\cdot2\pi x)\sin(b\cdot 2\pi x)dx $$
for
- $a=b=1$,
- $a=1$ and $b=2$,

using discrete approximation of an integral and following NumPy functions: `linspace, sin, mean`. Do not use `vectorize`.


In [71]:
# Case a = b = 1
### YOUR CODE BEGINS HERE ###
a = b = 1
start = 0
stop = 1
step = 1e-5
n_steps = int((stop - start) / step) + 1
steps = np.linspace(start, stop, n_steps)
answer = np.mean(np.sin(a * 2 * np.pi * steps) * np.sin(b * 2 * np.pi * steps))
### YOUR CODE ENDS HERE ###

print(answer.round(4))
assert np.isclose(answer, 0.5)

0.5


In [72]:
# Case a = 1, b = 2
### YOUR CODE BEGINS HERE ###
a = 1
b = 2
start = 0
stop = 1
step = 1e-5
n_steps = int((stop - start) / step) + 1
steps = np.linspace(start, stop, n_steps)
answer = np.mean(np.sin(a * 2 * np.pi * steps) * np.sin(b * 2 * np.pi * steps))
### YOUR CODE ENDS HERE ###

print(answer.round(4))
assert np.isclose(answer, 0)

0.0


### Exercise: vector operations ctd.

Let $a_j = j$. Calculate in a similar fashion matrix $(b_{ij})_{i,j=0, ..., 4}$, where
$$ b_{ij} = \int_0^1 \sin(a_i\cdot2\pi x)\sin(a_j\cdot 2\pi x)dx. $$

Try to structure your code in such a way that it would accept an arbitrary array `a`. Do not use `vectorize`.

In [73]:
a = np.arange(5)

### YOUR CODE BEGINS HERE ###
start = 0
stop = 1
step = 1e-5
n_steps = int((stop - start) / step) + 1
steps = np.linspace(start, stop, n_steps).reshape(1, -1)
a = a.reshape(-1, 1)
sin_mat = np.sin(2 * np.pi * (a @ steps))
answer = sin_mat @ sin_mat.transpose() / n_steps
### YOUR CODE ENDS HERE ###

print(answer.round(2))
assert np.isclose(answer, np.diag([0, 0.5, 0.5, 0.5, 0.5])).all()

[[ 0.   0.   0.   0.   0. ]
 [ 0.   0.5 -0.  -0.  -0. ]
 [ 0.  -0.   0.5 -0.   0. ]
 [ 0.  -0.  -0.   0.5 -0. ]
 [ 0.  -0.   0.  -0.   0.5]]


### Exercise: subarrays

Let us sample $n=100 000$ standard normal random variables. Calculate mean of values greater than zero.

In [74]:
np.random.seed(0)

n = 100000
sample = np.random.randn(n)

### YOUR CODE BEGINS HERE ###
answer = np.mean(sample[np.where(sample > 0)])
### YOUR CODE ENDS HERE ###

assert np.isclose(answer, np.sqrt(2/np.pi), rtol=0.01)

### Exercise: subarrays ctd.
Let us sample an array of random integers.
1. Select columns for which sum of its values gives remainder = 1 when divided by 3.
2. Change entries in that columns to ones.
3. Calculate sum of all entries.

In [75]:
np.random.seed(0)
sample = np.random.randint(0, 3, size=(10, 10))

### YOUR CODE BEGINS HERE ###
cor_mod_cols = np.where(np.sum(sample, axis=0) % 3 == 1)
sample[:, cor_mod_cols] = 1
answer = np.sum(sample)
### YOUR CODE ENDS HERE ###

assert answer == 88


### Exercise: matrix multiplication & merging arrays

In this exercies we will approximate operator norm of some matrices.

1. Create an an array of $n=1000$ vectors $v\in\mathbb{R}^2$ "evenly spaced" on the unit circle.
2. Combine given three matrices into one array.
3. Use **one** matrix multiplication operation with these arrays to obtain matrix products of all the vectors with all the matrices.
4. For each vector calculate its (euclidean) length.
5. Take maximum with respect to each matrix.
6. You should obtain three values representing approximations of operator norm of provided matrices.


In [76]:
matrix_1 = np.array([
  [1, 0],
  [0, 1]
])

matrix_2 = np.array([
  [1, 1],
  [1, 1]
])

matrix_3 = np.array([
  [1, 1],
  [0, 1]
])

### YOUR CODE BEGINS HERE ###
spaced_vectors = np.array([np.array([np.cos(angle), np.sin(angle)]) for angle in np.linspace(0, 2 * np.pi, 1000)]).transpose()
comb_matrix = np.array([matrix_1, matrix_2, matrix_3])
mul_op = comb_matrix @ spaced_vectors
mul_op = mul_op * mul_op
answer = np.array([np.sqrt(np.max(mul_op[i, 0, :] + mul_op[i, 1, :])) for i in range(3)])
### YOUR CODE ENDS HERE ###

print(answer.round(4))
assert np.isclose(answer, [1, 2, (1+np.sqrt(5))/2]).all()

[1.    2.    1.618]


## Other NumPy functionalities
NumPy package implements plethora of mathematical functionalities, not covered in this tutorial. NumPy User guide and API Reference are great starting point to discover those functionalities:
- https://numpy.org/doc/stable/user/
- https://numpy.org/doc/stable/reference/


# What is pandas?

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

It is meant for manipulating datasets in tabular format, that is 2D arrays with rows representing observations and columns representing variables, the so called **DataFrames**.

from: https://pandas.pydata.org/

### Creating dataframes

In the most basic form you just specify 2D array of values

In [77]:
import pandas as pd

df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4)
)

df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


It is possible to additionally specify columns and index labels

In [None]:
df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4),
    columns = ['column_1', 'column_2', 'column_3', 'column_4'],
    index   = ['a', 'b', 'c', 'd']
)

df

as well as type of entries

In [None]:
df = pd.DataFrame(
    np.arange(0, 16, 1).reshape(4, 4),
    columns = ['column_1', 'column_2', 'column_3', 'column_4'],
    index   = ['a', 'b', 'c', 'd'],
    dtype   = float
)

df

In [None]:
df.astype(int)

It can be done in an alternative manner

In [None]:
df = pd.DataFrame(
    {
      'column_1': [0,  4,  8, 12],
      'column_2': [1,  5,  9, 13],
      'column_3': [2,  6, 10, 14],
      'column_4': [3,  7, 11, 15]
    },
    index   = ['a', 'b', 'c', 'd']
)

df

It is worth mentioning that the underlying data can be accesed under `values` attribute

In [None]:
df.values

What's important Pandas DataFrames can handle entries of different type. In the most common situation different columns will have different types.

In [None]:
df = pd.DataFrame(
    {
      'integers': [0,   1,   2,  3],
      'floats':   [1.0, 2.0, 3,  4],
      'booleans': [True, False, True, False],
      'arbitrary objects':  [0, 1.0, 'abc', False]
    }
)

df

In [None]:
df.dtypes

Another available creation methods are listed below
- `pd.DataFrame.from_dict`
- `pd.DataFrame.from_records`

You can read about them here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

### Loading data
Let us peek into the CSV file conveniently provided in the Google Colab runtime

In [None]:
!head './sample_data/california_housing_train.csv'

*and* load it using `pd.read_csv` method.

In [None]:
pd.read_csv('./sample_data/california_housing_train.csv')

In Pandas it is also possible to load JSON files.

In [None]:
!head './sample_data/anscombe.json'

In [None]:
pd.read_json('./sample_data/anscombe.json')

## Other I/O functionalities
They are listed here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html and are suitable for: XML, HDF, Feather, Parquet files as well as for interaction with various databases and much more.

## Handling missing data and duplicates

In the real world it's often the case that there are missing or duplicate values in the dataset. They can be easily delt with in Pandas.

In [None]:
df = pd.DataFrame([
  [0,       1     ],
  [2,       0     ],
  [2,       np.nan],
  [np.nan,  np.nan]
])

df

Incomplete observations can be filtered out.

In [None]:
df.dropna()

In [None]:
df.dropna(how='all')

It is also possible to "fill the gaps".

In [None]:
df.fillna(0)

In [None]:
df.fillna(0).drop_duplicates()

In [None]:
df.fillna({0: 0, 1: 1})

## Data access

Let us observe how we can access and manipulate DataFrame structure.

In [None]:
df = pd.read_csv('./sample_data/california_housing_train.csv')
df

### describe & info
Once the dataframe is loaded `describe` and `info`methods provide an easy way to obtain a high level view of the underlying data.

In [None]:
df.describe()

In [None]:
df.info()

### head & tail
we can use these methods to look at the first and last entries

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.head(8)

### rename
can be used to rename column labels.

In [None]:
df.rename(columns={'longitude': 'lon', 'latitude': 'lat'})

### Accessing a subset of data

To select multiple columns specify list of labels within square brackets.

In [None]:
df[['longitude', 'latitude']]

Single column can be retrieved by specifying its name within square brackets. Note that this will yield a `Series` object which behaves a little different than DataFrames, but in general is a representation of single varaible values.

In [None]:
df['longitude']

If column name does not contain spaces and other special character it can accesed as an attribute.

In [None]:
df.longitude

Subset of a DataFrame can be selected using `loc` attribute. You need to specify index and column labels. Note that it supports slicing with respect to column names. Also, as [documentation](https://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#different-choices-for-indexing) says *note that contrary to usual python slices, both the start and the stop are included!*.

In [None]:
df.loc[5:10, 'population': 'median_house_value']

It also possible to select data with slices correspong to indices and columns **positions** (as in NumPy).

In [None]:
df.iloc[5:10, -4:]

Boolean arrays (including boolean `Series` object) can be used as well.

In [None]:
index = df.median_house_value < 25000
index

In [None]:
df.loc[index, 'median_house_value']

It is also possible to do

In [None]:
df[index]

## Merging dataframes
can be used to merge multiple arrays into one.


In [None]:
df_1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'], index=[0, 1, 2])
df_2 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['b', 'c', 'd'], index=[1, 2, 3])

In [None]:
df_1

In [None]:
df_2

There are to basic methods for merging dataframes. The first one is `append` that concatenates dataframes along the index.

In [None]:
df_1.append(df_2)

The second one is join that concatenates dataframes along the columns.

In [None]:
df_1.join(df_2, how='outer', lsuffix='_left', rsuffix='_right')

These two methods are special cases of more general function `concat` that allows more advanced oprations. There is also `merge` method which makes use of database-like joins.

All of these methods are described and explained here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

## Operations on dataframes
Pandas can be used to perfrom mathematical and other operations on the underlying data.

### Elementwise operations
The most common use case is to create a new column being a result of some operation on existing columns.

In [None]:
people_per_household = df.population / df.households
people_per_household

In [None]:
df['people_per_household'] = people_per_household
df

More general function can be applied with use of `apply` method

In [None]:
def func(row):
  if row.latitude > 0:
    ns_hemisphere = 'N'
  else:
    ns_hemisphere = 'S'

  if row.longitude > 0:
    ew_hemisphere = 'E'
  else:
    ew_hemisphere = 'W'

  gps_label = f"{abs(row.latitude)} {ns_hemisphere}, {abs(row.longitude)} {ew_hemisphere}"

  return gps_label

df.apply(func, axis=1).head()

### GroupBy
can be used to split a dataframe into multiple based on some condition/values

In [None]:
values = df.median_house_value.round(-4)
values


In [None]:
gb = df.groupby(values)
gb

Groups labels and indices can be accessed under `groups` attribute and themselves with the use of `get_group` method.

In [None]:
gb.groups

In [None]:
gb.get_group(10000)

The main use case of the `groupby` functionality is to split dataframe based on some condition, perform some aggregation within groups and then join the results. An example of such an operation is calcualtin mean within each group.

In [None]:
gb.mean().head()

You can read more about `groupby` funcionality in teh pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

## Pandas Exercises

### Pandas exercise: IMDB movies

In this exercise we will look into IMDB Top 1000 movies dataset. We will first load it into a dataframe.


In [4]:
url = "https://raw.githubusercontent.com/peetck/IMDB-Top1000-Movies/master/IMDB-Movie-Data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40
...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,0.00,45
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,0.00,22


Part 1.

1. How much should movie earn to be among 2% best earning movies?
2. Select movies that earned more than this value.
3. Calculate their mean rating.

In [5]:
### YOUR CODE BEGINS HERE ###
answer = df[df['Revenue (Millions)'] > df['Revenue (Millions)'].quantile(.98)]['Rating'].mean()
### YOUR CODE ENDS HERE ###

In [6]:
# Your code in cell(s) above
assert np.isclose(answer, 7.615)

Part 2.
1. Calculate mean metascore and mean revenue per director. (use groupby)
2. Restrict results to the directors that directed more than four movies. (look *up* `value_counts` method)
3. Sort the results with respect to the metascore and display the rating as a dataframe.
4. Who is the 3rd director it this rating and what is his/her mean revenue?

In [9]:
### YOUR CODE BEGINS HERE ###
print(df.columns)
means = df.groupby('Director')[['Metascore', 'Revenue (Millions)']].mean()
print(means)
restricted_means = df[df['Director'].isin(df['Director'].value_counts()[df['Director'].value_counts() > 4].index)].groupby('Director')[['Metascore', 'Revenue (Millions)', 'Rating']].mean()
print(restricted_means.sort_values(by='Metascore', ascending=False))
director = restricted_means.sort_values(by='Metascore', ascending=False)['Rating'].keys()[2]
revenue = df[df['Director'] == director]['Revenue (Millions)'].mean()
### YOUR CODE ENDS HERE ###

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')
                     Metascore  Revenue (Millions)
Director                                          
Aamir Khan                42.0               1.200
Abdellatif Kechiche       88.0               2.200
Adam Leon                 77.0               0.000
Adam McKay                65.5             109.535
Adam Shankman             64.0              78.665
...                        ...                 ...
Xavier Dolan              61.0               1.745
Yimou Zhang               42.0              45.130
Yorgos Lanthimos          77.5               4.405
Zack Snyder               48.0             195.148
Zackary Adler             90.0               6.530

[644 rows x 2 columns]
                    Metascore  Revenue (Millions)    Rating
Director                                                   
Dav

In [10]:
print(f"Mean revenue of {director} is {revenue:.3f}.")
assert np.isclose(revenue, 43.242)

Mean revenue of Denis Villeneuve is 43.242.


### Pandas exercises: Anscombe's quartet

In this exercise we will look into so called Anscombe's quartet that visualizes pitfalls of some statistics used to describe datasets

![image](https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg)

Firstly, let us load the data.

In [79]:
df = pd.read_json("./sample_data/anscombe.json")
df

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


For each series calculate its
- mean value of $X$
- mean value of $Y$
- sample variance of $X$
- sample variance of $Y$
- covariance beetwen $X$ and $Y$

and verify whether for each series you get
- mean of $X$ = 9,
- var of $X$ = 11,
- mean of $Y$ = 7.5 +- 0.01,
- var of $Y$ = 4.125 +- 0.01,
- cov of $X$ and $Y$ = 5.50 +- 0.01,

(you can try to make use of `aggregate` function of a `GroupBy` object)

In [80]:
### YOUR CODE BEGINS HERE ###
stats = df.groupby('Series')[['X','Y']].aggregate([np.mean, np.var])
cov = df.groupby('Series').cov()
covs = np.array([cov['X'][s]['Y'] for s in ['I', 'II', 'III', 'IV']])
print(stats)
print(cov)
assert np.isclose(stats[('X', 'mean')], np.full((4, ), 9.0)).all()
assert np.isclose(stats[('Y', 'mean')], np.full((4, ), 7.5), rtol=0.01).all()
assert np.isclose(stats[('X', 'var')], np.full((4, ), 11.0)).all()
assert np.isclose(stats[('Y', 'var')], np.full((4, ), 4.125), rtol=0.01).all()
assert np.isclose(covs, np.full((4, ), 5.50), rtol=0.01).all()
### YOUR CODE ENDS HERE ###

          X               Y          
       mean   var      mean       var
Series                               
I       9.0  11.0  7.500000  4.132640
II      9.0  11.0  7.500909  4.127629
III     9.0  11.0  7.500000  4.122620
IV      9.0  11.0  7.500909  4.123249
               X         Y
Series                    
I      X  11.000  5.503000
       Y   5.503  4.132640
II     X  11.000  5.500000
       Y   5.500  4.127629
III    X  11.000  5.497000
       Y   5.497  4.122620
IV     X  11.000  5.499000
       Y   5.499  4.123249


<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>