# Assignment 9.2

> Replace all TODOs with your code. Do not change any other code.

In [None]:
# Do not edit this cell

from typing import List

## Descriptive statistics

In this assignment, we will write the functions to calculate the basic statistics from scratch, not using numpy.

### Task 1

Let's start simple: write a function `mean` that calculates the average of the list.

$$\mu = \frac{{\sum_{i=1}^n x_i}}{{n}}$$

In [1]:
from typing import List

def mean(li: List[float]) -> float:
    return sum(li) / len(li) if li else 0.0


assert mean([1., 2., 3.]) == 2.
assert mean([1., 1., 2., 0.]) == 1.

### Task 2

Now let's calculate variance (dispersion). You may use the `mean` function implemented before.

$$V = \frac{{\sum_{i=1}^n (x_i - \mu)^2}}{{n}}$$

In [2]:
def variance(li: List[float]) -> float:
    if not li:
        return 0.0
    m = mean(li)
    return sum((x - m) ** 2 for x in li) / len(li)


assert variance([1., 1., 1.]) == 0.
assert variance([1., 2., 3., 4.]) == 1.25

### Task 3

The standard deviation is easy once you get the variance:

$$\sigma = \sqrt{V}$$

In [3]:
import math

def std(li: List[float]) -> float:
    return math.sqrt(variance(li))


assert std([1., 1., 1.]) == 0.
assert std([1., 2., 3., 4.]) == 1.25**0.5

### Task 4

**Median**

The median is the middle value in a sorted dataset. If the dataset has an odd number of values, the median is the value at the center. If the dataset has an even number of values, the median is the average of the two middle values.

In [4]:
def median(li: List[float]) -> float:
    n = len(li)
    if n == 0:
        return 0.0
    sorted_li = sorted(li)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_li[mid - 1] + sorted_li[mid]) / 2
    else:
        return sorted_li[mid]


assert median([1., 1., 1.]) == 1.
assert median([1., 4., 3., 2.]) == 2.5

## Measure performance

Sometimes, apart from theoretical, algorithmic complexity, it's a good idea to compare the runtime of two algorithms empirically, i.e., run the code many times and time it.

In Python's standard library, we have [timeit](https://docs.python.org/3/library/timeit.html) module that does exactly that.

Let's compare the runtime of your implementations and numpy. Use the provided setup code:

In [5]:
import timeit

# Generate data for tests
setup = '''
import random
import numpy as np
from __main__ import mean, variance, std, median

arr = np.random.rand(10_000) * 100
li = [random.random() * 100 for _ in range(10_000)]
'''

# Pass your function to timeit module
funcs = {
    'mean': 'mean(li)',
    'variance': 'variance(li)',
    'std': 'std(li)',
    'median': 'median(li)',
}

for name, func in funcs.items():
    elapsed_time = timeit.timeit(func, setup=setup, number=1000)
    print(f"Elapsed time for {name}: {elapsed_time} seconds")

Elapsed time for mean: 0.050544936000051166 seconds
Elapsed time for variance: 1.5669192299999395 seconds
Elapsed time for std: 1.5867264739999882 seconds
Elapsed time for median: 1.6867828289999807 seconds


### Task 5

Complete Python statements to compare your functions to numpy. Use `li` for your function and `arr` for numpy functions.

In [6]:
stmt_mean_custom = 'mean(li)'
stmt_mean_np = 'np.mean(arr)'

stmt_var_custom = 'variance(li)'
stmt_var_np = 'np.var(arr)'

stmt_std_custom = 'std(li)'
stmt_std_np = 'np.std(arr)'

stmt_median_custom = 'median(li)'
stmt_median_np = 'np.median(arr)'

### Task 6

Measure average exec time of your statements with `timeit` module. As your submission, fill out the table with results (rounded to 2 decimal places)

In [13]:
import timeit
from tabulate import tabulate

# Define your statements
stmts = {
    'mean_custom': stmt_mean_custom,
    'mean_np': stmt_mean_np,
    'var_custom': stmt_var_custom,
    'var_np': stmt_var_np,
    'std_custom': stmt_std_custom,
    'std_np': stmt_std_np,
    'median_custom': stmt_median_custom,
    'median_np': stmt_median_np
}

# Measure execution time for each statement
times = {}
for name, stmt in stmts.items():
    elapsed_time = timeit.timeit(stmt, setup=setup, globals=globals(), number=10000)
    times[name] = round(elapsed_time, 2)

# Prepare data for the table
table = []
for func in ['mean', 'var', 'std', 'median']:
    row = [func, times[func + '_custom'], times[func + '_np']]
    table.append(row)

# Print the table
print("Time per 10000 executions, secs")
print(tabulate(table, headers=['Func', 'Custom', 'Numpy']))

Time per 10000 executions, secs
Func      Custom    Numpy
------  --------  -------
mean        0.51     0.1
var        18.2      0.39
std        16.83     0.43
median     18.13     1.39


Time per 10000 executions, secs

| Func       | Custom | Numpy |
| ---------- | ------ | ----- |
| mean       |        |       |
| var        |        |       |
| std        |        |       |
| median     |        |       |