# TECH 2 mandatory assignment - Part B
Solution

First, the necessary packages will be imported.

In [12]:
import numpy as np
from part_A import std_builtin, std_loops
import timeit
import pandas as pd
import csv

Next, data will be read from the `data.csv` file. Three lists will be created: `large`, `medium`, and `small`, with data from the respective columns stored in these lists. The `large` list will contain 10,000 entries, the `medium` list 1,000 entries, and the `small` list 100 entries.

In [50]:
large_list, medium_list, small_list = [], [], []

# load data from csv file to lists
with open("data.csv", mode="r") as file:
    csv_reader = csv.reader(file)
    rows = [row for row in csv_reader if row]  # skip empty rows
    for row in rows:
        for value, lst in zip(row[:3], [small_list, medium_list, large_list]):
            value = value.strip()
            if value:
                lst.append(float(value))

The next step is to confirm that the file was read successfully by examining the data's length and its first elements.

In [52]:
assert (
    len(large_list) == 10000
), "Large list does not contain 10 000 elements but {}".format(len(large_list))
assert (
    len(medium_list) == 1000
), "Medium list does not contain 1 000 elements but {}".format(len(medium_list))
assert (
    len(small_list) == 100
), "Small list does not containa 100 elements but {}".format(len(small_list))

In [53]:
for lst in [small_list, medium_list, large_list]:
    assert [round(val, 2) for val in lst[:3]] == [0.68, 0.05, 0.22]

Next, the standard deviation will be calculated using NumPy, built-in functions, and loops on the previously created lists. The results will be stored in a dictionary (matrix-like format), where the size of the data lists corresponds to the rows, and the different functions correspond to the columns.

In [16]:
# set up for matrix
functions = [np.std, std_builtin, std_loops]
function_names = ["Numpy std", "Builtin std", "Loops std"]
data_samples = {"small": small_list, "medium": medium_list, "large": large_list}

# matrix
std_result_matrix = {
    size: [func(data) for func in functions] for size, data in data_samples.items()
}

The computed standard deviations for each list size and each function will then be displayed.

In [64]:
df = pd.DataFrame(std_result_matrix, index=function_names)
print(df)

                small    medium     large
Numpy std    0.282372  0.284674  0.285405
Builtin std  0.282372  0.284674  0.285405
Loops std    0.282372  0.284674  0.285405


A similar approach will be used to calculate the running times for each element in the matrix above. The results will also be stored in a dictionary. The number of executions is deliberately chosen, as timeit returns the total execution time for multiple iterations, not a single one. To obtain the time for a single iteration, the output is divided by the number of iterations.

In [43]:
numb = 1000  # deliberately choses number of execution.
time_result_matrix = {
    size: [timeit.timeit(lambda: func(data), number=numb) / numb for func in functions]
    for size, data in data_samples.items()
}

While exploring ways to present the data effectively, pandas tables were discovered. Let's give them a try.

In [44]:
import pandas as pd

df = pd.DataFrame(time_result_matrix, index=function_names)


df.style.format(precision=7, thousands=",", decimal=".").format_index(
    str.upper, axis=1
).relabel_index(function_names, axis=0).background_gradient(
    subset=pd.IndexSlice[:, :], cmap="YlOrRd"
)  # where yellow best, orange ok, red worse

Unnamed: 0,SMALL,MEDIUM,LARGE
Numpy std,0.0001439,0.0001637,0.0009916
Builtin std,7.59e-05,0.0006716,0.0075335
Loops std,0.0001071,0.0014531,0.0103717


The table above clearly shows which functions perform best for each list size. For smaller lists, built-in functions offer the best performance, followed by loops, with NumPy functions being the slowest. However, for larger lists, NumPy functions outperform both built-in functions and loops.

Runtimes are measured using %timeit, where the number of iterations is chosen automatically.

In [32]:
%timeit np.std(large_list)
%timeit std_loops(large_list)
%time std_builtin(large_list)

1.46 ms ± 399 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.7 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: user 5.18 ms, sys: 179 μs, total: 5.36 ms
Wall time: 6.16 ms


0.28540452694761564

In [45]:
%timeit np.std(medium_list)
%timeit std_loops(medium_list)
%time std_builtin(medium_list)

314 μs ± 150 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.85 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
CPU times: user 448 μs, sys: 1e+03 ns, total: 449 μs
Wall time: 453 μs


0.2846744328385061

In [46]:
%timeit np.std(small_list)
%timeit std_loops(small_list)
%time std_builtin(small_list)

86.3 μs ± 24.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
170 μs ± 30 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
CPU times: user 186 μs, sys: 5 μs, total: 191 μs
Wall time: 213 μs


0.2823721097353601