# LAB05: Benchmarking Python Data Structures - Collections module

### **Goals:** 
- #### Implement functions and using the collections module
- #### Measuring Performance & Analyzing time complexity
- #### Compare the efficiency of `list` vs `deque` and `dict` vs `Counter`

---

## 🎯 Learning Objectives

By the end of this lab, you should be able to:
- Use Python’s `time` and `timeit` modules to measure runtime performance.
- Compare the efficiency of `list`, `deque`, `dict`, and `Counter`.
- Explain observed performance differences using **Big O notation**.

---


In [None]:
# Just run this cell to import the libraries used in this notebook
import random
from collections import Counter, deque
import timeit
import time
import matplotlib.pyplot as plt

## Background

Python’s built-in `list` and `dict` types are general-purpose containers, but the `collections` module provides optimized data structures for specific use cases.

In this notebook, you will:
1. Benchmark `list` vs `deque` for queue-like operations.
2. Benchmark `dict` vs `Counter` for counting tasks.
3. Analyze and explain your results.

---

> Recall:
> - `list.pop(0)` is **O(n)** because it must shift all elements.
> - `deque.popleft()` is **O(1)** because it’s optimized for both ends.
> - `Counter` is a subclass of `dict` optimized for counting frequencies.


### Part 1: Benchmarking list vs deque - *Smoothing noisy sensor data*
Imagine working as an environmental engineer and tracking real-time air quality. Unfortunately, the sensor readings seem to have quite a bit of noise when collecting PM2.5 levels every second at one of the stations you have to monitor.
To smooth out the noise, one technique you can use is computing a rolling average of the readings. You want to do this efficiently. Let's see how a deque compares to a list!

Run the next code cell, the code below simulates 1 million fictional air quality readings. It randomly generates sample readings and saves the values in an array. 


In [None]:
# Simulate 1 million air quality readings
pm25_data = [random.randint(5, 100) for i in range(1000000)]
# pm25_data # Uncomment this line if you would like to see what the data is like

**Question 1:** Write 2 python functions in the code cells below. The functions should take in **2 input arguments**: (1) a list **data** and (2) a **window_size** (with default window = 10) and **returns** a list of rolling averages
- function **rolling_avg_list** should use *general purpose* list structure to implement rolling window averaging where the rolling window is implemented as a list
- function **rolling_avg_deque** should use a deque data structure (from the collections library) to implement rolling window averaging where the rolling window is implemented as a deque(if you need an example on how to use the deque collection, review the reading on [Python's deque](https://realpython.com/python-deque/)

In [None]:
def rolling_avg_list(data, window_size=10):
    #TODO: implement rolling window averaging where the window is a list
    averages = []
    window = ...
    ...
    ...
    return averages

In [None]:
def rolling_avg_deque(data, window_size=10):
    #TODO: implement rolling window averaging where the window is a deque
    averages = []
    window = ...
    ...
    ...
    return averages

**Question 2:** Replace the TODO comments below and add some code to test both functions using simple asserts.

In [None]:
#TODO: create a small test list to represent input data
small_list = ...
#TODO: add test for rolling_avg_list
...
#TODO: add test for rolling_avg_deque
...
#TODO: add a test to make sure both functions return the same thing
...

**Question 3:** It's time to benchmark! Complete the code below to see how both functions perform. Use the timeit library.

In [None]:
# Part 1: Benchmarking list vs deque 
pm25_data = [random.randint(5, 100) for i in range(1000000)]
start = ...
#TODO: call rolling_avg_list with air quality data as it's input parameter, use the default window_size of 20
end = ...

list_time = ...
print(f"List version time: {list_time:.3f} seconds")

start = ...
#TODO: call rolling_avg_deque with air quality data as it's input parameter, use the default window_size of 20
end = ...

deque_time = ...
print(f"Deque version time: {deque_time:.3f} seconds")


**Question 4:** Discussion:

1. Which method was faster and by how much?
2. Why might it be faster to use the deque structure here?
3. What is the Big O complexity for:
   - `list.append()`  
   - `deque.append()`  
   - `list.pop(0)`  
   - `deque.popleft()`  


*Replace this text with your answer*

### Part 2: Benchmarking Manual Dictionary vs Counter - *Daily e-commerce sales count*
Imagine you’re working with e-commerce data and you have a list of all products sold in a day. You want to know which items sold the most.

Run the next code cell, the code below simulates that this e-commerce data could look like. It randomly generates sample sales data and saves it in an array. 

In [None]:
# Generate sample sales data (simulate real-world e-commerce transactions)
products = ["laptop", "phone", "tablet", "headphones", "camera", "monitor"]
sales_data = [random.choice(products) for _ in range(1000000)]  # 1 million sales
# sales_data # Uncomment this line if you would like to see what the data looks like

**Question 1:** Write 2 python functions in the code cells below. The functions should take in a list **data** as a input argument and returns a dictionary where the keys are the items in the list and the values are the counts for the item.
- function **count_with_counter** should use the Counter from the collection library (if you need an example on how to use the Counter collection, review the reading on [Python's Counter](https://realpython.com/python-counter/)
- function **count_with_dict** should manually create the dictionary, only using the *general purpose* dictionary container
                                                                                                                                                            

In [None]:
def count_with_counter(data):
    ...
    return ...

In [None]:
def count_with_dict(data):
    ...
    return ...

**Question 2:** Replace the TODO comments below and add some code to test both functions using simple asserts.

In [None]:
small_list =...
#TODO: add test for count_with_counter
...
#TODO: add test for count_with_dict
...
#TODO: add a test to make sure both functions return the same thing
...

**Question 3:** It's time to benchmark! Complete the code below to see how both functions perform. Use the timeit library.

In [None]:
# Part 2: Benchmarking Manual Dictionary vs Counter

from collections import Counter
import timeit
# Set up to use with timeit.Timer()
setup_count = "from __main__ import count_with_counter, count_with_dict, sales_data"
# Counter counting
counter_count = ...

# Manual dictionary counting
manual_dict = ...


print("=== Dict vs Counter Benchmark ===")
counter_time = counter_count.timeit(number = 10)
manual_dict_time = manual_dict.timeit(number = 10)
print(f"Counter counting:     {counter_time:.5f} seconds")
print(f"Manual dict counting: {manual_dict_time:.5f} seconds")


**Question 4:** Discussion:
1. Which method was faster and by how much?  
2. Why might `Counter` be more efficient internally than a manual dictionary loop?  

*Replace this text with your answer*

## Let's visualize using the matplotlib library!
When comparing results, we often use visuals. The matplotlib library is a great python library that can help you do so.
Just run the code below. The code creates a bar chart comparing the performance results using `matplotlib`.

Just run this code, observe and discuss with a peer, a tutor or your instructor.


In [None]:
# Visualizing Results: just run this code, observe and discuss with a peer, a tutor or your instructor

import matplotlib.pyplot as plt

labels = ['rolling_avg_list', 'rolling_avg_deque', 'count_with_dict', 'count_with_counter']
times = [list_time, deque_time, manual_dict_time, counter_time]

plt.figure(figsize=(8,5))
plt.barh(labels, times)
plt.xlabel("Time (seconds)")
plt.title("Benchmark Comparison: list/deque and dict/Counter")
plt.show()


## Reflection

**Question:** Write a short summary of what you learned (~ 5-8 sentences or bullet points) that includes:

- Key differences between `list` and `deque` for queue operations.  
- Key differences between `dict` and `Counter` for counting.  
- How these results align with theoretical **Big O** complexity.  
- One real-world example (other than the ones used in this lab) where you’d prefer `deque` or `Counter`.
- What part of this lab was most challenging and why?


*Replace this text with your answer*


### Congratulations!
You’ve benchmarked and analyzed some of Python’s core/general vs. specialized data structures — a key skill for writing efficient code.