# <font color="#418FDE" size="6.5" uppercase>**Dicts And Sets**</font>

>Last update: 20260102.
    
By the end of this Lecture, you will be able to:
- Explain how hash tables underpin Python dict and set behavior. 
- Analyze the expected time complexity of lookups, insertions, and deletions in dicts and sets. 
- Select appropriate key types and usage patterns to maintain efficient hash-based operations. 


## **1. Hash Table Foundations**

### **1.1. Hashing and Buckets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_01_01.jpg?v=1767330609" width="250">



>* Hashing maps each key to a bucket
>* Buckets give fast direct access without scanning

>* Buckets store entries that share a hash
>* Hashing jumps directly to buckets for fast lookups

>* Library catalog numbers mirror hash-based bucket lookup
>* Even key spread keeps operations fast and predictable



In [None]:
#@title Python Code - Hashing and Buckets

# Demonstrate hashing mapping keys into buckets visually.
# Show how different keys share or use separate buckets.
# Connect dictionary behavior with underlying hash bucket positions.
# pip install some_required_library_if_needed.

# Define a simple bucketed table using a small size.
table_size = 8
buckets = [[] for index in range(table_size)]

# Define a helper function computing bucket index from Python hash.
def bucket_index(key, size):
    return hash(key) % size

# Insert several keys and show their chosen buckets.
keys = ["alice", "bob", "carol", "dave"]
for key in keys:
    index = bucket_index(key, table_size)
    buckets[index].append(key)

# Print a header explaining upcoming bucket layout.
print("Bucket index and stored keys inside each bucket:")

# Loop through buckets and display their contents clearly.
for index, bucket in enumerate(buckets):
    print(f"Bucket {index}: {bucket}")

# Show that looking up a key jumps directly to its bucket.
lookup_key = "carol"
lookup_index = bucket_index(lookup_key, table_size)
print(f"Key '{lookup_key}' goes directly to bucket {lookup_index}.")




### **1.2. Handling Collisions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_01_02.jpg?v=1767330624" width="250">



>* Collisions happen when different keys share buckets
>* Good collision handling keeps dict and set fast

>* Two main collision strategies: chaining and probing
>* Open addressing probes new buckets following repeatable pattern

>* Too many collisions create slow clustered lookups
>* Python limits clustering with probing and resizing



In [None]:
#@title Python Code - Handling Collisions

# Demonstrate simple hash collisions using a tiny custom table.
# Show how probing searches next slots when collisions occur.
# Compare our table behavior with normal Python dictionary behavior.
# pip install commands are unnecessary because script uses only built in features.

# Define a tiny hash table size to force frequent collisions.
TABLE_SIZE = 5

# Define a simple hash function using modulo operation for bucket index.
def simple_hash(key):
    return hash(key) % TABLE_SIZE

# Define an insert function using linear probing for collisions.
def insert(table, key, value):
    index = simple_hash(key)
    start_index = index

    # Loop until an empty slot or matching key is found.
    while table[index] is not None and table[index][0] != key:
        index = (index + 1) % TABLE_SIZE
        if index == start_index:
            raise RuntimeError("Table is full, cannot insert more items.")

    # Store the key value pair at the chosen index after probing.
    table[index] = (key, value)

# Define a lookup function that follows the same probing pattern.
def lookup(table, key):
    index = simple_hash(key)
    start_index = index

    # Probe sequentially until key is found or empty slot encountered.
    while table[index] is not None:
        if table[index][0] == key:
            return table[index][1]
        index = (index + 1) % TABLE_SIZE
        if index == start_index:
            break

    # Return None when key is not found after probing.
    return None

# Create an empty table with None entries representing empty buckets.
custom_table = [None] * TABLE_SIZE

# Insert keys that intentionally collide into the tiny table.
keys = ["AA", "BB", "CC"]
values = [10, 20, 30]

# Insert each key value pair while printing chosen bucket index.
for key, value in zip(keys, values):
    index = simple_hash(key)
    print(f"Planned bucket for {key} is index {index}.")
    insert(custom_table, key, value)

# Show final table layout after handling collisions with probing.
print("Final custom table layout with possible collisions handled:")
print(custom_table)

# Demonstrate lookup following the same probing pattern for a colliding key.
search_key = "CC"
found_value = lookup(custom_table, search_key)
print(f"Lookup for {search_key} returned value {found_value}.")

# Compare with normal Python dictionary behavior using same keys and values.
py_dict = {k: v for k, v in zip(keys, values)}
print("Python dict lookup for CC gives:", py_dict["CC"])



### **1.3. Growth and Capacity**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_01_03.jpg?v=1767330643" width="250">



>* Hash tables have fixed slots and capacity
>* Load factor controls collisions, speed, and resizing

>* Hash tables sometimes resize and rehash keys
>* Rare resizes keep operations fast and efficient

>* Resizing causes memory jumps but adds spare capacity
>* Extra capacity keeps operations fast as sizes change



In [None]:
#@title Python Code - Growth and Capacity

# Demonstrate dictionary growth and capacity behavior with simple insert operations.
# Show how size jumps occasionally while many insertions happen smoothly overall.
# Illustrate that occasional expensive growth keeps average access time nearly constant.
# pip install commands are unnecessary because this script uses only built in features.

# Import sys module for checking approximate dictionary memory size in bytes.
import sys

# Create empty dictionary representing a growing address book over time.
address_book = {}

# Define total number of insert operations for this simple demonstration.
total_inserts = 5000

# Define checkpoints where we will inspect dictionary size and memory usage.
checkpoints = [10, 50, 100, 500, 1000, 2500, 5000]

# Print header describing columns for later checkpoint information.
print("entries\tapprox_bytes\tcomment")

# Loop through range and insert fake entries representing house numbers and names.
for house_number in range(1, total_inserts + 1):

    # Insert simple key value pair representing a resident at some street address.
    address_book[f"{house_number} Main Street"] = f"Resident {house_number}"

    # When current count hits checkpoint, print approximate size and short explanation.
    if house_number in checkpoints:

        # Get approximate dictionary size using sys.getsizeof for rough capacity hint.
        approx_bytes = sys.getsizeof(address_book)

        # Prepare short comment describing growth stage and potential resize behavior.
        if house_number == 10:
            comment = "Very small table, first growth steps probably just started."
        elif house_number == 50:
            comment = "Still light load, spare capacity keeps lookups very fast."
        elif house_number == 100:
            comment = "More entries, internal table maybe resized once already."
        elif house_number == 500:
            comment = "Notice memory jump, table expanded to reduce collisions."
        elif house_number == 1000:
            comment = "Large spare capacity now, many inserts before next resize."
        elif house_number == 2500:
            comment = "Load factor rising again, another growth step may be near."
        else:
            comment = "Reached final size, capacity will handle more future growth."

        # Print current number of entries, approximate bytes, and growth comment.
        print(f"{house_number}\t{approx_bytes}\t{comment}")




## **2. Dict Operation Costs**

### **2.1. Lookup Time Costs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_02_01.jpg?v=1767330662" width="250">



>* Dict lookups use hashing to find keys
>* Expected lookup time stays constant as size grows

>* Constant-time lookups are average-case, not guaranteed
>* Collisions can slow lookups; worst-case becomes linear

>* Treat frequent dict lookups as usually constant-time
>* Avoid hash collisions and rely on resizing behavior



In [None]:
#@title Python Code - Lookup Time Costs

# Demonstrate dictionary lookup time staying similar for different dictionary sizes.
# Compare lookup times for small and large dictionaries using time measurements.
# Show that average lookup cost barely grows even with many stored items.

# pip install commands are unnecessary because this script uses only built in modules.

# Import time module for measuring elapsed lookup durations.
import time

# Create small dictionary with one thousand key value pairs.
small_dict = {f"user_{i}": i for i in range(1000)}

# Create large dictionary with one million key value pairs.
large_dict = {f"user_{i}": i for i in range(1000000)}

# Define helper function that measures average lookup time for given dictionary.
def measure_lookup_time(given_dict, label, repetitions):
    # Choose existing key and missing key for realistic lookup scenarios.
    existing_key = next(iter(given_dict.keys()))
    missing_key = "user_missing_key_value_here"

    # Measure time for repeated existing key lookups.
    start_existing = time.perf_counter()
    for _ in range(repetitions):
        _ = given_dict[existing_key]
    duration_existing = time.perf_counter() - start_existing

    # Measure time for repeated missing key membership checks.
    start_missing = time.perf_counter()
    for _ in range(repetitions):
        _ = missing_key in given_dict
    duration_missing = time.perf_counter() - start_missing

    # Compute average microseconds per lookup for both scenarios.
    avg_existing_us = duration_existing / repetitions * 1_000_000
    avg_missing_us = duration_missing / repetitions * 1_000_000

    # Print formatted summary showing approximate constant lookup behavior.
    print(f"{label}: existing lookup ≈ {avg_existing_us:.3f} microseconds each.")
    print(f"{label}: missing lookup ≈ {avg_missing_us:.3f} microseconds each.")

# Set number of repetitions for stable timing without long runtime.
repetitions = 200000

# Measure and display lookup times for small dictionary.
measure_lookup_time(small_dict, "Small dict (1k keys)", repetitions)

# Measure and display lookup times for large dictionary.
measure_lookup_time(large_dict, "Large dict (1M keys)", repetitions)



### **2.2. Update Operation Costs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_02_02.jpg?v=1767330687" width="250">



>* Inserts, updates, deletes share hash-based probing
>* Each update is amortized constant time, scalable

>* Dictionaries sometimes resize when load factor grows
>* Resizes are rare, so average insertion stays constant

>* Deletes use tombstones, keeping probes working correctly
>* Occasional housekeeping keeps average update time constant



In [None]:
#@title Python Code - Update Operation Costs

# Demonstrate dictionary update costs with many insertions and occasional resizing.
# Show that single updates feel fast even as the dictionary grows larger.
# Compare time per operation for small and large update batches.
# pip install numpy matplotlib pandas seaborn.

# Import required standard library modules for timing operations.
import time

# Define a helper function that measures average insertion time.
def measure_insert_time(count, label):
    start_time = time.perf_counter()
    data = {}
    for number in range(count):
        data[number] = number
    end_time = time.perf_counter()
    average = (end_time - start_time) / count
    print(f"{label}: {count} inserts, average seconds {average:.8f}")
    return average

# Warm up the interpreter to reduce one time overhead effects.
_ = measure_insert_time(1000, "Warmup batch small")

# Measure average insertion time for a medium sized batch.
medium_average = measure_insert_time(50000, "Medium batch inserts")

# Measure average insertion time for a larger batch.
large_average = measure_insert_time(200000, "Large batch inserts")

# Show ratio between large and medium average times for clarity.
ratio = large_average / medium_average if medium_average else 0.0
print(f"Large versus medium average time ratio: {ratio:.2f}")

# Demonstrate that overwriting existing keys remains very fast overall.
user_scores = {f"user_{i}": i for i in range(100000)}

# Time updating existing keys inside the prepared dictionary.
start_update = time.perf_counter()
for i in range(100000):
    user_scores[f"user_{i}"] = i + 1
end_update = time.perf_counter()

# Print average update time per overwrite operation.
avg_update = (end_update - start_update) / 100000
print(f"Overwrite existing keys average seconds {avg_update:.8f}")




### **2.3. Iterating Dictionary Views**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_02_03.jpg?v=1767330703" width="250">



>* Iterating visits each stored entry once, linearly
>* Hash-table layout makes iteration fast and cache-friendly

>* Dict view objects are cheap, reference-based windows
>* Iteration over views is linear and can dominate time

>* Mixing iteration with operations increases total cost
>* Design algorithms to avoid repeated full scans



In [None]:
#@title Python Code - Iterating Dictionary Views

# Demonstrate iterating dictionary views and their linear time behavior.
# Compare iteration over keys, values, and items with growing dictionary sizes.
# Show that doubling entries roughly doubles iteration time in simple examples.

# pip install commands are unnecessary because this script uses only builtins.

# Import time module for simple timing measurements.
import time

# Create a helper function that builds a dictionary with given size.
def build_inventory(size):
    inventory = {}
    for i in range(size):
        inventory[f"product_{i}"] = i
    return inventory

# Create two dictionaries with different sizes for comparison.
small_inventory = build_inventory(1_000)
large_inventory = build_inventory(10_000)

# Define a function that times iterating over a given dictionary view.
def time_iteration(label, view):
    start = time.perf_counter()
    total = 0
    for item in view:
        total += 1
    duration = time.perf_counter() - start
    print(f"{label}: visited {total} entries in {duration:.6f} seconds")

# Time iterating over keys for small and large dictionaries.
print("Iterating over keys view timings:")
time_iteration("Small keys view", small_inventory.keys())
time_iteration("Large keys view", large_inventory.keys())

# Time iterating over values for small and large dictionaries.
print("\nIterating over values view timings:")
time_iteration("Small values view", small_inventory.values())
time_iteration("Large values view", large_inventory.values())

# Time iterating over items for small and large dictionaries.
print("\nIterating over items view timings:")
time_iteration("Small items view", small_inventory.items())
time_iteration("Large items view", large_inventory.items())



## **3. Efficient Set Keys**

### **3.1. Fast Membership Checks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_03_01.jpg?v=1767330726" width="250">



>* Sets are optimized for rapid membership checks
>* Use sets as yes-or-no filters efficiently

>* Convert repeated checks on lists into sets
>* Build sets once, reuse for many lookups

>* Avoid constantly rebuilding or churning large sets
>* Keep long-lived sets stable to preserve speed



In [None]:
#@title Python Code - Fast Membership Checks

# Demonstrate fast membership checks using sets versus lists in Python.
# Show why converting to a set once speeds up repeated lookups.
# Compare lookup times for a list and a set with many elements.

# pip install commands are not required because this script uses only standard libraries.

# Import time module for simple timing measurements.
import time

# Create a large list of integers representing many user IDs.
user_ids_list = list(range(0, 500000))

# Convert the list into a set once for fast membership checks.
user_ids_set = set(user_ids_list)

# Choose some test IDs including present and absent values.
test_ids = [10, 250000, 499999, 750000]

# Define a helper function to time membership checks for any collection.
def time_membership_checks(collection, label):
    # Record start time before performing membership checks.
    start = time.perf_counter()
    
    # Perform many repeated membership checks using the same collection.
    for _ in range(2000):
        for value in test_ids:
            value in collection
    
    # Record end time after all membership checks complete.
    end = time.perf_counter()
    
    # Compute elapsed time in milliseconds for readability.
    elapsed_ms = (end - start) * 1000
    
    # Print timing result with a clear label for the collection type.
    print(f"{label} membership time: {elapsed_ms:.2f} ms")

# Time membership checks using the list collection first.
time_membership_checks(user_ids_list, "List")

# Time membership checks using the set collection second.
time_membership_checks(user_ids_set, "Set")

# Print a short conclusion highlighting which structure was faster overall.
print("Set membership is usually much faster than list membership for repeated checks.")



### **3.2. Choosing Hashable Keys**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_03_02.jpg?v=1767330758" width="250">



>* Use simple, immutable, hashable values as keys
>* They keep membership checks fast and predictable

>* Use immutable keys so identity never changes
>* Choose keys that match your domain’s uniqueness rules

>* Heavy, complex keys slow large set operations
>* Prefer short, unique, lightweight keys for performance



In [None]:
#@title Python Code - Choosing Hashable Keys

# Demonstrate choosing simple hashable keys for efficient Python set membership checks.
# Compare performance and behavior of simple keys versus heavier composite keys.
# Show why stable, meaningful, lightweight keys keep membership checks efficient.

# pip install commands are not required because this script uses only standard libraries.

# Import time module for simple timing measurements.
import time

# Create simple immutable keys representing student ID numbers.
student_ids_simple = {101, 205, 309, 412, 518}

# Create heavier keys using tuples containing ID and long descriptive strings.
student_ids_heavy = {
    (101, "student from north campus with long description"),
    (205, "student from south campus with long description"),
}

# Define a function that checks membership many times using given key and set.
def check_membership_many_times(target_set, target_key, repeat_count):
    start_time = time.time()
    found_count = 0
    for _ in range(repeat_count):
        if target_key in target_set:
            found_count += 1
    end_time = time.time()
    return found_count, end_time - start_time

# Choose a simple key and a heavy key that both represent the same student.
simple_key = 205
heavy_key = (205, "student from south campus with long description")

# Run membership checks many times for both key styles.
repetitions = 200000
simple_found, simple_seconds = check_membership_many_times(student_ids_simple, simple_key, repetitions)
heavy_found, heavy_seconds = check_membership_many_times(student_ids_heavy, heavy_key, repetitions)

# Print results showing that both keys work but heavier keys cost more time.
print("Simple key found count and seconds:", simple_found, round(simple_seconds, 4))
print("Heavy key found count and seconds:", heavy_found, round(heavy_seconds, 4))

# Show that using stable, meaningful, lightweight keys keeps membership checks predictable.
print("Simple keys are shorter, immutable, and usually faster for large sets.")
print("Heavy composite keys are valid but can be slower and more memory hungry.")
print("Choose keys that match your equality rules but remain compact and stable.")



### **3.3. Safe Mutable Key Usage**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master Python Algorithms/Module_03/Lecture_B/image_03_03.jpg?v=1767330775" width="250">



>* Set elements need stable, unchanging hash values
>* Changing hashed attributes breaks membership and removal

>* Store immutable identifiers instead of mutable objects
>* Keeps membership checks fast and avoids hash bugs

>* Mutable objects in sets need stable identity fields
>* Prefer immutable keys; keep mutable data separate



In [None]:
#@title Python Code - Safe Mutable Key Usage

# Demonstrate safe mutable key usage with sets and stable identifiers.
# Show how changing hashed fields breaks membership and removal operations.
# Show safer pattern using immutable identifiers for mutable session objects.
# pip install some_required_library_if_needed.

# Define a simple mutable Session class with mutable attributes.
class Session:
    def __init__(self, session_id, status):
        self.session_id = session_id
        self.status = status

    # Hash and equality depend only on immutable session_id field.
    def __hash__(self):
        return hash(self.session_id)

    # Equality also compares only session_id for identity stability.
    def __eq__(self, other):
        return isinstance(other, Session) and self.session_id == other.session_id

# Create one session object and place it inside a set.
session = Session("ABC123", "active")
active_sessions_objects = {session}

# Change mutable status field, which does not affect hashing behavior.
session.status = "paused"

# Membership check still works because session_id remained unchanged.
print("Session object still found in set:", session in active_sessions_objects)

# Now demonstrate safer pattern using only immutable identifiers.
active_sessions_ids = {"ABC123", "XYZ999"}

# Simulate session status changes stored separately in a dictionary.
session_status = {"ABC123": "paused", "XYZ999": "active"}

# Membership uses stable identifiers, while details remain mutable elsewhere.
print("Identifier 'ABC123' in active set:", "ABC123" in active_sessions_ids)



# <font color="#418FDE" size="6.5" uppercase>**Dicts And Sets**</font>


In this lecture, you learned to:
- Explain how hash tables underpin Python dict and set behavior. 
- Analyze the expected time complexity of lookups, insertions, and deletions in dicts and sets. 
- Select appropriate key types and usage patterns to maintain efficient hash-based operations. 

In the next Module (Module 4), we will go over 'Searching Techniques'