"""
Session 2 - Topic 3
===================
Optimization Implications - DEEP DIVE


"""

# =====================================
# 1. Why Optimization Implications Matter
# =====================================


"""
- Python is an interpreted language. Execution speed depends not just on algorithm but *how* you write the code.
- Different constructs (generator, list comp, global vs local) emit different bytecode.
- Some cause heavier GC load, others cause deeper call stacks.
- Understanding these differences is key to tuning critical paths.


"""

# =====================================
# 2. Generator vs List Comprehension
# =====================================



Generators are lazy — they produce values one by one.
List comprehensions build entire lists in memory immediately.

Memory vs speed tradeoff:
- Small data: list may be faster.
- Huge data: generator wins (no RAM bloat).

Bytecode for both shows critical differences.


Sure! Here's the simplified version of the Markdown with no "**bold**", no "do you want" or "let's explore", and shorter, direct sentences. You can paste this directly into your notebook.

---

````markdown
# List Comprehension vs Generator Expression

This compares list comprehensions and generator expressions using bytecode, time, and memory.

---

## 1. Code

def list_comp():
    return [i * i for i in range(10000)]

def generator_expr():
    return (i * i for i in range(10000))

def sum_with_generator():
    return sum(i * i for i in range(10000))
````

---

## 2. Bytecode

Use `dis.dis()` to check instructions.

list\_comp uses:

* BUILD\_LIST
* LIST\_APPEND
* All values computed and stored

generator\_expr uses:

* BUILD\_GENEXPR
* YIELD\_VALUE
* Values are delayed until needed

---

## 3. Timing

Each function runs 500 times using timeit.

| Function             | Time (sec) |
| -------------------- | ---------- |
| list\_comp           | \~0.08     |
| generator\_expr      | \~0.0008   |
| sum\_with\_generator | \~0.06     |

Creating a generator is very fast.
list\_comp is faster if all values are needed now.
sum\_with\_generator is slower because each value is called lazily.

---

## 4. Memory

Use tracemalloc to check memory.

| Function             | Memory (KB) |
| -------------------- | ----------- |
| list\_comp           | High        |
| generator\_expr      | Very low    |
| sum\_with\_generator | Low         |

list\_comp keeps all values in memory.
generator\_expr just stores the iterator.
sum\_with\_generator pulls one item at a time.

---

## 5. Summary

Use list\_comp if:

* You need all values
* You have enough memory

Use generator if:

* You only loop once
* Memory matters
* You don’t need to keep the data

```
```


In [4]:
import dis
import timeit
import tracemalloc

# 1. Example functions
def list_comp():
    return [i * i for i in range(10000)]

def generator_expr():
    return (i * i for i in range(10000))  # Just return the generator

def sum_with_generator():
    return sum(i * i for i in range(10000))

# 2. Disassembly
print("\nDisassembly of List Comprehension:")
dis.dis(list_comp)

print("\nDisassembly of Generator Expression:")
dis.dis(generator_expr)

# 3. Timing comparison
print("\nTiming Comparison (500 runs each):")
lc_time = timeit.timeit(list_comp, number=500)
gen_time = timeit.timeit(generator_expr, number=500)
sum_gen_time = timeit.timeit(sum_with_generator, number=500)

print(f"List Comprehension (build list): {lc_time:.5f} sec")
print(f"Generator Expression (create gen): {gen_time:.5f} sec")
print(f"Sum via Generator: {sum_gen_time:.5f} sec")

# 4. Memory comparison
def memory_usage(func):
    tracemalloc.start()
    func()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

print("\nMemory Usage Comparison:")
lc_mem = memory_usage(list_comp)
gen_mem = memory_usage(generator_expr)
sum_gen_mem = memory_usage(sum_with_generator)

print(f"List Comprehension: {lc_mem / 1024:.2f} KB")
print(f"Generator Expression (no iteration): {gen_mem / 1024:.2f} KB")
print(f"Sum via Generator (full iteration): {sum_gen_mem / 1024:.2f} KB")


Disassembly of List Comprehension:
  6           0 RESUME                   0

  7           2 LOAD_CONST               1 (<code object <listcomp> at 0x7fcc7eab2ce0, file "<ipython-input-4-219c454cc867>", line 7>)
              4 MAKE_FUNCTION            0
              6 LOAD_GLOBAL              1 (NULL + range)
             18 LOAD_CONST               2 (10000)
             20 PRECALL                  1
             24 CALL                     1
             34 GET_ITER
             36 PRECALL                  0
             40 CALL                     0
             50 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x7fcc7eab2ce0, file "<ipython-input-4-219c454cc867>", line 7>:
  7           0 RESUME                   0
              2 BUILD_LIST               0
              4 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                 7 (to 22)
              8 STORE_FAST               1 (i)
             10 LOAD_FAST                1 (i)
             12 LOAD_

"""
Python accesses:
- Locals: via LOAD_FAST (array lookup, O(1))
- Globals: via LOAD_GLOBAL (dictionary lookup, slower)

You should favor locals inside hot loops.


"""
Each function call:
- Allocates a new PyFrameObject
- Pushes/pops stack frames
- Adds slight latency (~0.2 - 0.5 microseconds per call)

In deeply nested recursion or tight loops, this adds up.
Iteration often beats recursion unless tail call optimized (which Python doesn't do!).


In [9]:
import dis
import timeit

# 1. Recursive Sum
def recursive_sum(n):
    if n == 0:
        return 0
    return n + recursive_sum(n - 1)

# 2. Iterative Sum
def iterative_sum(n):
    total = 0
    for i in range(n + 1):
        total += i
    return total

# 3. Disassemble both functions
print("\nDisassembly of Recursive Sum:")
dis.dis(recursive_sum)

print("\nDisassembly of Iterative Sum:")
dis.dis(iterative_sum)

# 4. Timing comparison
print("\nTiming Recursive vs Iterative Sum (n=500, 10,000 runs):")
rec_time = timeit.timeit('recursive_sum(500)', globals=globals(), number=10000)
iter_time = timeit.timeit('iterative_sum(500)', globals=globals(), number=10000)

print(f"Recursive Sum: {rec_time:.5f} sec")
print(f"Iterative Sum: {iter_time:.5f} sec")


Disassembly of Recursive Sum:
  5           0 RESUME                   0

  6           2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (0)
              6 COMPARE_OP               2 (==)
             12 POP_JUMP_FORWARD_IF_FALSE     2 (to 18)

  7          14 LOAD_CONST               1 (0)
             16 RETURN_VALUE

  8     >>   18 LOAD_FAST                0 (n)
             20 LOAD_GLOBAL              1 (NULL + recursive_sum)
             32 LOAD_FAST                0 (n)
             34 LOAD_CONST               2 (1)
             36 BINARY_OP               10 (-)
             40 PRECALL                  1
             44 CALL                     1
             54 BINARY_OP                0 (+)
             58 RETURN_VALUE

Disassembly of Iterative Sum:
 11           0 RESUME                   0

 12           2 LOAD_CONST               1 (0)
              4 STORE_FAST               1 (total)

 13           6 LOAD_GLOBAL              1 (NULL + range)
 

🧨 Overhead Lines Explained
Let’s look at the key instructions that cause overhead in recursion:

1. LOAD_GLOBAL 0 (recursive_sum)
Loads the function object from the global namespace.
This is required every time you make a recursive call — not free.
It involves dictionary lookups if it's not cached (though Python caches this for performance).
2. CALL_FUNCTION 1
This is the most expensive line in recursion.
It does:
Creates a new stack frame
Copies arguments (n-1 in this case)
Saves the current execution context (so we can return later)
Pushes the new frame onto the Python call stack
Every recursive call creates a new function activation record/frame , which consumes memory and CPU time.
3. RETURN_VALUE
Pops the current frame off the stack and returns control to the caller.
Each return requires restoring the previous frame, registers, instruction pointer, etc.
Again, repeated thousands of times → overhead.

"""
Object construction matters.

- Tuples are slightly cheaper than lists.
- Attribute lookup is slower than dictionary direct lookup.


In [12]:
import dis
import timeit

# 1. Functions to compare
def create_list():
    return [1, 2, 3, 4, 5]

def create_tuple():
    return (1, 2, 3, 4, 5)

def access_list():
    lst = [1, 2, 3, 4, 5]
    return lst[2]

def access_tuple():
    tpl = (1, 2, 3, 4, 5)
    return tpl[2]

# 2. Disassemble creation and access
print("\nDisassembly of create_list:")
dis.dis(create_list)

print("\nDisassembly of create_tuple:")
dis.dis(create_tuple)

print("\nDisassembly of access_list:")
dis.dis(access_list)

print("\nDisassembly of access_tuple:")
dis.dis(access_tuple)

# 3. Timing comparison
print("\nTiming Comparison (10 million runs):")
list_create_time = timeit.timeit('create_list()', globals=globals(), number=10_000_000)
tuple_create_time = timeit.timeit('create_tuple()', globals=globals(), number=10_000_000)
list_access_time = timeit.timeit('access_list()', globals=globals(), number=10_000_000)
tuple_access_time = timeit.timeit('access_tuple()', globals=globals(), number=10_000_000)

print(f"List creation: {list_create_time:.5f} sec")
print(f"Tuple creation: {tuple_create_time:.5f} sec")
print(f"List access: {list_access_time:.5f} sec")
print(f"Tuple access: {tuple_access_time:.5f} sec")


Disassembly of create_list:
  5           0 RESUME                   0

  6           2 BUILD_LIST               0
              4 LOAD_CONST               1 ((1, 2, 3, 4, 5))
              6 LIST_EXTEND              1
              8 RETURN_VALUE

Disassembly of create_tuple:
  8           0 RESUME                   0

  9           2 LOAD_CONST               1 ((1, 2, 3, 4, 5))
              4 RETURN_VALUE

Disassembly of access_list:
 11           0 RESUME                   0

 12           2 BUILD_LIST               0
              4 LOAD_CONST               1 ((1, 2, 3, 4, 5))
              6 LIST_EXTEND              1
              8 STORE_FAST               0 (lst)

 13          10 LOAD_FAST                0 (lst)
             12 LOAD_CONST               2 (2)
             14 BINARY_SUBSCR
             24 RETURN_VALUE

Disassembly of access_tuple:
 15           0 RESUME                   0

 16           2 LOAD_CONST               1 ((1, 2, 3, 4, 5))
              4 STORE_FAST 

"""
Proper benchmarking tips:
- Use timeit — it avoids common timing traps.
- Warm up the function before timing (avoid cold start anomalies).
- Repeat multiple times to observe variance.
- Ignore results unless >10% improvement in real-world cases.


"""

# =====================================
# 7. When NOT to Optimize
# =====================================


"""
DO NOT:
- Optimize readability-critical code unless profiling shows major gain.
- Fight Python's dynamic nature: heavy C-like micro-optimizations usually harm clarity without huge speed gain.
- Focus on single function benchmarks in isolation — system effects matter.

ALWAYS:
- Profile first (cProfile, timeit, tracemalloc).
- Optimize data structures before algorithm internals.
- Optimize ONLY your real bottlenecks.


"""

# ===========================
# END OF TOPIC 3
# ===========================
