**11.2 Lists vs Iterators vs Generators**
âœ” List â†’ Eager

Loads everything at once.

âœ” Iterator â†’ Lazy

Pulls values when needed.

âœ” Generator â†’ Custom lazy iterator

Defined via:

generator functions (yield)

generator expressions

Why this matters?

Loading a list of 50 million rows = ðŸ’¥ Out of memory.
Streaming with a generator = âœ” Perfect.

In [0]:
## Generator function
def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line

In [0]:
## Generator expression
nums = (int(x) for x in lines)

In [0]:
## Chaining generators
clean = (line.strip() for line in read_lines("logs"))
events = (json.loads(l) for l in clean)


**Why Generators Matter in DE**

âœ” They allow parsing huge files
âœ” They avoid memory blowups
âœ” They reduce GC pressure
âœ” They allow parallel/streaming processing
âœ” They help design Spark-like pipeline patterns
âœ” They allow infinite streams (Kafka, Kinesis)

In [0]:
## Full lazy pipeline, commonly used in ETL: This processes 1 record at a timeâ€”zero RAM blowup.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line

def clean(lines):
    for line in lines:
        line = line.strip()
        if line:
            yield line

def parse_json(lines):
    import json
    for line in lines:
        try:
            yield json.loads(line)
        except:
            continue

def filter_errors(records):
    for r in records:
        if r.get('status') == 'ERROR':
            yield r

pipeline = filter_errors(parse_json(clean(read_lines("logs.json"))))

In [0]:
## Chunking Patterns : Chunking allows processing data in batches.Fixed-size chunks

def chunked_iter(iterable, size):
    chunk = []
    for item in iterable:
        chunk.append(item)
        if len(chunk) == size:
            yield chunk
            chunk = []
    if chunk:
        yield chunk

**Used for:**

batch DB inserts

Snowflake COPY batches

S3 writes

incremental aggregation

**Windowed Iterators (Lazy Sliding Window)** :Using deque

In [0]:
from collections import deque

def sliding_window(iterable, size):
    q = deque(maxlen=size)
    for x in iterable:
        q.append(x)
        if len(q) == size:
            yield tuple(q)

**Used for:**

rolling averages

anomaly detection

signal processing

**Streaming JSON (Large JSON Files)**

Large JSON files are often:

JSON Lines (one JSON per line) â†’ ideal

Huge nested JSON (bad â€” must stream parse)

In [0]:
## âœ” Line-delimited JSON (NDJSON):

def json_stream(path):
    import json
    with open(path) as f:
        for line in f:
            yield json.loads(line)

In [0]:
## âœ” For HUGE JSON (1GB+): use streaming parser : ijson module (YAHOO / JPM ask this):
import ijson
for obj in ijson.items(open("file.json"), "item"):
    process(obj)