# Chapter 9: itertools and Functional Iteration Recipes

**Chapter 9 - Learning Python, 5th Edition**

The `itertools` module provides a collection of fast, memory-efficient tools
for working with iterators. Combined with `functools.reduce` and generator
pipelines, these tools enable powerful data processing without loading
entire datasets into memory.

## Key Concepts
- **`itertools`**: Standard library of iterator building blocks
- **Lazy evaluation**: Values computed on demand, not all at once
- **Generator pipelines**: Chaining generators for multi-stage processing
- **`functools.reduce`**: Cumulative reduction of an iterable to a single value

## Section 1: Chaining Iterables

`itertools.chain` concatenates multiple iterables into a single stream.
`chain.from_iterable` does the same but accepts a single iterable of
iterables (useful when you don't know the number of sources upfront).

In [None]:
import itertools

# chain: concatenate known iterables
frontend: list[str] = ["HTML", "CSS", "JavaScript"]
backend: list[str] = ["Python", "Go", "Rust"]
devops: list[str] = ["Docker", "Kubernetes"]

all_skills: list[str] = list(itertools.chain(frontend, backend, devops))
print(f"All skills: {all_skills}")

# chain.from_iterable: flatten an iterable of iterables
departments: list[list[str]] = [
    ["Alice", "Bob"],
    ["Carol", "Dave", "Eve"],
    ["Frank"],
]
all_employees: list[str] = list(itertools.chain.from_iterable(departments))
print(f"All employees: {all_employees}")

# Practical: merge sorted sequences (preserving order within each)
log_server_1: list[tuple[int, str]] = [(1, "start"), (3, "request"), (5, "response")]
log_server_2: list[tuple[int, str]] = [(2, "connect"), (4, "query"), (6, "disconnect")]

# chain then sort for a merged timeline
merged_timeline = sorted(
    itertools.chain(log_server_1, log_server_2),
    key=lambda entry: entry[0]
)
print(f"\nMerged timeline:")
for ts, event in merged_timeline:
    print(f"  t={ts}: {event}")

# Use with heapq.merge for already-sorted sequences (more efficient)
import heapq
sorted_a: list[int] = [1, 4, 7, 10]
sorted_b: list[int] = [2, 5, 8, 11]
sorted_c: list[int] = [3, 6, 9, 12]
merged = list(heapq.merge(sorted_a, sorted_b, sorted_c))
print(f"\nheapq.merge: {merged}")

## Section 2: Lazy Slicing with `islice`

`itertools.islice` provides slicing for any iterator -- including those
that don't support indexing (generators, file objects, etc.). It consumes
elements lazily without building an intermediate list.

In [None]:
import itertools

# You can't slice a generator with [start:stop]
def infinite_counter(start: int = 0) -> itertools.count:
    """An infinite sequence of integers."""
    return itertools.count(start)

# islice(iterable, stop)
first_ten: list[int] = list(itertools.islice(infinite_counter(), 10))
print(f"First 10: {first_ten}")

# islice(iterable, start, stop)
middle: list[int] = list(itertools.islice(infinite_counter(), 5, 15))
print(f"Elements 5-14: {middle}")

# islice(iterable, start, stop, step)
every_third: list[int] = list(itertools.islice(infinite_counter(), 0, 30, 3))
print(f"Every 3rd (0-29): {every_third}")

# Practical: preview the first few lines of a large "file"
import io

large_file = io.StringIO("\n".join(f"Line {i}: data_{i}" for i in range(1, 10001)))

print("\nFirst 5 lines:")
for line in itertools.islice(large_file, 5):
    print(f"  {line.rstrip()}")

# Skip header (first line), take next 3 lines
large_file.seek(0)
print("\nSkip header, next 3 lines:")
for line in itertools.islice(large_file, 1, 4):
    print(f"  {line.rstrip()}")

# Paginate results lazily
def paginate(iterable, page_size: int, page_num: int) -> list:
    """Return a specific page of results from an iterable."""
    start = page_size * (page_num - 1)
    return list(itertools.islice(iterable, start, start + page_size))

all_items = range(1, 51)
print(f"\nPage 1 (size=10): {paginate(iter(all_items), 10, 1)}")
print(f"Page 3 (size=10): {paginate(iter(all_items), 10, 3)}")

## Section 3: Grouping with `groupby`

`itertools.groupby` groups consecutive elements by a key function.
The input **must be sorted** by the same key for correct grouping.
It yields `(key, group_iterator)` pairs.

In [None]:
import itertools
from typing import NamedTuple


class Sale(NamedTuple):
    department: str
    product: str
    amount: float


sales: list[Sale] = [
    Sale("Electronics", "Laptop", 999.99),
    Sale("Electronics", "Phone", 699.99),
    Sale("Electronics", "Tablet", 449.99),
    Sale("Clothing", "Jacket", 89.99),
    Sale("Clothing", "Shoes", 129.99),
    Sale("Books", "Python 101", 39.99),
    Sale("Books", "Data Science", 49.99),
    Sale("Books", "Algorithms", 59.99),
]

# Data must be sorted by the grouping key
sales.sort(key=lambda s: s.department)

print("Sales by department:")
for dept, group in itertools.groupby(sales, key=lambda s: s.department):
    items = list(group)
    total = sum(s.amount for s in items)
    print(f"  {dept}: {len(items)} items, total ${total:.2f}")
    for sale in items:
        print(f"    - {sale.product}: ${sale.amount:.2f}")

# Group numbers by parity
numbers: list[int] = [1, 1, 2, 3, 3, 3, 4, 4, 5]
print(f"\nRun-length encoding:")
for value, group in itertools.groupby(numbers):
    count = sum(1 for _ in group)
    print(f"  {value} x {count}")

# Group by computed key: word length
words: list[str] = sorted(
    ["cat", "dog", "fish", "bird", "ant", "bear", "wolf", "bee", "ox"],
    key=len
)
print(f"\nWords grouped by length:")
for length, group in itertools.groupby(words, key=len):
    print(f"  {length} letters: {list(group)}")

## Section 4: Combinatorics

`itertools.product`, `combinations`, and `permutations` generate
combinatorial sequences lazily. These are essential for search problems,
testing, and mathematical applications.

In [None]:
import itertools

# product: Cartesian product (replaces nested for loops)
colors: list[str] = ["red", "blue"]
sizes: list[str] = ["S", "M", "L"]

variants: list[tuple[str, str]] = list(itertools.product(colors, sizes))
print(f"Product variants ({len(variants)}):")
for color, size in variants:
    print(f"  {color}-{size}")

# product with repeat: all binary strings of length 3
binary_3: list[tuple[int, ...]] = list(itertools.product([0, 1], repeat=3))
print(f"\n3-bit binary: {binary_3}")

# combinations: unordered selections (no repetition)
team: list[str] = ["Alice", "Bob", "Carol", "Dave"]
pairs: list[tuple[str, str]] = list(itertools.combinations(team, 2))
print(f"\nAll pairs from {team}:")
for pair in pairs:
    print(f"  {pair[0]} & {pair[1]}")

# combinations_with_replacement
dice_doubles: list[tuple[int, int]] = list(
    itertools.combinations_with_replacement(range(1, 7), 2)
)
print(f"\nDice pairs (with doubles): {len(dice_doubles)} combinations")
print(f"First 5: {dice_doubles[:5]}")

# permutations: ordered arrangements
letters: list[str] = ["A", "B", "C"]
perms: list[tuple[str, ...]] = list(itertools.permutations(letters))
print(f"\nPermutations of {letters}: {perms}")

# Partial permutations (pick 2 from 4)
partial: list[tuple[str, ...]] = list(itertools.permutations("ABCD", 2))
print(f"2-permutations of ABCD ({len(partial)}): {partial}")

## Section 5: `accumulate`, `starmap`, and `functools.reduce`

`accumulate` produces running totals (or running applications of any
binary function). `starmap` applies a function to pre-unpacked argument
tuples. `reduce` collapses an iterable to a single value.

In [None]:
import itertools
import operator
import functools

# accumulate: running total
monthly_revenue: list[int] = [100, 150, 120, 200, 180, 250]
cumulative: list[int] = list(itertools.accumulate(monthly_revenue))
print(f"Monthly revenue: {monthly_revenue}")
print(f"Cumulative:      {cumulative}")

# accumulate with custom function: running max
data: list[int] = [3, 1, 4, 1, 5, 9, 2, 6, 5]
running_max: list[int] = list(itertools.accumulate(data, max))
print(f"\nData:        {data}")
print(f"Running max: {running_max}")

# accumulate: factorial via running product
factorials: list[int] = list(itertools.accumulate(range(1, 8), operator.mul))
print(f"\nFactorials 1! to 7!: {factorials}")

# starmap: apply function to pre-unpacked argument tuples
points: list[tuple[float, float, float, float]] = [
    (0, 0, 3, 4),
    (1, 1, 4, 5),
    (0, 0, 0, 0),
]

import math

def distance(x1: float, y1: float, x2: float, y2: float) -> float:
    return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)

distances: list[float] = list(itertools.starmap(distance, points))
print(f"\nPoint pairs -> distances: {[f'{d:.2f}' for d in distances]}")

# starmap with pow: [2^5, 3^2, 10^3]
powers: list[int] = list(itertools.starmap(pow, [(2, 5), (3, 2), (10, 3)]))
print(f"Powers: {powers}")

# functools.reduce: collapse to single value
numbers: list[int] = [1, 2, 3, 4, 5]

product = functools.reduce(operator.mul, numbers)
print(f"\nProduct of {numbers}: {product}")

# reduce with initial value: merge dicts left-to-right
configs: list[dict[str, int]] = [
    {"timeout": 30, "retries": 3},
    {"retries": 5, "port": 8080},
    {"debug": 1},
]
merged: dict[str, int] = functools.reduce(lambda a, b: {**a, **b}, configs)
print(f"Merged configs: {merged}")

## Section 6: Building Data Pipelines with Chained Generators

Generators can be composed into multi-stage processing pipelines where
each stage lazily transforms data from the previous stage. This pattern
processes data element-by-element, keeping memory usage constant regardless
of input size.

In [None]:
from typing import Iterator, Iterable


# Pipeline stage 1: generate raw sensor readings
def sensor_readings(n: int) -> Iterator[dict[str, float]]:
    """Simulate n sensor readings with occasional bad data."""
    import random
    random.seed(42)
    for i in range(n):
        temp = random.gauss(25.0, 5.0)
        # Occasionally inject invalid readings
        if random.random() < 0.1:
            temp = -999.0  # Sensor error
        yield {"id": i, "temp_c": round(temp, 2)}


# Pipeline stage 2: filter out bad readings
def valid_only(readings: Iterable[dict[str, float]]) -> Iterator[dict[str, float]]:
    """Drop readings with invalid temperature."""
    for reading in readings:
        if reading["temp_c"] > -100:
            yield reading


# Pipeline stage 3: convert Celsius to Fahrenheit
def add_fahrenheit(readings: Iterable[dict[str, float]]) -> Iterator[dict[str, float]]:
    """Add Fahrenheit conversion to each reading."""
    for reading in readings:
        reading["temp_f"] = round(reading["temp_c"] * 9 / 5 + 32, 2)
        yield reading


# Pipeline stage 4: flag extreme values
def flag_extremes(
    readings: Iterable[dict[str, float]], threshold: float = 35.0
) -> Iterator[dict]:
    """Flag readings above the threshold."""
    for reading in readings:
        reading["extreme"] = reading["temp_c"] > threshold
        yield reading


# Compose the pipeline -- nothing executes yet
raw = sensor_readings(20)
cleaned = valid_only(raw)
converted = add_fahrenheit(cleaned)
flagged = flag_extremes(converted)

# Drive the pipeline by consuming the final generator
print("Sensor pipeline results:")
extreme_count = 0
total_count = 0
for reading in flagged:
    total_count += 1
    marker = " [EXTREME]" if reading["extreme"] else ""
    if reading["extreme"]:
        extreme_count += 1
    print(f"  #{reading['id']:02d}: {reading['temp_c']:6.2f}C / {reading['temp_f']:6.2f}F{marker}")

print(f"\nProcessed: {total_count} valid readings, {extreme_count} extreme")

## Section 7: Real-World Example -- Processing a Large Log File Lazily

This example demonstrates processing a large log file without loading it
entirely into memory. Each stage of the pipeline handles one line at a time,
so memory usage stays constant even for gigabyte-sized files.

In [None]:
import itertools
import re
from typing import Iterator, Iterable, NamedTuple
from collections import Counter
import io


class LogEntry(NamedTuple):
    timestamp: str
    level: str
    message: str


# Simulate a large log file
SAMPLE_LOG = """2024-01-15 08:00:01 INFO  Application started
2024-01-15 08:00:02 DEBUG Loading configuration from /etc/app.conf
2024-01-15 08:00:03 INFO  Connected to database
2024-01-15 08:00:05 WARNING Slow query detected (2.3s)
2024-01-15 08:00:10 ERROR Connection timeout to cache server
2024-01-15 08:00:11 INFO  Retrying cache connection
2024-01-15 08:00:12 INFO  Cache connection restored
2024-01-15 08:00:15 DEBUG Processing batch of 1000 records
2024-01-15 08:00:20 WARNING Memory usage at 85%
2024-01-15 08:00:25 ERROR Failed to write to /var/log/audit.log: Permission denied
2024-01-15 08:00:30 INFO  Batch processing complete
2024-01-15 08:00:31 DEBUG Cleaning up temporary files
2024-01-15 08:00:35 WARNING Disk usage at 90%
2024-01-15 08:00:40 ERROR OutOfMemoryError in worker thread 3
2024-01-15 08:00:41 INFO  Restarting worker thread 3
2024-01-15 08:00:45 INFO  Health check passed
"""

LOG_PATTERN = re.compile(
    r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(.+)$"
)


# Stage 1: Read lines lazily (simulates reading a real file)
def read_lines(source: io.StringIO) -> Iterator[str]:
    """Yield stripped lines from a file-like object."""
    for line in source:
        stripped = line.rstrip()
        if stripped:
            yield stripped


# Stage 2: Parse each line into a structured LogEntry
def parse_entries(lines: Iterable[str]) -> Iterator[LogEntry]:
    """Parse raw log lines into LogEntry objects."""
    for line in lines:
        match = LOG_PATTERN.match(line)
        if match:
            yield LogEntry(
                timestamp=match.group(1),
                level=match.group(2),
                message=match.group(3),
            )


# Stage 3: Filter by log level
def filter_level(
    entries: Iterable[LogEntry], min_level: str = "WARNING"
) -> Iterator[LogEntry]:
    """Keep only entries at or above the specified severity."""
    levels = {"DEBUG": 0, "INFO": 1, "WARNING": 2, "ERROR": 3, "CRITICAL": 4}
    threshold = levels.get(min_level, 0)
    for entry in entries:
        if levels.get(entry.level, 0) >= threshold:
            yield entry


# Compose and run the pipeline
log_file = io.StringIO(SAMPLE_LOG)
lines = read_lines(log_file)
entries = parse_entries(lines)
warnings_and_errors = filter_level(entries, "WARNING")

print("=== Warnings and Errors ===")
for entry in warnings_and_errors:
    print(f"  [{entry.level:7s}] {entry.timestamp} - {entry.message}")

# Analytics pass: count entries by level
log_file.seek(0)
all_entries = list(parse_entries(read_lines(log_file)))
level_counts = Counter(e.level for e in all_entries)

print(f"\n=== Log Level Distribution ===")
for level, count in level_counts.most_common():
    bar = "#" * (count * 3)
    print(f"  {level:8s} {bar} ({count})")

# Use islice for pagination: show only first 3 errors/warnings
log_file.seek(0)
first_3_issues = list(itertools.islice(
    filter_level(parse_entries(read_lines(log_file)), "WARNING"),
    3
))
print(f"\n=== First 3 Issues ===")
for entry in first_3_issues:
    print(f"  {entry.level}: {entry.message}")

## Summary

### itertools Quick Reference
| Function | Purpose |
|---|---|
| `chain(*iterables)` | Concatenate multiple iterables |
| `chain.from_iterable(it)` | Flatten an iterable of iterables |
| `islice(it, [start,] stop [, step])` | Lazy slicing of any iterator |
| `groupby(it, key=)` | Group consecutive elements by key |
| `product(*its, repeat=)` | Cartesian product |
| `combinations(it, r)` | r-length unordered selections |
| `permutations(it, r)` | r-length ordered arrangements |
| `accumulate(it, func)` | Running totals / reductions |
| `starmap(func, it)` | Apply function to unpacked tuples |

### Key Patterns
- **Always sort before `groupby`** -- it only groups consecutive elements
- **Use `islice`** instead of `list(gen)[:n]` to avoid materializing the full sequence
- **Chain generators** into pipelines for memory-efficient processing
- **`functools.reduce`** for cumulative operations that collapse to a single value
- **Generator pipelines** process data element-by-element, keeping memory constant