# Item 31: Be Defensive When Iterating Over Arguments

When a function takes a `list` of objects as a parameter, it's often important to iterate over that `list` multiple times.

In [1]:
# Say we'd like to figure out what the percentage of overall tourism each city receives. To this, we need a
# normalization function that sums the inputs to determine the total number of tourists per year and then divides
# each city's individual visitor count by the total to fin that city's contribution to the whole
def normalize(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

In [3]:
# Function works as expected when given a list of visits
visits = [15, 35, 80]
percentages = normalize(visits)
print(percentages)
assert sum(percentages) == 100.0

[11.538461538461538, 26.923076923076923, 61.53846153846154]


In [4]:
# We define a generator to read data from a file because then we can reuse the same function later
def read_visits(data_path):
    with open(data_path) as f:
        for line in f:
            yield int(line)

In [6]:
# Calling normalize on the read_visits generator's return value produces no result
it = read_visits('my_numbers.txt')
percentages = normalize(it)
print(percentages)

[]


This behavior of the above code (returned an empty list) occurs because an iterator produces its results only a single time. If we iterate over an iterator or a generator that has already raised a `StopIteration` exception, we won't get any results the second time around:

In [7]:
it = read_visits('my_numbers.txt')
print(list(it))
print(list(it)) # Already exausted

[15, 35, 80]
[]


Confunsingly, we also won't get any errors when iterating over an already exhausted iterator. `for` loops, the `list` constructor, and many other functions throughout the Python Standard Library expect the `StopIteration` exception to b raised during normal operation. These functions can't tell the difference an iterator that has no output and an iterator that had output and is now exhausted.

In [8]:
# To solve the above issue, we can explicitly exhast an input iterator and keep a copy of its entire content in a
# list. We can then iterate over the list version of the data as many times as we need. Here's the same function as
# before, but it defensively copies the input iterator
def normalize_copy(numbers):
    numbers_copy = list(numbers) # Copy the iterator
    total = sum(numbers_copy)
    result = []
    for value in numbers_copy:
        percent = 100 * value / total
        result.append(percent)
    return result

In [9]:
# Now the function works correctly on the read_visits generator's return value
it = read_visits('my_numbers.txt')
percentages = normalize_copy(it)
print(percentages)
assert sum(percentages) == 100.0

[11.538461538461538, 26.923076923076923, 61.53846153846154]


The problem with this approach is that the copy of the input iterator's content could be extremely large. Copying the iterator could cause the program to run out of memory and crash. This potential for scalability issues undermines the reason that the author wrote the `read_visits` function as a generator. 

In [10]:
# One way to get around the issue mentioned above is to accept a function that returns a new iterator each time is
# called
def normalize_func(get_iter):
    total = sum(get_iter()) # New iterator
    result = []
    for value in get_iter(): # New iterator
        percent = 100 * value / total
        result.append(percent)
    return result

In [11]:
# To use normalize_func, we can pass a lambda expression that calls the generator and produces a new iterator each time
path = 'my_numbers.txt'
percentages = normalize_func(lambda: read_visits(path))
print(percentages)
assert sum(percentages) == 100.0

[11.538461538461538, 26.923076923076923, 61.53846153846154]


Although the above code works, is kind of clumsy having to pass lamda function as a parameter to `normalize_func`. A much better way to achieve this is to provide a new container class that implements the *Iterator Protocol*.

The *Iterator Protocol* is how Python `for` loops and related expressions traverse the contents of a container type. When Python sees a statement like `for x in foo`, it actually calls `iter(foo)`. The `iter` built-in functions calls `foo.__iter__` special method in turn. The `__iter__` special method must return an iterator object (which itself implements the `__next__` special method). The, the `for` loop repeatedly calls the `next` built-in function on the iterator object until its exhausted (indicated by raising a `StopIteration` exception).

In [16]:
# We can achive the behavior mentioned above by implementing the __iter__ method as a generator in your class
class ReadVisits:
    def __init__(self, data_path):
        self.data_path = data_path

    def __iter__(self):
        with open(self.data_path) as f:
            for line in f:
                yield int(line)

In [17]:
# This new container type works correctly when passed to the original function
visits = ReadVisits(path)
percentages = normalize(visits)
print(percentages)
assert sum(percentages) == 100.0

[11.538461538461538, 26.923076923076923, 61.53846153846154]


This works becuase the `sum` method in `normalize` calls the `ReadVisits.__iter__` to allocate a new iterator object. The `for` loop to normalize the numbers also calls `__iter__` to allocate a second iterator object. Each of those iterators will be advanced and exhasuted independently, ensuring that each unique iteration sees all of the input data data values. The only downside to this approach is that it reads the input data multiple times.

The protocol states that when an iterator is passed to the `iter` built-in function, `iter` returns the iterator itself. In contrast, when a container type is passed to `iter`, a new iterator object is returned each time.

In [18]:
# We can test an input value for this behavior and raise a TypeError to reject arguments that can;t be repeatedly
# iterated over
def normalize_defensive(numbers):
    if iter(numbers) is numbers:
        raise TypeError('Must supply a container')
    total = sum(numbers)
    result = []
    for value in numbers():
        percent = 100 * value / total
        result.append(percent)
    return result

In [23]:
from collections.abc import Iterator

# We can also test if the parameter is an Iterator or not by using the Iterator class in the collections.abc built-in
# module
def normalize_defensive(numbers):
    if isinstance(numbers, Iterator): # Another way to check
        raise TypeError('Must supply a container')
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

The approach of using a container is ideal if we don't want to copy the full input iterator, as we did in the `normalize_copy` function, but we also need to iterate over the input data multiple times. 

In [24]:
# This function works as expected for list and ReadVisits inputs because they are iterable containers that follo the
# iterator protocol
visits = [15, 35, 80]
percentages = normalize_defensive(visits)
assert sum(percentages) == 100.0

visits = ReadVisits(path)
percentages = normalize_defensive(visits)
assert sum(percentages) == 100.0

In [25]:
# The function raises an exception if the input is an iterator rather than a container
visits = [15, 35, 80]
it = iter(visits)
normalize_defensive(it)

TypeError: Must supply a container