# Generators 

In [1]:
# Consider generators instead of returning lists
"""This content is based off of Brett Slatkin's Effective Python video lectures"""
def index_words(handle):
    offset = 0
    for line in handle:
        if line:
            yield offset
        for letter in line:
            offset +=1
            if letter == ' ':
                yield offset
                

In [2]:
texts = "he simplest choice for functions that produce a sequence of results is to return a list of items. For example, say you want to find the index of every word in a string. So, here I'm gonna define such a function. Takes the input texts, then it's going to have the results of the index of each word in that text. I'm assuming the text doesn't have any whitespace before or after it. And so if the text is not empty, an empty string, then we know that the very first position is a word. "

with open('sampletext.txt', 'w') as f:
    f.write(texts)

In [3]:
with open('sampletext.txt', 'r') as f:
    it = index_words(f)
    print(next(it))
    print(next(it))

0
3


In [23]:
# memory inefficient way of doing this
def index_indicator(statement):
    result = []
    if statement:
        result.append(0)
    for index, letter in enumerate(statement):
        if letter == ' ':
            result.append(index+1)
    return result

In [33]:
with open('sampletext.txt', 'r') as f:
    result = index_indicator(f.readline())
len(result)

96

- So the second function is using memory to save all indicies of words which could crash program if the text file is huge.

### Iterations

In [35]:
data = [15, 80, 75]

def normalize(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


output = normalize(data)
print(output)

[8.823529411764707, 47.05882352941177, 44.11764705882353]


In [36]:
# write data to file
data = [15, 80, 75, 90, 12, 45]
with open('numbers_data.txt', 'w') as f:
    for i in data:
        f.write('{}\n'.format(i))


In [37]:
# open data file with a generator
def read_numbers(path):
    with open(path, 'r') as f:
        for line in f:
            yield int(line)

In [38]:
read_numbers(path='numbers_data.txt')

<generator object read_numbers at 0x10aa60ed0>

In [39]:
list(read_numbers(path='numbers_data.txt'))

[15, 80, 75, 90, 12, 45]

In [41]:
# now let us see how iterator will behave
data_list = read_numbers(path='numbers_data.txt')
print(normalize(data_list))

[]


- The cause of this behavior is that an iterator only produces its results a single time and if you iterate over an iterator or a generator that's already raised a stop iteration exception, you won't get any results the second time around.
- If we run it a second time, you'll see that there's no exception raised. There's no obvious problem here. But the list that's created on the second time through is just empty. And if we do it, you know, multiple times after that, again we just get empty lists. So the problem here is that functions like list can't tell the difference between an iterator that has no output at all and an iterator that had output before like the first time through this loop, and that output's now exhausted. The stop iteration exception happens in both cases, and it's considered like normal operation for Python. And so the system can't tell the difference. And this same behavior will affect for loops and many other functions throughout the Python standard library that are expecting the stop iteration exception. To solve this problem, you can explicitly exhaust an input iterator and keep a copy of its entire contents in a list

In [43]:
iterator = read_numbers(path='numbers_data.txt')
print(list(iterator))
print(list(iterator))

[15, 80, 75, 90, 12, 45]
[]


- So let's change this function that we had before to avoid this issue of multiple iterators. Now let's take the numbers in input and let's redefine it as a list of the same value. And what that's doing is copying the iterator. So then later when we go through this and sum the numbers, we're iterating through a copy of all the numbers, we know they're all there. The second time through we're doing that again. Again we know it's a list so that it can't possibly be exhausted. And so we'll properly see everything that's there. Now if we have the read visits function again and we run, we run it on the path we had from before. And we pass that iterator into the normalize function, we'll see that this time around, it works properly. So because we added this copy up at the top, the normalize function will correctly work on an iterator, the iterator version of the input. So we've avoided the problem now. Unfortunately, it's not that easy. We've solved one problem but we've created a whole other problem. So the other problem here is that the list of numbers that we created up here to prevent the iterator exhaustion, that's going to create a copy of the input data

In [45]:
# solution
def normalize(numbers):
    numbers = list(numbers) # copy the iterator
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

it = read_numbers('numbers_data.txt')
print(normalize(it))

[4.73186119873817, 25.236593059936908, 23.65930599369085, 28.391167192429023, 3.7854889589905363, 14.195583596214512]


-  So we really haven't fundamentally solved the issue here, we've just made it not have the surprising behavior. Now the big difference between this version and the version before, pasting back in how we did it before, we can't just pass an iterator into normalize anymore. Because we actually need a function that generates this iterator. And so the way to actually do this is to create a new lambda function which returns back the iterator from read visits. And then we pass this iterator into normalize. And if we do this, then we get the correct behavior.

In [48]:
# another way
def normalize(numbers):
    total = sum(get_iter()) # New iterator
    print(total)
    result = []
    for value in get_iter(): # New iterator
        percent = 100 * value / total
        result.append(percent)
    return result

get_iter = lambda: read_numbers('numbers_data.txt')
print(normalize(get_iter))

317
[4.73186119873817, 25.236593059936908, 23.65930599369085, 28.391167192429023, 3.7854889589905363, 14.195583596214512]


- The way this works is, up here in normalize, get iterator is going to be called in both of these cases. That ends up evaluating this lambda. That lambda is going to call **read_numbers** multiple times. Twice, one for each time get iter is called. And so we get the correct behavior, but the natural outcome here is that we call revisits twice. Which means we've opened the data file and read the entire data file twice. So you're doing two passes through the entire input data which is the downside here. The upside is it can handle any amount of memory that the file could possibly contain.

In [49]:
lambda: read_numbers('numbers_data.txt')

<function __main__.<lambda>()>

In [57]:
# better way for Pythonista
class ReadNumbers(object):
    def __init__(self, data_path):
        self.data_path = data_path
    
    def __iter__(self):
        with open(self.data_path) as f:
            for line in f:
                yield int(line)
                
visits  = ReadNumbers('numbers_data.txt')


def normalize(numbers):
    total = sum(numbers) # this will iterate over visits to compete sum
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


print(normalize(visits))

[4.73186119873817, 25.236593059936908, 23.65930599369085, 28.391167192429023, 3.7854889589905363, 14.195583596214512]


- Why does this work where it didn't work before? Well, what happens here is the visits objects gets passed into normalize. Normalize is going to iterate over the numbers function multiple times. Each time it tries to iterate over the numbers object, say with the sum built in function. What sum is actually going to try to do is say "Okay, hey, this is numbers. "I need to iterate over this. "I'm going to call the iter built in function." By calling the iter built in function, it calls the read visits iter method. And the read visits iter method is going to open the data file a whole separate time and then give you all of the data as a generator in return. That let's the sum built in function here open the file, create a new iterator, iterate through the whole thing, exhaust that iterator, and then get the total. Then further down in the function for this for loop, the numbers where we loop through the numbers again, this for loop is going to call iter on numbers a second time. And by calling iter on numbers a second time, this method gets called a second time, which then allocates this iterator, or opens this file a second time, and then causes a generation of all those line numbers a second time. So this works because we actually do two passes through the data. But from the standpoint from the function, it has no idea. It just knows that it's going to be passed in a container of numbers, and it's going to try to iterate over that container multiple times. And it's going to expect that iterating multiple times is going to work just like it would with a standard Python list. Now that you know how containers like read visits work, you can write your own functions to ensure that parameters like number in this normalize function aren't just iterators that are going to cause this buggy behavior.