# Agenda

1. Review the iterator protocol
2. Generator functions
    - How to define them
    - How they're different from regular functions
    - Keeping state across invocations
    - How do they work? What's happening behind the scenes?
3. Generator expressions (aka generator comprehensions)
    - How to define them
    - Where/when/how to use them

# The iterator protocol

How does iteration work in Python?  We know that it works on many different types of values.

In [1]:
for one_item in 'abcde':
    print(one_item)

a
b
c
d
e


In [2]:
for one_item in [10, 20, 30, 40, 50]:
    print(one_item)

10
20
30
40
50


In [3]:
d = {'a':10, 'b':20, 'c':30}

for one_item in d:
    print(one_item)

a
b
c


# Things are iterable!

The iterator protocol in Python means that we can run a `for` loop over nearly any object, because nearly every object is *iterable*. For an object to be iterable, it needs a few things to be true:

1. It needs to respond to the `iter` builtin function. If you run `iter` on an object, it'll either give you its iterator, or it'll respond with a `TypeError`.
2. The iterator it responds with can be itself, or can be another object. The key thing about an iterator is that it knows how to respond to the `next` builtin function.
3. When the iterator wants to indicate that it has no more values to share, it raises `StopIteration`.

In this session, we're going to be talking about *generators*. Generators are iterable objects, meaning that they know how to behave inside of a `for` loop. But they are written very differently from most iterables -- they look like functions, and give us all of the benefits of functions.

If I use generators, then I can think at a higher level, in terms of a function, rather than the very low-level details.

In [4]:
# let's start with the world's dumbest Python function

def myfunc():
    return 1
    return 2
    return 3

In [5]:
myfunc()

1

In [6]:
import dis  # Python dis-assembler

In [7]:
dis.dis(myfunc)

  3           RESUME                   0

  4           RETURN_CONST             1 (1)


In [8]:
# what if I do something that appears slightly different?

def myfunc():
    yield 1
    yield 2
    yield 3

In [9]:
# what happens when I invoke this function? What do I get back?

myfunc()

<generator object myfunc at 0x107baa560>

By getting a generator object back, I can put this object in a `for` loop, and it'll know how to behave.

In [10]:
g = myfunc()   # assign it to g

In [11]:
iter(g)

<generator object myfunc at 0x107bab3d0>

In [12]:
g

<generator object myfunc at 0x107bab3d0>

In [14]:
iter(g) is g    # a generator is its own iterator

True

In [15]:
next(g)

1

In [16]:
next(g)

2

In [17]:
next(g)

3

In [18]:
next(g)

StopIteration: 

# What's happening?

When we invoke a *generator function*, Python notices that it has the `yield` keyword. When it compiles the function, it marks this down. And when we invoke the function, the function body doesn't run. Rather, we get a generator that when iterated over will invoke the function body.

With each iteration -- that is, each invocation of `next` -- the function will run up to and including the next `yield` statement.

`yield` is kind of like `return`, in that it returns a value. But `return` also exits from the function. `yield`, by contrast, goes to sleep immediately after returning the value. This means that when we ask for the next value, the function picks up from where it left off, and keeps going as if nothing happened.

If you use `return` in a generator function body, that just ends the loop by raising `StopIteration`. If the function body ends naturally, then it also raises `StopIteration`. You can, in theory, use `return` with a value, but you probably don't want to -- since that passes your value as the message to the `StopIteration` exception that was raised.

In [19]:
# will this work?

for one_item in myfunc():
    print(one_item)

1
2
3


In [22]:
# a generator function is a function, and can do function-like things!


def myfunc():
    print('\tLine 5, before yield 1')
    yield 1
    print('\tLine 7, after yield 1 and before yield 2')
    yield 2
    print('\tLine 9, after yield 2 and before yield 3')
    yield 3
    print('\tLine 11, after yield 3')

In [23]:
for one_item in myfunc():
    print(one_item)

	Line 5, before yield 1
1
	Line 7, after yield 1 and before yield 2
2
	Line 9, after yield 2 and before yield 3
3
	Line 11, after yield 3


In [24]:
def squares(n):
    for one_number in range(n):
        yield one_number ** 2

In [25]:
for one_item in range(10):
    print(one_item, end=' ')

0 1 2 3 4 5 6 7 8 9 

In [26]:
for one_item in squares(10):
    print(one_item, end=' ')

0 1 4 9 16 25 36 49 64 81 

# Why generator functions?

1. We might have an infinite (or just very long) series of values we're going to get. This way, we can get them one at a time, rather than all at once in a huge list.
2. It's easy(ish) to write a generator function that filters values, or that maps them to another value.


# Exercise: Only evens

Write a generator function that takes a list (or other iterable) of integers as an argument. It should return, with each iteration, the next EVEN number in that iterable. When we get to the end of the values (or just the evens), the generator should exit.



In [27]:
def only_evens(numbers):
    for one_item in numbers:
        if one_item % 2 == 0:
            yield one_item

In [29]:
g = only_evens([10, 11, 12, 13, 15, 18, 21, 22])

In [30]:
for one_item in g:
    print(one_item)

10
12
18
22


# When are generators removed from memory?

Python manages memory for us. It normally removes any object whose reference count (i.e., the number of references to it from variables and other data structures) drops to zero. 

If you refer to a generator in a global variable, then the global only goes away when Python exits!

This means that if you don't delete a variable referring to a generator that you no longer care about, you will still consume that memory. In many cases, that's not a big deal, because the whole point of a generator is that it doesn't consume lots of memory, but does things in little pieces.



In [31]:
# what about an infinite sequence? That's perfect for a generator!
# Fibonacci sequence

def fib():
    first = 0
    second = 1

    while True:
        yield first
        first, second = second, first+second

g = fib()        

In [32]:
# if I'm really dumb or don't like my computer (or my job), I can say

# list(g)

In [33]:
for one_item in g:
    print(one_item, end=' ')

    if one_item > 10_000_000_000:
        break

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 

# Exercise: `read_n`

When we iterate over a file object, we get each line of the file, one at a time. Each returned value is a string ending with `'\n'`. The final iteration returns an empty string.

This works well, because a very large number of files are strucured with one record per line.

What if you want to read from a file whose records each take three lines?

I want you to write a generator function, `read_n`, that takes two arguments:

- `filename`, the name of a text file you want to read from
- `n`, an integer, the number of lines from `filename` that should be returned with each iteration

The final iteration might contain fewer than `n` lines. 

The idea is that we can invoke `read_n` with an integer, and then we'll get that many lines back from the file with each iteration.

A few things:

- If you invoke `readline` on a file, you get back the next line, including the next newline
- If you are already at the end of a file, you'll get an empty string back from `readline`

In [40]:
def read_n(filename, n):
    with open(filename) as infile:
        while True:
            output = []
            for counter in range(n):
                output.append(infile.readline())

            string_output = ''.join(output)

            if string_output:
                yield string_output
            else:
                break

for one_item in read_n('/etc/passwd', 6):
    print(one_item)

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.

#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33

In [41]:
# list comprehension

def read_n(filename, n):
    with open(filename) as infile:
        while True:
            string_output = ''.join([infile.readline()
                                    for counter in range(n)])

            if string_output:
                yield string_output
            else:
                break

for one_item in read_n('/etc/passwd', 6):
    print(one_item)

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.

#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33

In [42]:
g = read_n('/etc/passwd', 6)

In [43]:
next(g)

'##\n# User Database\n# \n# Note that this file is consulted directly only when the system is running\n# in single-user mode.  At other times this information is provided by\n# Open Directory.\n'

In [44]:
next(g)

'#\n# See the opendirectoryd(8) man page for additional information about\n# Open Directory.\n##\nnobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false\nroot:*:0:0:System Administrator:/var/root:/bin/sh\n'

In [45]:
next(g)

'daemon:*:1:1:System Services:/var/root:/usr/bin/false\n_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico\n_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false\n_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false\n_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false\n_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false\n'

In [46]:
# what is really happening inside of the generator object?

g

<generator object read_n at 0x1085088c0>

In [47]:
g.gi_running  # is the generator currently running?

False

In [48]:
g.gi_code   # this is the code object, which contains the bytecode and lots of hints, like a function

<code object read_n at 0x108510570, file "/var/folders/n7/3xckdj8j3dz45qz4xnlk1fy40000gn/T/ipykernel_44593/1366442129.py", line 3>

In [50]:
g.gi_code.co_varnames

('filename', 'n', 'infile', 'counter', 'string_output')

In [51]:
g.gi_code.co_argcount

2

In [52]:
# this is the stack frame for the function that is running
g.gi_frame

<frame at 0x1087bca40, file '/var/folders/n7/3xckdj8j3dz45qz4xnlk1fy40000gn/T/ipykernel_44593/1366442129.py', line 10, code read_n>

In [53]:
g.gi_frame.f_lineno

10

In [54]:
g.gi_frame.f_locals

{'filename': '/etc/passwd', 'n': 6, 'infile': <_io.TextIOWrapper name='/etc/passwd' mode='r' encoding='UTF-8'>, 'string_output': 'daemon:*:1:1:System Services:/var/root:/usr/bin/false\n_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico\n_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false\n_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false\n_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false\n_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false\n'}

# Comprehensions

You've seen list comprehensions

In [55]:
[x ** 2
 for x in range(-5, 5)]

[25, 16, 9, 4, 1, 0, 1, 4, 9, 16]

In [56]:
# set comprehensions

{x ** 2
 for x in range(-5, 5)}

{0, 1, 4, 9, 16, 25}

In [58]:
# dict comprehensions

{x : x ** 2
 for x in range(-5, 5)}

{-5: 25, -4: 16, -3: 9, -2: 4, -1: 1, 0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

In [59]:
# many people see this, and say: What about round parentheses, () ?
# they try it, thinking theyll get a "tuple comprehension"

(x ** 2
 for x in range(-5, 5))

<generator object <genexpr> at 0x108463370>

The core Python developers decided it would be useful to be able to create generators even more easily than with generator functions. They used comprehension syntax to let us create "generator expressions," or "generator comprehensions."

When use a generator comprehension?

- If you have an iterable
- You want to invoke a function, method, or operator on each element of the iterable
- You want to return those values in an iteable, but not necessarily in a list

Another way to say this: You want a list comprehension, but without the potential for extreme memory usage that a list would produce.

There are many people nowadays who almost only use generator comprehensions, and almost never use list comprehension.

In [60]:
numbers = [10, 11, 15, 17, 20, 22,23, 29, 32]

g = (one_number
    for one_number in numbers
    if one_number % 2 == 0)

In [61]:
list(g)

[10, 20, 22, 32]

In [62]:
# if you want it as a function, you can use a regular function that *returns* a generator expression.
# this is equivalent to implementing the function and having it return a generator comprehension.

In [63]:
def evens_only(numbers):
    return (one_number
    for one_number in numbers
    if one_number % 2 == 0)    

In [65]:
list(evens_only([10, 11, 15, 17, 20, 22,23, 29, 32]))

[10, 20, 22, 32]

# Exercise: Word lengths

1. Define a string containing several words.
2. Use a generator expression to calculate the total of all word lengths in the string.  (The comprehension should return the length of each word, and you then run `sum` on the output to get the total number.)

In [70]:
s = 'this is a bunch of words for my generators course'

sum(len(one_word)
    for one_word in s.split()
   )

40