# Generators

Have you ever had to work with a dataset so large that it overwhelmed your machine’s memory? Or maybe you have a complex function that needs to maintain an internal state every time it’s called, but the function is too small to justify creating its own class. In these cases and more, generators and the Python yield statement are here to help.

Python generators are a powerful, but misunderstood tool. They’re often treated as too difficult a concept for beginning programmers to learn — creating the illusion that beginners should hold off on learning generators until they are ready. I think this assessment is unfair, and that you can use generators sooner than you think

## Generators

If you’ve never encountered a generator before, the most common real-life example of a generator is a backup generator, which creates —

generates — electricity for your house or office. Conceptually, Python generators generate values one at a time from a given sequence, instead of giving the entirety of the sequence at once. This one-at-a-time fashion of generators is what makes them so compatible with for loops. If this sounds confusing, don’t worry too much. As we explain how to create generators, it will become more clear. There are two ways to create a generator. They differ in their syntax, but the end result is still a generator. We’ll teach these concepts by covering their syntax and comparing them to a similar, but non-generator equivalent.

In [11]:
# primer če želimo narediti countdown funkcijo
def countdown(num):
    print('Starting')
    while num > 0:
        return num
        num -= 1
    print('Stop')

In [12]:
print(countdown(5))

Starting
5


A normal Python function will always return one value, whether it be a list, an integer or some other object. But what if you wanted to be able to call a function and have it yield a series of values? That is where generators come in. A generator works by “saving” where it last left off (or yielding) and giving the calling function a value. So instead of returning the execution to the caller, it just gives temporary control back. To do this magic, a generator function requires Python’s yield statement.

### The generator function

A generator function is just like a regular function but with a key difference: the yield keyword replaces return.

In [13]:
# Regular function
def function_a():
    return "a"

In [14]:
# Generator function
def generator_a():
    yield "a"

The two functions above perform exactly same action (returning/yielding the same string). However, if you try to inspect the generator function, it won’t match what the regular function shows.

In [15]:
function_a()

'a'

In [16]:
generator_a()

<generator object generator_a at 0x7f39bc2d9228>

Calling a regular function tells Python to go back to where the function is located in our code, perform the code within the block, and return the result. In order to get the generator function to yield its values, you need to pass it into the next() function. next() is a special function that asks, “What’s the next item in the iteration?” In fact, next() is the precise function that is called when you run a for loop! Lists, dictionaries, strings, and the like all implement next(), so this is why you can incorporate them into loops in the first place.

In [19]:
a = generator_a()

In [20]:
# Asking the generator what the next item is
next(a)

'a'

In [21]:
# Do not do this
next(a)

StopIteration: 

Notice that we have to pass in generator function with the parentheses since the function itself is the generator. Providing only the function name will throw an error since you’re trying to give next() a function name. As expected, the generator function will yield “a” once we invoke the next() function. This example is not fully representative of what a generator is useful for. Remember that generators produce a stream of values, so yielding a single value doesn’t really qualify as a stream. To do this, we can actually put in multiple yield statements into a generator function. These yield statements form the sequence that the generator will output. We’ll create a generator and bind it to a varible mg. Then, if we keep passing mg into next(), we’ll get to the next yield. If we keep going past, we’ll be given a StopIteration error to tell us that the generator has no more values to give. The StopIteration error is actually how a for loop knows when to stop iterating.

In [22]:
def multi_generate():
    yield "a"
    yield "b"
    yield "c"

In [23]:
mg = multi_generate()
next(mg)

'a'

In [24]:
next(mg)

'b'

In [25]:
next(mg)

'c'

In [26]:
next(mg)

StopIteration: 

Assigning multi_generate to mg is a crucial step in using a generator function. Binding a generator to mg allows us to create a single instance of a generator we can refer back to. We can continue passing mg into next() and get those other yield statements. Observe what happens if we just keep trying to pass in multi_generate itself.

In [27]:
next(multi_generate())

'a'

In [28]:
next(multi_generate())

'a'

It’s easy to think of generators as a machine that waits for one command and one command only: next(). Once you call next() on the generator, it will dispense the next value in the sequence it is holding. Otherwise, you can’t do much else with a generator. The image below represents our generator as a simple machine.

We continue to get the result of the first yield statement. The reason behind this is subtle. When we pass the generator function itself into next(), Python assumes you are passing a new instance of multi_generate into it, so it will always give you the first yield result. By binding the generator to a variable, Python knows you are trying to act on the same thing when you pass it into next(). We’ve noted that as we keep passing in mg into next, we get the other yield results. This is possible only if the generator somehow remembers what it last did. This memory is what distinguishes generator functions from regular functions! Once you use a function, it’s a one-and-done deal. Once you return the value from the function. A generator will keep yielding values until its out. This brings us to another important property of generators. Once we’ve finished iterating through them, we can’t use them anymore. Once we got through all three yield values in mg, it can’t provide anything to us anymore. We’d have to store another instance of the multi_generate generator to begin asking next() statements of it again. 

In [29]:
# vaja: probajo prejšnji češit z genratorji
def countdown(num):
    print('Starting')
    while num > 0:
        yield num
        num -= 1
    print('Stop')

In [34]:
c = countdown(3)
print(next(c))
print(next(c))
print(next(c))

Starting
3
2
1


Our data still hasn’t been read in yet, so let’s do that with a generator function. The data is called recipeData.csv, and its contained in a CSV file. We’ll use the open() function to enable us to read it, and we’ll start using next() function to read what the first few lines of the CSV are.



In [1]:
# Creating a generator that will generate the data row by row
def beerDataGenerator():
    file = "data/recipeData.csv"
    with open(file, encoding="ISO-8859-1") as f:
        for row in f:
            yield row

Python basically turns the file object into a generator when we iterate over it in this manner. This allows us to process files that are too large to load into memory. You will find generators useful for any large data set that you need to work with in chunks or when you need to generate a large data set that would otherwise fill up your all your computer’s memory.

We’ll slowly dissect the above code:
- We’ve designated dataGenerator as our generator function that will dispense our CSV file row by row. The function includes the name of the file in file, and this enables us to use the open() function to be able to read it.
- While we’ve discussed that Python objects like lists and dictionaries can be iterated over, we can also iterate over files that we open() as well.
- The encoding tells Python what kinds of characters it should expect to see; ISO-8859-1 specifically refers to Latin-1.
- The for loop will start with the first row in the CSV file, yield that row, and then save its current place in reading the file until the generator function is called again.

If you’re following along with the data on your own computer, you’ll need to replace file with the exact path on your computer to where the file is located. This will enable Python to find it when you want to open() it.

In [2]:
# Remember to store an instance of the generator so we can refer back to it
beer = beerDataGenerator()

In [3]:
next(beer)

'BeerID,Name,URL,Style,StyleID,Size(L),OG,FG,ABV,IBU,Color,BoilSize,BoilTime,BoilGravity,Efficiency,MashThickness,SugarScale,BrewMethod,PitchRate,PrimaryTemp,PrimingMethod,PrimingAmount,UserId\n'

In [4]:
next(beer)

'1,Vanilla Cream Ale,/homebrew/recipe/view/1633/vanilla-cream-ale,Cream Ale,45,21.77,1.055,1.013,5.48,17.65,4.83,28.39,75,1.038,70,N/A,Specific Gravity,All Grain,N/A,17.78,corn sugar,4.5 oz,116\n'

Once we’ve created a beerDataGenerator in beer, we can start passing it into next() to look at the data itself. As the CSV file suggests, the columns are separated by commas. Furthermore, each row ends with an \n, which indicates a line break. We found that the first item in recipeData.csv to is a list of column names and the first row to describe a delicious Vanilla Cream Ale.

> You may be asking, “We can store the data in a list comprehension! Why jump through an extra hoop and use a generator?” As a programmer, you may encounter Big Data. This is a somewhat nebulous term, and so we won’t delve into the various Big Data definitions here. Suffice to say that any Big Data file is too big to assign to a variable. Our data file doesn’t qualify as Big Data, but we can still learn a lot by imposing a restriction on ourselves to recreate this conundrum. We’ll assume for now that our beer data is so large in size that we are incapable of storing all of the data in a list of lists. With the normal route of reading in data blocked off, we are forced to reconsider our options. This is where generators come in. We’ll explain later precisely why generators work here, but until then we can rest assured that our generator function will enable us to read the data in the first place, albeit not all at once. Along with generator functions, we can also create generators using generator expressions.

### The generator expression

Early, we compared our generator function to a regular function since they have many similar aspects. For generation expressions, we’ll use list comprehensions.

In [39]:
lc_example = [n**2 for n in [1, 2, 3, 4, 5]]

In [40]:
lc_example

[1, 4, 9, 16, 25]

In [41]:
genex_example = (n**2 for n in [1, 2, 3, 4, 5])

In [42]:
genex_example

<generator object <genexpr> at 0x7f39ad5305e8>

lc_example is our list comprehension, while genex_example is our generator expression that performs almost the same task. Take note that the only difference between the two is that the generator expression is surrounded by parentheses, rather than brackets. If we either of these iterators in a for loop, they will produce the same result and will be indistinguishable. However, if we try to inspect these variables in our interpreter, they produce different results.

This result is similar to what we saw when we tried to look at a regular function and a generator function. Python also recognizes that genex_example is a generator in generator expression form (). As lc_example is a list, we can perform all of the operations that they support: indexing, slicing, mutation, etc. We cannot do this with the generator expression. Generators are specialized as an easy to produce an output one-at-a-time, so they do not support these operations. However, like list comprehensions we can implement logic within generator expressions to form a filter if we needed it.

In [43]:
genex_example2 = (n**2 for n in [1, 2, 3, 4, 5] if n >= 3)

In [44]:
next(genex_example2)

9

Effectively, there is no difference in how we will use a generator function or generator expression. Once we have our generator expression, we can call next() on it to start getting the values it will produce. Once we go through all of the values that the generator expression can produce, we cannot use it anymore. This contrasts against a list comprehension, which we can reuse as much as we want.

In [45]:
next(genex_example)

1

In [46]:
next(genex_example)

4

In [47]:
next(genex_example)

9

In [48]:
next(genex_example)

16

In [49]:
next(genex_example)

25

In [50]:
next(genex_example)

StopIteration: 

The idea that we can only use generators once is tied to the idea of their consumption. Recall that when we iterate over some iterator, we perform some operation on each of the values within. We then move on with our analysis using these processed values, meaning that typically we may not need the original iterator. Generators fit perfectly into this need, allowing us to form an iterator that we can use once and then not have to worry about it taking up space after we use it (in a for loop, for example). We talked about next() as the way to get the values from the generators, but its often better to use generators in for loops. Using next() forces us to have to deal with the StopIteration ourselves, but the for loop uses this to know when to stop!

In [51]:
genex_example = (n**2 for n in [1, 2, 3, 4, 5])
# Using a for loop to consume a generator is better than using next()
for ge in genex_example:
    print(ge)

1
4
9
16
25


One distinction that generator expressions have over functions is their succinctness. Generator functions take up multiple lines, whereas we can fit generator expressions in one line. Multiple lines are not bad in and of itself, but it opens up functions to greater complexity that may introduce bugs later on. We’ll rewrite our generator function as a one-line expression that read in our beer data. This conciseness that will come in handy later in the article.

In [20]:
beer_data = "data/recipeData.csv"

# This one line perfoms the same action as beerDataGenerator()!
lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))

> - Generators produce values one-at-a-time as opposed to giving them all at once.
- There are two ways to create generators: generator functions and generator expressions.
- Generator functions yield, regular functions return.
- Generator expressions need (), list comprehensions use [].
- You can only use a generator once.
- There are two ways to get values from generators: the next() function and a for loop. The for loop is often the preferred method.
- We can use generators to read files and give us one line at a time.

### Laziness and generators

We know now that generators produce a single value from a defined sequence, but only when we ask next() or within a for loop. We call this lazy evaluation. Generators are lazy because they only give us a value when we ask for it. The flipside here is that only that single value takes up memory. The ultimate result is that generators are incredibly memory efficient, which makes it a perfect candidate for reading and using Big Data files. Once we ask for the next value of a generator, the old value is discarded. Once we go through the entire generator, it is also discarded from memory as well.

### Generators pipelines

We currently haven’t learned anything from the beer data. All we’ve done so far is to take the original CSV file and create a generator that will yield each line in the CSV, one at a time in the form of a string. Unless we’d like to do some crazy string manipulation, we’ll need to think of a way to get our data into a readable, useable form. Below is a representation of what our code currently does: a simple read from file and output of a single line from the file.

Generators come to the rescue again here! So far in the article, we’ve been passing in other structures, specifically iterators, to the generators to indicate what sequence we’d like to generate from. However, generators are iterators themselves too — why don’t we create another generator that takes the output another generator? Our lines generator outputs the line in its entirety, so we’ll make a second generator that does some formatting for us.

In [30]:
beer_data = "data/recipeData.csv"

lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))
lists = (l.split(",") for l in lines)

The end result of our generators is a stream of lists, each containing the data within a row of the CSV. If we iterate through lists, we’ll be able to easily access the data elements within and perform the analyses we need! We’ve effectively made a pipeline for our data set, starting from the raw data set and sending it through 2 generators to get it into a familiar form. Remember that generators aren’t lists themselves, they merely generate a single element of a sequence and only take up the amount that element needs. By piping generators together, we’ve created a quick, easy-to-read way for us to read data that would be inaccessible through normal means. There’s some real power to this approach, and its significance can’t be understated. We didn’t need to create any temporary lists to hold intermediate values as we processed them. 

In this pipeline, each generator is put in charge of a single operation that will eventually be applied to all rows of the data set. Although having each list is good, there’s still some small issues that need to be addressed before we can do any meaningul analyses. First, we’d like to take the column names since they aren’t data and then turn them into a dictionary that would make any further code easier to read. Note: if you’re running this code on your own machine, you must remember that you can only use generators once. If you use the generator in a for loop to view the output, you’ll need to run the data and the whole pipeline again. Thankfully, the generators run fast here.

In [64]:
beer_data = "data/recipeData.csv"

lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))
lists = (l.split(",") for l in lines)

# Take the column names out of the generator and store them, leaving only data
columns = next(lists)

# Take these columns and use them to create an informative dictionary
beerdicts = (dict(zip(columns, data)) for data in lists)

In [41]:
next(beerdicts)

{'BeerID': '1',
 'Name': 'Vanilla Cream Ale',
 'URL': '/homebrew/recipe/view/1633/vanilla-cream-ale',
 'Style': 'Cream Ale',
 'StyleID': '45',
 'Size(L)': '21.77',
 'OG': '1.055',
 'FG': '1.013',
 'ABV': '5.48',
 'IBU': '17.65',
 'Color': '4.83',
 'BoilSize': '28.39',
 'BoilTime': '75',
 'BoilGravity': '1.038',
 'Efficiency': '70',
 'MashThickness': 'N/A',
 'SugarScale': 'Specific Gravity',
 'BrewMethod': 'All Grain',
 'PitchRate': 'N/A',
 'PrimaryTemp': '17.78',
 'PrimingMethod': 'corn sugar',
 'PrimingAmount': '4.5 oz',
 'UserId\n': '116\n'}

The beerdicts does some simple formatting, which gives our pipeline even more power!

This is a great place to start inquiring our data about our future beer brewing choices. Now that we have our generator pipeline in place, we can start consuming the data produced by the generators and create some insights. We usually consume generators using for loops, so we’ll use one to figure out what the most popular type of homebrewed beer is.

In [52]:
beer_counts = {}
for bd in beerdicts:
    if bd["Style"] not in beer_counts:
        beer_counts[bd["Style"]] = 1
    else:
        beer_counts[bd["Style"]] += 1

most_popular = 0
most_popular_type = None
for beer, count in beer_counts.items():
    if count > most_popular:
        most_popular = count
        most_popular_type = beer

In [53]:
most_popular_type

'American IPA'

This operation is ubiquitous in data wrangling and processing, and you’ve probably seen it before. The only new thing here is that instead of referring back to a list of lists containing our data, we rely on dictionaries that are produced by our generators. With generators, we are able to make the same inquires we’d want from any Big Data set as we would a regular-sized one. We now know that American IPAs are the most popular homebrewed beer in the data set, and we know how many entries they have in the data. We can try figuring out how strong our beer should be. This data is contained in the “ABV” (Alcohol By Volume) key. Since we are working with dictionaries as the output of our generator stream, why don’t we add another generator to hone in on the exact values we want to output.

In [65]:
abv = (float(bd["ABV"]) for bd in beerdicts if bd["Style"] == "American IPA")

In [59]:
sum(abv)

76944.87000000004

In [60]:
most_popular

11940

In [66]:
# Get the average ABV for an American IPA
sum(abv)/most_popular

6.44429396984925

We should take special note of our use of sum() with the abv generator. It is not immediately intuitive that sum() will sum up all of the ABV values that it receives. You may think of sum() as reducing the whole output of the generator into one value. By dividing this sum by the number of American IPA entires there are, we got the average. Our data suggests that your average American IPA is about 6.4% alcohol by volume! Our last generator abv takes the dictionaries that are output by beerdicts and outputs the ABV key, but only if the beer is an American IPA. Filters on our generator expression form a powerful tool in our pipeline. If we think of each successive generator as a modular component, we can then swap out generators for others that may have a more desirable functionality. If we wanted to change what kind of beer we wanted to investigate or look at another beer characteristic, the only thing we need to change is the generator operation. The picture below expresses the different parts of the generator pipeline approach. It consists of some raw data you want to process, the pipeline that does the actual processing, and the final consumption of the output of this pipeline. Following this pattern will enable you to reenact what we’ve done with the beer data.

If you’re used to the workflow of using a list of lists and leveraging all the list methods to do your analyses, this new approach to data wrangling might be strange. However, the data pipeline is a powerful concept that can be immediately incorporated into your code and you should try it.

## Using Generators

[Primeri iz prakse](https://github.com/dabeaz/generators)

### Example 1: Reading Large Files

A common use case of generators is to work with data streams or large files, like CSV files. These text files separate data into columns by using commas. This format is a common way to share data. Now, what if you want to count the number of rows in a CSV file? The code block below shows one way of counting those rows:

In [7]:
def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    file.close()
    return result

In [8]:
csv_gen = csv_reader("data/nile.csv")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 572


Looking at this example, you might expect csv_gen to be a list. To populate this list, csv_reader() opens a file and loads its contents into csv_gen. Then, the program iterates over the list and increments row_count for each row.

This is a reasonable explanation, but would this design still work if the file is very large? What if the file is larger than the memory you have available? To answer this question, let’s assume that csv_reader() just opens the file and reads it into an array:

This function opens a given file and uses file.read() along with .split() to add each line as a separate element to a list. If you were to use this version of csv_reader() in the row counting code block you saw further up, then you’d get the following output:

In this case, open() returns a generator object that you can lazily iterate through line by line. However, file.read().split() loads everything into memory at once, causing the MemoryError.

Before that happens, you’ll probably notice your computer slow to a crawl. You might even need to kill the program with a KeyboardInterrupt. So, how can you handle these huge data files? Take a look at a new definition of csv_reader():

In [9]:
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

In [10]:
csv_gen = csv_reader("data/nile.csv")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 571


In [11]:
#možnost 1
file = "data/nile.csv"
# Use generators to get number of rows, with one row in memory
def line_aggregate(file):
    rows = 0
    with open(file, encoding='ISO-8859-1') as f:
        for row in f:
            rows += 1
        return rows
    
line_aggregate(file)

571

In [9]:
# možnost 2
file = "data/nile.csv"
num_lines = sum(1 for line in open(file))
print(num_lines)

571


In this version, you open the file, iterate through it, and yield a row. 

What’s happening here? Well, you’ve essentially turned csv_reader() into a generator function. This version opens a file, loops through each line, and yields each row, instead of returning it.

You can also define a generator expression (also called a generator comprehension), which has a very similar syntax to list comprehensions. In this way, you can use the generator without calling a function:

In [12]:
csv_gen = (row for row in open("data/nile.csv"))

This is a more succinct way to create the list csv_gen. You’ll learn more about the Python yield statement soon. For now, just remember this key difference:
- Using yield will result in a generator object.
- Using return will result in the first line of the file only.

### Example 2: Generating an Infinite Sequence

Let’s switch gears and look at infinite sequence generation. In Python, to get a finite sequence, you call range() and evaluate it in a list context:

In [13]:
a = range(5)

In [14]:
list(a)

[0, 1, 2, 3, 4]

Generating an infinite sequence, however, will require the use of a generator, since your computer memory is finite:

In [15]:
def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

This code block is short and sweet. First, you initialize the variable num and start an infinite loop. Then, you immediately yield num so that you can capture the initial state. This mimics the action of range().

After yield, you increment num by 1. If you try this with a for loop, then you’ll see that it really does seem infinite:

    for i in infinite_sequence():
        print(i, end=" ")

The program will continue to execute until you stop it manually.

Instead of using a for loop, you can also call next() on the generator object directly. This is especially useful for testing a generator in the console:

In [16]:
gen = infinite_sequence()

In [17]:
next(gen)

0

In [18]:
next(gen)

1

In [19]:
next(gen)

2

In [20]:
next(gen)

3

Here, you have a generator called gen, which you manually iterate over by repeatedly calling next(). This works as a great sanity check to make sure your generators are producing the output you expect.

> Note: When you use next(), Python calls .__next__() on the function you pass in as a parameter. There are some special effects that this parameterization allows, but it goes beyond the scope of this article. Experiment with changing the parameter you pass to next() and see what happens!

### Example 3: Creating New Iteration Patterns with Generators

You want to implement a custom iteration pattern that’s different than the usual builtin
functions (e.g., range(), reversed(), etc.).

If you want to implement a new kind of iteration pattern, define it using a generator
function. Here’s a generator that produces a range of floating-point numbers:

In [37]:
def frange(start, stop, increment):
    x = start
    while x < stop:
        yield x
        x += increment

To use such a function, you iterate over it using a for loop or use it with some other
function that consumes an iterable (e.g., sum(), list(), etc.). For example:

In [40]:
for n in frange(0, 4, 0.5):
    print(n)

0
0.5
1.0
1.5
2.0
2.5
3.0
3.5


In [41]:
list(frange(0, 1, 0.125))

[0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875]

### Example 4: Creating Data Pipelines With Generators

Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory. Imagine that you have a large CSV file:

This example is pulled from the TechCrunch Continental USA set, which describes funding rounds and dollar amounts for various startups based in the USA. Click the link below to download the dataset:

Let’s think of a strategy:
- Read every line of the file.
- Split each line into a list of values.
- Extract the column names.
- Use the column names and lists to create a dictionary.
- Filter out the rounds you aren’t interested in.
- Calculate the total and average values for the rounds you are interested in.

Normally, you can do this with a package like pandas, but you can also achieve this functionality with just a few generators. You’ll start by reading each line from the file with a generator expression:

In [46]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))

Then, you’ll use another generator expression in concert with the previous one to split each line into a list:



In [47]:
list_line = (s.rstrip().split(",") for s in lines)

Here, you created the generator list_line, which iterates through the first generator lines. This is a common pattern to use when designing generator pipelines. Next, you’ll pull the column names out of techcrunch.csv. Since the column names tend to make up the first line in a CSV file, you can grab that with a short next() call:

In [48]:
cols = next(list_line)

This call to next() advances the iterator over the list_line generator one time. Put it all together, and your code should look something like this:

In [55]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))
list_line = (s.rstrip().split(",") for s in lines)
cols = next(list_line)

To sum this up, you first create a generator expression lines to yield each line in a file. Next, you iterate through that generator within the definition of another generator expression called list_line, which turns each line into a list of values. Then, you advance the iteration of list_line just once with next() to get a list of the column names from your CSV file.

To help you filter and perform operations on the data, you’ll create dictionaries where the keys are the column names from the CSV:

In [49]:
company_dicts = (dict(zip(cols, data)) for data in list_line)

This generator expression iterates through the lists produced by list_line. Then, it uses zip() and dict() to create the dictionary as specified above. Now, you’ll use a fourth generator to filter the funding round you want and pull raisedAmt as well:

In [50]:
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)

In this code snippet, your generator expression iterates through the results of company_dicts and takes the raisedAmt for any company_dict where the round key is A.

Remember, you aren’t iterating through all these at once in the generator expression. In fact, you aren’t iterating through anything until you actually use a for loop or a function that works on iterables, like sum(). In fact, call sum() now to iterate through the generators:

In [51]:
total_series_a = sum(funding)

In [52]:
total_series_a

18500000

Putting this all together, you’ll produce the following script:

In [59]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))
list_line = (s.rstrip().split(",") for s in lines)
cols = next(list_line)
company_dicts = (dict(zip(cols, data)) for data in list_line)
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)
total_series_a = sum(funding)
print(f"Total series A fundraising: ${total_series_a}")

Total series A fundraising: $18500000


This script pulls together every generator you’ve built, and they all function as one big data pipeline. Here’s a line by line breakdown:
- Line 2 reads in each line of the file.
- Line 3 splits each line into values and puts the values into a list.
- Line 4 uses next() to store the column names in a list.
- Line 5 creates dictionaries and unites them with a zip() call:
    - The keys are the column names cols from line 4.
    - The values are the rows in list form, created in line 3.
- Line 6 gets each company’s series A funding amounts. It also filters out any other raised amount.
- Line 11 begins the iteration process by calling sum() to get the total amount of series A funding found in the CSV.

> Note: The methods for handling CSV files developed in this tutorial are important for understanding how to use generators and the Python yield statement. However, when you work with CSV files in Python, you should instead use the csv module included in Python’s standard library. This module has optimized methods for handling CSV files efficiently.

### Example 5: A generator that follows a log file like Unix 'tail -f'

follow.py. A generator that follows lines written to a real-time log file (like Unix 'tail -f'). To run this program, you need to have a log-file to work with. Run the program logsim.py to create a simulated web-server log (written in the file access-log). Leave this program running in the background for the next few parts.

>`seek() method − fileObject.seek(offset[, whence])`
- offset − This is the position of the read/write pointer within the file.
- whence − This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's en

In [None]:
# logsim.py
import time, random

from data import ips
from data import docs

with open("access-log","w") as f:
    while True:
        time.sleep(random.random())
        n = random.randint(0,len(ips)-1)
        m = random.randint(0,len(docs)-1)
        t = time.time()
        date = time.strftime("[%d/%b/%Y:%H:%M:%S -0600]",time.localtime(t))
        write_String = f'{ips[n]} - - {date} {docs[m]}\n'
        f.write(write_String)
        f.flush()

In [None]:
# follow.py
#
# A generator that follows a log file like Unix 'tail -f'.
#
# Note: To see this example work, you need to apply to 
# an active server log file.  Run the program "logsim.py"
# in the background to simulate such a file.  This program
# will write entries to a file "access-log".

import time

def follow(thefile):
    thefile.seek(0,2)      # Go to the end of the file
    while True:
        line = thefile.readline()
        if not line:
            time.sleep(0.1)    # Sleep briefly
            continue
        yield line

# Example use
if __name__ == '__main__':
    with open("access-log") as logfile:
        for line in follow(logfile):
            print(line, end="")

pipeline.py. An example of using generators to set up a simple processing pipeline. Print all server log entries containing the word 'python'.

In [None]:
# pipeline.py
#
# An example of setting up a processing pipeline with generators
import re

def grep(pattern,lines):
    patc = re.compile(pattern)
    for line in lines:
        if patc.search(line):
             yield line

if __name__ == '__main__':
    from follow import follow

    # Set up a processing pipe : tail -f | grep python
    with open("access-log") as logfile:
        loglines = follow(logfile)
        pylines  = grep(r"python", loglines)

        # Pull results out of the processing pipeline
        for line in pylines:
            print(line, end="")

### Example 6: Calculate the number of bytes transferred in an Apache server log 

In [None]:
# genlog.py
#
# Sum up the bytes transferred in an Apache server log using
# generator expressions

with open("access-log") as wwwlog:
    bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
    bytes_sent = (int(x) for x in bytecolumn if x != '-')
    print("Total", sum(bytes_sent))

## Vaja: Creating Data Processing Pipelines

You want to process data iteratively in the style of a data processing pipeline (similar to
Unix pipes). For instance, you have a huge amount of data that needs to be processed,
but it can’t fit entirely into memory.

Generator functions are a good way to implement processing pipelines. To illustrate,
suppose you have a huge directory of log files that you want to process:

    foo/
        access-log-012007.gz
        access-log-022007.gz
        access-log-032007.gz
        ...
        access-log-012008
    bar/
        access-log-092007.bz2
        ...
        access-log-022008

To process these files, you could define a collection of small generator functions that
perform specific self-contained tasks. For example:

In [2]:
import os
import fnmatch
import gzip
import bz2
import re

def gen_find(filepat, top):
    '''
    Find all filenames in a directory tree that match a shell wildcard pattern
    '''
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            yield os.path.join(path,name)
            
def gen_opener(filenames):
    '''
    Open a sequence of filenames one at a time producing a file object.
    The file is closed immediately when proceeding to the next iteration.
    '''
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()
        
def gen_concatenate(iterators):
    '''
    Chain a sequence of iterators together into a single sequence.
    '''
    for it in iterators:
        yield from it
        
def gen_grep(pattern, lines):
    '''
    Look for a regex pattern in a sequence of lines
    '''
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

You can now easily stack these functions together to make a processing pipeline. For
example, to find all log lines that contain the word python, you would just do this:

In [3]:
lognames = gen_find('access-log*', 'data/pipeline')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('robotsl?.txt', lines)

for line in pylines:
    print(line, end='')

124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [1

If you want to extend the pipeline further, you can even feed the data in generator
expressions. For example, this version finds the number of bytes transferred and sums
the total:

In [4]:
lognames = gen_find('access-log*', 'data/pipeline')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('124.115.6.12', lines)
bytecolumn = (line.rsplit(None,1)[1] for line in pylines)
_bytes = (int(x) for x in bytecolumn if x != '-')
print('Total', sum(_bytes))

Total 1704


Processing data in a pipelined manner works well for a wide variety of other problems,
including parsing, reading from real-time data sources, periodic polling, and so on.

In understanding the code, it is important to grasp that the yield statement acts as a
kind of data producer whereas a for loop acts as a data consumer. When the generators
are stacked together, each yield feeds a single item of data to the next stage of the
pipeline that is consuming it with iteration. In the last example, the sum() function is
actually driving the entire program, pulling one item at a time out of the pipeline of
generators.

One nice feature of this approach is that each generator function tends to be small and
self-contained. As such, they are easy to write and maintain. In many cases, they are so
general purpose that they can be reused in other contexts. The resulting code that glues
the components together also tends to read like a simple recipe that is easily understood.

The memory efficiency of this approach can also not be overstated. The code shown
would still work even if used on a massive directory of files. In fact, due to the iterative
nature of the processing, very little memory would be used at all.

There is a bit of extreme subtlety involving the gen_concatenate() function. The
purpose of this function is to concatenate input sequences together into one long sequence
of lines. The itertools.chain() function performs a similar function, but requires
that all of the chained iterables be specified as arguments. In the case of this
particular recipe, doing that would involve a statement such as lines = iter
tools.chain(*files), which would cause the gen_opener() generator to be fully consumed.
Since that generator is producing a sequence of open files that are immediately
closed in the next iteration step, chain() can’t be used. The solution shown avoids this
issue.

Also appearing in the gen_concatenate() function is the use of yield from to delegate
to a subgenerator. The statement yield from it simply makes gen_concatenate()
emit all of the values produced by the generator it.

Last, but not least, it should be noted that a pipelined approach doesn’t always work for
every data handling problem. Sometimes you just need to work with all of the data at
once. However, even in that case, using generator pipelines can be a way to logically
break a problem down into a kind of workflow.

[Več primerov](https://github.com/dabeaz/generators)

## Nasveti

### Consider Generator Expressions for Large Comprehensions

The problem with list comprehensions  is that they may create a whole new list containing one item for each
value in the input sequence. This is fine for small inputs, but for large inputs this could
consume significant amounts of memory and cause your program to crash.

For example, say you want to read a file and return the number of characters on each line.
Doing this with a list comprehension would require holding the length of every line of the
file in memory. If the file is absolutely enormous or perhaps a never-ending network
socket, list comprehensions are problematic. Here, I use a list comprehension in a way that
can only handle small input values.

In [24]:
# vaja: v eni vrstici preštej število znakov v vsaki vrstici v datoteki
value = [len(x) for x in open("data/example.txt")]
print(value)

[94, 87, 89, 1, 91, 91, 93, 88, 31, 1, 88, 87, 47, 87, 85, 88, 21]


To solve this, Python provides generator expressions, a generalization of list
comprehensions and generators. Generator expressions don’t materialize the whole output
sequence when they’re run. Instead, generator expressions evaluate to an iterator that
yields one item at a time from the expression.

A generator expression is created by putting list-comprehension-like syntax between ()
characters. Here, I use a generator expression that is equivalent to the code above.
However, the generator expression immediately evaluates to an iterator and doesn’t make
any forward progress.

In [25]:
it = (len(x) for x in open("data/example.txt"))
print(it)

<generator object <genexpr> at 0x7f6341310480>


The returned iterator can be advanced one step at a time to produce the next output from
the generator expression as needed (using the next built-in function). Your code can
consume as much of the generator expression as you want without risking a blowup in
memory usage.

In [26]:
print(next(it))

94


In [27]:
print(next(it))

87


Another powerful outcome of generator expressions is that they can be composed together.
Here, I take the iterator returned by the generator expression above and use it as the input
for another generator expression.

In [28]:
roots = ((x, x**0.5) for x in it)

Each time I advance this iterator, it will also advance the interior iterator, creating a
domino effect of looping, evaluating conditional expressions, and passing around inputs
and outputs.

In [29]:
print(next(roots))

(89, 9.433981132056603)


Chaining generators like this executes very quickly in Python. When you’re looking for a
way to compose functionality that’s operating on a large stream of input, generator
expressions are the best tool for the job. The only gotcha is that the iterators returned by
generator expressions are stateful, so you must be careful not to use them more than once.

<hr>

A list comprehension in Python works by loading the entire output list into memory. For small or even medium-sized lists, this is generally fine. If you want to sum the squares of the first one-thousand integers, then a list comprehension will solve this problem admirably:


In [1]:
sum([i * i for i in range(1000)])

332833500

But what if you wanted to sum the squares of the first billion integers? If you tried then on your machine, then you may notice that your computer becomes non-responsive. That’s because Python is trying to create a list with one billion integers, which consumes more memory than your computer would like. Your computer may not have the resources it needs to generate an enormous list and store it in memory. If you try to do it anyway, then your machine could slow down or even crash.

When the size of a list becomes problematic, it’s often helpful to use a generator instead of a list comprehension in Python. A generator doesn’t create a single, large data structure in memory, but instead returns an iterable. Your code can ask for the next value from the iterable as many times as necessary or until you’ve reached the end of your sequence, while only storing a single value at a time.

If you were to sum the first billion squares with a generator, then your program will likely run for a while, but it shouldn’t cause your computer to freeze. The example below uses a generator:


In [3]:
sum(i * i for i in range(10000000))

333333283333335000000

You can tell this is a generator because the expression isn’t surrounded by brackets or curly braces. Optionally, generators can be surrounded by parentheses.

The example above still requires a lot of work, but it performs the operations lazily. Because of lazy evaluation, values are only calculated when they’re explicitly requested. After the generator yields a value (for example, 567 * 567), it can add that value to the running sum, then discard that value and generate the next value (568 * 568). When the sum function requests the next value, the cycle starts over. This process keeps the memory footprint small.

map() also operates lazily, meaning memory won’t be an issue if you choose to use it in this case:

In [4]:
sum(map(lambda i: i*i, range(10000000)))

333333283333335000000

It’s up to you whether you prefer the generator expression or map().

### Consider Generators Instead of Returning Lists

The simplest choice for functions that produce a sequence of results is to return a list of
items. For example, say you want to find the index of every word in a string. Here, I
accumulate results in a list using the append method and return it at the end of the
function:

In [30]:
def index_words(text):
    result = []
    if text:
        result.append(0)
    for index, letter in enumerate(text):
        if letter == ' ':
            result.append(index + 1)
    return result

This works as expected for some sample input.

In [32]:
address = 'Four score and seven years ago…'
result = index_words(address)
print(result)

[0, 5, 11, 15, 21, 27]


There are two problems with the index_words function.

The first problem is that the code is a bit dense and noisy. Each time a new result is found,
I call the append method. The method call’s bulk (result.append) deemphasizes the
value being added to the list (index + 1). There is one line for creating the result list
and another for returning it. While the function body contains ~130 characters (without
whitespace), only ~75 characters are important.

A better way to write this function is using a generator. Generators are functions that use
yield expressions. When called, generator functions do not actually run but instead
immediately return an iterator. With each call to the next built-in function, the iterator
will advance the generator to its next yield expression. Each value passed to yield by
the generator will be returned by the iterator to the caller.

Here, I define a generator function that produces the same results as before:

In [34]:
def index_words_iter(text):
    if text:
        yield 0
    for index, letter in enumerate(text):
        if letter == ' ':
            yield index + 1

It’s significantly easier to read because all interactions with the result list have been
eliminated. Results are passed to yield expressions instead. The iterator returned by the
generator call can easily be converted to a list by passing it to the list built-in function.

In [35]:
result = list(index_words_iter(address))

In [36]:
result

[0, 5, 11, 15, 21, 27]

## Profiling Generator Performance

You learned earlier that generators are a great way to optimize memory. While an infinite sequence generator is an extreme example of this optimization, let’s amp up the number squaring examples you just saw and inspect the size of the resulting objects. You can do this with a call to sys.getsizeof():

In [42]:
import sys
nums_squared_lc = [i * 2 for i in range(10000)]
sys.getsizeof(nums_squared_lc)

87624

In [43]:
nums_squared_gc = (i ** 2 for i in range(10000))
print(sys.getsizeof(nums_squared_gc))

120


In this case, the list you get from the list comprehension is 87,624 bytes, while the generator object is only 120. This means that the list is over 700 times larger than the generator object!

There is one thing to keep in mind, though. If the list is smaller than the running machine’s available memory, then list comprehensions can be faster to evaluate than the equivalent generator expression. To explore this, let’s sum across the results from the two comprehensions above. You can generate a readout with cProfile.run():

In [44]:
import cProfile
cProfile.run('sum([i * 2 for i in range(10000)])')

         5 function calls in 0.002 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <string>:1(<listcomp>)
        1    0.000    0.000    0.002    0.002 <string>:1(<module>)
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [45]:
cProfile.run('sum((i * 2 for i in range(10000)))')

         10005 function calls in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10001    0.003    0.000    0.003    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.exec}
        1    0.002    0.002    0.004    0.004 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




Here, you can see that summing across all values in the list comprehension took about a third of the time as summing across the generator. If speed is an issue and memory isn’t, then a list comprehension is likely a better tool for the job.

> Note: These measurements aren’t only valid for objects made with generator expressions. They’re also the same for objects made from the analogous generator function since the resulting generators are equivalent.

Remember, list comprehensions return full lists, while generator expressions return generators. Generators work the same whether they’re built from a function or an expression. Using an expression just allows you to define simple generators in a single line, with an assumed yield at the end of each inner iteration.

The Python yield statement is certainly the linchpin on which all of the functionality of generators rests, so let’s dive into how yield works in Python.

## Advanced topics

### Sentence Take #3: A Generator Function

A Pythonic implementation of the same functionality uses a generator function to replace
the SequenceIterator class.

In [111]:
import re
import reprlib
RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text
        self.words = RE_WORD.findall(text)
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
    def __iter__(self):
        for word in self.words: # Iterate over self.word.
            yield word # Yield the current word.
        return # This return is not needed; the function can just “fall-through” and return
#automatically. Either way, a generator function doesn’t raise StopIteration: it
#simply exits when it’s done producing values

Back in the Sentence code in Example 14-4, `__iter__` called the SentenceIterator
constructor to build an iterator and return it. Now the iterator in Example 14-5 is in
fact a generator object, built automatically when the `__iter__` method is called, because
`__iter__` here is a generator function.

### How a Generator Function Works

Any Python function that has the yield keyword in its body is a generator function: a
function which, when called, returns a generator object. In other words, a generator
function is a generator factory.

> The only syntax distinguishing a plain function from a generator
function is the fact that the latter has a yield keyword somewhere
in its body. Some argued that a new keyword like gen should
be used for generator functions instead of def, but Guido did not
agree.

Here is the simplest function useful to demonstrate the behavior of a generator:

In [118]:
def gen_123():  # Any Python function that contains the yield keyword is a generator function.
    yield 1 # Usually the body of a generator function has loop, but not necessarily; here I
#just repeat yield three times.
    yield 2
    yield 3

In [119]:
gen_123 # Looking closely, we see gen_123 is a function object.

<function __main__.gen_123()>

In [120]:
gen_123() #But when invoked, gen_123() returns a generator object.

<generator object gen_123 at 0x7f39acbd65e8>

In [121]:
for i in gen_123(): # Generators are iterators that produce the values of the expressions passed to yield.
    print(i)

1
2
3


In [122]:
g = gen_123() # For closer inspection, we assign the generator object to g.

In [123]:
next(g) # Because g is an iterator, calling next(g) fetches the next item produced by yield.

1

In [124]:
next(g)

2

In [125]:
next(g)

3

In [127]:
next(g) # When the body of the function completes, the generator object raises a StopIteration.

StopIteration: 

A generator function builds a generator object that wraps the body of the function.
When we invoke next(…) on the generator object, execution advances to the next yield
in the function body, and the next(…) call evaluates to the value yielded when the function
body is suspended. Finally, when the function body returns, the enclosing generator
object raises StopIteration, in accordance with the Iterator protocol.

> I find it helpful to be strict when talking about the results obtained
from a generator: I say that a generator yields or produces
values. But it’s confusing to say a generator “returns” values. Functions
return values. Calling a generator function returns a generator.
A generator yields or produces values. A generator doesn’t
“return” values in the usual way: the return statement in the body
of a generator function causes StopIteration to be raised by the
generator object

This example makes the interaction between a for loop and the body of the function
more explicit.

In [128]:
def gen_AB(): #
    print('start')
    yield 'A' #
    print('continue')
    yield 'B' #
    print('end.')

In [129]:
for c in gen_AB(): #
    print('-->', c)

start
--> A
continue
--> B
end.


- The generator function is defined like any function, but uses yield.
- The first implicit call to next() in the for loop at will print 'start' and stop at the first yield, producing the value 'A'.
- The second implicit call to next() in the for loop will print 'continue' and stop at the second yield, producing the value 'B'.
- The third call to next() will print 'end.' and fall through the end of the function body, causing the generator object to raise StopIteration.
- To iterate, the for machinery does the equivalent of g = iter(gen_AB()) to get a generator object, and then next(g) at each iteration.
- The loop block prints --> and the value returned by next(g). But this output will be seen only after the output of the print calls inside the generator function.
- The string 'start' appears as a result of print('start') in the generator function body.
- yield 'A' in the generator function body produces the value A consumed by the for loop, which gets assigned to the c variable and results in the output -- > A.
- Iteration continues with a second call next(g), advancing the generator function body from yield 'A' to yield 'B'. The text continue is output because of the second print in the generator function body.
- yield 'B' produces the value B consumed by the for loop, which gets assigned to the c loop variable, so the loop prints --> B.
- Iteration continues with a third call next(it), advancing to the end of the body of the function. The text end. appears in the output because of the third print in the generator function body.
- When the generator function body runs to the end, the generator object raises StopIteration. The for loop machinery catches that exception, and the loop terminates cleanly

Now hopefully it’s clear how `Sentence.__iter__` in Example 14-5 works: `__iter__` is
a generator function which, when called, builds a generator object that implements the
iterator interface, so the SentenceIterator class is no longer needed.
This second version of Sentence is much shorter than the first, but it’s not as lazy as it
could be. Nowadays, laziness is considered a good trait, at least in programming languages
and APIs. A lazy implementation postpones producing values to the last possible
moment. This saves memory and may avoid useless processing as well.

### Sentence Take #4: A Lazy Implementation

The Iterator interface is designed to be lazy: next(my_iterator) produces one item
at a time. The opposite of lazy is eager: lazy evaluation and eager evaluation are actual
technical terms in programming language theory.

Our Sentence implementations so far have not been lazy because the `__init__` eagerly
builds a list of all words in the text, binding it to the self.words attribute. This will
entail processing the entire text, and the list may use as much memory as the text itself (probably more; it depends on how many nonword characters are in the text). Most of
this work will be in vain if the user only iterates over the first couple words.
Whenever you are using Python 3 and start wondering “Is there a lazy way of doing
this?”, often the answer is “Yes.”

The re.finditer function is a lazy version of re.findall which, instead of a list, returns
a generator producing re.MatchObject instances on demand. If there are many
matches, re.finditer saves a lot of memory. Using it, our third version of Sentence is
now lazy: it only produces the next word when it is needed.

In [None]:
import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text # No need to have a words list.
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
    def __iter__(self):
        # finditer builds an iterator over the matches of RE_WORD on self.text, yielding MatchObject instances.
        for match in RE_WORD.finditer(self.text):
            # match.group() extracts the actual matched text from the MatchObject instance.
            yield match.group()

Generator functions are an awesome shortcut, but the code can be made even shorter
with a generator expression.

### Sentence Take #5: A Generator Expression

Simple generator functions like the one in the previous Sentence class 
can be replaced by a generator expression.

A generator expression can be understood as a lazy version of a list comprehension: it
does not eagerly build a list, but returns a generator that will lazily produce the items  on demand. In other words, if a list comprehension is a factory of lists, a generator
expression is a factory of generators.

In [131]:
def gen_AB(): #
    print('start')
    yield 'A'
    print('continue')
    yield 'B'
    print('end.')

In [132]:
res1 = [x*3 for x in gen_AB()]

start
continue
end.


In [137]:
for i in res1: #
    print('-->', i)

--> AAA
--> BBB


In [134]:
res2 = (x*3 for x in gen_AB()) 

In [135]:
res2

<generator object <genexpr> at 0x7f39acbd67c8>

In [136]:
for i in res2: 
    print('-->', i)

start
--> AAA
continue
--> BBB
end.


1. The list comprehension eagerly iterates over the items yielded by the generator object produced by calling gen_AB(): 'A' and 'B'. Note the output in the next lines: start, continue, end.
2. This for loop is iterating over the res1 list produced by the list comprehension.
3. The generator expression returns res2. The call to gen_AB() is made, but that call returns a generator, which is not consumed here. 
4. res2 is a generator object.
5. Only when the for loop iterates over res2, the body of gen_AB actually executes.
6. Each iteration of the for loop implicitly calls next(res2), advancing gen_AB to the next yield. Note the output of gen_AB with the output of the print in the for loop.

So, a generator expression produces a generator, and we can use it to further reduce the
code in the Sentence class.

In [138]:
import re
import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text):
        self.text = text
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)
    def __iter__(self):
        return (match.group() for match in RE_WORD.finditer(self.text))

The only difference from Example 14-7 is the `__iter__` method, which here is not a
generator function (it has no yield) but uses a generator expression to build a generator
and then returns it. The end result is the same: the caller of `__iter__` gets a generator
object.

Generator expressions are syntactic sugar: they can always be replaced by generator
functions, but sometimes are more convenient. The next section is about generator
expression usage.

### Generator Expressions: When to Use Them

In Example 14-9, we saw that a generator expression is a syntactic shortcut to create a
generator without defining and calling a function. On the other hand, generator functions are much more flexible: you can code complex logic with multiple statements, and can even use them as coroutines.

For the simpler cases, a generator expression will do, and it’s easier to read at a glance,
as the Vector example shows.

My rule of thumb in choosing the syntax to use is simple: if the generator expression
spans more than a couple of lines, I prefer to code a generator function for the sake of
readability. Also, because generator functions have a name, they can be reused. You can
always name a generator expression and use it later by assigning it to a variable, of course,
but that is stretching its intended usage as a one-off generator.

The Sentence examples we’ve seen exemplify the use of generators playing the role of
classic iterators: retrieving items from a collection. But generators can also be used to
produce values independent of a data source. The next section shows an example of
that.

### Another Example: Arithmetic Progression Generator

The classic Iterator pattern is all about traversal: navigating some data structure. But a
standard interface based on a method to fetch the next item in a series is also useful
when the items are produced on the fly, instead of retrieved from a collection. For
example, the range built-in generates a bounded arithmetic progression (AP) of integers,
and the itertools.count function generates a boundless AP.

We’ll cover itertools.count in the next section, but what if you need to generate a
bounded AP of numbers of any type?

Example 14-10 shows a few console tests of an ArithmeticProgression class we will
see in a moment. The signature of the constructor in Example 14-10 is Arithmetic
Progression(begin, step[, end]). The range() function is similar to the ArithmeticProgression here, but its full signature is range(start, stop[, step]). I chose
to implement a different signature because for an arithmetic progression the step is
mandatory but end is optional. I also changed the argument names from start/stop
to begin/end to make it very clear that I opted for a different signature. In each test in
Example 14-10 I call list() on the result to inspect the generated values.

In [139]:
class ArithmeticProgression:
    def __init__(self, begin, step, end=None):
        self.begin = begin
        self.step = step
        self.end = end # None -> "infinite" series
    def __iter__(self):
        result = type(self.begin + self.step)(self.begin)
        forever = self.end is None
        index = 0
        while forever or result < self.end:
            yield result
            index += 1
            result = self.begin + self.step * index

- `__init__` requires two arguments: begin and step. end is optional, if it’s None, the series will be unbounded.
- This line produces a result value equal to self.begin, but coerced to the type of the subsequent additions.
- For readability, the forever flag will be True if the self.end attribute is None, resulting in an unbounded series.
- This loop runs forever or until the result matches or exceeds self.end. When this loop exits, so does the function.
- The current result is produced.
- The next potential result is calculated. It may never be yielded, because the while loop may terminate.

In [144]:
ap = ArithmeticProgression(0, 1, 3)

In [145]:
list(ap)

[0, 1, 2]

In [146]:
ap = ArithmeticProgression(1, .5, 3)

In [147]:
list(ap)

[1.0, 1.5, 2.0, 2.5]

In [148]:
ap = ArithmeticProgression(0, 1/3, 1)

In [149]:
list(ap)

[0.0, 0.3333333333333333, 0.6666666666666666]

In [150]:
from fractions import Fraction
ap = ArithmeticProgression(0, Fraction(1, 3), 1)
list(ap)

[Fraction(0, 1), Fraction(1, 3), Fraction(2, 3)]

In [151]:
from decimal import Decimal
ap = ArithmeticProgression(0, Decimal('.1'), .3)
list(ap)

[Decimal('0'), Decimal('0.1'), Decimal('0.2')]

Note that type of the numbers in the resulting arithmetic progression follows the type
of begin or step, according to the numeric coercion rules of Python arithmetic. In
Example 14-10, you see lists of int, float, Fraction, and Decimal numbers.

In the last line of Example 14-11, instead of simply incrementing the result with
self.step iteratively, I opted to use an index variable and calculate each result by
adding self.begin to self.step multiplied by index to reduce the cumulative effect
of errors when working with with floats.

The ArithmeticProgression class from Example 14-11 works as intended, and is a
clear example of the use of a generator function to implement the `__iter__` special
method. However, if the whole point of a class is to build a generator by implementing
`__iter__`, the class can be reduced to a generator function. A generator function is, after
all, a generator factory.

Example 14-12 shows a generator function called aritprog_gen that does the same job
as ArithmeticProgression but with less code. The tests in Example 14-10 all pass if
you just call aritprog_gen instead of ArithmeticProgression.

In [153]:
def aritprog_gen(begin, step, end=None):
    result = type(begin + step)(begin)
    forever = end is None
    index = 0
    while forever or result < end:
        yield result
        index += 1
        result = begin + step * index