Skip to content

Latest commit

 

History

History
executable file
·
162 lines (103 loc) · 8.31 KB

第十四章-可迭代之迭代器和生成器.md

File metadata and controls

executable file
·
162 lines (103 loc) · 8.31 KB

CHAPTER 14 Iterables, iterators and generators

When I see patterns in my programs, I consider it a sign of trouble. The shape of a program should reflect only the problem it needs to solve. Any other regularity in the code is a sign, to me at least, that I’m using abstractions that aren’t powerful enough — often that I’m generating by hand the expansions of some macro that I need to write [1]. — Paul Graham Lisp hacker and venture capitalist

Iteration is fundamental to data processing. And when scanning datasets that don’t fit in memory, we need a way to fetch the items lazily, that is, one at a time and on demand. This is what the Iterator pattern is about. This chapter shows how the Iterator pattern is built into the Python language so you never need to implement it by hand.

迭代是处理数据的基础。

Python does not have macros like Lisp (Paul Graham’s favorite language), so abstracting away the Iterator pattern required changing the language: the yield keyword was added in Python 2.2 (2001)[2]. The yield keyword allows the construction of generators, which work as iterators.

Note

Every generator is an iterator: generators fully implement the iter‐ ator interface. But an iterator — as defined in the GoF book — re‐ trieves items from a collection, while a generator can produce items “out of thin air”. That’s why the Fibonacci sequence generator is a common example: an infinite series of numbers cannot be stored in a collection. However, be aware that the Python community treats iterator and generator as synonyms most of the time.

  1. From Revenge of the Nerds, a blog post.
  2. Python 2.2 users could use yield with the directive from future import generators; yield became available by default in Python 2.3.

Python 3 uses generators in many places. Even the range() built-in now returns a generator-like object instead of full-blown lists like before. If you must build a list from range, you have to be explicit, e.g. list(range(100)).

Every collection in Python is iterable, and iterators are used internally to support:

Python中的每一个集合都是可迭代的,迭代器用于集合的内部以支持以下动作:

• for loops;
• collection types construction and extension;
• looping over text files line by line;
• list, dict and set comprehensions;
• tuple unpacking;
• unpacking actual parameters with * in function calls.

This chapter covers the following topics:

• How the iter(...) built-in function is used internally to handle iterable objects.

• How to implement the classic Iterator pattern in Python.

• How a generator function works in detail, with line by line descriptions.

• How the classic Iterator can be replaced by a generator function or generator ex‐ pression.

• Leveraging the general purpose generator functions in the standard library.

• Using the new yield from statement to combine generators.

• A case study: using generator functions in a database conversion utility designed to work with large data sets.

• Why generators and coroutines look alike but are actually very different and should not be mixed.

We’ll get started studying how the iter(...) function makes sequences iterable.

Sentence take #1: a sequence of words

We’ll start our exploration of iterables by implementing a Sentence class: you give its constructor a string with some text, and then you can iterate word by word. The first version will implement the sequence protocol, and it’s iterable because all sequences are iterable, as we’ve seen before, but now we’ll see exactly why.

我们通过实现的Sentence类开始对可迭代的探究:你对类的构造器赋值一些文本,然后你就可以一个词接着一个词的迭代。第一个版本会将实现序列协议,

Example 14-1 shows a Sentence class that extracts words from a text by index.

Example 14-1. sentence.py: A Sentence as a sequence of words. import re import reprlib

RE_WORD = re.compile('\w+')

class Sentence:
    def __init__(self, text): self.text = text
	  	  self.words = RE_WORD.findall(text) def __getitem__(self, index):
        return self.words[index] def __len__(self):
        return len(self.words) def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

1: re.findall returns a list with all non-overlapping matches of the regular expression, as a list of strings.

2: self.words holds the result of .findall, so we simply return the word at the given index.

3: To complete the sequence protocol, we implement __len__ — but it is not needed to make an iterable object.

4: reprlib.repr is a utility function to generate abbreviated string representations of data structures that can be very large [3].

[3] Wefirstusedreprlibin“Vectortake#1:Vector2dcompatible”onpage278.

By default, reprlib.repr limits the generated string to 30 characters. See the following console session to see how Sentence is used:

默认情况下,reprlib.repr限制生成的字符串为30个字符串。在下列终端中的会话你可以看到Sentence是如何使用的:

Example 14-2. Testing iteration on a Sentence instance.

>>> s = Sentence('"The time has come," the Walrus said,') # >>> s
Sentence('"The time ha... Walrus said,') # >>>forwordins: #
... print(word) The
time
has
come
the
Walrus
said
>>> list(s) # 4
['The', 'time', 'has', 'come', 'the', 'Walrus', 'said']

1: A sentence is created from a string.

2: Note the output of __repr__ using ... generated by reprlib.repr.

3: Sentence instances are iterable, we’ll see why in a moment.

4: Being iterable, Sentence objects can be used as input to build lists and other iterable types.

In the next pages, we’ll develop other Sentence classes that pass the tests in Example 14-2. However, the implementation in Example 14-1 is different from all the others because it’s also a sequence, so you can get words by index:

在下一页,我们会编写其他的Sentence类

>>> s[0]
  'The' 
>>> s[5]
  'Walrus'
>>> s[-1]
 'said'

Every Python programmer knows that sequences are iterable. Now we’ll see precisely why.

Why sequences are iterable: the iter function

##为什么序列可以迭代:iter函数 Whenever the interpreter needs to iterate over an object x, it automatically calls iter(x). The iter built-in function:

当解释器需要迭代对象x时,它自动地调用iter(x)。该iter为内建函数:

  1. Checks whether the object implements, __iter__, and calls that to obtain an iterator;
  2. If __iter__ is not implemented, but __getitem__ is implemented, Python creates an iterator that attempts to fetch items in order, starting from index 0 (zero);
  3. If that fails, Python raises TypeError, usually saying "'C' object is not itera ble", where C is the class of the target object.

That is why any Python sequence is iterable: they all implement __getitem__ . In fact, the standard sequences also implement __iter__, and yours should too, because the special handling of __getitem__ exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

这就是为什么Python序列可以迭代的原因:它们全都实现了__getitem__。实际上,标准序列也都实现了__iter__,而且你应该为自己序列实现该方法,因为

As mentioned in “Python digs sequences” on page 312, this is an extreme form of duck typing: an object is considered iterable not only when it implements the special method iter, but also when it implements getitem, as long as getitem accepts int keys starting from 0.

In the goose-typing approach, the definition for an iterable is simpler but not as flexible: an object is considered iterable if it implements the iter method. No subclassing or registration is required, because abc.Iterable implements the subclasshook, as seen in “Geese can behave as ducks” on page 340. Here is a demonstration:

>>> class Foo:
... def __iter__(self):
... pass
...
>>> from collections import abc >>> issubclass(Foo, abc.Iterable) True
>>> f = Foo()
>>> isinstance(f, abc.Iterable) True

However, note that our initial Sentence class does not pass the issubclass(Sentence, abc.Iterable) test, even though it is iterable in practice.