# Generators

A generator function in Python is defined like a normal function, but whenever it needs to generate a value, it does so with the `yield` keyword rather than `return`. If the body of a `def` contains yield, the function automatically becomes a Python generator function.

`yield` will return a value and halt at that point (but not terminate like `return`)

In [32]:
def generate_range(n):
    counter = 0
    while counter <= n:
        yield counter
        counter += 1

for elem in generate_range(3):
    print (elem)

0
1
2
3


Generators can also be recursive.

In [34]:
def generate_range(n):
    if n > 0:
        yield from generate_range(n-1)
    yield n

for elem in generate_range(3):
    print (elem)

0
1
2
3


### Generator Object

Python Generator functions return a generator object that is iterable, i.e., can be used as an `Iterator`. Note that generator objects are subclasses of iterator objects. `Generator` objects are used either by calling the `next` method of the generator object or using the generator object in a `for` loop.

In [38]:
# x is a generator object
x = generate_range(2)

# Iterating over the generator object using next

# In Python 3, __next__()
print(next(x))
print(next(x))
print(next(x))


0
1
2


Generators bring two main advantages over lists:
 1. With a generator, we avoid needing to store all generated values simultaneously, which can bring significant memory savings.
 2. In fact, it might be *impossible* to store all values generated!  A generator may yield *infinitely* many values.

In [12]:
def all_positive_integers():
    """Generator for all numbers greater than zero"""
    
    i = 1
    while True:
        # Note: loop is intentionally nonterminating!
        yield i
        i += 1

def firstn(stream, num):
    """Return the first num values (which must be positive) generated by stream."""
    
    for v in stream:
        yield v
        num -= 1
        if num <= 0:
            return

Yes, a naive reading of `all_positive_integers` shows it running forever.  However, it continually invokes `yield`, which *suspends execution of the generator, until the consumer is ready for the next value*.  A consumer that only asks for finitely many values can terminate, even when the generator has the *capacity* to produce infinitely many values.  Here are some examples of pulling out interesting subsequences of `all_positive_integers`.

In [13]:
list(firstn(all_positive_integers(), 10))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [14]:
#test for primes
def is_prime(n):
    return n > 1 and all(n % m != 0 for m in range(2, n))

list(firstn((n for n in all_positive_integers() if is_prime(n)), 20))

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

Note that generators do not evaluate all values at once, and therefore are not indexable. We can however comvert them to a list first and then subscript.

In [39]:
#this will not work
#print (generate_range(10)[3])

print (list(generate_range(10))[3])

3


### Memory Benefits

Below are 2 implementations of getting frequent words from a Trie. Note that here the generator is faster, but it may not always be the case.

In [2]:
from trie import Trie
from text_tokenize import tokenize_sentences

def make_word_trie(text):
    """
    Given a piece of text as a single string, create a Trie whose keys are the
    words in the text, and whose values are the number of times the associated
    word appears in the text
    """
    frequencies = {}
    for sentence in tokenize_sentences(text):
        for word in sentence.split():
            if word not in frequencies:
                frequencies[word] = 0
            frequencies[word] += 1
    t = Trie()
    for word, freq in frequencies.items():
        t.set(word, freq)
    return t

In [3]:
def get_frequent_words(trie, desired_frequency, number_of_words):
    """
    TODO Docstring
    """
    words = []
    for item, freq in trie.get_items():
        if freq > desired_frequency:
            words.append(item)
            if len(words) == number_of_words:
                break
    return words

In [4]:
def generate_frequent_words(trie, desired_frequency, number_of_words):
    """
    TODO Docstring
    """
    words = []
    for item, freq in trie.generate_items():
        if freq > desired_frequency:
            words.append(item)
            if len(words) == number_of_words:
                break
    return words

In [6]:
filename = 'lots_of_words.txt'
with open(filename, encoding="utf-8") as f:
    text = f.read()
trie = make_word_trie(text)

In [7]:
import time

start = time.time()
frequent_words = get_frequent_words(trie, 1000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        1000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = get_frequent_words(trie, 10000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        10000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = get_frequent_words(trie, 100000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        100000,
        frequent_words,
        end - start))

10 words seen more than 1000 times: 
	result: 	['a', 'an', 'and', 'any', 'as', 'again', 'at', 'all', 'after', 'away']
	time spent: 	0.1198 seconds

10 words seen more than 10000 times: 
	result: 	['a', 'and', 'as', 'for', 'to', 'the', 'that', 'be', 'with', 'me']
	time spent: 	0.1205 seconds

10 words seen more than 100000 times: 
	result: 	[]
	time spent: 	0.1219 seconds



In [8]:
import time

start = time.time()
frequent_words = generate_frequent_words(trie, 1000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        1000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = generate_frequent_words(trie, 10000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        10000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = generate_frequent_words(trie, 100000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        100000,
        frequent_words,
        end - start))

10 words seen more than 1000 times: 
	result: 	['a', 'an', 'and', 'any', 'as', 'again', 'at', 'all', 'after', 'away']
	time spent: 	0.0047 seconds

10 words seen more than 10000 times: 
	result: 	['a', 'and', 'as', 'for', 'to', 'the', 'that', 'be', 'with', 'me']
	time spent: 	0.0343 seconds

10 words seen more than 100000 times: 
	result: 	[]
	time spent: 	0.0997 seconds

