# Generators

Generators are functions that return a list of values to iterate over. Rather than constructing the whole list, we create and return one element at a time. This can often be faster and more space efficient. If we are looking for an element that meets some criteria, we can't break and don't have to create every single element. Since we are only making one element at a time, we avoid having to make a massive list that could give us a MemoryError

There are two key words used in generators. The first is `yield` which on iteration will return a value and halt at that point. When we need the next value on the next iteration we will continue from that point. The second is `yield from` which, when called on an iterable (list, set, another generator) will continually yield everything from it one at a time before commencing.

In [19]:
def generate_range(n):
    counter = 0
    while counter <= n:
        yield counter
        counter += 1

In [20]:
for elem in generate_range(10):
    print (elem)

0
1
2
3
4
5
6
7
8
9
10


Generators can also recursive. In the example before we yield from the recursive results before yielding the value for the current iteration.

In [21]:
def generate_range(n):
    if n > 0:
        yield from generate_range(n-1)
    yield n

In [22]:
for elem in generate_range(10):
    print (elem)

0
1
2
3
4
5
6
7
8
9
10


Because generators do not evaluate all of the values at once, they are not indexable. It is possible to convert them into a list by passing it into the constructor. 

In [23]:
print (generate_range(10))

<generator object generate_range at 0x7f41bd1f21f0>


In [24]:
print (generate_range(10)[3])

TypeError: 'generator' object is not subscriptable

In [None]:
print (list(generate_range(10)))
print (list(generate_range(10))[3])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
3


## Finding all Subsets

Below are two implementations of finding all of the subsets of a list. The first, `get_all_subsets`, creates a list of the result and is a recursive function similar to what you've seen before. The second, `generate_all_subsets`, is the generator equivalent for it.

As the number of elements grows, the number of elements gets incredibly large. If we make a giant list, it may be too big to fit in memory.

In [None]:
def get_all_subsets(elements, current_index=0, current_subset=None):
    results = []
    if current_subset is None:
        current_subset = []
    if current_index < len(elements):
        current_element = elements[current_index]
        # add current element to subset
        results.extend(get_all_subsets(elements, current_index+1, current_subset + [current_element]))
        # skip current element
        results.extend(get_all_subsets(elements, current_index+1, current_subset))
    else:
#         print ("appending: {}".format(current_subset))
        results.append(current_subset)
    return results

In [None]:
get_all_subsets([1,2,3])

[[1, 2, 3], [1, 2], [1, 3], [1], [2, 3], [2], [3], []]

In [None]:
len(get_all_subsets([1,2,3,4,5,6,7]))

128

In [None]:
len(get_all_subsets(list(range(10))))

1024

In [None]:
len(get_all_subsets(list(range(20))))

1048576

In [None]:
def generate_all_subsets(elements, current_index=0, current_subset=None):
    if current_subset is None:
        current_subset = []
    if current_index < len(elements):
        current_element = elements[current_index]
        # add current element to subset
        yield from generate_all_subsets(elements, current_index+1, current_subset + [current_element])
        # skip current element
        yield from generate_all_subsets(elements, current_index+1, current_subset)
    else:
        print ("yielding: {}".format(current_subset))
        yield current_subset

In [None]:
elements = [1,2,3]

In [None]:
for element in generate_all_subsets(elements):
    print ("next:", element)

yielding: [1, 2, 3]
next: [1, 2, 3]
yielding: [1, 2]
next: [1, 2]
yielding: [1, 3]
next: [1, 3]
yielding: [1]
next: [1]
yielding: [2, 3]
next: [2, 3]
yielding: [2]
next: [2]
yielding: [3]
next: [3]
yielding: []
next: []


In [None]:
for element in get_all_subsets(elements):
    print ("next:", element)

next: [1, 2, 3]
next: [1, 2]
next: [1, 3]
next: [1]
next: [2, 3]
next: [2]
next: [3]
next: []


# Trie Items

In Lab 5, we implemented a function to return all of the items in a Trie. It makes a lot of sense for this function to be a generator. Part of what makes a Trie useful is that it compresses words so we don't have to store all of them in their entirety. Making a giant list of the items will take up a lot of space. If we are looking for the first word with a particular value, we may not need to search through all the items in the Trie.

In [33]:
class Trie:
    def __init__(self, itemtype=None):
        self.value = None
        self.children = {}
        self.type = itemtype

    def checktype(self, key):
        """
        Raise an error when a prefix of the wrong type is used
        """
        if not isinstance(key, self.type):
            raise TypeError

    def getval(self, key):
        """
        Helper for get and contains. Gets the value associated
        with a key if one exists, otherwise returns None
        """
        # Base case, traversed Trie down to the last element of the key
        if len(key) == 0:
            return self.value
        # Recursive step: walk down Trie one element of key at a time
        childKey = key[:1]
        if childKey in self.children:
            return self.children[childKey].getval(key[1:])
        # If key does not exist in the Trie
        else:
            return None

    def set(self, key, value):
        """
        Add a key with the given value to the trie, or reassign the associated
        value if it is already present in the trie.  Assume that key is an
        immutable ordered sequence.  Raise a TypeError if the given key is of
        the wrong type.
        """
        if self.type is None:
            self.type = type(key)
        else:
            self.checktype(key)
        # Iteratively travel down the trie one element of key at a time
        current = self
        for i in range(len(key)):
            subkey = key[i:i+1]
            if subkey not in current.children:
                # Add a node for this key element if it doesn't already exist
                if value is not None:
                    current.children[subkey] = Trie(self.type)
                else:
                    # If we try to delete a key that doesn't exist in the Trie
                    raise KeyError
            current = current.children[subkey]
        # Set the value for the node associated with the key
        current.value = value


    def get(self, key):
        """
        Return the value for the specified prefix.  If the given key is not in
        the trie, raise a KeyError.  If the given key is of the wrong type,
        raise a TypeError.
        """
        self.checktype(key)
        val = self.getval(key)
        if val is None:
            raise KeyError
        else:
            return val

    def delete(self, key):
        """
        Delete the given key from the trie if it exists.
        """
        # Setting the value of the key to None is all we need
        # to do to dissociate them in the Trie
        self.set(key, None)

    def contains(self, key):
        """
        Is key a key in the trie? return True or False.
        """
        return self.getval(key) is not None

    def get_items(self):
        """
        Returns a list of (key, value) pairs for all keys/values in this trie and
        its children.
        """
        # APPROACH: Recursively build up items from children
        ans = []
        # Base case: Found a Trie node associated with a value
        if self.value is not None:
            ans.append((self.type(), self.value))
        # Recursive step: Get items from children and prepend the keys associated with
        # the children to build up full sequences
        for childkey in self.children:
            for subitem, val in self.children[childkey].get_items():
                ans.append((childkey+subitem, val))
        return ans
    
    def generate_items(self):
        """
        Returns a list of (key, value) pairs for all keys/values in this trie and
        its children
        """
        # Base case: Found a Trie node associated with a value
        if self.value is not None:
            yield (self.type(), self.value)
        # Recursive step: Get items from children and prepend the keys associated with
        # the children to build up full sequences
        for childkey in self.children:
            for subitem, val in self.children[childkey].generate_items():
                yield childkey+subitem, val

In [26]:
from text_tokenize import tokenize_sentences

def make_word_trie(text):
    """
    Given a piece of text as a single string, create a Trie whose keys are the
    words in the text, and whose values are the number of times the associated
    word appears in the text
    """
    frequencies = {}
    for sentence in tokenize_sentences(text):
        for word in sentence.split():
            if word not in frequencies:
                frequencies[word] = 0
            frequencies[word] += 1
    t = Trie()
    for word, freq in frequencies.items():
        t.set(word, freq)
    return t

In [27]:
def get_frequent_words(trie, desired_frequency, number_of_words):
    """
    TODO Docstring
    """
    words = []
    for item, freq in trie.get_items():
        if freq > desired_frequency:
            words.append(item)
            if len(words) == number_of_words:
                break
    return words

In [28]:
def generate_frequent_words(trie, desired_frequency, number_of_words):
    """
    TODO Docstring
    """
    words = []
    for item, freq in trie.generate_items():
        if freq > desired_frequency:
            words.append(item)
            if len(words) == number_of_words:
                break
    return words

`make_word_trie` from the lab can be used to generate a mapping of words to frequencies. One thing we might care about is finding words that appear very frequently. We probably don't have to search through the entire Trie every time to do this. Using a generator instead of a list can make this faster.

In [29]:
filename = 'resources/alice.txt'
with open(filename, encoding="utf-8") as f:
    text = f.read()
trie = make_word_trie(text)

print ("From Alice in Wonderland\n")

print ("Non-Generator")
print ("words seen more than {} times: \t\t{}".format(10, get_frequent_words(trie, 10, 4)))
print ("words seen more than {} times: \t{}".format(100, get_frequent_words(trie, 100, 4)))
print ("words seen more than {} times: \t{}".format(1000, get_frequent_words(trie, 1000, 4)))
print ("words seen more than {} times: \t{}\n".format(100000, get_frequent_words(trie, 100000, 4)))

print ("Genertor")
print ("words seen more than {} times: \t\t{}".format(10, generate_frequent_words(trie, 10, 4)))
print ("words seen more than {} times: \t{}".format(100, generate_frequent_words(trie, 100, 4)))
print ("words seen more than {} times: \t{}".format(1000, generate_frequent_words(trie, 1000, 4)))
print ("words seen more than {} times: \t{}".format(100000, generate_frequent_words(trie, 100000, 4)))

From Alice in Wonderland

Non-Generator
words seen more than 10 times: 		['project', 'poor', 'people', 'perhaps']
words seen more than 100 times: 	['a', 'alice', 'all', 'and']
words seen more than 1000 times: 	['the']
words seen more than 100000 times: 	[]

Genertor
words seen more than 10 times: 		['project', 'poor', 'people', 'perhaps']
words seen more than 100 times: 	['a', 'alice', 'all', 'and']
words seen more than 1000 times: 	['the']
words seen more than 100000 times: 	[]


In [30]:
filename = 'resources/lots_of_words.txt'
with open(filename, encoding="utf-8") as f:
    text = f.read()
trie = make_word_trie(text)

In [31]:
import time

start = time.time()
frequent_words = get_frequent_words(trie, 1000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        1000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = get_frequent_words(trie, 10000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        10000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = get_frequent_words(trie, 100000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        100000,
        frequent_words,
        end - start))

10 words seen more than 1000 times: 
	result: 	['a', 'an', 'and', 'any', 'as', 'again', 'at', 'all', 'after', 'away']
	time spent: 	0.1367 seconds

10 words seen more than 10000 times: 
	result: 	['a', 'and', 'as', 'for', 'to', 'the', 'that', 'be', 'with', 'me']
	time spent: 	0.1151 seconds

10 words seen more than 100000 times: 
	result: 	[]
	time spent: 	0.1204 seconds



In [32]:
import time

start = time.time()
frequent_words = generate_frequent_words(trie, 1000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        1000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = generate_frequent_words(trie, 10000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        10000,
        frequent_words,
        end - start))

start = time.time()
frequent_words = generate_frequent_words(trie, 100000, 10)
end = time.time()
print ("{} words seen more than {} times: \n\tresult: \t{}\n\ttime spent: \t{:.4f} seconds\n".format(
        10,
        100000,
        frequent_words,
        end - start))

10 words seen more than 1000 times: 
	result: 	['a', 'an', 'and', 'any', 'as', 'again', 'at', 'all', 'after', 'away']
	time spent: 	0.0042 seconds

10 words seen more than 10000 times: 
	result: 	['a', 'and', 'as', 'for', 'to', 'the', 'that', 'be', 'with', 'me']
	time spent: 	0.0314 seconds

10 words seen more than 100000 times: 
	result: 	[]
	time spent: 	0.1019 seconds

