# Tutorial about generators

Firstly, what are generator expressions and why use them? A generator is a function that returns an iterator object. So Python generators are a simple way of creating iterators and we use them because they are easy to create, and quick to create. The seciton on [the advantages of generators](#advantages) goes into more detail on the pros of generators. 

## Generator functions
Just as we can create data *** in a fucntion, generators can be created in a similar way. However functions that create generators use the keyowrd `yield` instead of `return`

In [14]:
def my_gen_maker():
    yield 1

gen_1 = my_gen_maker()
type(a)

generator

Let's compare the syntax of a generator function and a list creation function

In [29]:
def list_squares_from_list(list_of_numbers):
    results_list = []
    for number in list_of_numbers:
        results_list.append(number ** number)
    results_list

def gen_squares_from_list(list_of_numbers):
    for number in list_of_numbers:
        yield (number ** number)

The main things to note about the generator function:
1. `yield` is used instead of `return` 
2. no empty list (or generator equivilent) needs to be created
3. the syntax is much simpler

## Generator Expressions

Just like a list comprehension creates a list from an iterable, generator expressions create a generator with similar syntax, but it yields a generator. For example:

In [30]:
lst_1 = [i**i for i in [1,2,3]]
type(lst_1)

list

In [42]:
gen_2 = (i**i for i in [1,2,3])
type(gen_2)

generator

We can print a list to see the items within, but not a generator:

In [36]:
print(lst_1)
print(gen_2)

[1, 4, 27]
<generator object <genexpr> at 0x00000285480686D0>


But this is easily overcome by using Python's `list()` function to turn the generator into a list. The [advantages of generators](#advantages) will be lost if you turn the generator into a list.

In [39]:
print(list(gen_2))

[]


Similarly we cannot slice a generator by index, as you can with a list: 

In [40]:
lst_1[2]

27

In [41]:
gen_2[2]

TypeError: 'generator' object is not subscriptable

Again this can be easily overcome by using Python's `list()` function to turn the generator into a list and then the index applied to that. But again,  the [advantages of generators](#advantages) will be lost if you turn the generator into a list.

Generators generate their elements on the fly so they can only advance to the next element when called to do so, therefore there is no pre-existing array to index.

In [44]:
list(gen_2)[2]

IndexError: list index out of range

We can see that generators *only* create elements in response to iteration. This is known as __lazy evaluation__ (like Spark).

## Iterating through your generator

Using `next` to iterate through the generator and cause it to calculate the next element in the sequence

In [21]:
next(gen_2)

1

In [22]:
next(gen_2)

4

In [23]:
next(gen_2)

27

However, if you go too far an iterate again after exhausting the items to be generated, you get a StopIteration error

In [24]:
next(gen_2)

StopIteration: 

We could get around this by using `try` and `except` as part of our code:

In [45]:
try:
    next(gen_2)
except StopIteration:
    pass

Or simply, by using a `for` loop. `for` loops can be used to iterate through the generator, until the end of the sequence.

__Interesting side note__: because `next` has been used on `gen_2` to the point of "exhaustion" the generator no longer exists in memory and it needs to be recreated. This is because generators can only be iterated over once.

In [46]:
gen_2 = (i**i for i in [1,2,3])

for item in gen_2:
    print(item)

1
4
27


## Advantages of Generators
<a id='advantages'></a>

* Save memory space: Iterators don’t compute the value of each item when instantiated. They only compute it when you ask for it. Especiall useful for large datasets
* Save computing time at the point of creation (but not necessarily at when accessing elements) 

and as we have seen already
* the syntax is similar or even simpler than creating lists

In [63]:
from itertools import combinations_with_replacement
import sys

combo_list = [''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20)]
combo_gen = (''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20))

In [65]:
print (f"Size of list is {sys.getsizeof(combo_list)}")
print (f"Size of list is {sys.getsizeof(combo_gen)}")

Size of list is 7731040
Size of list is 88


In [66]:
%timeit([''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20)])

521 ms ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [67]:
%timeit(''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20))

1.38 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


However at the point of accessing these elements the computing time could be slower. 

In [73]:
def count_WW_in_list(combi_lst):
    count_WW = 0
    for combo in combi_lst:
        if "WW" in combo:
            count_WW += 1
    return count_WW
%timeit(count_WW_in_list(combo_list))       

The slowest run took 4.05 times longer than the fastest. This could mean that an intermediate result is being cached.
93.9 ms ± 57.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [74]:
combo_gen = (''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20))

def count_WW_in_gen(combi_gen):
    count_WW = 0
    for combo in combi_gen:
        if "WW" in combo:
            count_WW += 1
    return count_WW
%timeit(count_WW_in_gen(combo_gen))   


The slowest run took 4.87 times longer than the fastest. This could mean that an intermediate result is being cached.
2.54 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


A practical example:

If you needed to open a massive dataset in csv a generator might be perfect for this as it would save time and memory. 

`file_name = 'a_really_humungous_csv.csv'`

You might typically access data like this:
`list_of_CSV_data = [row for row in open(file_name)]`
But depending on the size of the csv, this might take ages and possibly crash the program due to lack of memory. 

By simply replacing the square brackets with round ones, we have a generator that exectute very quickly and will take up little memory. 
`generator_of_CSV_data = (row for row in open(file_name))`

In [79]:
combo_gen = (''.join(item) for item in combinations_with_replacement('ABCDWXYZ', 20))

def sum_letters_in_gen(combi_gen):
    count_len = 0
    sum_of_all_letters = sum(len(combo) for combo in combi_gen)
    return sum_of_all_letters

print(f"Total number of letters in the generator is {sum_letters_in_gen(combo_gen)}")
%timeit(sum_letters_in_gen(combo_gen))   

Total number of letters in the generator is 17760600
770 ns ± 351 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [80]:
def sum_letters_in_list(combi_lst):
    count_len = 0
    sum_of_all_letters = sum(len(combo) for combo in combi_lst)
    return sum_of_all_letters
print(f"Total number of letters in the list is {sum_letters_in_list(combo_list)}")
%timeit(sum_letters_in_list(combo_list))  

Total number of letters in the list is 17760600
342 ms ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


"With big list comprehensions you'll run out of memory, with big generators you'll run out of time"