<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen based on the notebooks created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___


# Python Intermediate 5

**Description:** This notebook describes:
* What is a generator
* How to write a generator comprehension
* The advantages of using a generator
 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 45 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics 1](./python-basics-1.ipynb))

**Knowledge Recommended:** None

**Data Format:** None

**Libraries Used:** None

**Research Pipeline:** None
___

# What is a generator?

We have learned from [python intermediate 1](./python-intermediate-1.ipynb) that any Python object that allows its members to be iterated over in a for-loop is an **iterable**. Strings, lists, sets and dictionaries are all iterables. 

In [1]:
# Use a for loop to iterate over a list
ls = [1, 2, 3]
for num in ls:
    print(num)

1
2
3


In [2]:
# Use a for loop to iterate over a string


Python has a built-in function `iter()` which takes an **interable** and returns an **iterator**. The iterator can be used to iterate over the input iterable.

In [3]:
# Use the built-in iter function to create an iterator out of the list stored in ls
my_ls = iter(ls)
type(my_ls)

list_iterator

To access the values in the list, we can use the `next()` function to get one value at a time.

In [4]:
# Use next() to get the first element from the list
next(my_ls)

1

In [5]:
# Use next() to get the second element from the list


A **generator** is a function that creates an **iterator**. On the surface, generators look like ordinary functions, but they are actually very different. Let's use a simple example to understand the difference. 

In [6]:
# Create a Python function which takes a list of numbers
# and returns a list of numbers, each os which is two times
# of the numbers in the input list

def two_times(ls):
    new_ls = []
    for n in ls:
        new_ls.append(2*n)
    return new_ls 

two_times([1, 2, 3])

[2, 4, 6]

If we feed a list of numbers to this function, we get a new list back. Most importantly, the entire new list of numbers is stored in the memory.

We can create a Python generator to give us the same sequence of values. Note that a generator uses the `yield` statement. 

In [7]:
# Create a Python generator

def gen(ls): 
    for n in ls:
        yield 2*n 
        
my_gen = gen([1, 2, 3]) 

Since a generator creates an iterator, it yields one value at a time. 

In [8]:
# Use next () to yield one element from the iterable at a time
next(my_gen) 

2

In [9]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

4

In [10]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

6

The generator is exhausted when all the items have been used. If we use `next()` function again, Python returns a `StopIteration` error.

In [11]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

StopIteration: 

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Pick an iterable of your choice and write a generator which takes the iterable as its input. 

# Generator comprehension

Python provides a shorter way to define a generator function, that is, generator comprehensions.
Generator comprehensions basically have the same syntax as list comprehensions, except that they use parentheses `()` instead of hard brackets `[]`.

In [12]:
# Create a list comprehension using hard brackets []
numbers = [5,6,7,8,9]
new_list = [num for num in numbers if num > 5]
print(new_list)

[6, 7, 8, 9]


In [13]:
# Create a generator using parentheses
new_gen = (num for num in numbers if num > 5)

In [14]:
# Yield the values one at a time
next(new_gen)

6

In [15]:
next(new_gen)

7

In [16]:
next(new_gen)

8

In [17]:
next(new_gen)

9

When all the items have been yielded, if we use `next()` function again, Python returns a `StopIteration` error.

In [18]:
# Yield the next generator output
next(new_gen)

StopIteration: 

# The advantages of generators

Generators do not hold the entire result in the memory. It yields one item at a time. Because a generator only has to yield one item at a time, it can lead to significant savings in memory usage. 

In [19]:
# Demonstrate the memory size difference of 
# a list comprehension vs generator comprehension

# Import getsizeof which measures memory usage in bytes
from sys import getsizeof
  
list_comprehension = [i for i in range(10000)]
generator_comprehension = (i for i in range(10000))
  
# Print the size of the list comprehension
print('List comprehension memory usage: ', getsizeof(list_comprehension))

# Print the size of the generator comprehension
print('Generator comprehension memory usage: ', getsizeof(generator_comprehension))

List comprehension memory usage:  85176
Generator comprehension memory usage:  104


Since a generator occupies less memory, using a generator instead of a normal iterable like a list can lead to a performace boost. This advantage in performance is especilly helpful when you have a really big dataset with hundreds of thousands of items or even millions of items to loop through. 

In [20]:
# import the time module to calculate the processing time
import time

In [21]:
# Calculate the processing time when we create a list with 1m items
def ml(n):
    ls = []
    for i in range(n):
        ls.append(n)
    return ls

start = time.process_time()
ml(1000000)
end = time.process_time()
print(end - start)

0.04928600000000005


In [22]:
# Calculate the processing time when we create a generator with 1m items
def ml_gen(n):
    for i in range(n):
        yield i
        
start = time.process_time()
ml_gen(1000000)
end = time.process_time()
print(end - start)

2.4999999999941735e-05


Using a generator makes sense in scenarios where loading an entire list, dictionary, or set could fill all available memory. This could be because each item is large, the list is large, or both. 

If you want to take one item at a time, do a lot of calculations based on that item, and then move on to the next item, then use a generator. 

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

* Create a generator object using a generator comprehension
* Print out every value in the generator 
* Use `try` and `except` in your code to prevent the program from crashing after the generator is exhausted

For a quick refresh of `try` and `except`, you can refer to [python basics 2](./python-basics-2.ipynb).

In [23]:
# Create a generator using a generator comprehension


# An example of a generator from Constellate

You may not be aware of it, but you have actually seen and worked with a generator before! In Constellate, when you build a dataset and use the Constellate client to download the dataset, you will be working with a generator. Let's use the example we have seen before in the notebook [exploring-word-frequencies](./exploring-word-frequencies.ipynb).

In [24]:
# import modules and libraries
import constellate
from pathlib import Path

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/


In [25]:
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
# Check to see if a dataset file exists
# If not, download a dataset using the Constellate Client
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_file = Path.cwd() / 'data' / 'Shakespeare' # Make sure this filepath matches your dataset filename

if dataset_file.exists() == False:
    try: 
        dataset_file = constellate.download(dataset_id, 'jsonl', 'Shakespeare')
    except: 
        dataset_file = constellate.get_dataset(dataset_id)

All documents from JSTOR published in Shakespeare Quarterly from 1950 - 2020. 6745 documents.
INFO:root:File /Users/zchen/data/Shakespeare exists. Not re-downloading.


In [26]:
# Read in the data 
dataset = constellate.dataset_reader(dataset_file)

In [27]:
# Check the type of 'dataset'
dataset

<generator object dataset_reader at 0x103ef70d0>

In [28]:
# Get the first document using next()
next(dataset)

{'id': 'http://www.jstor.org/stable/2870472',
 'docType': 'article',
 'title': 'Review Article',
 'creator': ['H. R. Coursen'],
 'isPartOf': 'Shakespeare Quarterly',
 'sourceCategory': ['Language & Literature',
  'Humanities',
  'Performing Arts',
  'Arts'],
 'pageStart': '498',
 'url': 'http://www.jstor.org/stable/2870472',
 'volumeNumber': '42',
 'issueNumber': '4',
 'language': ['eng'],
 'pageEnd': '501',
 'pageCount': 4,
 'pagination': 'pp. 498-501',
 'datePublished': '1991-12-01',
 'publicationYear': 1991,
 'publisher': 'Folger Shakespeare Library',
 'wordCount': 1658,
 'provider': 'jstor',
 'outputFormat': ['unigram', 'bigram', 'trigram'],
 'identifier': [{'name': 'issn', 'value': '00373222'},
  {'name': 'oclc', 'value': '39852252'},
  {'name': 'lccn', 'value': 'sn98-23302'},
  {'name': 'local_doi', 'value': '10.2307/2870472'},
  {'name': 'journal_id', 'value': 'shakquar'}],
 'unigramCount': {"Shakespeare's": 11,
  'Hidden': 1,
  'World:': 1,
  'A': 2,
  'Study': 1,
  'of': 91,
 

We have in total 6745 documents in the dataset. Quite a lot! 

In [33]:
# Calculate the processing time of the generator in miliseconds
start = time.process_time() * 1000
dataset = constellate.dataset_reader(dataset_file)
end = time.process_time() * 1000
print(end - start)

1241.4130000000005


In [34]:
# Calculate the processing time of the list with the same items in miliseconds
start = time.process_time() * 1000
dataset = list(constellate.dataset_reader(dataset_file))
end = time.process_time() * 1000
print(end - start)

13383.610999999999


___
## Lesson Complete

Congratulations! You have completed *Python Intermediate 5*.


### Exercise Solutions
Here are a few solutions for exercises in this lesson.

In [31]:
# Pick an iterable of your choice and write a generator which takes the iterable as its input

w = "generator"
def gen(w):
    for l in w:
        yield l.upper()
w_gen = gen(w)
w

'generator'

In [32]:
# Create a generator using a generator comprehension
gen = (number for number in range(30))

# Print the rest of the values using a loop
while True:
    try: 
        print(next(gen))
    except StopIteration: 
        print('Generator exhausted')
        break

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Generator exhausted
