Generators
A generator is a function that can be paused and resumed while still maintaining state between these stops and starts.

Pausing and Resuming Functions?
Typically, when you call a function, you lose that function's local variables after it reaches the return statement.

Generators allow you to return a value
suspend execution of the function
then resume it later with all of the locals still intact!

Creating a Generator

*make a function that has the keyword 'yield' in it
*yield is like return in that gives the value immediately to the right of it
*however instead of stopping the functioin completely and discarding the locals
*it temporarily suspends the execution of the function so that it can be continued later

Generators and Iterators
From the docs:
When you call a generator function, it doesn’t return a single value; instead it returns a generator object that supports the iterator protocol.
So... a generator object is an iterator; it implements both __iter__ and __next__ (but an iterator is not always necessarily a generator)

Generators and Iterators Continued
So, when you call a generator function, you immediately get a generator object back, but the function body itself is not yet executed. The generator object returned behaves like an iterator:

it has a __next__ method
...so that means you can pass the generator object into the next function, similar to iterable objects returning iterators
using next controls the function's execution; it starts or resumes the function until yield is encountered
at which point a value is returned and execution is temporarily suspended.

In [None]:
#Generator function example

def f():
    print('print 1')
    yield 'return 1'
    print('print 2')
    yield 'return 2'
    print('print 3')
    yield 'return 3'

# when calling the generator function the body isn't executed
gen_obj = f()

# calling next starts/resumes function execution until yield is encountered
# note that 
next (gen_obj)
next (gen_obj)
next (gen_obj)

#can be looped over

for val in f():
    print(val)


No Class Needed!
Hm - this seems really similar to creating a class and implementing __iter__ and __next__. Aaaand, that's true:

generators are a simple way of getting an object back that supports the iterator protocol
no need to define a whole new class and define two methods on that class
just write a function
Let's write some code that allow us to loop over the letters in the alphabet without creating a string of the entire alphabet beforehand.

In [None]:
#example with a class

class Alphabet:
    START, STOP = 65, 91
    def __init__(self):
        self.i = Alphabet.START
        
    def __iter__(self):
        return self
    
    def __next__(self):
        ch = chr(self.i)
        self.i += 1
        if self.i > Alphabet.STOP:
            raise StopIteration
        return ch
    
for letter in Alphabet():
    print(letter)

In [None]:
#exmaple with a generator

def alphabet():
    START, STOP = 65, 91
    i = START
    while i < STOP:
        yield chr(i)
        i += 1
    # or use range, of course

for letter in alphabet():
    print(letter)

#infinite generator

def infinite_abc():
    START, STOP = 65,67
    i = START
    while True:
        if i > STOP:
            i = START
        yield chr(i)
        i += 1

In [None]:
import sys
lc = [i ** 2 for i in range(10000)]
ge = (i ** 2 for i in range(10000))
sys.getsizeof(lc)
sys.getsizeof(ge)



In [None]:
%%html
<style type="text/css">

.reveal table {
    font-size: 1em;
}

.reveal div.highlight {
    margin: 0; 
}

.reveal div.highlight>pre {
    margin: 0; 
    width: 100%;
    font-size: var(--jp-code-font-size);
}

.reveal div.jp-OutputArea-output>pre {
    margin: 0; 
    width: 90%;
    font-size: var(--jp-code-font-size);
    box-shadow: none;
}

</style>


Descriptive Statistics, Generator Practice
Breaking Up Data Processing
Let's write a few reusable functions that help read and process data:

reading in a file into a list of lines
extracting a column
converting a column to a type

In [None]:
def readFile(fn):
    return (line.strip() for line in open(fn, 'r').readlines());
def parseData(data, parser):
    return (parser(line) for line in data)
def extractCol(parsedData, idx):
    return (line[idx] for line in parsedData)

def extractCols(parsedData, idxs):
    return (tuple((line[idx] for idx in idxs)) for line in parsedData)
def convertVals(col, fn):
    return (fn(val) for val in col)
import sys
data = readFile('starbucks_drinkMenu_expanded.csv')
print('data', sys.getsizeof(data))

parsedData = parseData(data, lambda line: line.split(','))
print('parsedData', sys.getsizeof(parsedData))

column = extractCol(parsedData, 3)
print('column', sys.getsizeof(column))

vals = convertVals(column, lambda val: int(val) if val.isnumeric() else None)
print('vals', sys.getsizeof(vals))
max((v for v in vals if v is not None))


In [None]:
data = (line.strip() for line in open('starbucks-menu-nutrition-drinks.csv'))
parsedData = (line.split(',') for line in data)
filtered = (row[:2] for row in parsedData if row[1].isnumeric())
max(filtered, key=lambda t: int(t[1]))

Example Generator
Calling the following function does not require entire contents of file (or even entire column) to be read into memory; instead, calorie value is read as needed.

In [None]:
# create generator function to read in 
# calorie column
def get_calories():
    with open('starbucks_drinkMenu_expanded.csv', 'r') as f:
        next(f)
        for line in f:
            line_parts = line.split(',')
            yield int(line_parts[3])


Descriptive Statistics
Max, Min, and Len
It may be useful to describe a data set by:

the number of data points
the highest and lowest value
There are built in functions in Python to do this, like max, min, and len

In [None]:
# max and min can actually take a generator 
max(get_calories())

min(get_calories())

#A generator is not actually a collection of elements, so you can't use len on it. 
# Instead, you'll have to turn your generator into a collection...

# if we want to work with all values from our generator, we can convert to a list 
# (that means all values are in memory, tho)
calories = list(get_calories())
# now it's possible to get the length of our data set
len(calories)

# because it's a list we can view the first 10 values with slicing
calories[:10]

# ...and the last 10 values
calories[-10:]

Central Tendency
Two methods of determining where our data set is centered are:

mean
median


In [None]:
# calculating the mean
sum(calories) / len(calories)

# if we need the mean, we'll have to sort first
sorted_calories = sorted(calories)

# calculating the median
# if there is an even number of elements, we'll have to take average of middle two

def median(d):
    middle_index = len(d) // 2
    if len(d) % 2 == 0:
        return (d[middle_index] + d[middle_index + 1]) / 2
    else: 
        return d[middle_index]

median(sorted_calories)

# note that outliers may not affect the median, whereas they can throw off the mean!

copy_sorted_calories = sorted_calories[:]

# change the last value...
copy_sorted_calories[-1] = 200000
sum(copy_sorted_calories) / len(copy_sorted_calories)

median(copy_sorted_calories)

# otoh adding / removing several values that aren't outliers may make the median jump, 
# whereas the mean may only change slightly
copy_sorted_calories = [150] * 20 + sorted_calories[:]
sum(copy_sorted_calories) / len(copy_sorted_calories)

median(copy_sorted_calories)

# note that there are so many values that are 190 above that it's tough to change
# that without adding several values like we did above
sorted_calories.count(190)

#easier to calculate all of these using numpy or pandas
import numpy as np

np.mean(calories)

np.median(calories)

np.max(calories)

np.min(calories)

# no mode in numpy, I don't think
# but there is one in scipy
from scipy import stats
stats.mode(calories)

from collections import Counter
Counter(calories)

import pandas as pd
starbucks=pd.read_csv('starbucks_drinkMenu_expanded.csv')
descriptives=starbucks.describe()
print(type(descriptives))
descriptives.to_csv("starbucks_descriptives.csv")

descriptives
descriptives['Calories']
starbucks['Calories']
starbucks.mode