### Generator for Big Data (modularizing data pipelines using generators)
### Only using generators once is useful for CONSUMPTION

Earlier, we discussed imposing a restriction on ourselves that forced us to use a generator to read our data instead of reading it into a list of lists. We cited the problem of Big Data and an our inability to store it all in one variable. While calling it a Big Data problem is still correct, you may also call it a memory problem.

Let's say that you have an older laptop with about 4GB of RAM, random access memory. The true size of our beer data set is only about 3MB, but suppose that we asked everyone around the globe to give us their recipes, resulting in a data set around 3GB. If we were to read the entirety of our data set into a variable, it would take up a bit more than 3GB of RAM! This would leave us with little room for other operations, much less other variables of similar size. Storing our data in a list of lists would take up so much memory that any analyses we do would take excruciatingly long to do.

We know now that generators produce a single value from a defined sequence, but only when we ask next() or within a for loop. We call this lazy evaluation. Generators are lazy because they only give us a value when we ask for it. The flipside here is that only that single value takes up memory. The ultimate result is that generators are incredibly memory efficient, which makes it a perfect candidate for reading and using Big Data files. Once we ask for the next value of a generator, the old value is discarded. Once we go through the entire generator, it is also discarded from memory as well.

![](https://i.imgur.com/zDmeJgr.jpg)

In [9]:
def beerDataGenerator():
    file = "recipeData.csv"
    for row in open(file, encoding="ISO-8859-1"):
        yield row

beer = beerDataGenerator()

### Generator expression version
![](https://i.imgur.com/hPjtGB8.jpg)

In [64]:
beer_data = "recipeData.csv"

# generator 1 - to read in line by line
lines = (line for line in open(beer_data, encoding="ISO-8859-1"))

# why not pass output of the first generator into another generator?
# generator 2 - to process each line
lists = (line.split(",") for line in lines)

#take column names out out generator to store them
columns = next(lists)

# generator 3 - create a dictionary entry for each row
beerdicts = (dict(zip(columns, data)) for data in lists)

# generator 4 - gets ABV of American IPA
abv = (float(bd["ABV"]) for bd in beerdicts if bd["Style"] == "American IPA")

In [55]:
columns

['BeerID',
 'Name',
 'URL',
 'Style',
 'StyleID',
 'Size(L)',
 'OG',
 'FG',
 'ABV',
 'IBU',
 'Color',
 'BoilSize',
 'BoilTime',
 'BoilGravity',
 'Efficiency',
 'MashThickness',
 'SugarScale',
 'BrewMethod',
 'PitchRate',
 'PrimaryTemp',
 'PrimingMethod',
 'PrimingAmount',
 'UserId\n']

In [35]:
next(beerdicts)

{'BeerID': '1',
 'Name': 'Vanilla Cream Ale',
 'URL': '/homebrew/recipe/view/1633/vanilla-cream-ale',
 'Style': 'Cream Ale',
 'StyleID': '45',
 'Size(L)': '21.77',
 'OG': '1.055',
 'FG': '1.013',
 'ABV': '5.48',
 'IBU': '17.65',
 'Color': '4.83',
 'BoilSize': '28.39',
 'BoilTime': '75',
 'BoilGravity': '1.038',
 'Efficiency': '70',
 'MashThickness': 'N/A',
 'SugarScale': 'Specific Gravity',
 'BrewMethod': 'All Grain',
 'PitchRate': 'N/A',
 'PrimaryTemp': '17.78',
 'PrimingMethod': 'corn sugar',
 'PrimingAmount': '4.5 oz',
 'UserId\n': '116\n'}

### Consuming the data and generating some insights

In [56]:
# what is the most popular style of homebrewed beer?

beer_counts = {}

for bd in beerdicts:
    if bd["Style"] not in beer_counts:
        beer_counts[bd["Style"]] = 1
    else:
        beer_counts[bd["Style"]] += 1

In [57]:
import pandas as pd

In [61]:
pd.Series(beer_counts).sort_values(ascending=False)

American IPA                         11940
American Pale Ale                     7581
Saison                                2617
American Light Lager                  2277
American Amber Ale                    2038
Blonde Ale                            1753
Imperial IPA                          1478
American Stout                        1268
Irish Red Ale                         1204
American Brown Ale                    1152
Witbier                               1072
California Common Beer                1044
Weissbier                              988
Oatmeal Stout                          961
Russian Imperial Stout                 929
Weizen/Weissbier                       919
Sweet Stout                            919
Robust Porter                          897
Kölsch                                 869
Double IPA                             864
Cream Ale                              830
American Porter                        787
English IPA                            784
Imperial St

In [65]:
# whats the average ABV of American IPA
IPA_count = pd.Series(beer_counts).sort_values(ascending=False)["American IPA"]
sum(abv) / IPA_count

6.44429396984925

# Conclusion

* Generators are memory efficient since they only require memory for the one value they yield.
* Generators are lazy: they only yield values when explicitly asked.
* You can feed the output of a generator to the input of another generator to form data pipelines.
* Data pipelines can be modularized and customized to your needs.
* Generators are useful for generating values ad infinitum.