# Del 04: Parsanje, analiza podatkov in generiranje poročil

# Generators

## Iteration and iterables

**Iteration is the repetition of some kind of process over and over again.** Python’s for loop gives us an easy way to iterate over various objects. Often, you’ll iterate over a list, but we can also iterate over other Python objects such as strings and dictionaries.

In [5]:
# Iterating over a list
ez_list = [1, 2, 3]
for i in ez_list:
    print(i)

1
2
3


In [6]:
# Iterating over a string
ez_string = "Generators"
for s in ez_string:
    print(s)

G
e
n
e
r
a
t
o
r
s


In [9]:
# Iterating over a dictionary
ez_dict = {1 : "First", 2 : "Second"}
for key, value in ez_dict.items():
    print(key, value)

1 First
2 Second


**We refer to any object that can support iteration as an iterable.**

### What defines an iterable?

Iterables support something called the **Iterator Protocol**. 

In [10]:
number = 12345
for n in number:
    print(n)

TypeError: 'int' object is not iterable

> - Iteration is the idea of repeating some process over a sequence of items. In Python, iteration is usually related to the for loop.
- An iterable is an object that supports iteration.
- To be an iterable, it must describe to a for loop two things:
    - What item comes next in the iteration.
    - When should the loop stop iteration.
- Generators are iterables.

## Iterators

<ul>
<li>iterable - an object that has the __iter__ method defined</li>
<li>iterator - an object that has both __iter__ and __next__
defined where __iter__ will return the iterator object and
__next__ will return the next element in the iteration.</li>
</ul>

In [2]:
my_list = [1, 2, 3]
next(my_list)

TypeError: 'list' object is not an iterator

In [3]:
iter(my_list)

<list_iterator at 0x7f8acc362780>

In [4]:
list_iterator = iter(my_list)
next(list_iterator)

1

In [5]:
next(list_iterator)

2

In [6]:
next(list_iterator)

3

In [7]:
next(list_iterator)

StopIteration: 

In [10]:
for item in iter(my_list):
    print(item)

1
2
3


In [9]:
for item in my_list:
    print(item)

1
2
3


### Creating Your Own Iterators

In [14]:
import re
RE_WORD = re.compile(r'\w+')


class SentenceIterator:
    def __init__(self, text):
        """
        Constructor
        """
        self.words = RE_WORD.findall(text)
        self.index = 0

    def __iter__(self):
        """
        Returns itself as an iterator
        """
        return self

    def __next__(self):
        """
        Returns the next word in the sequence or 
        raises StopIteration
        """
        try:
            word = self.words[self.index]
        except IndexError:
            raise StopIteration()
        self.index += 1
        return word

if __name__ == '__main__':
    sentence = SentenceIterator('Danes je lep dan.')
    for item in sentence:
        print(item)

Danes
je
lep
dan


In [15]:
class Doubler:
    """
    An infinite iterator
    """
    def __init__(self):
        """
        Constructor
        """
        self.number = 0

    def __iter__(self):
        """
        Returns itself as an iterator
        """
        return self

    def __next__(self):
        """
        Doubles the number each time next is called
        and returns it. 
        """
        self.number += 1
        return self.number * self.number

if __name__ == '__main__':
    doubler = Doubler()
    count = 0

    for number in doubler:
        print(number)
        if count > 5:
            break
        count += 1

1
4
9
16
25
36
49


## Generators

In [11]:
# primer če želimo narediti countdown funkcijo
def countdown(num):
    print('Starting')
    while num > 0:
        return num
        num -= 1
    print('Stop')

In [12]:
print(countdown(5))

Starting
5


### The generator function

In [13]:
# Regular function
def function_a():
    return "a"

In [14]:
# Generator function
def generator_a():
    yield "a"

In [15]:
function_a()

'a'

In [16]:
generator_a()

<generator object generator_a at 0x7f39bc2d9228>

In [19]:
a = generator_a()

In [20]:
# Asking the generator what the next item is
next(a)

'a'

In [21]:
# Do not do this
next(a)

StopIteration: 

In [22]:
def multi_generate():
    yield "a"
    yield "b"
    yield "c"

In [23]:
mg = multi_generate()
next(mg)

'a'

In [24]:
next(mg)

'b'

In [25]:
next(mg)

'c'

In [26]:
next(mg)

StopIteration: 

In [27]:
next(multi_generate())

'a'

In [28]:
next(multi_generate())

'a'

In [29]:
# vaja: probajo prejšnji češit z genratorji
def countdown(num):
    print('Starting')
    while num > 0:
        yield num
        num -= 1
    print('Stop')

In [34]:
c = countdown(3)
print(next(c))
print(next(c))
print(next(c))

Starting
3
2
1


In [1]:
# Creating a generator that will generate the data row by row
def beerDataGenerator():
    file = "data/recipeData.csv"
    with open(file, encoding="ISO-8859-1") as f:
        for row in f:
            yield row

In [2]:
# Remember to store an instance of the generator so we can refer back to it
beer = beerDataGenerator()

In [3]:
next(beer)

'BeerID,Name,URL,Style,StyleID,Size(L),OG,FG,ABV,IBU,Color,BoilSize,BoilTime,BoilGravity,Efficiency,MashThickness,SugarScale,BrewMethod,PitchRate,PrimaryTemp,PrimingMethod,PrimingAmount,UserId\n'

In [4]:
next(beer)

'1,Vanilla Cream Ale,/homebrew/recipe/view/1633/vanilla-cream-ale,Cream Ale,45,21.77,1.055,1.013,5.48,17.65,4.83,28.39,75,1.038,70,N/A,Specific Gravity,All Grain,N/A,17.78,corn sugar,4.5 oz,116\n'

### The generator expression

In [39]:
lc_example = [n**2 for n in [1, 2, 3, 4, 5]]

In [40]:
lc_example

[1, 4, 9, 16, 25]

In [41]:
genex_example = (n**2 for n in [1, 2, 3, 4, 5])

In [42]:
genex_example

<generator object <genexpr> at 0x7f39ad5305e8>

In [43]:
genex_example2 = (n**2 for n in [1, 2, 3, 4, 5] if n >= 3)

In [44]:
next(genex_example2)

9

In [45]:
next(genex_example)

1

In [46]:
next(genex_example)

4

In [47]:
next(genex_example)

9

In [48]:
next(genex_example)

16

In [49]:
next(genex_example)

25

In [50]:
next(genex_example)

StopIteration: 

In [51]:
genex_example = (n**2 for n in [1, 2, 3, 4, 5])
# Using a for loop to consume a generator is better than using next()
for ge in genex_example:
    print(ge)

1
4
9
16
25


In [20]:
beer_data = "data/recipeData.csv"

# This one line perfoms the same action as beerDataGenerator()!
lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))

> - Generators produce values one-at-a-time as opposed to giving them all at once.
- There are two ways to create generators: generator functions and generator expressions.
- Generator functions yield, regular functions return.
- Generator expressions need (), list comprehensions use [].
- You can only use a generator once.
- There are two ways to get values from generators: the next() function and a for loop. The for loop is often the preferred method.
- We can use generators to read files and give us one line at a time.

### Laziness and generators

### Generators pipelines

In [30]:
beer_data = "data/recipeData.csv"

lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))
lists = (l.split(",") for l in lines)

In [64]:
beer_data = "data/recipeData.csv"

lines =  (line for line in open(beer_data, encoding="ISO-8859-1"))
lists = (l.split(",") for l in lines)

# Take the column names out of the generator and store them, leaving only data
columns = next(lists)

# Take these columns and use them to create an informative dictionary
beerdicts = (dict(zip(columns, data)) for data in lists)

In [41]:
next(beerdicts)

{'BeerID': '1',
 'Name': 'Vanilla Cream Ale',
 'URL': '/homebrew/recipe/view/1633/vanilla-cream-ale',
 'Style': 'Cream Ale',
 'StyleID': '45',
 'Size(L)': '21.77',
 'OG': '1.055',
 'FG': '1.013',
 'ABV': '5.48',
 'IBU': '17.65',
 'Color': '4.83',
 'BoilSize': '28.39',
 'BoilTime': '75',
 'BoilGravity': '1.038',
 'Efficiency': '70',
 'MashThickness': 'N/A',
 'SugarScale': 'Specific Gravity',
 'BrewMethod': 'All Grain',
 'PitchRate': 'N/A',
 'PrimaryTemp': '17.78',
 'PrimingMethod': 'corn sugar',
 'PrimingAmount': '4.5 oz',
 'UserId\n': '116\n'}

In [52]:
beer_counts = {}
for bd in beerdicts:
    if bd["Style"] not in beer_counts:
        beer_counts[bd["Style"]] = 1
    else:
        beer_counts[bd["Style"]] += 1

most_popular = 0
most_popular_type = None
for beer, count in beer_counts.items():
    if count > most_popular:
        most_popular = count
        most_popular_type = beer

In [53]:
most_popular_type

'American IPA'

In [65]:
abv = (float(bd["ABV"]) for bd in beerdicts if bd["Style"] == "American IPA")

In [59]:
sum(abv)

76944.87000000004

In [60]:
most_popular

11940

In [66]:
# Get the average ABV for an American IPA
sum(abv)/most_popular

6.44429396984925

## Using Generators

### Example 1: Reading Large Files

In [7]:
def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    file.close()
    return result

In [8]:
csv_gen = csv_reader("data/nile.csv")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 572


In [9]:
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

In [10]:
csv_gen = csv_reader("data/nile.csv")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 571


In [11]:
#možnost 1
file = "data/nile.csv"
# Use generators to get number of rows, with one row in memory
def line_aggregate(file):
    rows = 0
    with open(file, encoding='ISO-8859-1') as f:
        for row in f:
            rows += 1
        return rows
    
line_aggregate(file)

571

In [9]:
# možnost 2
file = "data/nile.csv"
num_lines = sum(1 for line in open(file))
print(num_lines)

571


In [12]:
csv_gen = (row for row in open("data/nile.csv"))

### Example 2: Generating an Infinite Sequence

In [13]:
a = range(5)

In [14]:
list(a)

[0, 1, 2, 3, 4]

In [5]:
def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

    for i in infinite_sequence():
        print(i, end=" ")

In [12]:
gen = infinite_sequence()

In [13]:
next(gen)

0

In [14]:
next(gen)

1

In [17]:
next(gen)

4

In [16]:
next(gen)

3

### Example 3: Creating New Iteration Patterns with Generators

In [37]:
def frange(start, stop, increment):
    x = start
    while x < stop:
        yield x
        x += increment

In [40]:
for n in frange(0, 4, 0.5):
    print(n)

0
0.5
1.0
1.5
2.0
2.5
3.0
3.5


In [41]:
list(frange(0, 1, 0.125))

[0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875]

### Example 4: Creating Data Pipelines With Generators

In [18]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))

In [19]:
list_line = (s.rstrip().split(",") for s in lines)

In [20]:
cols = next(list_line)

In [21]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))
list_line = (s.rstrip().split(",") for s in lines)
cols = next(list_line)

In [22]:
company_dicts = (dict(zip(cols, data)) for data in list_line)

In [23]:
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)

In [24]:
total_series_a = sum(funding)

In [25]:
total_series_a

18500000

In [26]:
file_name = "data/techcrunch.csv"
lines = (line for line in open(file_name))
list_line = (s.rstrip().split(",") for s in lines)
cols = next(list_line)
company_dicts = (dict(zip(cols, data)) for data in list_line)
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)
total_series_a = sum(funding)
print(f"Total series A fundraising: ${total_series_a}")

Total series A fundraising: $18500000


### Example 5: A generator that follows a log file like Unix 'tail -f'

In [None]:
# logsim.py
import time, random

from data import ips
from data import docs

with open("access-log","w") as f:
    while True:
        time.sleep(random.random())
        n = random.randint(0,len(ips)-1)
        m = random.randint(0,len(docs)-1)
        t = time.time()
        date = time.strftime("[%d/%b/%Y:%H:%M:%S -0600]",time.localtime(t))
        write_String = f'{ips[n]} - - {date} {docs[m]}\n'
        f.write(write_String)
        f.flush()

In [None]:
# follow.py
#
# A generator that follows a log file like Unix 'tail -f'.
#
# Note: To see this example work, you need to apply to 
# an active server log file.  Run the program "logsim.py"
# in the background to simulate such a file.  This program
# will write entries to a file "access-log".

import time

def follow(thefile):
    thefile.seek(0,2)      # Go to the end of the file
    while True:
        line = thefile.readline()
        if not line:
            time.sleep(0.1)    # Sleep briefly
            continue
        yield line

# Example use
if __name__ == '__main__':
    with open("access-log") as logfile:
        for line in follow(logfile):
            print(line, end="")

In [None]:
# pipeline.py
#
# An example of setting up a processing pipeline with generators
import re

def grep(pattern,lines):
    patc = re.compile(pattern)
    for line in lines:
        if patc.search(line):
             yield line

if __name__ == '__main__':
    from follow import follow

    # Set up a processing pipe : tail -f | grep python
    with open("access-log") as logfile:
        loglines = follow(logfile)
        pylines  = grep(r"python", loglines)

        # Pull results out of the processing pipeline
        for line in pylines:
            print(line, end="")

### Example 6: Calculate the number of bytes transferred in an Apache server log 

In [None]:
# genlog.py
#
# Sum up the bytes transferred in an Apache server log using
# generator expressions

with open("access-log") as wwwlog:
    bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
    bytes_sent = (int(x) for x in bytecolumn if x != '-')
    print("Total", sum(bytes_sent))

## Vaja: Creating Data Processing Pipelines

    foo/
        access-log-012007.gz
        access-log-022007.gz
        access-log-032007.gz
        ...
        access-log-012008
    bar/
        access-log-092007.bz2
        ...
        access-log-022008

In [30]:
import os
import fnmatch
import gzip
import bz2
import re

def gen_find(filepat, top):
    '''
    Find all filenames in a directory tree that match a shell wildcard pattern
    '''
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist, filepat):
            yield os.path.join(path,name)
            
def gen_opener(filenames):
    '''
    Open a sequence of filenames one at a time producing a file object.
    The file is closed immediately when proceeding to the next iteration.
    '''
    for filename in filenames:
        if filename.endswith('.gz'):
            f = gzip.open(filename, 'rt')
        elif filename.endswith('.bz2'):
            f = bz2.open(filename, 'rt')
        else:
            f = open(filename, 'rt')
        yield f
        f.close()
        
def gen_concatenate(iterators):
    '''
    Chain a sequence of iterators together into a single sequence.
    '''
    for it in iterators:
        yield from it
        
def gen_grep(pattern, lines):
    '''
    Look for a regex pattern in a sequence of lines
    '''
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

In [33]:
lognames = gen_find('access-log*', 'pipeline')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('robotsl?.txt', lines)

for line in pylines:
    print(line, end='')

124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robotsl.txt ..." 200 71
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
124.115.6.12 - - [1

In [35]:
lognames = gen_find('access-log*', 'pipeline')
files = gen_opener(lognames)
lines = gen_concatenate(files)
pylines = gen_grep('124.115.6.12', lines)
bytecolumn = (line.rsplit(None,1)[1] for line in pylines)
_bytes = (int(x) for x in bytecolumn if x != '-')
print('Total', sum(_bytes))

Total 1704


## Nasveti

### Consider Generator Expressions for Large Comprehensions

In [37]:
# vaja: v eni vrstici preštej število znakov v vsaki vrstici v datoteki
value = [len(x) for x in open("data/example.txt")]
print(value)

[7, 2, 10, 9, 7, 12, 10, 11, 4, 3, 3, 13, 3, 2, 22]


In [38]:
it = (len(x) for x in open("data/example.txt"))
print(it)

<generator object <genexpr> at 0x0000016E0C36B0B0>


In [39]:
print(next(it))

7


In [40]:
print(next(it))

2


In [41]:
roots = ((x, x**0.5) for x in it)

In [42]:
print(next(roots))

(10, 3.1622776601683795)


<hr>

In [43]:
sum([i * i for i in range(1000)])

332833500

In [44]:
sum(i * i for i in range(10000000))

333333283333335000000

In [45]:
sum(map(lambda i: i*i, range(10000000)))

333333283333335000000

### Consider Generators Instead of Returning Lists

In [46]:
def index_words(text):
    result = []
    if text:
        result.append(0)
    for index, letter in enumerate(text):
        if letter == ' ':
            result.append(index + 1)
    return result

This works as expected for some sample input.

In [47]:
address = 'Four score and seven years ago…'
result = index_words(address)
print(result)

[0, 5, 11, 15, 21, 27]


In [48]:
def index_words_iter(text):
    if text:
        yield 0
    for index, letter in enumerate(text):
        if letter == ' ':
            yield index + 1

In [49]:
result = list(index_words_iter(address))

In [50]:
result

[0, 5, 11, 15, 21, 27]

## Profiling Generator Performance

In [64]:
import sys
nums_squared_lc = [i * 2 for i in range(1_000_000)]
sys.getsizeof(nums_squared_lc)

8448728

In [59]:
nums_squared_gc = (i ** 2 for i in range(1_000_000))
print(sys.getsizeof(nums_squared_gc))

112


In [62]:
import cProfile
cProfile.run('sum([i * 2 for i in range(1_000_000)])')

         5 function calls in 0.118 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.077    0.077    0.077    0.077 <string>:1(<listcomp>)
        1    0.016    0.016    0.117    0.117 <string>:1(<module>)
        1    0.000    0.000    0.118    0.118 {built-in method builtins.exec}
        1    0.025    0.025    0.025    0.025 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [63]:
cProfile.run('sum((i * 2 for i in range(1_000_000)))')

         1000005 function calls in 0.232 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000001    0.111    0.000    0.111    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    0.232    0.232 <string>:1(<module>)
        1    0.000    0.000    0.232    0.232 {built-in method builtins.exec}
        1    0.121    0.121    0.232    0.232 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


