# Iterators and generators
A generator allows us to get an element from a collection without first loading the full collection in memory. They are more relevant as the size of the collection or of the objects to be stored in it increases, so they are a useful tool to handle long time series or sequences that cannot be stored in memory. 

In [41]:
from random import normalvariate, randint
from itertools import count
from datetime import datetime
from itertools import groupby
from itertools import filterfalse
from itertools import islice
from scipy.stats import normaltest
from datetime import datetime

In [2]:
def fibonacci_list(num_items):
    numbers = []
    a, b = 0, 1
    while len(numbers) < num_items:
        numbers.append(a)
        a, b = b, a+b
    return numbers

In [3]:
fibonacci_list(10)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In [4]:
type(fibonacci_list(10))

list

We define a generator function that produces (yields) a Fibonacci's number

In [38]:
def fibonacci_gen(num_items):
    a, b = 0, 1
    while num_items:
        yield a
        a, b = b, a+b
        num_items -= 1

In [39]:
for f in fibonacci_gen(10):
    print(f);

0
1
1
2
3
5
8
13
21
34


In [7]:
type(fibonacci_gen(10))

generator

## Memory usage and time complexity
Loading a list to be used for a loop operator more expensive than using a generator both in terms of memory and time. The trade-off is that we cannot reference an element in a generator and we cannot know the number of elements in it. We can do a test using a function to compute the number of fibonacci numbers that are divisible by three.

In [8]:
len([n for n in fibonacci_gen(100_000) if n % 3 == 0])

25000

In [9]:
len([n for n in fibonacci_list(100_000) if n % 3 == 0])

25000

In [10]:
%load_ext memory_profiler

In [11]:
%memit len([n for n in fibonacci_gen(100_000) if n % 3 == 0])

peak memory: 237.30 MiB, increment: 116.78 MiB


In [12]:
%memit len([n for n in fibonacci_list(100_000) if n % 3 == 0])

peak memory: 585.56 MiB, increment: 463.68 MiB


In [13]:
%timeit len([n for n in fibonacci_list(100_000) if n % 3 == 0])

2.77 s ± 354 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%timeit len([n for n in fibonacci_gen(100_000) if n % 3 == 0])

2 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Operations with generators
Generators can be used to simulate infinite sets. The [itertools](https://docs.python.org/3/library/itertools.html) module allows to create infinite iterator, chaining two generators, and set a condition to end an iterator. In order to experiment with such operator we simulate a random process using a generator function that simulate reading a long time series of pairs (timestamp, value) in order to see whether some data points in a day don't follow the normal distribution. In this notebook we will use a generator that doesn't really read a file but one that randomly creates the pairs (timestamp, value). 

In [15]:
#def read_data(filename):
#    with open(filename) as fd:
#        for line in fd:
#            data = line.strip().split(',')
#            timestamp, value = map(int, data)
#            yield datetime.fromtimestamp(timestamp), value

### Random process simulation
We simulate a random process by defining a function that loops indefinitely using the built-in function [count()](https://docs.python.org/3/library/itertools.html#itertools.count) that returns a natural number every time it is called. A value is created each time, taken from a normal distribution, and after a week, i.e 604000 seconds or loops, the function outputs a value 100 that is clearly outside of the normal distribution. The output of the function are tuple pairs (timestamp, value)  

In [21]:
def read_fake_data(filename):
    for timestamp in count():
        # We insert an anomalous data point approximately once a week
        if randint(0, 7 * 60 * 60 * 24 - 1) == 1:
            value = normalvariate(0, 1)
        else:
            value = 100
        yield datetime.fromtimestamp(timestamp), value

In [22]:
def groupby_day(iterable):
    key = lambda row: row[0].day
    for day, data_group in groupby(iterable, key):
        yield list(data_group)

In [23]:
def is_normal(data, threshold=1e-3):
    _, values = zip(*data) # unpack the data tuples
    k2, p_value = normaltest(values)
    if p_value < threshold:
        return False
    return True

This function returns a generator with only the groups of data points that are not normally distributed

In [24]:
def filter_anomalous_groups(data):
    yield from filterfalse(is_normal, data)

In [25]:
def filter_anomalous_data(data):
    data_group = groupby_day(data)
    yield from filter_anomalous_groups(data_group)

We start to collect the first (fake) data points from the source that do not belong to a normal distribution

In [36]:
data = read_fake_data('filename')
anomaly_generator = filter_anomalous_data(data)
first_anomalies = islice(anomaly_generator, 10)

We print the first elements that are not normally distributed

In [37]:
for data_anomaly in first_anomalies:
    start_date = data_anomaly[0][0]
    value = data_anomaly[0][1]
    end_date = data_anomaly[-1][0]
    print(f"Anomaly from {start_date} - {end_date}, value: {value}")

Anomaly from 1970-01-03 00:00:00 - 1970-01-03 23:59:59, value: 100
Anomaly from 1970-01-04 00:00:00 - 1970-01-04 23:59:59, value: 100
Anomaly from 1970-01-08 00:00:00 - 1970-01-08 23:59:59, value: 100
Anomaly from 1970-01-11 00:00:00 - 1970-01-11 23:59:59, value: 100
Anomaly from 1970-01-15 00:00:00 - 1970-01-15 23:59:59, value: 100
Anomaly from 1970-01-19 00:00:00 - 1970-01-19 23:59:59, value: 100
Anomaly from 1970-01-27 00:00:00 - 1970-01-27 23:59:59, value: 100
Anomaly from 1970-02-08 00:00:00 - 1970-02-08 23:59:59, value: 100
Anomaly from 1970-02-15 00:00:00 - 1970-02-15 23:59:59, value: 100
Anomaly from 1970-02-22 00:00:00 - 1970-02-22 23:59:59, value: 100


In [40]:
def groupby_window(data, window_size=3600):
    window = tuple(islice(data, window_size))
    for item in data:
        yield window
        window = window[1:] + (item,)

## References
* [Welford's online algorithm for calculating the variance of a time series](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)