# Iterables, Iterators, and Generators   <a name='ItItGen' />

## _Iterables_   <a name='Iterables' />

We are all now very familiar with <code>for</code> loops, which provide a basis for an effective functional definition of an _iterable_ data type, which is a data type that:

- can contain 0, 1, or many elements
- can be used with a <code>for</code> statement so that each element of the _iterable_ data type is made available in the iterations of the <code>for</code> loop

<code>for x in _put-iterable-here_:
    ...</code>
    
These are well-known basic Python _iterable_ data types:

- list
- range()
- tuple
- dictionary
- set
- string

The term _iterable_ is often used as a noun, as a short form of '_iterable_ data type'.

The cells below illustrate how these data types are iterable in <code>for</code> loops.

We will later show other data types that are iterable and, therefore, can be included in <code>for</code> loops and in list comprehension statements.

In [None]:
my_list = [0, 1, 2]
my_range = range(3)
my_tup = (0, 1, 2)
my_dct = {0:'zero', 1:'one', 2:'two'}
my_set = {0, 1, 2}
my_str = 'hello'

In [None]:
for e in my_list:
    print(e)

In [None]:
for e in my_range:
    print(e)

In [None]:
for e in my_tup:
    print(e)

In [None]:
for e in my_dct:
    print(e)

In [None]:
for e in my_dct.items():
    print(e)

In [None]:
for e in my_set:
    print(e)

In [None]:
for e in my_str:
    print(e)

Computers are powerful because they can iterate through large data sets and make computations.  Iterable data and loops are the programming components that enable that capability. 

## Iterators  <a name='Iterators' />

Iterators are like iterables in that a <code>for</code> loop can be used to present their elements one at a time, but they are different from iterables in these ways:
- Only a single pass is permitted through the iterator, in which each of the elements is retrieved once in sequence.
  - Once exhausted, no further elements can be retrieved from an iterator without re-creating it.
  - Once exhausted, the iterator returns an <code>StopIteration</code> exception, which is gracefully handled (you won't see evidence of it) when iterators are used in <code>for</code> loops.
- The next iterator element can be retrieved using the <code>next()</code> function.

The built-in <code>iter()</code> function creates an iterator from a data type that is a collection of elements, as shown below.  

Iterators can be faster and use less memory than storing an entire list in memory, for example, when data sets are large.

In [None]:
it = iter(my_list)

In [None]:
print(next(it))
print(next(it))
print(next(it))
print(next(it))

The code cell below, if executed immediately after the cell above, will return nothing because the iterator is exhausted.  The iterator needs to be reset being using it again.

Notice that rather than seeing evidence of an exception/error, instead nothing happens. 

In [None]:
for i in it:
    print(i)

The code in the cell below will return something because the iterator is reset in the first statement. In addition, you will notice that the <code>for</code> loop does not show the <code>StopIteration</code> error but, instead, just uses it as an indication that the loop shuld be terminated.

In [None]:
it = iter(my_list)
for i in it:
    print(i)

In [None]:
it = iter(range(3))
for i in it:
    print(i)
    
# This next statement will cause an error
next(it)

Just using the <code>iter()</code> on an iterable does not necessarily provide any advantage in speed and memory usage because, as in the case above, the list already exists in memory.  But, these examples are useful to demonstrate how an iterator functions,  Iterators do provide speed and memory advantages in other circumstances.  

The cells below give a quick example of how iterators can reduce memory usage.

In [None]:
%load_ext memory_profiler

In [None]:
%%memit

with open('files/numbers.txt', 'r') as f:
    data = f.readlines()
    
for i in range(len(data)):
    data[i] = int(data[i].strip())
    
answer = sum(data)
print(answer)

In [None]:
%%memit

answer = 0
with open('files/numbers.txt', 'r') as f:
    for line in f:
        answer += int(line.strip())
    
print(answer)

## Generators  <a name='Generators' />

Generators are a specific type of iterator.  They have all the characteristics of iterators noted above and, in addition, they create a stream of elements with a _function_ rather than merely regurgitating the elements from an existing data structure.

Generators can be defined in two ways:
- With a function
- With a comprehension statement

A generator function is just like a regular custom function except that it uses <code>yield</code> rather than <code>return</code> to send values back to the calling statement.  <code>yield</code> return a value or values from one iteration of the function, and then causes the function to pause until it is called next.  When called again, the function will continue by executing until its next <code>yield</code> statement.

Generators, define using comprehension, are defined within a set of parentheses.

Our first example will be another way to read in numerical data from a text file shown in two way, one using a function and another using comprehension.

In [None]:
%%memit

def read_convert(filepath):
    f =  open(filepath, 'r')
    for line in f:
        yield int(line.strip())

answer = 0
for x in read_convert('files/numbers.txt'):
    answer += x
    
print(answer)

In [None]:
%%memit

read_convert_comp = (int(line.strip()) if line.strip() != '' else 0 for line in open('files/numbers.txt', 'r'))

answer = 0
for x in read_convert('files/numbers.txt'):
    answer += x
    
print(answer)

Our next example, which generates the Fibonacci sequence one term at a time, is a little more complex.  If $f_0 = 1$ and $f_1 = 1$  are the first two terms in the Fibonacci sequence, then the remaining terms are defined by:

$f_i = f_{i-1} + f_{i-2}$,

so that the Fibonacci sequence starts with

$f = 1,1,2,3,5,8,13,21,34, ...$.

In [None]:
''' Compute n terms of the Fibonacci sequence '''
def fib(n):
    t1, t2 = 1, 1
    counter = 1
    yield t1
    
    while counter <= n:
        counter += 1
        t1, t2 = t2, t1 + t2
        yield t1

Note that two assignments can be executed on the same line, as is done twice in the code above.  

In [None]:
[x for x in fib(7)]

The <code>yield</code> statement automatically causes the <code>StopIteration</code> exception once the function terminates, as demonstrated in the next cell.

List comprehension and <code>for</code> loops handle the <code>StopIteration</code> exception gracefully.  That is, they simply stop when the exception is encountered without alerting you to it.  That does not happen with <code>while</code> loops.

In [None]:
f = fib(7)
while True:
    print(next(f))

Generators may be infinite.

In [None]:
''' Compute an infinite number of terms of the Fibonacci sequence '''
def fib_1():
    t1, t2 = 1, 1
    counter = 1
    yield t1
    
    while True:
        counter += 1
        t1, t2 = t2, t1 + t2
        yield t1

In [None]:
f = fib_1()
[next(f) for _ in range(8)]

In [None]:
''' The generator is still active '''
next(f)

Generators can provide an enormous advantage in speed and memory usage because all the elements they return do not need to be stored in memory simultaneously because they are generated on demand, one at a time.

## Summary

This image from a post on medium.com summarizes the relationship between iterables, iterators, generators, and Python data types that are collections of elements (containers).

![Summary](images/0_IBiuORdbv0CTKuia.png)

https://tonylixu.medium.com/python-yield-iterator-and-generator-introduction-15be182f6135, retrieved 11/22/2021

# Text File Input with Base Python Generators <a name='text_w_gen' /> 

Why read files with a generator?

__Large files can be read faster by generators while using less memory for holding the entire file contents at once as is done with base Python statements__

We will demonstrate the speed advantage in the subsequent code cells using a longer version of the Norfolk weather file, which is 1M lines long: <code>files/NorfolkWeatherLong.csv</code>.  Each line has an integer day index and a floating point temperature. 

<code>%%timeit</code> is a function that is built in to jupyter.  It times the execution of the entire cell into which it is placed.

## Typical Base Python Method

... which takes approximately 1 second to read a large code file.

In [None]:
%%timeit 

def ssc_2d(file_name):
    with open(file_name,'r') as f:
        data = [line.strip().split(',') for line in f.readlines()]
        data = [[int(point[0]), float(point[1])] for point in data]
        return data

filename = 'files/NorfolkWeatherLong.csv'
nw1999 = ssc_2d(filename)

## Large File Input with Base Python Generators

The code in the cell below returns a generator that provides access to the text file data.

It is almost 10X faster than using a generator.

In [None]:
%%timeit
from itertools import islice

def ssc_2d(filename):
    with open(filename,'r') as f:
        data = (line.strip().split(',') for line in f.readlines())
        data = ((int(point[0]), float(point[1])) for point in data)
    return data

x = ssc_2d('files/NorfolkWeatherLong.csv')
# The data is now available in the generator x

We still haven't demonstrated how to use generators to minimize memory usage, but we will do that shortly.

Also, these execution times are only for the reading of the data.  We must also consider the ultimate task for which we are inputting the data and the total effect on execution time of using generators.  It is possible to use methods for subsequent plotting or computation that squander the speed gain demonstrated above.    Nonetheless, these quick demonstrations hint at the power and value of generators.

Let's consider one use case where we want to plot the first 365 lines of <code>NorfolkWeatherLong.csv</code> (the first year), and let's compare the speed of base Python versus generators.  As previously mentioned, <code>matplotlib</code> will not plot generators, so we need to present it was a list (<code>numpy</code> arrays and <code>pandas</code> series work as well).

In [None]:
import matplotlib.pyplot as plt

In [None]:
%%timeit

def ssc_2d(data):
    x = []
    y = []
    for i in range(len(data)):
        data[i] = data[i].strip().split(',')
        x.append(int(data[i][0]))
        y.append(float(data[i][1]))
    return x, y

with open('files/NorfolkWeatherLong.csv','r') as f:
    nw1999 = f.readlines()
x, y = ssc_2d(nw1999)

plt.plot(x[:365],y[:365])
plt.show()

In [None]:
%%timeit

from itertools import islice

def ssc_2d(filename):
    with open(filename,'r') as f:
        data = (line.strip().split(',') for line in f.readlines())
        data = ((int(point[0]), float(point[1])) for point in data)
    return data

x = ssc_2d('files/NorfolkWeatherLong.csv')

plt.plot(*zip(*islice(x, 0, 365)))
plt.show()

Generators provide access text file data and plot it __6X__ faster!

What do <code>\*zip()</code> and <code>\*islice()</code> do?
- <code>*zip()</code> separates the $x$ and $y$ data, which the <code>ssc_2d()</code> returns as tuples for each data point.
- <code>islice()</code> is the slice operator for generators akin the the <code>my_list[start:stop:step]</code> statement for lists.  Here, <code>islice()</code> selects the first 365 data points.  This is much faster than transforming the entire generator to a list and then slicing.  The accompanying <code>*</code> removes the outer container of the data so <code>zip()</code> can reformat the 'loose' sublists of two elements each.

Multiple graphs are displayed because <code>%%timeit</code> executes the code multiple times as it profiles its execution time.

In [None]:
from itertools import islice

def ssc_2d(filename):
    with open(filename,'r') as f:
        data = (line.strip().split(',') for line in f.readlines())
        data = ((int(point[0]), float(point[1])) for point in data)
    return data

x = ssc_2d('files/NorfolkWeatherLong.csv')

print([*islice(x, 0, 25)], '\n\n')
print([*zip(*islice(x, 0, 25))])

# Computations on Large Text Files with Generators in One Pass <a name='comp_w_gen' />  

Generators provide a subtantial speed advantage and, actually, a feasibility advantage over base Python methods with extremely large data files are concerned.  This is particularly the case when an entire data set cannot be held in memory.  Generators, in this context, can quickly read the data in chunks, thus avoiding the problem of overloading computer memory if all the data were to be loaded at once.  The complication that sometimes arises is that computation methods need to be designed that work with reading the data in chunks and, if a one-pass comptuation is desired, a way to avoid having summary statistics determined by the entire data set, such as the mean.

The first two examples below can be accomplished easily in one pass.  The third example requires thoughtful planning on how to complete the computation in one pass.

## Moving Average Computation<a name='mv_avg' />  

To compute moving averages with streamed data, we need to amend the approach above where only one value was read at a time.  With an $n$-period moving average, we need to keep track of $n$ numbers at a time while maintaining a moving window with those values in an efficient manner.  We need to consider what are the best data types to:
- Quickly update the moving window data as we stream the data
- Quickly compute the mean

There is not necessarily one data type that will be best for both, so we will need to investigate.  Two possibilities are:
- <code>numpy</code> arrays
  - These have built-in methods for computing means, which are fast
- Python lists
  - Might be fast for this application if used properly (avoiding <code>append</code>, but need to compute mean using <code>sum()</code> function, which might be slower then <code>numpy</code>.
  - In the code below, a scheme is used where the oldest data element in an $n$-element list is overwritten with the newest element, before taking the average.
  
Let's use a larger set of 1999 Norfolk weather data: <code>files/NorfolkWeatherLong.csv</code>.

The three code sets below represent these approaches:

- Using basic Python without list comprehension
- Using basic Python with list comprehension
- Using <code>numpy</code> 
- Using a Python generator function

In [None]:
def ssc_2d_map(file, convert):
    with open(file, 'r') as f:
        data = f.readlines()
    
    for i in range(len(data)):
        data[i] = data[i].strip().split(',')
        for j, mk_type in convert.items():
            data[i][j] = mk_type(data[i][j])
    return [x[1] for x in data]

With a Python list and a <code>for</code> loop.

In [None]:
import time
import matplotlib.pyplot as plt

period = 10
filepath = 'files/NorfolkWeatherLong.csv'
convert_map = {0:int, 1:float}

start = time.time()

data = ssc_2d_map(filepath, convert_map)
ma = []  # Initialize moving average

for i in range(0, len(data) - period + 1):
    ma.append(sum(data[i:i + period])/period)

plt.plot(ma[:365])
plt.show()

print(f'Execution time: {float(time.time() - start): .2f} seconds')

With list comprehension.

In [None]:
import time
import matplotlib.pyplot as plt

period = 10
filepath = 'files/NorfolkWeatherLong.csv'

start = time.time()

data = ssc_2d_map(filepath, convert_map)
ma = [sum(ma[i:i + period])/period for i in range(0,len(ma) - period + 1)]

plt.plot(ma[:365])
plt.show()

print(f'Execution time: {float(time.time() - start): .2f} seconds')

With a Python generator.

In [None]:
import time
import matplotlib.pyplot as plt

def read_a_line(filename):
    with open(filename,'r') as f:
        for line in f:
            yield float(line.strip().split(',')[1])

start = time.time()
window = 10
filepath = 'files/NorfolkWeatherLong.csv'
read_temp = read_a_line(filepath)
mv_avg = []

''' Create variable for moving window '''
mv_win = [0 for i in range(window)]

''' Fill window with initial data '''
for i in range(window - 1):
    mv_win[i] = next(read_temp)
    
for temp in read_temp:
    i += 1
    mv_win[i%window] = temp
    mv_avg.append(sum(mv_win)/window)

plt.plot(mv_avg[:365])
plt.show()

print(f'Execution time: {float(time.time() - start): .2f} seconds')

The downsides and weaknesses of generators are due to entire file contents not being held in memory at any one time:
- Any processing/computations of the data must be done as the data is being streamed
- Any portion of the input data from a generator that is plotted in <code>matplotlib</code> must be converted to a list or similar data structure.  <code>matplotlib</code> cannot plot generators.

The code above reads one line at a time, although it is possible to improve generator performance the best performing generators read files in larger _chunks_.

We could optimize inputting data from a text file using the Python <code>open()</code> statement and using an optional argument for the best _chunk size_ (number of lines) to read in each iteration.  But, the <code>pandas</code> package permits us to read files in _chunks_ and permits us to create a generator to do so, as we demonstrate in the next section.

## NYC Taxi Data <a name='nyc' />  

Let's start investigating generators with a somewhat large data set on New York City Taxis.

[NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data)

This data is about 5.5GB, so it is unlikely to crash your computer, but loading it all into RAM, or attempting to, could be very slow and cause your computer to continually swap data between the hard drive and memory, which is very slow.  

We will investigate several methods for loading the data and computing a frequency histogram for the number of passengers while paying attention to speed and memory consumption.
- Basic Python: reading all of the data at once.
- Basic Python with a generator
- Basic Python with a generator function
- <code>pandas</code>

Please note these methods in the cells that follow:
- The 'f' string used to format the output.
- The <code>dict.get()</code> statement, which allows a default value to be assigned to a new key if it doesn't already exist in the dictionary.  The effect of <code>fh.get(num_pass, 0)</code> is that is creates a new dictionary element with a key of <code>num_pass</code> if that key doesn't exist and assigns a corrseponding value of zero.  If the key exists, then this statement simply gets the value associates with the key <code>num_pass</code>.
- The so-called walrus operator (:=), which permits an assignment statement within a loop declarion or conditional statement
- The generator also strips whitespace from the ends of the line of text, including <code>\n</code>, while splitting it at the commas. 
- The generator conveniently takes a filename and opens the file within the function rather than needs a filestream passed to it.

__This file is too large to include in a gitHub repo and also too large for everybody in class to download at the same time because it would take forever or we'd crash the local network.  So, this will be a demonstration rather than you runnign the code as well.__ If you do want to test this data later, you can first download the file using the code in the next cell, which take take a while to execute.

In [None]:
import shutil

source_path = r"\\files.campus.wm.edu\jrbrad\public_html\data"
dest_path = r"files"
filename = '\\train.csv'

shutil.copy(source_path + filename, dest_path + filename)

Basic Python reading the entire file at one time to create frequency histogram data.

In [None]:
import time

start = time.time()
file_name = 'files/train.csv'
fh = {}
col_idx = 7

with open(file_name, 'r') as f:
    heading = f.readline()[col_idx]  # Reads header line
    data = f.readlines()

for pt in data:
    pt = pt.strip().split(',')
    pt = int(pt[col_idx])
    fh[pt] = fh.get(pt, 0) + 1
        
print(heading, fh)
print(f'Execution time: {float(time.time() - start): .2f} seconds')

A filestream created with the <code>open()</code> statement is actually a iterator and so we can use it to read one line at a time, thus reducing memory usage significantly: we never consume more memory than is needed to for one line from the text file.

In [None]:
import time

start = time.time()
file_name = 'files/train.csv'
fh = {}
col_idx = 7

with open(file_name, 'r') as f:
    heading = f.readline()[col_idx]  # Reads header line
    for line in f:
        num_pass = int(line.strip().split(',')[7])
        fh[num_pass] = fh.get(num_pass, 0) + 1
        
print(heading, fh)
print(f'Execution time: {float(time.time() - start): .2f} seconds')

Here's the generator function for the second approach, with the <code>read_one_line</code> function: it uses a <code>yield</code> statement rather than <code>return</code> and so it is a generator function. 

In [None]:
def read_one_line(filename, skip_first=True):
    with open(filename, 'r') as f:
        if skip_first:
            _ = f.readline()
        for line in f:
            yield line.strip().split(',')

In [None]:
import time

start = time.time()
fh = {}
col_idx = 7

file_name = 'files/train.csv'
read_gen = read_one_line(file_name, skip_first=False)
heading = next(read_gen)[col_idx]

for line in read_gen:
    num_pass = int(line[col_idx])
    fh[num_pass] = fh.get(num_pass, 0) + 1
        
print(heading, fh)
print(f'Execution time: {float(time.time() - start): .2f} seconds')

The following cells show an example of how you might try to speed up data input by reading multiple lines with each function call.  The code becomes a bit more complex and, in this case, does not provide much advantage.

In [None]:
from itertools import islice
    
def read_chunks(filepath, chunksize, skip_first):
    with open(filepath, 'r') as f:
        if skip_first:
            _ = f.readline()
        while True:
            lines = islice(f, chunksize)
            lines = [*lines]
            if lines:
                yield (line.strip().split(',') for line in lines)
            else:
                break

In [None]:
import time

start = time.time()
file_name = 'files/train.csv'
read_gen = read_chunks(file_name, 10000, skip_first=True)
fh = {}

for chunk in read_gen:
    for line in chunk:
        num_pass = int(line[7])
        fh[num_pass] = fh.get(num_pass, 0) + 1
        
print(fh)
print(f'Execution time: {float(time.time() - start): .2f} seconds')

The <code>pandas</code> package also provides a method for reading data in chunks.  This requires that we combine the results from multiple <code>value_counts()</code> operations, which create frequency histograms.  In this circumstance, <code>pandas</code> does not speed up the operation, as indicated by the graph below.  In addition, using Python to read one line at a time consumes much less memory than is required for the chunksizes required by <code>pandas</code> for reasonably fast execution.

While packages like <code>pandas</code> can provide some wonderful functionality, in some cases basic Python can be faster.

In [None]:
import pandas as pd
import time

start = time.time()

fh = pd.Series(dtype='int64')
for chunk in pd.read_csv('files/train.csv', chunksize=2000000):
    fh = fh.add(chunk['passenger_count'].value_counts(), fill_value=0)
print(fh)
print(f'Execution time: {float(time.time() - start): .2f} seconds')

Graph of <code>pandas</code> execution time versus chunksize.

In [None]:
import matplotlib.pyplot as plt

p_time = {1000: 63.54459285736084,
 10000: 62.875988245010376,
 100000: 62.655011892318726,
 1000000: 63.9449999332428,
 2000000: 64.10800004005432,
 5000000: 64.95298767089844,
 10000000: 65.6650767326355}

fig,ax = plt.subplots()
x = [x for x in p_time.keys()]
y = [y for y in p_time.values()]
y2 = [62.5 for _ in range(len(x))]
ax.scatter(x, y, c='b', label='pandas')
ax.scatter(x, y2, c='gray', label='Basic Python')
for i in range(len(p_time.keys())-1):
    ax.plot(x[i:i+2], y[i:i+2], c='b')
    ax.plot(x[i:i+2], y2[i:i+2], c='gray')
ax.set_ylabel('Execution Time (sec)')
ax.set_xlabel('Chunksize (lines)')
ax.legend()
plt.show()

__Best Approach?:__ the best approach here is arguably to use the filestream generator directly in your code (or use a function) to access one line at a time.  It is close to the fastest approach and minimizes memory consumption.

## One-pass versus Two-pass Computations <a name='1_2_pass' /> 

Calculations were made while streaming data in both of the preceding example rather than loading all of the data into memory at once. Finding ways to do computations while streaming data is one way to reduce memory requirements.  It is sometimes not as easy to do so for some computations.

Suppose we are interested in computing the standard deviation of the number of riders.  Having already computed the frequency histogram data we could compute the standard deviation directly.  But, this is not possible if our data were floating-point and we needed to construct _bins_, in which case computing standard deviation from the frequency data would be only an approximation. 

This formula is typically used to compute the variance (from which we compute standard deviation) of a series of $n$ values:

$ \sum_{i=0}^{n-1}{\left( x_i - \bar{x} \right)^2}$.

This computation requires that we know the mean of $x$, $\bar{x}$ prior to summing the squared differences.  So, an intuitive approach would be to do this computation in two passes, the first to compute $\bar{x}$.   Can reasonable estiamtes be computed in one pass?

In [None]:
import random

n = 10
x = [i for i in range(n)]
x_bar = sum(x) / n
x_var = sum([(z - x_bar)**2 for z in x])/n
print(f'Variance: {x_var: .2f}')

Using a generator, however, presents an issue because computing the <code>sum()</code> of generator in the code cell below exhausts the generator and so no data remains in <code>x</code> to iterate over when computing <code>x_var</code>.  The result is a variance of 0, which is incorrect.

In [None]:
n = 10
x = (i for i in range(n))
x_bar = sum(x) / n
x_var = sum([(z - x_bar)**2 for z in x])/n
print(f'Variance: {x_var: .2f}')

The generator needs to be reset in order to make this second computation and so using a generator makes this a _two-pass_ computation.  This could still be a viable method given the speed and memory requirements for generators.

In [None]:
n = 10
x = (i for i in range(n))
x_bar = sum(x) / n

x = (i for i in range(n))
x_var = sum([(z - x_bar)**2 for z in x])/n
print(f'Variance: {x_var: .2f}')

If we were to compute the variance of number of passengers in a New York City taxi in this fashion, we therefore need to instantiate the generator twice.

In [None]:
import time

start = time.time()
file_name = 'files/train.csv'

print('Computing mean: ', end='')
num_pass_sum = 0
n = 0
read_gen = read_one_line(file_name, skip_first=True)
for line in read_gen:
    num_pass_sum += int(line[7])
    n += 1
num_pass_mean = num_pass_sum/n
print(f'{float(time.time() - start): .2f} seconds, mean = {num_pass_mean}')
      
print(f'Computing variance: ', end='')
num_pass_var = 0.0
read_gen = read_one_line(file_name, skip_first=True)
for line in read_gen:
    num_pass_var += (int(line[7]) - num_pass_mean)**2
num_pass_var = num_pass_var/n
print(f'{float(time.time() - start): .2f} seconds; Variance = {num_pass_var}')

Another approach is to sumamrize the data in a frequency histogram and, then, compute variance from that frequency data.

This approach takes only one pass through the data and then computes the variance from a condensed form of the data, which is much faster than another pass through the data.  This gives an exact result for integer data, but would be an approximation with floating-point data.

In [None]:
import time

''' Compute frequency histogram '''
start = time.time()
file_name = 'files/train.csv'
fh = {}
col_idx = 7

with open(file_name, 'r') as f:
    heading = f.readline()[col_idx]  # Reads header line
    for line in f:
        num_pass = int(line.strip().split(',')[7])
        fh[num_pass] = fh.get(num_pass, 0) + 1

''' Compute variance from frequency data '''
n = sum([v for k,v in fh.items()])
mean = sum([k*v for k,v in fh.items()])/n
var = sum([v/n * (k - mean)**2 for k,v in fh.items()])
print(f'n: {n}; Mean: {mean}; Variance: {var}; Execution time: {float(time.time() - start): .2f} seconds')

Using $Var[X] = {\bf E}[X^2] - {\bf E}[X]^2$ and the generator in a _one-pass_ method (although there might be some numerical stability issues with this approach sometimes).

In [None]:
import time

start = time.time()
file_name = 'files/train.csv'

m_x = 0
m_x_sq = 0
n = 0
read_gen = read_one_line(file_name, skip_first=True)
for line in read_gen:
    d = int(line[7])
    m_x += d
    m_x_sq += d**2
    n += 1
mean = m_x/n
print(f'mean = {mean}; variance = {m_x_sq/n - (mean**2)}; Execution time: {float(time.time() - start): .2f} seconds')

_One-pass_ approximations to variance are possible (that are numerically stable).  Here is one.  Don't worry about the details of the computation: the main point is that variance can be approximated accurately in one-pass.  In case you do want to look at the details, here is a reference:

[https://planetmath.org/onepassalgorithmtocomputesamplevariance](https://planetmath.org/onepassalgorithmtocomputesamplevariance)

Lessons:
- Streaming big data is memory-efficient and reduces execution time
- Making computations with streamed data requires careful consideration of how a calculation is made