# 01 - Iterating Collections

#### Lecture

This following point will be a key takeaway of these first few sections so I have pasted it at the very top, so that it's easy to find later on, but don't need to worry about it right now if reading chronologically:

- An iterable is an object that implements __iter__ (returns an iterator)
- An iterator is an object that implements __iter__ (returns itself) and `__next__`

We saw how sequence types support iteration by being able to access elements by index. We could even write our custom sequence types by implementing the `__getitem__` method.

But there are some limitations:

* items must be numerically indexable, with indexing starting at `0`
* cannot be used with unordered collections, such as sets

If we think about iterating over a collection, what we really need is a way to request the **next** item in the collection.

This is like picking marbles out from a bag. The marbles are unlabeled and we want to ensure that we don't pick the same one twice.

If we can do that, our collection does not require being indexable, nor does it need to be ordered (i.e. we don't need the notion of relative positions of elements in the container).

This is exactly what iterables are in general - they provide a method that returns the "next" element in the collection. This approach works equally well with sequence type collections, as well as unordered collection types such as sets.

Of course, the order in which **next** returns items from an unordered colllection is not known in advance - and we see that when we iterate over a set for example:

In [1]:
s = {'x', 'y', 'b', 'c', 'a'}
for item in s:
    print(item)

y
a
c
b
x


As you can see the order in which the elements of the set was returned, did not match the order in which we added elements to the set.

Furthermore, we cannot use indexing to access elements in a set:

In [2]:
s[0]

TypeError: 'set' object does not support indexing

We have a couple of recommendations for general iteration:

- Be able to 'get the next item' in the collection (not necessarily through a sequential index).
- Make the collection finite.
- Allow exhaustion of the iterable via the exception `StopIteration`.
- Keep track of the indices somehow to ensure that the same elements aren't being called multiple times
- Be able to use a `for` loop, comprehension, etc.
- Restart the iteration from the beginning (without having to create a new instance of the object.
- Support the `__next__` special method which returns an item from the collection

#### Our Own Implementation

Let's go ahead and define a kind of iterable ourselves. 

What we'll want to do is to have a container type of class that implements the `__next__` method, instead of that `__getitem__` method. 

Let's create our own implementation that we can iterate through to generate square numbers. 

- Since we want our collection to be finite, we'll require in a specific length. This means we can also implement `__len__`.
- Every time we call `__next__`, it should return the next element in the collection - so we'll have to keep track of where we are in the iteration somehow. We'll do this with `i`.
- When we want to exhaust the iterable, we should raise the `StopIteration` error when `next` is called.

In [1]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
    
    def __next__(self):
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result   
    
    def __len__(self):
        return self.length

Now let's generate some square values:

In [4]:
sq = Squares(5)

while True:
    try: 
        print(next(sq))
    
    except StopIteration:
        break

0
1
4
9
16


Now the iterable is exhausted. Calling `next(sq)` again will throw the exception:

In [8]:
print(next(sq))

StopIteration: 

But, we still cannot iterate through the collection using `for` loop or comprehension, so our collection is technically NOT iterable:

In [9]:
sq = Squares(10)
for item in sq:
    print(item)

TypeError: 'Squares' object is not iterable

# 02 - Iterators

In the last lecture we saw that we could approach iterating over a collection using this concept of `next`.

But there were some downsides that did not resolve (yet!):
* we cannot use a `for` loop
* once we exhaust the iteration (repeatedly calling next), we're essentially done with object. The only way to iterate through it again is to create a new instance of the object.

First we are going to look at making our `next` be usable in a `for` loop.

This idea of using `__next__` and the `StopIteration` exception is exactly what Python does.

So, somehow we need to tell Python that the object we are dealing with can be used with `next`. Python knows we have a `__next__` method but how do we tell Python that it will behave in a way consistent with using a `while` loop to iterate.

In other words, it knows we have `__next__`, but how does it know we implement `StopIteration`?

To do so, we create an `iterator` type object.

**Protocols**

A protocol is simply a fancy way of saying that our class is going to implement certain functionality that Python can count on - it's basically a contract with Python. We'll go into much more detail of protocols in Part 4 of the deep dive. 

**The Iterator Protocol**

Here, if we tell Python that we're implementing particular methods, it will support certain things. In this case, it will support iterating over things using a `while` loop.

We're going to implement the **iterator protocol**.

Iterators are objects that implement:
* a `__next__` method
* an `__iter__` method that simply returns the object itself

That's it - that's all there is to an iterator - two methods, `__iter__` and `__next__`.

**If an object is an iterator, we can use it with `for` loops, comprehensions, etc.**. We can also use `enumerate`, `sorted` and other functions because they support iterators.

How do we make our `Squares` instance an iterator? 

All we have to do is add an `__iter__` method and make it return `self`. That's it.

In [10]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
    
    def __next__(self):
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result   
    
    def __iter__(self):
        return self

In [11]:
sq = Squares(5)

for item in sq:
    print(item)

0
1
4
9
16


This iterator cannot be restarted which is an issue, but it is not a requirement of the **iterator protocol**.

As we said above, `sorted` will work because `sorted` can take iterators:

In [12]:
sq = Squares(5)

sorted(sq)

[0, 1, 4, 9, 16]

#### Iterators vs Iterables

In general iterables can be re-used for iteration (like lists, tuples, ranges, etc) - because they are not the object performing the iteration. 

The iterator does that since it implements `__next__`. However, since iterators also implement `__iter__` they are technically also iterables. 

But, iterables that are not iterators generally do not become exhausted. (We can technically break these conventions by making an iterator with `__iter__` that returns a new iterator instead of itself, but we don't need to worry about this.)

To reiterate,

- An iterable is an object that implements __iter__ (returns an iterator)
- An iterator is an object that implements __iter__ (returns itself) and `__next__`

So, iterables DO NOT implement `__next__`.

So when does `__iter__` get called?

It won't be called with our `while True` approach - that only calls `__next__`. Let's add print statements to make this clear.

In [27]:
class Squares:
    def __init__(self, length):
        self.length = length
        self.i = 0
    
    def __next__(self):
        
        print('__next__ called')
        if self.i >= self.length:
            raise StopIteration
        else:
            result = self.i ** 2
            self.i += 1
            return result   
    
    def __iter__(self):
        
        print('__iter__ called')
        return self

In [28]:
sq = Squares(2)

while True:
    try:
        print(next(sq))
    
    except StopIteration:
        break

__next__ called
0
__next__ called
1
__next__ called


But `__iter__` **will** be called at the beginning when a `for` loop (or comprehension) is used.

In [29]:
sq = Squares(2)

for item in sq:
    print(item)

__iter__ called
__next__ called
0
__next__ called
1
__next__ called


So what's actually happening? 

Firstly, let's note:

Python needs a consistent way to get an iterator either from an iterable or from an iterator. 

- If the object is an iterable (has no `__next__` implemented) and we want this object to be an iterator, Python needs to return an iterator. This iterator will have a new memory address. When we call `__next__` on our object, Python will actually call `__next__` on the iterator until we get `StopIteration`.

- If the object is an iterator already, then Python doesn't need to return a new iterator - it can just return itself because it's already an iterator. Then, we can call `__next__` on this iterator (the same `__next__` in the original object) until we get `StopIteration`.

In [49]:
sq = Squares(5)
sq_iterator = iter(sq)
print(sq_iterator is sq)

__iter__ called
True


Then, Python calls `__next__` on `sq_iterator`. **In this particular case**, `sq` IS`sq_iterator` so it doesn't matter whether we call it on `sq` or `sq_iterator`. But in general, Python calls `__next__` on the iterator object.

In [None]:
while True:
    try:
        print(next(sq_iterator))
    
    except StopIteration:
        break

# 03 - Iterators  and Iterables

Previously we saw that we could create **iterator** objects by simply implementing:

* a `__next__` method that returns the next element in the container
* an `__iter__` method that just returns the object itself (the iterator object)

The drawback is that iterators get **exhausted**, and when this happens, the **iterator is a useless, throwaway object**. This means we have to create a new iterator every time we want to use a new iteration over the collection - can we somehow avoid having to remember to do that every time?

Let's break down the iterator into two distinct things:

1. The collection (container) of items/elements/marbles in a bag.
2. A method to iterate over the collection.

Why should we have to recreate the collection (1.) just to iterate over them again?

So if we separate our iterator into these two parts, then we should have:

1. A separate iterator object (which will always be a throwaway object - there's no escaping that), **created every time we need to start a fresh iteration**.
2. A collection which is iterable and **only created once**. This will be used to maintain/mutate our data so it may have `append`,`pop` methods, etc.

In this case, the iterator is responsible for iterating over the collection.

**Example**

Let's look at an example where we break up the iterator into a collection and an iterator object.

Firstly, the unseparated version:

In [1]:
class Cities:
    def __init__(self):
        self._cities = ['Paris', 'Berlin', 'Rome', 'Madrid', 'London']
        self._index = 0
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._index >= len(self._cities):
            raise StopIteration
        else:
            item = self._cities[self._index]
            self._index += 1
            return item

Now let's break it up. Remember, our `_cities` list object may contain millions of data points or be pulled from an API which may take a long time so we can see why it would be wasteful to have to create a new instance of `Cities` every time our iterator gets exhausted. 

In [5]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)

And let's create our iterator this way:

In [4]:
# container part
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)

# iterator part
class CityIterator:
    def __init__(self, city_obj):
        # cities is an instance of Cities
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

1. So now we can create our `Cities` instance **once** to generate a container of all our data elements (perhaps from an API).
2. Then, we create our iterator instance and pass it the **instance** of our collection of our objects.
3. Then, we can use the iterator to iterate through our collection of objects using a `for` loop for example. But once the loop terminates, our `city_iterator` is an exhausted, throwaway object. 


In [52]:
cities = Cities()
city_iterator = CityIterator(cities)

for item in city_iterator:
    print(item)

New York
Newark
New Delhi
Newcastle


It would be nice if we didn't have to manually create a new iterator every time. 

This is where the **formal definition of a Python iterable** comes in...

**An iterable is a Python object that implements the iterable protocol. The iterable protocol requires that the object implements a single method: `__iter__`. This method returns a *new instance of the iterator object* which is used to iterate over the iterable.**

#### Making an Iterable

Let's quickly paste in from above our iterator which will be used by our iterable.

In [5]:
class CityIterator:
    def __init__(self, city_obj):
        # cities is an instance of Cities
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

So let's make our `Cities` instance a formal **iterable** by adding `__iter__` which returns a **new instance** of the iterator object.

In [1]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        return CityIterator(self)     

Don't let the `self` in `return CityIterator(self)` fool you into thinking that, since we are returning `self`, `cities` must be an iterator. Why? 

Because, we start with a `Cities` instance called `cities`. Since we've fulfilled the protocol requirements, when we call `iter(cities)`, we return a **new instance** of an iterator.

This 

`CityIterator` *is* however an **iterator** because it has implemented **both** `__iter__` and `__next__`. 

As a result, **iterators are themselves iterables but they're iterators that *can* become exhausted**. 

Iterables on the other hand **never become exhausted** because they always return a new iterator that iterates over the original collection.

In any case, calling `iter()` on something will **always return an iterator**. 

**Chronology**

When iterating over an **iterable**, Python first:
- Calls the `iter()` to obtain an iterator.
- Then, it starts iterating over the **iterator** using `next` and `StopIteration`, etc.).
- After the iteration is complete, if we execute another on another line using a `for` loop, it will work because **iterables never become exhausted**.

Here's proof:

In [7]:
cities = Cities()

for city in cities:
    print(city)

New York
Newark
New Delhi
Newcastle


In [8]:
for city in cities:
    print(city)

New York
Newark
New Delhi
Newcastle


Note that every time we execute a `for` loop, we always start off by calling `__iter__`. Updating `CityIterator` with a print statement in `__iter__`, `__init__` and `__next__`, and `Cities` with a print statement in `__iter__` will make that clear:

In [11]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        print('Cities __iter__ called')
        return CityIterator(self)     

In [12]:
class CityIterator:
    def __init__(self, city_obj):
        print('CityIterator object created!')
        # city_obj is an instance of Cities
        self._city_obj = city_obj
        self._index = 0
        
    def __iter__(self):
        print("CityIterator __iter__ called")
        return self
    
    def __next__(self):
        print("CityIterator __next__ called")
        if self._index >= len(self._city_obj):
            raise StopIteration
        else:
            item = self._city_obj._cities[self._index]
            self._index += 1
            return item

Now, let's call `__iter__` and then the `for` loop. 

In [18]:
cities = Cities()

city_iter_1 = iter(cities)

Cities __iter__ called
CityIterator object created!


In [19]:
for city in city_iter_1:
    print(city)

CityIterator __iter__ called
CityIterator __next__ called
New York
CityIterator __next__ called
Newark
CityIterator __next__ called
New Delhi
CityIterator __next__ called
Newcastle
CityIterator __next__ called


Immediately executing a `for` loop tells Python to **always** call the `__iter__` first, expecting an iterator to be returned (it needs a consistent way of ensuring that it's provided an iterator because only iterators implement `__next__`). Then it calls `__next__` on the provided iterator. 

#### Final, Neat Iterator-Iterable Solution

To keep things self-contained, we can put the `CityIterator` class within `Cities` and note that `return CityIterator(self)` -> `return self.CityIterator(self)`.

In [None]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        print('Calling Cities instance __iter__')
        return self.CityIterator(self)
    
    class CityIterator:
        def __init__(self, city_obj):
            # cities is an instance of Cities
            print('Calling CityIterator __init__')
            self._city_obj = city_obj
            self._index = 0

        def __iter__(self):
            print('Calling CitiyIterator instance __iter__')
            return self

        def __next__(self):
            print('Calling __next__')
            if self._index >= len(self._city_obj):
                raise StopIteration
            else:
                item = self._city_obj._cities[self._index]
                self._index += 1
                return item

#### Mixing Iterables and Sequences

`Cities` is an iterable but not a sequence because we haven't implemented `__getitem__`. Recalling that, if we implement `__getitem__` and use a `for` loop, Python repeatedly calls `__getitem__`. 

But as we've just seen, every time we call a `for` loop, Python calls `__iter__` to return an iterator, and then repeatedly calls `__next__` until we exhaust the iterator.

So which approach does Python take?

Let's take the self-contained class above and add `__getitem__` (delegating the responsibilities of slicing and indexing to the underlying list object of our collection), so that our instance is a **sequence and an iterable**:

In [20]:
class Cities:
    def __init__(self):
        self._cities = ['New York', 'Newark', 'New Delhi', 'Newcastle']
        
    def __len__(self):
        return len(self._cities)
    
    def __iter__(self):
        print('Calling Cities instance __iter__')
        return self.CityIterator(self)
    
    def __getitem__(self, s):
        print('getting item via __getitem__')
        return self._cities[s]
    
    class CityIterator:
        def __init__(self, city_obj):
            # cities is an instance of Cities
            print('Calling CityIterator __init__')
            self._city_obj = city_obj
            self._index = 0

        def __iter__(self):
            print('Calling CitiyIterator instance __iter__')
            return self

        def __next__(self):
            print('Calling __next__')
            if self._index >= len(self._city_obj):
                raise StopIteration
            else:
                item = self._city_obj._cities[self._index]
                self._index += 1
                return item

In [21]:
for city in cities:
    print(city)

Cities __iter__ called
CityIterator object created!
CityIterator __next__ called
New York
CityIterator __next__ called
Newark
CityIterator __next__ called
New Delhi
CityIterator __next__ called
Newcastle
CityIterator __next__ called


As you can see, **Python prefers the iterator-iterable approach**.

So to sum up, when Python wants to loop over some object using `for`, it first checks for `__iter__` by default. If it can't find any, it resorts to a `__getitem__` method.

Python's list object has both `__iter__` and `__getitem__` implemented so it prefers to use the `__iter__` protocol.

# 04 - Example 1 - Consuming Iterators Manually

It can be useful to manually iterate through an iterator using the `next()` function.

A fairly typical use case for this would be when reading data from a CSV file where you know the first few lines consist of information about the data rather than just the data itself.

Let's try this using a CSV file I have saved alongside the Jupyter notebook.

Let's first load the data and see what it looks like:

In [51]:
with open("../Section 04 - Iterables and Iterators/cars.csv") as file:
    
        for idx, line in enumerate(file):
            if idx > 4:
                break
            
            else:
                print(line)

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin

STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT

Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US

Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US

Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US



As we can see, the values are delimited by `;` and the first two lines consist of the column names, and column types.

The reason for the spacing between each line is that each line ends with a newline, and our print statement also emits a newline by default. So we'll have to strip those out.

Here's what we want to do: 
* read the first line to get the column headers and create a named tuple class
* read data types from second line and store this so we can cast the strings we are reading to the correct data type
* read the data rows and parse them into a named tuples

As we might expect, as we're looping through each row we'll need to have `if` statements to catch and deal with the first two rows independently.

In [52]:
with open("../Section 04 - Iterables and Iterators/cars.csv") as file:
    row_index = 0
    for line in file:
        if row_index == 0:
            # header row
            headers = line.strip('\n').split(';')
            print(headers)
        elif row_index == 1:
            # data type row
            data_types = line.strip('\n').split(';')
            print(data_types)
        else:
            # data rows
            data = line.strip('\n').split(';')
            print(data)
        row_index += 1
        
        if row_index == 4:
            break

['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin']
['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']
['Buick Skylark 320', '15.0', '8', '350.0', '165.0', '3693.', '11.5', '70', 'US']


With the code above, we get each row of data as a list whose elements correspond to the data headers if it's the first row, data types if it's the second row, and all other rows are the car data.

Just to make our goal clear, let's print out the data types and data below one another:

In [53]:
data_types = ['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']
data_row = ['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']

print(data_types)
print(data_row)

['STRING', 'DOUBLE', 'INT', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'DOUBLE', 'INT', 'CAT']
['Chevrolet Chevelle Malibu', '18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', 'US']


We want to apply the above data type to the below data element-wise. We can do this with the following function and a list comprehension

In [54]:
def cast(data_type, value):
    if data_type == 'DOUBLE':
        return float(value)
    elif data_type == 'INT':
        return int(value)
    else:
        return str(value)
    
[cast(data_type, value) for data_type, value in zip(data_types, data_row)]

['Chevrolet Chevelle Malibu', 18.0, 8, 307.0, 130.0, 3504.0, 12.0, 70, 'US']

As you can see, our data is now in the correct type (float values like 18.0 are actual floats and not strings: '18.0').

Our code became really messy with nested `if` statements because we had to deal with the first and second row separately. But, if we convert our `file` iterable into an iterator, we can use `next()` to manually go to the next line. 

This is good because our iterator is a consumable and we can't go backwards in it, and it also makes our code clean.

In [57]:
from collections import namedtuple
cars = []

def cast(data_type, value):
    if data_type == 'DOUBLE':
        return float(value)
    elif data_type == 'INT':
        return int(value)
    else:
        return str(value)

def cast_row(data_types, data_row):
    return [cast(data_type, value) for data_type, value in zip(data_types, data_row)]
            

with open("../Section 04 - Iterables and Iterators/cars.csv") as file:
    
    file_iter = iter(file)  
    
    headers = next(file_iter).strip('\n').split(';')      # get 0th row
    Car = namedtuple('Car', headers)
    
    data_types = next(file_iter).strip('\n').split(';')   # get 1st row
    
    for line in file_iter:
        data = line.strip('\n').split(';')
        data = cast_row(data_types, data)
        car = Car(*data)
        cars.append(car)

cars[0]

Car(Car='Chevrolet Chevelle Malibu', MPG=18.0, Cylinders=8, Displacement=307.0, Horsepower=130.0, Weight=3504.0, Acceleration=12.0, Model=70, Origin='US')

This approach is fine, but if we wanted we could clean things up a little bit more. 

If you noticed, we have an empty `cars` list defined outside and we append to it from the inside. We can replace that with a list comprehension. In the full notes, he shortens the code down significantly with two list comprehensions but it becomes quite unreadable, so I've left it out here.

# 05 - Example 2 - Cyclic Iterators

# 06 - Lazy Iterables

# 07 - Python's Built-In Iterables and Iterators

# 08 - Sorting Iterables

# 09 - The iter() Function

# 10 - Iterating Callables

# 11 - Delegating Iterators

# 12 - Reversed Iteration

# 13 - Caveat Using Iterators for Function Arguments