# Object-Oriented Programming

Let's pretend that there isn't any package for linear algebra such as [numpy.linalg](https://numpy.org/doc/stable/reference/routines.linalg.html), and we'd like to implement vector-like behavior for collections of numbers. Intuitively, a `list` seems to be useful for representing vectors:

In [None]:
v1 = [1, 2]
v2 = [-1, 2]

When multiplying a vector with a scalar (i.e., changing its magnitude), we expect each component of the vector to be multiplied by that scalar. What happens when we use the multiplication operator on a list?

In [None]:
v1 * 2

This is not useful: the multiplication operator `*` for a list creates `n` copies of the list `l` when used as `l * n`. A similar problem arises when using the addition operator `+`, which concatenates the two lists instead of doing pairwise addition of their elements.

In [None]:
v1 + v2

What we want is to be able to define custom multiplication or addition behavior that is intimately related to a vector and its components (its data). In Object-Oriented Programming, a _class_ is a construct for exactly this purpose.

## Classes = Data + Behavior

Classes are defined using the `class` keyword. When "calling" a class (using the parantheses, just like calling a function), a new _object_ is created with the type of that class.

In [None]:
class Foo:
    pass

In [None]:
f = Foo()
type(f)

As seen before, every object has _attributes_:

In [None]:
dir(f)

We can assign (new) attributes to objects using the `.` operator:

In [None]:
f.x = 42
f.x

It is also possible to assign _functions_ to a class. Note the subtle change in type when referring to the function itself (or as an attribute of the class) vs. when referring to it as an attribute of the _object_ `f`:

In [None]:
def bar(foo_obj):
    return foo_obj.x * 2

Foo.bar = bar
type(bar), type(Foo.bar), type(f.bar)

In [None]:
id(bar), id(Foo.bar), id(f.bar)

The "magic" that happens here is that when a function is an attribute of a class, it becomes a _(bound) method_ of objects of that class. When a method is bound to an object, Python passes the object itself as the first argument when calling the method:

In [None]:
bar(f)

In [None]:
f.bar()  # No arguments are passed to bar! 

It is not common to manually add functions as attributes to a class as in the example above. Instead, we usually define functions in the class body. It is a convention to use `self` as the name of the first argument that captures the object to which the method is bound.

In [None]:
class Point2D:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def distance_to_origin(self):
        return (self.x ** 2 + self.y ** 2) ** .5

That strangely-looking `__init__()` function is one of the mostly used ["magic methods"](https://rszalski.github.io/magicmethods/) (also called "dunder methods" for their "__d__ouble __under__score" naming convention). Whenever we construct a new `Point2D` object, Python will look for the object's `__init__()` method and pass through the arguments provided to the constructor:

In [None]:
p1 = Point2D(3, 4)

In [None]:
p1.x

In [None]:
p1.distance_to_origin()

Having several `Point2D` objects, what happens when we try to check if they're equal?

In [None]:
p2 = Point2D(3, 4)

In [None]:
p2 is p1

In [None]:
p2 == p1

__Question__: Does the result of value equality surprise you? Why would the two identical points be considered unequal? How should Python know what makes two objects of the same type equal by value?

Their (human-readable) representation is also not entirely clear:

In [None]:
p1

Just as `__init__()`, there are magic methods to define equality and representation of an object:

In [None]:
class Point2D(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def distance_to_origin(self):
        return (self.x ** 2 + self.y ** 2) ** .5
    
    def __eq__(self, other):
        if isinstance(other, Point2D):
            return self.x == other.x and self.y == other.y
        else:
            return False
        
    def __repr__(self):
        return f'{self.__class__.__name__}({self.x}, {self.y})'

In [None]:
p1 = Point2D(3, 4)
p2 = Point2D(3, 4)

The `__repr__()` method is automatically called when this notebook is trying to represent any of our `Point2D` objects:

In [None]:
p1

Whenever Python encounters a statement `obj1 == obj2`, it will look for an `__eq__()` method on `obj1` and return whatever the result is when passing it `obj2`. There are similar magic methods for comparison that are triggered when encountering operators such as `<`, `>=`, etc. These are called _rich comparison methods_, and are (just as `__eq__()` _not_ implemented by default!

In [None]:
p1 == p2

To prevent lots of repetitive code for initialization, comparison, and representation, modern Python (since version 3.7) has [_data classes_](https://docs.python.org/3/library/dataclasses.html). These provide sensible default implementations for object initialization, representation, and equality checks.

To make a data class, we just need to decorate a class with the `@dataclass` decorator (decorators will be explained later).

In [None]:
from dataclasses import dataclass

In [None]:
@dataclass
class Point2D:
    x: float
    y: float
    
    def distance_to_origin(self) -> float:
        return (self.x ** 2 + self.y ** 2) ** .5

In [None]:
p1 = Point2D(3, 4)
p2 = Point2D(3, 4)
p1.x

In [None]:
p1

In [None]:
p1 == p2

Data classes require that its member attributes (in our case `x` and `y`) have their type specified using _type annotations_.

## Type Hints

A full discussion of [type hints](https://docs.python.org/3/library/typing.html) is beyond the scope of this tutorial. Since Python 3.5, it is possible to indicate the type of variables, function arguments and return values, and class members.

In [None]:
def add_one(x: int) -> int:
    return x + 1

In [None]:
add_one(42)

In [None]:
add_one('foo')

Providing type hints is optional in most cases, but required in case of data classes!

__Exercise__: Create a class `Vector` that represents a vector in $n$ dimensions. Think about how to represent the vector's components. Should these be seperate attributes? Or a container type such as list? (see the [specs for generic container types](https://docs.python.org/3/library/typing.html#generic-concrete-collections)) 

Similar to the `Point2D` example above, implement a method to calculate the vector's [norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm). Use the assertions below to verify the correctness of your solution (note that we expect vectors of any number of dimensions).

In [None]:
# Your Solution:

In [None]:
# %load 'solutions/vector_norm.py'

In [None]:
assert Vector([3, 4]).norm() == 5
assert Vector([3, 4, 5]).norm() == 50 ** .5

## Composition

Objects can be composed of (multiple) other objects, just as the built-in container types discussed before.

In [None]:
import math

A circle has a point in space which is its center:

In [None]:
@dataclass
class Circle:
    center: Point2D
    radius: float
    
    def circumference(self):
        return 2 * math.pi * radius

When we add a method to our `Point2D` class to compute the distance to another point, we can easily find out if a given point is within a circle:

In [None]:
@dataclass
class Point2D:
    x: float
    y: float
    
    def distance_to_origin(self) -> float:
        return (self.x ** 2 + self.y ** 2) ** .5
    
    def distance_from(self, other: 'Point2D') -> float:
        return ((self.x - other.x) ** 2 + (self.y - other.y) ** 2) ** .5

@dataclass
class Circle:
    center: Point2D
    radius: float
    
    def circumference(self) -> float:
        return 2 * math.pi * self.radius
    
    def __contains__(self, point: Point2D) -> bool:
        return self.center.distance_from(point) <= self.radius

Using magic methods such as `__contains__()` allows us to write surprisingly elegant code:

In [None]:
c = Circle(center=Point2D(3, 4), radius=1)
Point2D(3.5, 4) in c, Point2D(5, 4) in c, Point2D(5, 4) not in c

## Composition of a Pandas DataFrame

In [None]:
import pandas as pd

Just as we created a `Point2D` object by "calling" the class, we can create a DataFrame:

In [None]:
transaction_df = pd.DataFrame({
    'amount': [42., 100., 999.],
    'from': ['bob', 'alice', 'bob'],
    'to': ['alice', 'bob', 'alice']
})
transaction_df

As can be seen, we created `transaction_df` by providing the constructor a dict having column names as keys and lists of column data as values. There are several other ways to [construct a DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [None]:
type(transaction_df)

Let's explore our DataFrame by inspecting some of its attributes:

In [None]:
transaction_df.values

In [None]:
type(transaction_df.values)

In [None]:
transaction_df.columns, type(transaction_df.columns)

In [None]:
transaction_df.amount

In [None]:
type(transaction_df.amount)

We can somehow look up columns of the DataFrame using the `[]` notation, just as with dicts.

In [None]:
transaction_df['amount']

In [None]:
type(transaction_df['amount'])

... or look up individual values

In [None]:
transaction_df['amount'][0], type(transaction_df['amount'][0])

## Index-based (or Label-based) Selection and Assignment

In [None]:
transaction_df = pd.DataFrame({
    'amount': [42., 100., 999.],
    'from': ['bob', 'alice', 'bob'],
    'to': ['alice', 'bob', 'alice']
})
transaction_df

Every DataFrame has an _index_, with as default an integer sequence starting at 0:

In [None]:
transaction_df.index

Let's start investigating `.loc`:

In [None]:
type(transaction_df.loc)

It is an attribute of our DataFrame, but not a method. We can provide single keys from the DataFrame's index to lookup a particular row:

In [None]:
transaction_df.loc[1]

... or provide a list of keys to retrieve multiple rows:

In [None]:
transaction_df.loc[[0, 2]]

To understand how these lookups with the `[]` notation work, consider the [`__getitem__()` magic method](https://docs.python.org/3/reference/datamodel.html#object.__getitem__):

In [None]:
@dataclass
class FooIndexer:
    container: list
    
    def __getitem__(self, label):
        return self.container[label]

In [None]:
FooIndexer([1, 2, 3])[1]

__Question:__ What should happen if we want to look up members of the container by custom index labels?

We can add a custom index when constructing a DataFrame:

In [None]:
transaction_df = pd.DataFrame({
    'amount': [42., 100., 999.],
    'from': ['bob', 'alice', 'bob'],
    'to': ['alice', 'bob', 'alice']
}, index=[2, 4, 6])
transaction_df

In [None]:
transaction_df.index

`loc[]` will raise a `KeyError` when looking up a key that's not in the index:

In [None]:
transaction_df.loc[1]

Just as for DataFrames, there's also a constructor for Series (which are representations of the columns or rows of a DataFrame):

In [None]:
transaction_messages = pd.Series(['foo', 'bar', 'baz'])
transaction_messages

Using `assign()`, we add the transaction messages as a new column to the DataFrame:

In [None]:
transaction_df.assign(message=transaction_messages)

Is this the expected result? What went wrong? How can the situation be corrected such that the messages correspond to the existing transactions in our DataFrame?

In [None]:
transaction_messages = pd.Series(['foo', 'bar', 'baz'], index=[2, 4, 6])
transaction_df.assign(message=transaction_messages)

In some cases, it is useful to reset the index, i.e., to change the existing index into an integer sequence starting at 0:

In [None]:
transaction_df.reset_index()

Note that the original index is added to the DataFrame as a column named `index`.

In [None]:
transaction_df = pd.DataFrame({
    'amount': [42., 100., 999.],
    'from': ['bob', 'alice', 'bob'],
    'to': ['alice', 'bob', 'alice'],
    'tx_id': [101, 201, 301]
})
transaction_df

As an opposite, we can take one column and make it the DataFrame's index:

In [None]:
transaction_df.set_index('tx_id')

Why do we bother with all this indexing? The following statements demonstrate the speed differences in searching a DataFrame for particular values:

In [None]:
import numpy as np

In [None]:
df_size = 100_000

foo_df = pd.DataFrame({
    'a': np.arange(df_size),
    'b': np.random.permutation(df_size)
})
foo_df

In [None]:
%%timeit
for n in np.random.choice(df_size, size=10):
    foo_df.loc[lambda df: df['b'] == n]

In [None]:
%timeit foo_df.loc[lambda df: df['b'] == 42]

In [None]:
idx_foo_df = foo_df.set_index('b')

In [None]:
%timeit idx_foo_df = foo_df.set_index('b')

In [None]:
%%timeit
for n in np.random.choice(df_size, size=10):
    idx_foo_df.loc[n]

In [None]:
%timeit idx_foo_df.loc[42]

## Boolean-based Selection

In [None]:
transaction_df

Besides index labels, we can also provide boolean sequences, or _masks_, to `loc[]`:

In [None]:
transaction_df.loc[[True, False, True]]

In [None]:
transaction_df['amount'] > 100

In [None]:
transaction_df.loc[transaction_df['amount'] > 100]

## Inheritance

Next to our `Circle` class, we add a class `Square` to our domain:

In [None]:
@dataclass
class Square:
    center: Point2D
    side_length: float
    
    def circumference(self) -> float:
        return 4 * self.side_length
    
    def __contains__(self, point: Point2D) -> bool:
        return (
            point.x <= self.center.x + self.side_length / 2 and
            point.x >= self.center.x - self.side_length / 2 and
            point.y <= self.center.y + self.side_length / 2 and
            point.x >= self.center.y - self.side_length / 2
        )

This introduces a few problems. Both `Circle` and `Square` have a `center` attribute which is replicated. And imagine that both classes would share a common implementation of a method, e.g. `distance_to_origin()` that returns the distance from the center of the shape to the origin of the space. That would create additional copy/pasting of code, which is undesirable.

A solution for these issues is object _inheritance_. Besides being composed of other objects, objects can _inherit_ data and behavior from _parent_ objects.

In [None]:
from abc import ABC, abstractmethod

All the common data and methods for circles, squares (and other shapes we might add in the future) can be defined in an (abstract) base class:

In [None]:
@dataclass
class Shape2D(ABC):
    center: Point2D
    
    @abstractmethod
    def circumference(self) -> float:
        pass
    
    @abstractmethod
    def __contains__(self, point: Point2D) -> bool:
        pass
    
    def distance_to_origin(self) -> float:
        return self.center.distance_to_origin()

Note the parentheses in the class definition: they mean that the class `Shape2D` inherits from class `ABC` (which we imported and stands for Abstract Base Class). Also note the `@abstractmethod` decorators, which indicate that these methods have no implementation in our `Shape2D` class.

In [None]:
s = Shape2D()

Inheritance from `ABC` makes sure we cannot (incidentally) construct an object from an abstract base class. We can now define our `Circle` class as follows. Note the absence of the `center` attribute and the `distance_to_origin()` method. These are inherited from the base class.

In [None]:
@dataclass
class Circle(Shape2D):
    radius: float
    
    def circumference(self) -> float:
        return 2 * math.pi * self.radius

c = Circle(Point2D(3, 4), 1)

We still cannot construct an object of type `Circle`, because it didn't implement the abstract method `__contains__()` from its base class. The `@abstractmethod` decorator makes sure that we don't incidentally create an object with an unimplemented method.

In [None]:
@dataclass
class Circle(Shape2D):
    radius: float
    
    def circumference(self) -> float:
        return 2 * math.pi * self.radius
    
    def __contains__(self, point: Point2D) -> bool:
        return self.center.distance_from(point) <= self.radius

@dataclass
class Square(Shape2D):
    side_length: float
    
    def circumference(self) -> float:
        return 4 * self.side_length
    
    def __contains__(self, point: Point2D) -> bool:
        return (
            point.x <= self.center.x + self.side_length / 2 and
            point.x >= self.center.x - self.side_length / 2 and
            point.y <= self.center.y + self.side_length / 2 and
            point.x >= self.center.y - self.side_length / 2
        )

It is not _required_ that parent or base classes inherit from `ABC`, or explicitly define abstract methods using the `@abstractmethod` decorator. However, these are helpful to make our intent clear and prevent some mistakes.

Using the object hierarchy of shapes, we can now easily define a function that sums the circumference of all shapes in an Iterable, regardless their exact type:

In [None]:
def total_size(shapes: list[Shape2D]) -> float:
    return sum(s.circumference() for s in shapes)

In [None]:
total_size([
    Circle(Point2D(3, 4), 1),
    Square(Point2D(0, 0), 2)
])

## Iterating

The `__iter__()` and `__next__()` magic methods allow us to turn any object into an _Iterator_:

In [None]:
@dataclass
class IntRange:
    upper_bound: int
    
    def __iter__(self):
        self.i = 0
        return self
    
    def __next__(self):
        if self.i >= self.upper_bound:
            raise StopIteration
            
        current_value = self.i
        self.i += 1
        return current_value

These magic methods are called when the object is passed to the built-in functions `iter()` and `next()`.

In [None]:
r = IntRange(3)
r_iter = iter(r)
next(r_iter), next(r_iter)

In [None]:
next(r_iter)

`next()` returns the next element of the sequence represented by the Iterator, while `iter()` turns the object passed to it into the actual Iterator (and performs any necessary initialization). When there are no more elements to return, `next()` is supposed to raise a `StopIteration` exception. (Exception handling is not explained in this tutorial, the appendix contains references to good tutorials)

In [None]:
r_iter = iter(IntRange(3))
while True:
    try:
        print(next(r_iter))
    except StopIteration:
        print('Finished iterating!')
        break

Of course, we don't need to explicitly call the `iter()` and `next()` functions. Any object that implements an `__iter__()` method is called an _Iterable_. Whenever we use an Iterable in a `for .. in` loop, Python automatically creates the Iterator for us and repeatedly calls its `__next__()` method until the `StopIteration` exception:

In [None]:
for i in IntRange(3):
    print(i * 2)

In [None]:
[i ** 2 for i in IntRange(3)]

For your information, Python has a [range](https://docs.python.org/3/library/stdtypes.html#ranges) class in its standard library for purposes such as above.

__Question__: What is the benefit (in terms of memory usage) of using Iterators?

__Exercise__: Another usage of Iterators is to group "common" objects in a sequence and inspecting these groups (or computing some aggregate). Create a `ShapeGrouper` Iterator that groups its shapes by their type. You can test your solution by the assertion below.

_Hint_: You can get the class name of an object `o` by `o.__class__.__name__`.

In [None]:
# Your solution:

In [None]:
# %load solutions/shape_grouper_efficient.py

In [None]:
grouper = ShapeGrouper([
    Square(Point2D(1, 1), 1),
    Circle(Point2D(3, 4), 1),
    Square(Point2D(0, 0), 2)
])

In [None]:
group_iter = iter(grouper)
shape_type, shape_list = next(grouper)
assert shape_type == 'Square'
assert shape_list == [Square(Point2D(1, 1), 1), Square(Point2D(0, 0), 2)]
shape_type, shape_list = next(grouper)
assert shape_type == 'Circle'
assert shape_list == [Circle(Point2D(3, 4), 1)]

__Bonus Exercise__: What happens if our list of shapes is long, having many unique shape types? Can we make our implementation more efficient by using a dictionary for quickly looking up all shapes of a given type? 

_Hint_: [`collections.defaultdict()`](https://docs.python.org/3/library/collections.html#collections.defaultdict) from the standard library may be particularly useful in this case.

__Bonus Exercise 2__: Modify the grouper such that it can compute the total size per shape type as below:

In [None]:
# bonus
assert grouper.total_size() == {'Square': 12, 'Circle': 6.283185307179586}

In [None]:
for shape_type, shape_list in grouper:
    print(f'{shape_type}: {shape_list}')

How does this relate to Pandas' `groupby()` method?

In [None]:
transaction_df

When we group by the recipient of the transaction, we get a `DataFrameGroupBy` object:

In [None]:
type(transaction_df.groupby('to'))

... which is just another Iterable:

In [None]:
g = transaction_df.groupby('to')
g_iter = iter(g)

In [None]:
next(g_iter)

In [None]:
for receiver, receiver_transaction_df in transaction_df.groupby('to'):
    print(f'{receiver} got a total amount of {receiver_transaction_df["amount"].sum()}')

In [None]:
transaction_df.groupby('to').sum()

## Operator (or Method) Chaining

__Exercise__: Create a class `Vector` that implements the addition and multiplication behavior as given at the top of this module. Use the assertions below to verify the correctness of your solution (note that we expect vectors of any number of dimensions).

_Hints_: 

1. There are [some magic methods](https://docs.python.org/3/reference/datamodel.html#emulating-numeric-types) to provide an implementation of numeric operators on custom classes.
2. For the addition of vectors, [`zip()`](https://docs.python.org/3/library/functions.html#zip) can be a useful built-in function.

In [None]:
# Your solution:

In [None]:
# %load solutions/vector_basic.py

In [None]:
v1 = Vector([1., 2.])
v2 = Vector([2., 4.])
v3 = Vector([3.5, 4.5])
v4 = Vector([1, 2, 3])

assert v1 + v1 == v2
assert v1 * 2 == v2
assert v1 * 2 + v3 == Vector([5.5, 8.5])
assert v1 + v3 * 2 == Vector([8, 11])
assert v4 + v4 == Vector([2, 4, 6])

_Reflection_: Why is it convenient that our addition and multiplication methods return (new) `Vector` objects?

__Bonus Exercise__: Add methods to the `Vector` class such that it (1) also implements a lookup by dimension just as we index a list using `[]`, and (2) it can return its number of dimensions using the builtin function `len()`.

_Hint_: There are [some magic methods](https://docs.python.org/3/reference/datamodel.html#emulating-container-types) for implementing container-like behavior for custom classes.

In [None]:
# Your solution:

In [None]:
# %load solutions/vector_as_container.py

In [None]:
v5 = Vector([42, 99])
assert v5[0] == 42
assert len(v5) == 2
assert (v5 + Vector([1, 1]) * 2)[1] == 101

_Reflection_: looking at the snippet below, what is the object that the `assign` method is bound to? What is its type? Does it have a name?

In [None]:
(
    transaction_df
    .loc[lambda df: df['to'] == 'alice']
    .assign(amount=lambda df: df['amount'] * 2)
)