<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn and OOP: Exercises and Solutions

In [1]:
import doctest

In [2]:
class C:
    def __init__(self, data=17):
        self.data = data
    def __repr__(self):
        return f"C({self.data})"
    def __lt__(self, other):
        return self.data < other.data

### Exercise

Run this code and explain the result:

In [3]:
C(17) <= C(18)

TypeError: '<=' not supported between instances of 'C' and 'C'

This is a `TypeError` because `__le__` does not exist.

### Exercise

Run this code and explain the result. **Hint**: the special `id` function in Python gets the **location** of the object in memory.

In [4]:
C(17) == C(17) 

False



```python
C(17) == C(17)
```

is `False` because if an object doesn't have `__eq__`, Python will fall back to comparing with `id`.

In [5]:
id(C(17))

2642756520256

In [6]:
id(C(17)) # a new instance, so a new location in memory

2642756520704

### Exercise

Edit the definition of `C`, implementing `__eq__` and `__le__`, to fix the problems above.

In [7]:
class C:
    """
    >>> C(17) <= C(18)
    True
    >>> C(17) == C(17)
    True
    """
    def __init__(self, data=17):
        self.data = data
    def __repr__(self):
        return f"C({self.data})"
    def __lt__(self, other):
        return self.data < other.data
    def __eq__(self, other):
        return self.data == other.data
    def __le__(self, other):
        return self.data <= other.data

In [8]:
doctest.run_docstring_examples(C, globals(), verbose=True)

Finding tests in NoName
Trying:
    C(17) <= C(18)
Expecting:
    True
ok
Trying:
    C(17) == C(17)
Expecting:
    True
ok


### Exercise

This is a classic OOP exercise. Implement a class `Vehicle`, and then create sub-classes `Bicycle` and `Car` from it using `super`. A `Vehicle` has some number of wheels, and a colour, and a method `move`.

In [9]:
class Vehicle:
    """
    >>> v = Vehicle()
    >>> v.move()
    The vehicle is moving in an abstract kind of way
    """
    def __init__(self, colour=None, n_wheels=None):
        self.colour = colour
        self.n_wheels = n_wheels
    def move(self):
        print("The vehicle is moving in an abstract kind of way")
    
class Bicycle(Vehicle):
    """
    >>> b = Bicycle("red")
    >>> b.move()
    The red bicycle is pedalling
    """
    def __init__(self, colour):
        super().__init__(colour, 2)
    def move(self):
        print(f"The {self.colour} bicycle is pedalling")
    
class Car(Vehicle):
    """
    >>> c = Car("blue")
    >>> c.move()
    The blue car is combusting petrol
    """
    def __init__(self, colour):
        super().__init__(colour, 4)
    def move(self):
        print(f"The {self.colour} car is combusting petrol")

In [10]:
doctest.run_docstring_examples(Vehicle, globals(), verbose=True)
doctest.run_docstring_examples(Bicycle, globals(), verbose=True)
doctest.run_docstring_examples(Car, globals(), verbose=True)

Finding tests in NoName
Trying:
    v = Vehicle()
Expecting nothing
ok
Trying:
    v.move()
Expecting:
    The vehicle is moving in an abstract kind of way
ok
Finding tests in NoName
Trying:
    b = Bicycle("red")
Expecting nothing
ok
Trying:
    b.move()
Expecting:
    The red bicycle is pedalling
ok
Finding tests in NoName
Trying:
    c = Car("blue")
Expecting nothing
ok
Trying:
    c.move()
Expecting:
    The blue car is combusting petrol
ok



### Exercise: predict the mode

In many machine learning scenarios it's good to create a simple **baseline** to compare a more sophisticated algorithm against. In classification, one simple example is to predict the **mode** -- the most common $y$ value in the training data, ignoring the $X$.

In [11]:
from collections import Counter

def mode(y):
    """
    Example: Counter("aaba") returns (item, count) tuples ordered by count:
    [('a', 3), ('b', 1)]
    So the most common item is at [0][0]
    """
    return Counter(y).most_common()[0][0]
mode(['a', 'a', 'b', 'a'])

Create a `ModePredictor` class by refactoring the above code. Inherit from Scikit-Learn `BaseEstimator` and `ClassifierMixin`. Compare its (training) classification accuracy on the dataset below against another classifier, such as `RandomForestClassifier`. Remember, we should **inherit** classification accuracy, not implement it ourselves!

In [85]:
import pandas as pd
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [86]:
df = pd.read_csv("data/unbalanced.csv", index_col=0)
df.head()

Unnamed: 0,X0,X1,y
0,0.66,0.27,healthy
1,0.57,0.41,healthy
8,0.96,0.46,unhealthy
2,0.48,0.47,healthy
3,0.55,0.64,healthy


In [90]:
X = df[["X0", "X1"]].values
y = df["y"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [91]:
X_train

array([[0.32, 0.39],
       [0.66, 0.27],
       [0.29, 0.14],
       [0.97, 0.74],
       [0.66, 0.29],
       [0.32, 0.46],
       [0.81, 0.72],
       [0.22, 0.87],
       [0.17, 0.18],
       [0.1 , 0.76]])

In [94]:
y_train

array(['healthy', 'healthy', 'healthy', 'unhealthy', 'healthy', 'healthy',
       'healthy', 'healthy', 'healthy', 'healthy'], dtype=object)

In [95]:
class ModePredictor(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        # Example: Counter("aaba") returns (item-count) tuples ordered by count:
        # [('a', 3), ('b', 1)]
        # So the most common item is at [0][0]
        self.mode = Counter(y).most_common()[0][0]
        return self
    def predict(self, X):
        # ignore X, return mode!
        # we have to return a prediction yhat for *each* element of the query X
        return np.array([self.mode for _ in X])

When we run the code below we should see a table of results like this:

```python
ModePredictor(): 0.90
RandomForestClassifier(): 0.90
```

In [96]:
clfs = [ModePredictor(), RandomForestClassifier()]
for clf in clfs:
    clf.fit(X_train, y_train)
    print(f"{clf}: {clf.score(X_test, y_test):.2f}")

ModePredictor(): 0.90
RandomForestClassifier(): 0.90


We see the RF does great, 0.9 - but the mode predictor does the same, because this data is highly imbalanced.