# Week 6: Python classes

## Goals:
- Create `tar` and `zip` files
- Learn fundamentals of object oriented programming
- Create Python classes


## Creating a `tar` file

One can do this fairly easily in Python with the standard library [tarfile](https://docs.python.org/3/library/tarfile.html).

Suppose we want to create a `tar` file containing files `data/ex1.csv` and `data/ex2.csv`.

In [1]:
import tarfile

with tarfile.open("two_examples.tar.gz", "w:gz") as tar:
    tar.add("data/ex1.csv")
    tar.add("data/ex2.csv")

This creates a `tar` file with the same directory structure as given. 

The files `ex1.csv` and `ex2.csv` are contained in the folder `data` in our `tar` file. Can modify code or move files around to fix this.

## Creating a `zip` file

Just like with `tar` one can create `zip` files easily with the standard library [zipfile](https://docs.python.org/3/library/zipfile.html).

In [2]:
import zipfile

with zipfile.ZipFile("two_examples.zip", "w") as zip:
    zip.write("data/ex1.csv")
    zip.write("data/ex2.csv")

Same directory "problem" as before. 

## Programming paradigms

There are many different kinds of programming paradigms, which refer to the fundamental style of the code and how one executes commands. 

Some examples:
- procedural (e.g. `C`, `FORTRAN`, `COBOL`)
- object oriented (e.g. `Java`, `Python`, `C++`)
- functional (e.g. `Scala`, `Haskell`, `Lisp`)

Many modern languages take ideas from many different paradigms. For example `Python` has procedural, object oriented, and functional aspects. 

**Opinion:** I would say `Python` is *primarily* in the object oriented style.

One of the 'slogans' of object oriented programming is that the objects carry the data. 

Users use the objects to call certain functions. 

For example, in `Python` 
- **methods** are *functions* attached to objects, and 
- **attributes** are *data* attached to objects.

There is sometimes overlap and inconsistencies, but this is the general idea. 

We can use `dir` to get all of the methods and attributes of an object.

#### Python lists

In [3]:
print(dir(list))

['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']


Special methods and attributes are indicated with the prefix and suffix `__`. Let's ignore these for now.

In [4]:
list(filter(lambda s: not "__" in s, dir(list)))

['append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

I believe all of these are methods.

In [8]:
L = [1, 5, 25]
L.append(125)
# M = L + [125]
print(L)

[1, 5, 25, 125]


Methods can sometimes change the object without seemingly defining a new object

In [9]:
L.remove(1)
print(L)

[5, 25, 125]


### Numpy arrays

In [10]:
import numpy as np

Arrays in numpy have type `ndarray`. We can view all the methods and attributes with `dir`:

In [11]:
# type(np.array([[1]]))
dir(np.ndarray)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_namespace__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__buffer__',
 '__class__',
 '__class_getitem__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__dlpack__',
 '__dlpack_device__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',


**Note:** For those interested, naming conventions in PYthon exist. They are described in PEPs (Python Enchancment Proposals), which are documents to the Python community aiming at steering *design*. There is a section in [PEP8](https://peps.python.org/pep-0008/#descriptive-naming-styles) that describes naming conventions and addresses the usage of `__foo` vs `_foo` vs `__foo__`. 

`ndarray` have an attribute called *shape*.

In [14]:
X = np.array([[1]])
print(X)
print(X.shape)

[[1]]
(1, 1)


## Python classes

![](graphics/python-at-school.jpg)

You won't see this image unless you visit the [repository on GitHub](https://github.com/joshmaglione/CS4102-Jupyter/blob/main/graphics/python-at-school.jpg). (Don't worry it's a waste of time.)

Python classes are user-defined Python objects like `list` or `ndarray`. 

Very convenient to use:
- want to build your own original object,
- want to interact with lots of data conveniently. 

Here's a basic example. Let's create a class called `VectorSpace`. We will *initialise* this class with a field (either $\mathbb{Q}$, $\mathbb{R}$, or $\mathbb{C}$) and a nonnegative integer.

In [15]:
class VectorSpace:
    def __init__(self, K, d):
        self.field = K 
        self.dim = d 

The special method `__init__` tells Python how to *create* a `VectorSpace` object. 

The object/variable `self` is Python's syntax for methods. All methods (e.g. `x.foo()`) are built with `self` as the first input variable. (I don't think it *needs* to be first, but this is the convention.)

So the above code tells Python that a `VectorSpace` is create with inputs for `K` and `d`.

In [19]:
V = VectorSpace("R", 3)
print(V)

<__main__.VectorSpace object at 0x7fc94402e4e0>


Python successfully created `VectorSpace`, but Python does not know how to print it other than give the only info it has.

We can tell Python how to print our object with `__repr__`. This special method should return a `str`.

In [20]:
class VectorSpace:
    def __init__(self, K, d):
        self.field = K 
        self.dim = d 
    
    def __repr__(self):
        if self.field == "Q":
            s = "rational"
        elif self.field == "C":
            s = "complex"
        else:
            s = "real"
        return f"A {self.dim}-dimensional {s} vector space"

In [21]:
V = VectorSpace("R", 3)
print(V)
print(V.field, V.dim)

A 3-dimensional real vector space
R 3


We can create methods for a class by creating functions within the class (including the input variable `self`).

In [22]:
class VectorSpace:
    def __init__(self, K, d):
        self.field = K 
        self.dim = d 
    
    def __repr__(self):
        if self.field == "Q":
            s = "rational"
        elif self.field == "C":
            s = "complex"
        else:
            s = "real"
        return f"A {self.dim}-dimensional {s} vector space"
    
    def std_basis(self):
        d = self.dim
        if d == 0:
            return []
        int_bas = [[0]*(i) + [1] + [0]*(d-i-1) for i in range(d)]
        if self.field == "Q": 
            return int_bas
        return [list(map(lambda x: float(x), v)) for v in int_bas]

In [25]:
V = VectorSpace("R", 3)
V.std_basis()

[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]

We could edit our class, so that when we receive bad input (e.g. `d < 0`) we raise an error. We'll come back to this later... maybe.

## Try it yourself

### Problem 1:

Create a Python class that is initialise with a `pandas` dataframe and a list of the column values (strings) corresponding to independent variables. It should have the following attributes:
- `dataframe` : the given dataframe,
- `dep_vals` : a list of the column values corresponding to dependent variables,
- `ind_vals` : same as `dep_vals` but independent.

It should have the following methods:
- `X_matrix` : a `numpy` matrix (`ndarray`) whose columns are the 'independent' variables together with a ones column. 
- `Y_matrix` : same as `X_matrix` but 'dependent'.

1. Write a `__repr__` method for this class. 
2. Write a `__len__` method for this class. 

In [None]:
class MyDataFrame:

    def __init__(self, df, ind):
        self.dataframe = df 
        self.ind_vals = ind 
        self.dep_vals = list(filter(
            lambda s: not s in ind, 
            df.columns.values
        ))
    
    def __repr__(self):
        return "{}".format(self.dataframe)
    
    def __len__(self):
        return len(self.dataframe)
    
    def X_matrix(self):
        return np.array(
            [[1]*len(self)] + [self.dataframe[x] for x in self.ind_vals]
        ).T

    def Y_matrix(self):
        return np.array(
            [self.dataframe[y] for y in self.dep_vals]
        ).T
    
import pandas as pd 
df = pd.read_csv("data/nonlinear_ex.csv")
mdf = MyDataFrame(df, ["x_i"])
print(mdf)
print(len(mdf))
print(mdf.X_matrix())
print(mdf.Y_matrix())

### Problem 2: 

Same idea, create a Python class initialised by a `pandas` dataframe, but it's up to you how to name things and what to include. There should be methods that output the covariance matrix (after normalising) and the (ordered) principal components. 

(An exercise 😁)