# Review of Python for the final

This notebook summarizes what I expect you to know about Python to be ready for the final (in addition to the midterm, which you'll find in a separate notebook). Read each piece of cell of text, read each cell of code, anticipate what it is going to do, execute it, and confirm your understanding.

# Basic Python


## Functions

Functions in Python are similar to functions in math: they take inputs, called arguments, and return outputs.

Functions start with the special keyword `def` ("special keyword" means that you cannot use a variable called `def`), then the name of the function, then parentheses and an optional argument list. The function declaration end with a colon `:`, after which comes the block of the function with one level of indentation. When indentation decreases, the function is over.

Here is a simple Python function similar to the square function in math.

In [None]:
def square(x):
    return x ** x

A function always returns a value. If you do not explicitly return a value, or if you use only `return`, then the value returned is `None`.

In [None]:
def append_to_list(x, a):
    x.append(a)

a = [1, 2, 3]
b = append_to_list(a, 4)
print("b is None: ", b is None)

The arguments to this function are called "positional arguments": Python replaced the arguments with the values depending on their position.

You may want to write a function where some arguments are implicit and have a default value. For example, this function converts a daily interest rate to an annual rate (assuming 250 trading days in the year):

In [None]:
def annualize(r):
    return ((1 + r) ** 250) - 1

Now suppose you sometimes want to convert monthly interest rates as well. You could write another function, but that would not be DRY. Instead, you can rewrite this function, using an extra argument `from` and a default value of `daily` for it, as in this next cell. Your code that uses this function will still work as it assumes a daily interest rate, but now you can also write code with a monthly interest rate. (If you know Java or C++, this is the Python equivalent of "function overloading".)

In [None]:
def annualize(r, freq="daily"):
    period = 250
    if freq == "monthly":
        period = 12
    elif freq != "daily":
        raise ValueError("Frequency must be either daily or monthly")

    return ((1 + r) ** period) - 1

The argument `r` is called a "positional argument", because the only part that distinguishes it is its position. The argument `freq` is called a "named argument", because you call it via its name, for example to annualize a 1% monthly interest rate you would write:

In [None]:
annualize(0.01, freq="monthly")

Default or named arguments are very common in Python. For example, in the `print()` function, you can use a the named argument `sep`, where the default is space, to print items on a line each. You can also modify the default value of 5 in the k-nearest neighbors algorithm to 8, for example.

By default, in the Google Style Guide, there are no spaces around the `=` when you define or use a named argument.

In [None]:
print("One line", "Two lines", "Three lines", sep="\n")

import sklearn.neighbors
knn8 = sklearn.neighbors.NearestNeighbors(n_neighbors=8)

## Modules

A module is a set of code written by someon (possibly you) and that you can import and use.

Sometimes module names in code are different from module names in language. For example, the module we used the most was SciKit-Learn, because it implements many of our data models, but the name in code is `sklearn`. Similarly, BeautifulSoup's name in code is `bs4`.

Most of the modules we used already came with your installation of Anaconda. Other modules do not and you can install a module by running this in a shell: `pip install <module-name>`. For example, for BeautifulSoup, you can run: `pip install bs4`.

On a Jupyter notebook, you can install modules from within Python with the syntax in this cell:

In [None]:
!pip install bs4

A module can have submodules. For example, SciKit-Learn is a very big module, with neural networks, logistic regression, k-nearest neighbors, etc. If you imported all of SciKit-Learn each time, the sum of your code plus imported modules could be big, and hence slow. Therefore, programmers divided SciKit-Learn into sub-modules so you can import only the model you need and keep your code light and fast.

For example, this code to import the linear model in SciKit-Learn will throw an error:

In [None]:
import sklearn

linear = sklearn.linear_model.LinearRegression()

Instead, you have to import the sub-module `linear_model`, and then the code works:

In [None]:
import sklearn.linear_model

linear = sklearn.linear_model.LinearRegression()

You can "alias" a module in order to write code faster. An "alias" is a shortcut or a nickname for a module, similar to saying "Bob", which is shorter than "Robert".

Two common aliases are `np` for `numpy` and `pd` for `pandas`, with this syntax:

In [None]:
import numpy as np
import pandas as pd

Any file you write can be imported like a Python module. This helps keep your code DRY.

If you write a file called `exercise1.py`, and then want to use function `compute()` from exercise 1 in another exercise in the same folder, this would be the syntax:

``` python
import exercise1

exercise1.compute()
```

The calling file has to be in the same folder as `exercise1.py`. At our level, this is sufficient (otherwise, you have to change the Python PATH variable with `sys.path.append("/path/to/exercise1")`, but this is bad code).

### Dunder variables

Python files have "dunder variables", where "dunder" stands for "double underscore". We'll use two dunder variables: `__file__`, which has the location of the current Python file on disk if you run it locally (but not on Google Colab, where this throws an error), and `__name__`, which equals `"__main__"` if the file is run directly, and the name of the file if it's imported.

In [None]:
print(__name__)
print(__file__)

So we can use that dunder variable to ensure we only run a piece of code when the Python file is called directly, but not when it's imported:

In [None]:
if __name__ == "__main__":
    print("This code is running only when the file is called directly")
else:
    print("This code only runs when imported")

## Getting help

You can type `help(<object-name>)` inside Python, or run `pydoc <object-name>`. An object name can be a function, a module, or a module name with a period (.) and a sub-module, or an object, or a function.

For example, run this inside Python to get help on the function `print`, the module `datetime`, or the model `sklearn.neighbors.NearestNeighbors` (which tells you the default value of neighbors is 5).

In [None]:
print(help(print))

import datetime
print(help(datetime))

print(help(sklearn.neighbors.NearestNeighbors))

## View vs. copy

Some data structures can be complex, such as lists. When we assign a list `[1, 2, 3]` to a variable `a`, we're actually storing a "pointer" to the list. If we assign it to another variable `b`, we're not copying the whole list, but only a pointer to it. If we change `b`, we also change the original list stored in `a`. In this case, the variable is called a **"view"** of the list. Here is an example:

In [None]:
a = [1, 2, 3]
print("Original list:", a)

b = a
b.append(4)  # this changes the list, pointed to by `a` and `b`

print("Value of a:", a)
print("Value of b:", b)

If we want the change in `b` not to affect `a`, we need to create a **"copy"** of the list:

In [None]:
a = [1, 2, 3]
print("original list:", a)

c = a.copy()
c.append(4)
print("Value of a:", a)
print("Value of c:", c)

## Value versus reference

A close concept to "view vs. copy" is "value vs. reference." The operator `is` tests whether two variables point to the same **reference** in memory, and so it is very fast. The operator `==` tests whether two variables have the same **value**.

The difference is apparent in numbers. For speed reasons, the Python language has only one reference for the integers between -5 and 256. So both of these tests return true because `a` and `b` have the value `42`, and there's only one `42` in Python:

In [None]:
a = 42
b = 42
print(a is b)
print(a == b)

But when we try with `257`, then only one test returns true:

In [None]:
a = 257
b = 257
print(a is b)
print(a == b)

The comparison `is` uses the reference of the variable, and `==` uses the value.

Some functions work by "in-place": they take arguments by reference and modify them. Sorting a list is an example of a function working by reference, and it returns nothing (or `None`):

In [None]:
a = [3, 2, 1]
b = a.sort()
print("a =", a)
print("b =", b)

If we try the same thing on a string, which is immutable and cannot be changed, then we get an error:

In [None]:
a = "bcda"
a.sort()

Instead, we can use the function `sorted()`, which works by value: it takes the value of argument, works with it, and returns a new value:

In [None]:
a = "abcd"
print(sorted(a))

# To print this as a string, use `join()`, which joins a list into a string:
print("".join(sorted(a)))

## Doc-tests and test-driven development

Doc-tests are executable documentation. They are very helpful in ensuring that functions always do what we expect of them, and that any changes to the function do not break previous code.

In test-driven development, you write the tests first, then you write the function. In our case, you can do them in either order.

Doc-tests are inside the doc-string (which is the first string after the function name), and are anything that starts with `>>> `. They always follow this pattern: code to execute after `>>> `, and the result expected in the next line. Here is an example that you should copy for your doc-tests:

In [None]:
def compute_area(width, height):
    """
    >>> compute_area(3, 2)
    6
    >>> compute_square(1.1, 2)
    2.2
    """

    return width * height

## Defensive programming

In defensive programming, we expect things will go wrong and the function will be called with the wrong inputs.

For example, run this cell: should the area of "hi" and 2 really be "hihi"? can an area be negative?

In [None]:
print(compute_area("hi", 2))
print(compute_area(-2, 2))

These are called an edge or corner case: something at the boundary of what's acceptable.

In this course, we do not want such an answer. So instead, we check the type of the argument and throw an explicit `ValueError` or `TypeError` in this case, refactored into a function to avoid repetition. We also check that the argument is positive. And we add those two corner cases as doc-tests:

In [None]:
def check_argument(a):
    if not isinstance(a, (int, float)):
        raise ValueError("arguments must be numbers")
    if a < 0:
        raise ValueError("arguments must be non-negative")


def compute_area(width, height):
    """
    >>> compute_area(3, 2)
    6
    >>> compute_square(1.1, 2)
    2.2
    >>> compute_area("hi", 2)
    Traceback (most recent call last):
    ...
    ValueError: arguments must be numbers
    >>> compute_area(-2, 2)
    Traceback (most recent call last):
    ...
    ValueError: arguments must be non-negative
    """

    check_argument(width)
    check_argument(height)

    return width * height

You can also use `assert` to ensure that your code is running as you expect. For example, in the computation of the area, you may want to check that it's always positive, otherwise your code has some problem that you should fix at development time. Calling `assert` is a guardrail, a reminder for you to fix your code when writing it. It should never be thrown at runtime, when your code is running.

These are the difference between assertions and checking arguments:

- checking arguments happens at the top of a function, and never after; assertions should not check the type of an argument, but should be used inside the function

- assertions are a coding tool and should always be true at runtime when the code is run (if they do, you should change your code); argument checking can happen at runtime, and the caller should handle the error.

Here is an example of using `assert` in the previous function (where I omit doc-tests for clarity). If we checked the arguments correctly, the assertion is always true. If later we decide to allow negative areas, for example to calculate integrals, then this assert will be false and will reminds us to handle that case explicitly.

In [None]:
def compute_area(width, height):
    """
    >>> compute_area(3, 2)
    6
    >>> compute_square(1.1, 2)
    2.2
    >>> compute_area("hi", 2)
    Traceback (most recent call last):
    ...
    ValueError: arguments must be numbers
    >>> compute_area(-2, 2)
    Traceback (most recent call last):
    ...
    ValueError: arguments must be non-negative
    """

    check_argument(width)
    check_argument(height)
    area = width * height

    assert area >= 0, "Area should be non-negative!"

    return area

## Happy path / readability

Consider this function that returns whether a year is a leap year:

In [None]:
def is_leap_year(year):
    return (year % 4) and not ((year % 100) and not (year % 400))


The body of the function is wide with 65 characters and hard to read: should it be `and` or `or`? Are parentheses in the right place? Is it correctly using the conversion of integers to booleans?
Because it's hard to read, it's also hard to find the bug.

Instead, from assignment 3 onwards, I want you to write functions that are long / vertical and use early returns:

In [None]:
def is_leap_year(year):

    if year % 400 == 0:
        return True

    if year % 100 == 0:
        return False

    return year % 4 == 0

Please see the [readability grading guide on Github](https://github.com/mm3509/b9122/blob/main/grading/readability.md) for more examples of refactoring code with the "Happy Path", which is the path of the most common case, aligned to the left edge.

## Debugging

Debugging is a big part of programming. Please copy the code below into PyCharm, or your favorite code editor, and understand why the last doc-test fails. Debug it like we did in class.

In [None]:
import doctest
import numpy

def sum_numpy(a, b):
    """
    Sum two numbers.
    >>> import numpy
    >>> a = numpy.uint8(1)
    >>> b = numpy.uint8(2)
    >>> bool(sum_numpy(a, b) == 3)
    True
    >>> a = numpy.uint8(250)
    >>> b = numpy.uint8(250)
    >>> bool(sum_numpy(a, b) == 500)
    True
    """
    return a + b

doctests = doctest.testmod(optionflags=doctest.ELLIPSIS)
assert 0 == doctests.failed, "Some doc-tests failed, exiting..."
print("Your doc-tests pass, congratulations!")

## Recursion

Recursion is a programming paradigm where you solve a problem at a given step, say for integer `n`, by using the solution at a smaller integer, say `n-1`. The solution starts with a base case.

One example is factorial, which is defined recursively as in this function. Notice that the base case, where `n = 0`, is handled first.

In [None]:
def factorial(n):
    """
    >>> factorial(0)
    1
    >>> factorial(3)
    6
    >>> factorial(4)
    24
    >>> factorial(3.5)
    Traceback (most recent call last):
    ...
    ValueError: argument must be integer
    >>> factorial("3")
    Traceback (most recent call last):
    ...
    ValueError: argument must be integer
    >>> factorial(-1)
    Traceback (most recent call last):
    ...
    ValueError: argument must be positive
    """
    if not isinstance(n, int):
        raise ValueError("argument must be integer")

    if n < 0:
        raise ValueError("argument must be positive")

    if n == 0:
        return 1

    return n * factorial(n - 1)

## List comprehension

A list comprehension is a way to express a for loop in a much quicker way. It starts with the list syntax (`[]`) with a `for` loop nested inside. This syntax is specific to Python. For example, to generate a list with all the letters of the alphabet (using the correspondence from an ASCII code to a character),, we can use either of these pieces of code; notice how the list comprehension has most of the elements in the for loop, but rearranged in a single line.

In [None]:
# Equivalent for loop
alphabet = []
for i in range(97, 122):
    alphabet.append(chr(i))

# List comprehension
alphabet = [chr(i) for i in range(97, 122)]

## File reading and writing

You can read from a file on your computer with a "context manager" and this syntax:

In [None]:
# Open a file for reading. The second parameter is "r" for "reading"
with open("/path/to/file.txt", "r") as f:
      for line in f:
            print(line)

# Alternatives: f.read(), f.readline(), f.readlines()

# Write to a file, with the same syntax, but "w+".
# "w" is for "writing", "+" is to create the file if it does not exist.
with open("/path/to/file", "w+") as f:
    f.write("\n".join(["one", "two", "three"]))

## Local scoping

When you declare a variable inside a function, Python creates a variable that is defined only inside that function, which is called "local scoping". If you reuse a variable name, for example from an argument, Python has two variables, one that is local to the function and one that is not.

For example, understand why this piece of code at the end does not print a dictionary with just one key, `uni3`:

In [None]:
def other_function(d):
    d = {}
    d["uni3"] = 4

d = {"uni1": 1, "uni2": 3}
print(d)
other_function(d)
print(d)

# NumPy

NumPy, for "Numerical Python", is a package for computations such as linear algebra. The MNIST images in assignment 2 were stored as a matrix of size 28x28, with elements.

Most of the time, you won't use NumPy directly, but use algorithms packages like SciKit-Learn that use NumPy. So this is a primer with just the basics.

NumPy is often abbreviated as `np` to type faster.

### Arrays

You create an array by calling `np.array()` on a list (for a vector) and a list of lists (for a matrix).

In [None]:
import numpy as np
vector = np.array([1, 2, 3, 4])
print("Vector: ", vector)

matrix = np.array([[1, 2], [3, 4]])
print("Matrix: ", matrix)

Even though they have different dimensions, they are both called "arrays".

A one-dimensional array, abbreviated as "1-D array", is represented by the `print()` function as a row vector.

A two-dimensional array, abbreviated as "2-D array", is represented by the `print()` function as a matrix, or list of lists.

You convert from 1-D arrays to 2-D arrays with `np.reshape()`. This can often go wrong, so please use carefully. The function takes as arguments an array and a tuple with the dimension. NumPy will distribute the elements from the argument array into an array of the new size. If you pass `-1` as a dimension, this will be inferred from the number of elements in the argument array and the other dimensions in the tuple. For now, please only use with `1` or `-1` in the dimensions, to convert between row/column vectors to the equivalent arrays.


Reshaping with `(-1, 1)` converts an array to the equivalent of a column vector. Reshaping with `(1, -1)` converts an array to the equivalent of a row vector.

In [None]:
vector_as_2d_array = np.reshape(vector, (1, -1))
print("Vector has shape", vector.shape)
print("Matrix has shape", vector_as_2d_array.shape)

### Array shape

You get the size of an array with the attribute `.shape` (it's a field, not a function, as has no parentheses):

In [None]:
print("Vector has shape: ", vector.shape)
print("Matrix has shape: ", matrix.shape)

### Array type

All elements in an array have the same type (unlike lists in Python). For example, this code throws an error:

In [None]:
np.array([1, 2, (3, 4)])

You can check the type of the array by checking the type of an element. Notice that NumPy variables have the number of bits in the type (for example, int64 is an integer of 64 bits):

In [None]:
print(type(vector[1]))

You can convert between types with the method `.astype()`, for example to convert to floats of size 32:

In [None]:
vector_float = vector.astype(np.float32)
print(type(vector_float[1]))

### Initialize new arrays

You can initialize an array of a given size full of zeros or ones with `np.zeros()` and `np.ones()`, which take as argument a tuple for the size.

By default, these functions create an array with type float, which is why you see a dot at after the integers.

In [None]:
a = np.zeros((2, 2))
print("a = ", a)
b = np.ones((2, 2))
print("b = ", b)

You can create vector by passing a tuple of size 1: `(3,)` (the comma at the end makes this a tuple instead of an integer):

In [None]:
vector = np.ones((3, ))
print(vector)
print(vector.shape)

You can create a diagonal matrix with `np.diag()`, which takes an array and places it in the diagonal:

In [None]:
#a = np.diag(np.array([1, 2, 3]))
print(np.eye(3))
#print(a)

### Array stacking

We stack arrays horizontally or vertically with `np.hstack()` or `np.vstack()`, which take as arguments a **list** of arrays:

In [None]:
a = np.ones((3, 1))
b = np.zeros((3, 1))
c = np.hstack([a, b])
print("horizontal:", c, sep="\n")
d = np.vstack([a, b])
print("vertical:")
print(d)

### Indexing

Arrays behave like lists, so you can access their values with the same notation. For example, this creates an upper diagonal matrix:

In [None]:
a = np.eye(2)
a[0, 1] = 1
print("a", a, sep="\n")
print("The lower left value is:", a[1, 0])

### Basic array operations

Operations follow the intuitive math standard. For example, you can add a number to a vector, which adds that number to every element in the vector (the number is "broadcast" to the vector):

In [None]:
print(5 + np.ones((3,)))

Likewise, you can multiply a number with a vector:

In [None]:
print(5 * np.ones((3,)))

You can add two vectors:

In [None]:
print(np.ones((3,)) + 6 * np.ones(3,))

You can add a vector and a matrix of the compatible size. The vector is converted to a matrix (subtype polymorphism, just like `1 + 2.5` converts the integer `1` to a float):

In [None]:
print(np.ones((3,)) + np.ones((1, 3)))

but if the sizes are not compatible, for example a vector (which is size (1,3)) plus a matrix of size (3,1), you get a matrix of size (3,3):

In [None]:
a = np.ones((3,))
b = np.ones((3, 1))
print(a.shape)
print(b.shape)
print(a)
print(b)
print(np.ones((3,)) + np.ones((3, 1)))

## Matrix operations

You can do matrix manipulation, for example `.T` tranposes a matrix:

In [None]:
a = np.zeros((4, 2))
print("Original: ", a, sep="\n")
print("Transpose: ", a.T, sep="\n")

You can multiply matrices with the `np.matmul()`, which takes two arrays of compatible shape:

In [None]:
a = np.ones((3, 2))
np.matmul(a, a.T)

You can invert a matrix with `np.linalg.inv()`:

In [None]:
a = np.diag([1, 2, 3])
a[0, 2] = 4
print("a:", a, sep="\n")

a_inverse = np.linalg.inv(a)
print("inverse:", a_inverse)

c = np.matmul(a, a_inverse)
print("multiplication:")
print(c)

### Linear regression

A linear regression is simply matrix multiplication:

beta = (X' * X) ^{-1} * (X * Y)

so we can run a linear regression within NumPy.

This next cell starts the random number generator to a certain state, called a "seed", so you'll generate the same data as me.

It generates random data (using two functions from `np.random`, for the uniform distribution and normal distribution).

In [None]:
N = 200
alpha = 4
beta = 2

np.random.seed(42)
x = np.random.uniform(low=0.5, high=13.3, size=N)
error = np.random.normal(loc=0.0, scale=2.0, size=N)
y = alpha + beta * x + error
print(y.shape)

To run the regression, we convert `x` to a matrix of the same shape as a column vector:

In [None]:
x_as_array = np.reshape(x, (-1, 1))
y_as_array = np.reshape(y, (-1, 1))
print(x_as_array.shape)

We confirm it has the right shape. We now stack the constant horizontally:

In [None]:
constant = np.ones((N, 1))
x_with_constant = np.hstack([constant, x_as_array])
print("Regressors have shape:", x_with_constant.shape)

And now we run the formula to find the parameters:

In [None]:
xx = np.matmul(x_with_constant.T, x_with_constant)
xx_inv = np.linalg.inv(xx)
xy = np.matmul(x_with_constant.T, y_as_array)

beta_hat = np.matmul(xx_inv, xy)
print("Your linear regression produced these estimates:")
print(beta_hat)
print("The original values are:", alpha, beta)

With 200 data points, we got estimates very close to the data!

Note: if you multiply matrices and get an error `ValueError: matmul: Input operand 1 has a mismatch ...`, your matrices have incompatible sizes. Check the size of each matrix with `.shape`. The sizes have to follow this rule:

```
(M , N) * (N , K) => Result dimensions is (M, K)
```

# Plotting

Now we plot these results. We'll use `matplotlib`, a package for plotting in Python. First, a scatter plot with `scatter()`:

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x, y)

Then, we add the lines on top of the scatter. To do so, we predict the value of y at each value of x with the formula:  `y_predicted = beta_hat * x_with_constant` (when implementing, we have to adapt the formula so the multiplication has compatible dimensions). We plot a line with `plot()` and pass the color red as an argument `c="r"`:

In [None]:
y_predicted = np.matmul(x_with_constant, beta_hat)
plt.scatter(x, y)
plt.plot(x, y_predicted, c="r")

# Sci-Kit Learn

We did this linear regression from scratch to learn matrix manipulation in NumPy and Matplotlib.

In practice, you'll often use the Sci-Kit Learn package, which has linear regression and many algorithms. (Note: if you have to install it, remember that it has two names: you install the package with `scikit-learn` and import the module with `sklearn`; see details [here](https://towardsdatascience.com/scikit-learn-vs-sklearn-6944b9dc1736?gi=cafe4b37d090)).

We'll run the same regression in Sci-Kit learn to make sure we have the right results.

We import the linear regression module, start a new linear regresssion, update it to fit the data (in-place!), and print the coefficients. Notice that the linear regression already adds a constant:

In [None]:
import sklearn.linear_model
reg = sklearn.linear_model.LinearRegression()
reg.fit(x_as_array, y_as_array)
print(reg.coef_)

Finally, we compare to the value we found "by hand" with NumPy:

In [None]:
coeff_from_sklearn = reg.coef_[0, 0]
coeff_from_numpy = beta_hat[1, 0]
print(coeff_from_sklearn,
      coeff_from_numpy,
      abs(coeff_from_sklearn - coeff_from_numpy) < 1e-12,
      sep="\n")