<a href="https://colab.research.google.com/github/philberns/FTRN65/blob/main/ex3_python_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python environment
We will provide exercises and handins in the form of jupyter notebooks, and suggest using one of two ways of running them.
* Anaconda - https://docs.anaconda.com/anaconda/install/
    * Many useful modules included by default
    * More can be installed with `conda install <package>` or `pip install <package>`
    * Run on your own computer or a server you control
    * Requires installation
* Google colab - https://colab.research.google.com/
    * Many useful modules included by default
    * More can be installed with `!pip install <package>` in a notebook cell
    * Requires google account (can use the student.lu.se account you have)
    * Revision history
    * Just open and go, can load files from google drive, github or local computer

# Python Intro

We will assume that everyone taking this course has some previous programming experience and either has experience with Python or can pick it up quickly enough. If you feel a need to brush up on your Python skills there are plenty of resources online for this, for example the [Python wiki](https://wiki.python.org/moin/BeginnersGuide/Programmers) has a page with Python tutorials for people with previous programming experience.

We will try to focus more on introducing you to some of the packages commonly used for machine learning, and in this notebook we provide all the code needed to run the examples but we encourage you to make changes and make sure you understand the basic functionality.


## PyPlot
The module `matplotlib.pyplot` has origins in emulating MATLAB commands but is its own thing and can also be used in a pythonic object oriented way. A simple tutorial can be found [here](https://matplotlib.org/tutorials/introductory/pyplot.html).

## Numpy
Numpy offers high performant vectorization and numerical computation tools for the Python language. It is used in the background by many libraries but you can also use it for matrix manipulation yourself.

Here we will define some data and do linear regresison in numpy. We will also set up polynomial features using numpy to get a felling for how the matrices are actually built in the background in libraries such as scikit learn where there are modules for automating this.

In [None]:
import sys
import numpy as np
import matplotlib.pyplot as plt

# Reshape the array so that dimension 2 has size 1 and dimension 1 has the size needed for this to work
X_train = np.linspace(0, 5, 15).reshape(-1, 1)
y_train = np.sqrt(X_train) + 0.1 * np.random.randn(*X_train.shape)

X_true = np.linspace(0, 5, 100).reshape(-1, 1)
y_true = np.sqrt(X_true)

plt.scatter(X_train, y_train)
plt.plot(X_true, y_true)

: 

Now we have a simple dataset, we want to format it for linear regression. One easy assumption is that $y_i=\theta x_i+\epsilon$ where $\epsilon\sim\mathcal N(0, \sigma^2)$ is some measurement noise. We can the write this as $$y= X\theta + \epsilon ~~~\text{ for }~ y=\begin{bmatrix}y_1\\ \vdots\\ y_n\end{bmatrix}, X=\begin{bmatrix}x_1\\ \vdots\\ x_n\end{bmatrix}$$. The ML estimation of $\theta$ will then be $\theta=(X^TX)^{-1}X^Ty$ as covered in the lectures.

In [None]:
X = X_train
y = y_train

theta = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, y))

y = np.matmul(X_true, theta)

plt.scatter(X_train, y_train)
plt.plot(X_true, y_true)
plt.plot(X_true, y)

Instead of just assuming that $y$ is proportional to $X$ we can assume linearity and add a constant term,
$$y_i=\theta_0 + \theta_1 x_i=\begin{bmatrix}1 & x_i\end{bmatrix}\begin{bmatrix}\theta_0\\\theta_1\end{bmatrix}$$
$$\Rightarrow\underbrace{\begin{bmatrix}y_1\\\vdots\\y_n\end{bmatrix}}_Y = \underbrace{\begin{bmatrix}1 & x_1\\\vdots\\1 & x_n\end{bmatrix}}_X\underbrace{\begin{bmatrix}\theta_0\\\theta_1\end{bmatrix}}_\theta$$

In [None]:
def add_ones_column(X):
    return np.hstack((np.ones(X.shape), X))

X = add_ones_column(X_train)
y = y_train

theta = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, y))

y = np.matmul(add_ones_column(X_true), theta)

plt.scatter(X_train, y_train)
plt.plot(X_true, y_true)
plt.plot(X_true, y)

Let's also add the first $n$ monomials of $X$ as additional features.
$$y_i=\theta_0 + \theta_1 x_i + \dots + \theta_n x_i^n=\begin{bmatrix}1 & x_i & \dots & x_i^n \end{bmatrix}\begin{bmatrix}\theta_0\\\theta_1\\\vdots\\\theta_n\end{bmatrix}$$
Run this for a few different degrees of polynomials.

In [None]:
def create_monomials(X):
    # Choose degree of polynomial here, for 1 we should get the same as above
    n = 5
    terms = []
    for i in range(n + 1):
        terms.append(X**i)
    return np.hstack(terms)

X = create_monomials(X_train)
y = y_train

theta = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, y))

y = np.matmul(create_monomials(X_true), theta)

plt.scatter(X_train, y_train)
plt.plot(X_true, y_true)
plt.plot(X_true, y)

## Scikit-Learn
Scikit-Learn is a huge library of tools for machine learning. It contains methods for downloading well known datasets, methods for analysing and preprocessing data, for regression and classification and for creating training pipelines.

For now we will just use it for loading a common dataset called the iris dataset.

In [None]:
from sklearn.datasets import load_iris

iris_obj = load_iris()

print("size of data {}\nnames of columns {}\ntarget label size {}\nlabel names {}".format(iris_obj.data.shape, iris_obj.feature_names, iris_obj.target.shape, iris_obj.target_names))

## Pandas
Now we want to use pandas dataframes to explore the data a bit. Lets create a dataframe for the data and the labels and then join them together so we have everything in one spot.

In [None]:
import pandas as pd

data = pd.DataFrame(iris_obj.data, columns=iris_obj.feature_names)

labels = pd.DataFrame(iris_obj.target, columns=pd.Index(["species"]))

iris = data.join(labels)

iris

Since we are just exploring for our sake and not doing learning, it would be nice to have the names of the species instead of the label.

In [None]:
iris.species.replace({i:iris_obj.target_names[i] for i in range(len(iris_obj.target_names))}, inplace=True)
iris

Describe provides some useful information, all of the rows here (count, mean, std...) also have their own commands if you want them separately.

In [None]:
iris.describe()

We can also apply these over groupings such as showing the information for each type of flower and gain some insight into what might be features that can be used for classifying them.

In [None]:
iris.groupby("species").describe()

We can also use `apply` to apply a function over each column to aggregate some custom statistic. For example if we wanted to find the difference between the minimum and maximum value for each feature we could create a function that finds that for a column and apply that over all columns.

In [None]:
def feature_range(c):
    return c.max() - c.min()

iris.iloc[:, 0:4].apply(feature_range)

And we can also do this over the grouped data, but then we use `aggregate` instead.

In [None]:
iris.groupby("species").aggregate(feature_range)