
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <meta name="generator" content="pandoc" />
    <title></title>
    <style type="text/css">code{white-space: pre;}</style>
  </head>
  

# MACHINE LEARNING FOR RESEARCHERS

# Notebook 1. Introduction to Machine Learning algorithms


    
This notebook introduces the most basic supervised machine learning methods, both in the context of <em>regression</em> and <em>classification</em>. 

In particular, the following contents are covered: 

<ol>
    <li> Linear regression </li>
    <li> Logistic regression </li>
    <li> K-NN classifiers </li>
</ol>

It is highly recommended that this notebook is read and run after a first reading of the theory and in parallel with the slides available in AV. 
Note also that it is not required to develop any code. All examples are totally implemented, and therefore these notebooks have to be regarded as demonstrative material. The goal is understanding the operation of the algorithms. The notebook contains several questions that have to submitted through AV. 

The codes used for loading and plotting the Iris and MNIST data sets have been taken from <a href=https://github.com/ageron/handson-ml2>Geron (Github site)</a>. Please, consult the textbook for reference. 

## Linear regression

Supervised learning models aimed at predicting an output which takes values in a continuous domain are called <b>regression</b> models. Some examples of these systems are stock price predictors, house price predictors, the progression of some disease after a given amount of time, etc. 

Linear regression models are those which can be expressed through hypothesis functions of the form:

  \begin{equation}
  y(\pmb{x}, \pmb{w}) = w_0 + \sum_{j=1}^{M-1} w_j \phi_j(\pmb{x}).
  \end{equation}

Where the vector $\pmb{x}$ is $D$-dimensional, the function basis $\{\phi_m\}$ is $M-1$ dimensional, and 
  where $w_0$ is called the <em>bias</em> parameter, and $\pmb{w}$ denotes the
  vector $(w_0,\ldots,w_{M-1})^T$. By assuming a dummy basis function
  $\phi_0(\pmb{x})$ = 1, we get:

  \begin{equation}
  y(\pmb{x}, \pmb{w}) = \sum_{j=0}^{M-1} w_j \phi_j(\pmb{x}) = \pmb{w}^T \pmb{\phi}(\pmb{x})
  \end{equation}


This models are called <em>linear</em> since the hypotheses are linear functions of the weights (even if the function basis are non-linear functions of $\pmb{x}$).

The loss function, which measure the error fitting the training set, is then given by the <b>squared error</b>:

\begin{equation}
  J(\pmb{w}) = \frac{1}{2} \sum_{n=1}^N (\pmb{w}^T \pmb{\phi}(\pmb{x}_n)- t_n)^2
  \end{equation}

We will consider first the simplest regression case with a single-dimensional input space and a single target. A toy data set, similar to the one used in the slides is created and shown next:

In [None]:
# Libraries required
import numpy as np
import os

# Required to use matplotlib inside the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# We set a fixed seed to get always the same results
np.random.seed(1)

# Data set generation
X = np.linspace(0,1,num=10)
X = np.array([X]).T # Since X is single-dimensional, we have to convert it to a matrix to transpose it 
t = np.sin(2*np.pi*X) + 0.1*np.random.randn(10, 1)
xfull = np.linspace(0,1,num=100);
xfull = np.array([xfull]).T
treal = np.sin(2*np.pi*xfull);

# Data set plotting
plt.plot(X, t, "b.")
plt.plot(xfull, treal, "g-")
plt.xlabel("$x$", fontsize=18)
plt.ylabel("$t$", rotation=0, fontsize=18)
plt.axis([0, 1, -1.5, 1.5])
plt.show()

### Trivial feature space

The simplest approximation is working directly in the input space, that is, with the basis $\{\phi_m\}$ = $\{1, x\}$ (note that $\pmb{x}$ in this case is single-dimensional thus is simply written as $x$). 

Finding the best weights ($\pmb{w}$) fitting the data set requires minimizing the loss function by solving the normal equations:

\begin{equation}
  \pmb{w}^* = (\pmb{\Phi}^T \pmb{\Phi})^{-1} \pmb{\Phi}^T \pmb{t}
\end{equation}

where the design matrix $\pmb{\Phi}$ is:

  \begin{equation}
  \pmb{\Phi} = \begin{bmatrix}
  \phi_0(\pmb{x}_1) & \phi_1(\pmb{x}_1) & \dots & \phi_{M-1}(\pmb{x}_1) \\
  \phi_0(\pmb{x}_2) & \phi_1(\pmb{x}_2) & \dots & \phi_{M-1}(\pmb{x}_2) \\
  \vdots & \vdots & \ddots & \vdots \\
  \phi_0(\pmb{x}_N) & \phi_1(\pmb{x}_N) & \dots & \phi_{M-1}(\pmb{x}_N) \\
  \end{bmatrix} =\begin{bmatrix}
  1 & x_1  \\
1 & x_2  \\
  \vdots & \vdots \\
1 & x_N  \\
  \end{bmatrix}
  \end{equation}
  
  Next functions find the best solution giving the design matrix and predict the target of new points (given in the feature space $\pmb{\phi}$):

In [None]:
def fit(Phi,t):
    w = np.linalg.inv(Phi.T.dot(Phi)).dot(Phi.T).dot(t)
    return w

def predict(w, phi):
    y = phi.dot(w)  # When Φ contains several points (rows) to predict, 
                    # instead of evaluating it one by one as in eq. (9) of the unit 1 slides 
                    # it is more simple to compute in matrix form as y=Φw
    return y

Now, let's create the design matrix and find the best fitting for the toy example:

In [None]:
Phi = np.c_[np.ones((X.shape[0], 1)), X] # Each point x is transformed to feature space (1,x)
wopt = fit(Phi, t)

xnew = np.arange(start=0.0,stop=1.0,step=0.01)
phinew = np.c_[np.ones((xnew.shape[0], 1)), xnew]  # Each point x is transformed to feature space (1,x)
y = predict(wopt, phinew)

plt.plot(xnew, y, "r-")
plt.plot(xfull, treal, "g-")
plt.plot(X, t, "b.")
plt.axis([0, 1, -1.5, 1.5])
plt.show()

### Polynomial feature space

What do you think about previous result? 

Actually it seems a pretty limited approximation. The cause is the lack of flexibility in the model (<b>it can only produce lines!</b>). Of course, the solution is to work with a more expressive feature space. Let's  try for example the feature space compose by polynomials up to $(M-1)$-degree, that is, $\{\pmb{\phi}_m\}$ = $\{1,x,x^2,\ldots,x^{M-1}\}$. So, the design matrix is now:

\begin{equation}
\pmb{\Phi} = \begin{bmatrix}
1 & x_1 & x_1^2 & \ldots & x_1^{M-1} \\
1 & x_2 & x_2^2 & \ldots & x_2^{M-1}  \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_N & x_N^2 & \ldots & x_N^{M-1} \\
\end{bmatrix}
\end{equation}


The normalequation and predict functions do not change, only the transformation the feature space. 

Next code show the result up to M=9.

In [None]:
M = 9
Phi = np.c_[np.ones((X.shape[0], 1))]
xnew = np.arange(start=0.0,stop=1.0,step=0.01)
phinew = np.ones((xnew.shape[0],1))
fig, ax = plt.subplots(nrows=M, ncols=1, figsize=(5,15))
fig.tight_layout()

for m in range(1,10):
    Phi = np.c_[Phi,X**m] # Add column X^M 
    wopt = fit(Phi, t)
    
    phinew = np.c_[phinew, xnew**m]  # Add column xnew^M
    y = predict(wopt, phinew)

    ax[m-1].set_ylim(-1.5,1.5)
    ax[m-1].set_title(f'M={m}')
    ax[m-1].plot(xfull, treal, "g-")
    ax[m-1].plot(xnew, y, "r-")
    ax[m-1].plot(X, t, "b.")
    ax[m-1].axis([0, 1, -1.5, 1.5])

Clearly, the quality of the result depends on the model complexity (M). As stated before, inflexible models underfit the data (e.g., M=1,2). On the other hand, too flexible models overfit the data (e.g. M=8, 9). Neither underfit nor underfit models generalize correctly to new data. 

### Regularization

In order to control automatically the complexity of the model a regularization term is introduced in the loss function: 

\begin{equation}
  \widetilde{J}(\pmb{w}) = \frac{1}{2} \sum_{n=1}^N [\sum_{j=0}^{M-1} w_j \phi_j(\pmb{x}_n) - t_n]^2 + \frac{\lambda}{2}\sum_{j  =0}^{M-1}|w_j|^q
  \label{LOSSREG}
  \end{equation}

the term $\lambda$ is the regularization hyper-parameter, and $q$ determines the type of regularization. The so-called ridge-regression uses $q$=2, and has the advantage of having a closed
  form version of the normal equations:
  
  \begin{equation}
  \pmb{w}^* = (\lambda I + \pmb{\Phi}^T \pmb{\Phi})^{-1} \pmb{\Phi}^T \pmb{t}
  \label{WOPTREG}
  \end{equation}
  
  Let's create a new function implementing this approximation:

In [None]:
def fitridge(L,Phi,t):
    w = np.linalg.inv(L*np.eye(Phi.shape[1])+Phi.T.dot(Phi)).dot(Phi.T).dot(t)
    return w

and run it for different values of $\lambda$ (L in the function):

In [None]:
# Note the Phi and phinew are still created from the previous code block for M=9
LAMBDAS = 7
fig, ax = plt.subplots(nrows=LAMBDAS, ncols=1, figsize=(5,15))
fig.tight_layout()

for i in range(LAMBDAS):
    lnL = -i*3;
    wopt = fitridge(np.exp(lnL),Phi, t)
    y = predict(wopt, phinew)

    ax[i].set_ylim(-1.5,1.5)
    ax[i].set_title(f'ln $\lambda$={lnL}')
    ax[i].plot(xfull, treal, "g-")
    ax[i].plot(xnew, y, "r-")
    ax[i].plot(X, t, "b.")
    ax[i].axis([0, 1, -1.5, 1.5])

### K-fold cross-validation

To determine the optimal $\lambda$ hyper-parameter, cross-validation can be used. Cross-validation consist of separating a set of points from the training set to check on them the accuracy of the prediction (let us remark that these points can't be used for training). K-fold cross-validation repeats this procedure with K independently-drawn validation sets, and averages the results. The best hyper-parameters are those maximizing the <b>average</b> performance. 


Next code performs this operation: 

In [None]:
from sklearn.model_selection import KFold # import KFold

LAMBDAS = 20
K = 5
kf = KFold(n_splits=K, shuffle=True) # Define the split - into 5 folds 
kf.get_n_splits(Phi) # returns the number of splitting iterations in the cross-validator
loss = np.zeros([LAMBDAS])

for  i in range(LAMBDAS):
    lnL = -i

    for train_index, validation_index in kf.split(Phi):
        # print(train_index, validation_index)
        Phi_train, Phi_validation = Phi[train_index], Phi[validation_index]
        t_train, t_validation = t[train_index], t[validation_index]
    
        wopt = fitridge(np.exp(lnL),Phi_train, t_train)
        y = predict(wopt, Phi_validation)
        loss[i] += (t_validation - y).T.dot(t_validation - y)[0]
        
# Loss plotting versus -ln $\lambda$
plt.plot(np.arange(0,LAMBDAS), np.log(loss/K), "*-")
plt.xlabel("$- \ln\;\lambda$", fontsize=18)
plt.ylabel("$\ln\;\overline{loss}$", rotation=90, fontsize=18)
plt.show()

The result shows a minimal loss for a regularization hyper-parameter $-\ln \lambda \sim 10-12$. Note that the method is subject to randomness, since the folding of the sets is random. Therefore, different executions yield slightly different results. Due to that effect, it is recommended that the final prediction loss estimate is performed on a separate test set, separated from the training and the validation sets <b>before</b> the cross-validation procedure.

### Fitting using gradient descent optimization

Solving the normal equations can be demanding in terms of CPU power when the number of instances in the training set is large. An alternative method, valid also when closed-form solutions are not available, is the use of gradient descent. It is an iterative method that departing from some randomly chosen point, moves it at each iteration in the contrary (since the goal is seeking for the minimum) to the maximal slope increase direction, that is, towards $-\nabla J(\pmb{w})$. Summarizing:

<ol>
<li>
  Set initial weights for $\pmb{w}$ (e.g., randomly)
<li> Repeat:
  \begin{equation*}
  \pmb{w}^{(\text{new})} = \pmb{w}^{(\text{old})} - \eta \nabla J(\pmb{w}^{(\text{old})}) = \pmb{w}^{(\text{old})} + \eta \pmb{\Phi}^T [\pmb{t} - \pmb{\Phi} \pmb{w}^{(\text{old})}]
  \end{equation*}
  until convergence.
</ol>

$\eta$ is called the learning rate and trade-offs the convergence and speed of the algorithm.

Let's implement a fit function based on gradient descent and try it on the previous example for different learning rates.

In [None]:
def gradient(Phi,t,w):
    m = X.shape[0]
    gradients = -Phi.T.dot(t - Phi.dot(w))
    return gradients

def fitGD(Phi,t,eta):
    n_iterations = 1000
    w = np.random.randn(Phi.shape[1],1) 
    for iteration in range(n_iterations):
        w = w - eta * gradient(Phi,t,w)
    return w

In [None]:
import warnings
warnings.filterwarnings('ignore')

M = 9
valuesETAS = [0.001, 0.01, 0.1, 1]
ETAS = len(valuesETAS)
Phi = np.c_[np.ones((X.shape[0], 1))]
xnew = np.arange(start=0.0,stop=1.0,step=0.01)
phinew = np.ones((xnew.shape[0],1))
fig, ax = plt.subplots(nrows=M, ncols=ETAS, figsize=(10,15))
fig.tight_layout()

for m in range(1,M+1):
    Phi = np.c_[Phi,X**m] # Add column X^M 
    phinew = np.c_[phinew, xnew**m]  # Add column xnew^M
        
    for e in range(ETAS):
        eta = valuesETAS[e]
        wopt = fitGD(Phi, t, eta)
        y = predict(wopt, phinew)

        ax[m-1,e].set_ylim(-1.5,1.5)
        ax[m-1,e].set_title(f'M={m}, $\eta$={eta}')
        ax[m-1,e].plot(xfull, treal, "g-")
        ax[m-1,e].plot(xnew, y, "r-")
        ax[m-1,e].plot(X, t, "b.")
        ax[m-1,e].axis([0, 1, -1.5, 1.5])

Small learning rates converge slower, so they require more iterations and more time. On the other hand, large learning rates are quicker but may lead to non-convergence situations (as shown in the right-most figure in the previous plot). We can do an experiment with random data to measure the time differences between both fitting methods (normal equations and gradient descent) for different sizes of the training set: 

### What happens if we have more data?

Overfitting can be avoided if more training data is available. As an example, we can repeat the first experiment for a slightly larger data set of $N$=20 points. As can be seen in the next figures a model with moderate complexity does not suffer from overfitting, like before.  

In [None]:
# Data set generation
X = np.linspace(0,1,num=20)
X = np.array([X]).T # Since X is single-dimensional, we have to convert it to a matrix to transpose it 
t = np.sin(2*np.pi*X) + 0.1*np.random.randn(20, 1)
xfull = np.linspace(0,1,num=100);
xfull = np.array([xfull]).T
treal = np.sin(2*np.pi*xfull);

M = 9
Phi = np.c_[np.ones((X.shape[0], 1))]
xnew = np.arange(start=0.0,stop=1.0,step=0.01)
phinew = np.ones((xnew.shape[0],1))
fig, ax = plt.subplots(nrows=M, ncols=1, figsize=(5,15))
fig.tight_layout()

for m in range(1,10):
    Phi = np.c_[Phi,X**m] # Add column X^M 
    wopt = fit(Phi, t)
    
    phinew = np.c_[phinew, xnew**m]  # Add column xnew^M
    y = predict(wopt, phinew)

    ax[m-1].set_ylim(-1.5,1.5)
    ax[m-1].set_title(f'M={m}')
    ax[m-1].plot(xfull, treal, "g-")
    ax[m-1].plot(xnew, y, "r-")
    ax[m-1].plot(X, t, "b.")
    ax[m-1].axis([0, 1, -1.5, 1.5])


### Regression with sklearn

Next, we will discuss how to work directly with scikit-learn (or <b>sklearn</b> in short) library (the python library for machine learning). Let's start with a linear regression with trivial feature space, as done earlier:

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, t)
y = lin_reg.predict(xfull) 

plt.plot(xnew, y, "r-")
plt.plot(xfull, treal, "g-")
plt.plot(X, t, "b.")
plt.axis([0, 1, -1.5, 1.5])
plt.show()

We can ask sklearn to use a feature space, as before:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=True) # Degree M=2 with bias term
Phi = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(Phi, t)
phifull = poly_features.transform(xfull)
y = lin_reg.predict(phifull)

plt.plot(xfull, y, "r-")
plt.plot(xfull, treal, "g-")
plt.plot(X, t, "b.")
plt.axis([0, 1, -1.5, 1.5])
plt.show()

A list of options for Linear Regression can be obtained by executing the next command (in general you get help for any python command or function this way):

In [None]:
LinearRegression?

As can be seen in the documentation, predictors offer many convenient methods, for example:

In [None]:
print(f'The fitting accuracy in the training set is {lin_reg.score(Phi,t)*100:.2f}%')

Besides, regularizers like Ridge, Lasso or ElasticNet are at hand: 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso

poly_features = PolynomialFeatures(degree=20, include_bias=True) # Degree M=20 with bias term
Phi = poly_features.fit_transform(X)
lin_reg = LinearRegression() # The regularization parameter is called $\alpha$ in scikit-learn
lin_reg_other = Lasso(alpha=np.exp(-10)) # The regularization parameter is called $\alpha$ in scikit-learn
lin_reg.fit(Phi, t)
lin_reg_other.fit(Phi, t)

phifull = poly_features.transform(xfull)
y = lin_reg.predict(phifull)
y_other = lin_reg_other.predict(phifull)

plt.plot(xfull, y_other, "b-") # Lasso in blue
plt.plot(xfull, y, "r-") # Unregularized linear regression in red 
plt.plot(xfull, treal, "g-")
plt.plot(X, t, "b.")
plt.axis([0, 1, -1.5, 1.5])
plt.show()

In the next section we develop a full and realistic Regression example, where we show how to use also other characteristics, like cross-validation, directly from python libraries.

***
### Question 1
> **Propose some alternative feature space for the previous data set. For example, one alternative feature space could be {1,x,exp(x),exp(x^2)}. Make use of the fact that the data are generated from a sin function**
***

## A full regression example: The California House Value Data Set

In this section we provide a full example on a realistic data set aimed at predicting the median house values in Californian districts, given a number of features from these districts. This example is developed in Geron's chapter 2.

First, the data set is fetched and some rows of the data set are shown as an example. 

<b>Note 1:</b> The data set is handled using Pandas. Pandas is a library of Python for data set manipulation. Interested students can consult the book "Python Data Science Handbook" by Jake VanderPlas, which provides a clear and thorough description of this library (and others), and is available in AV.

### Get data set

In [None]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

### Let's show the data set

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

### Prepare the data for model fitting

Next, the raw table is prepared for its use in the regression model. For that the following steps are done:

<ol>
    <li> Computation of representative features (e.g. rooms_per_household)
    <li> Separation of targets and features
    <li> Substitution of missing features (e.g. number of rooms) for the median value of the column
    <li> Substitution of categorical variables (e.g. ocean_proximity) for a 1-K encoding
    <li> Separation of training and test sets
</ol>

You can skip most of this step since it involves dataframes and ndarrays manipulation. The significant part is that at the end result is divided in training and testing sets, and targets are provided in separated vectors as well.  

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

housing = load_housing_data() # Since we perform destructive change, we reload the data
targets = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1) # drop labels for training set

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]


housing_num = housing.drop("ocean_proximity", axis=1)

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
train_set, test_set, targets_train, targets_test = train_test_split(housing_prepared, targets, test_size=0.2) # Reserve 20% of instances for testing purposes

### Linear regression model with Ridge regularization and Cross-Validation

Finally, a regression can be easily implemented with all of the functionality discussed earlier, by using the sklearn modules:

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

X = train_set
X_test = test_set
t = targets_train
t_test = targets_test

# Now using Ridge with cross-validation
alphas = np.exp(-np.arange(start=1,stop=5,step=.1)).tolist()
ridge_reg = RidgeCV(alphas=alphas, 
                    cv=10, 
                    store_cv_values=False, 
                    normalize=True, 
                    fit_intercept=True)
ridge_reg.fit(X, t)
y = ridge_reg.predict(X)
y_test = ridge_reg.predict(X_test)
rmse_train = np.sqrt(mean_squared_error(t, y))
rmse_test = np.sqrt(mean_squared_error(t_test, y_test))

print(f'Ridge regression achieved minimum MSE for regularizer term {ridge_reg.alpha_:.05f}')
print(f'The score (rmse) in the training set is {rmse_train:.00f}, with accuracy of {ridge_reg.score(X,t)*100:.02f}%')
print(f'The score (rmse) in the test set is {rmse_test:.00f}, with accuracy of {ridge_reg.score(X_test,t_test)*100:.02f}%')

***
### Question 2
> **Try to explain why the results got for the last example are rather poor, and propose some suitable idea to improve the results using only linear regression and linear classification**
***

## Logistic regression

A classifier tries to predict an output belonging to a finite set of values. For example, a classifier may have the purpose of indicating whether or not a patient suffers from a pathology based on various input data or features. The target in this example is binary (YES/NO). As another example, for the MNIST dataset the goal can be  predicting the number that appears in an image (0/.../9). In a musical classification application the goal can be determining to which genre a song belongs (ROCK/POP/CLASSIC/...), etc.

More formally, in <b>classification</b> problems the goal is to take a $D$-dimensional input vector $\pmb{x}$ and assign it to one of $K$ classes $\mathcal{C}_k$ for $k$=$1,\ldots,K$. As in the regression case we assume an input transformation to a <b>feature space</b> using a set of fixed (non-linear) $M-1$ basis functions $\{\phi_m(\pmb{x})\}$, for $m$=$1,\ldots,M-1$ and the dummy function $\phi_0(\pmb{x})$ = 1.

Thus, in matrix notation, $\pmb{\phi}$ = $(\phi_0(\pmb{x}),\ldots,\phi_{M-1}(\pmb{x}))^T$.

The regions assigned to different classes are separated by <b>decision boundaries</b>. The term <b>linear classification</b> is used when these decision boundaries (in the feature space) are hyperplanes. If a data set can be separated without misclassification errors by some hyperplane, then it is called <b>linearly separable</b>.

Logistic regression is a linear binary classification model which models the posterior probability of $\mathcal{C}_1$ using the <b>logistic sigmoid</b>:
  
\begin{equation}
y(\pmb{w}, \pmb{x}) = p(\mathcal{C}_1|\pmb{\phi}) = \sigma(\pmb{w}^T\pmb{\phi}) = \frac{1}{1+e^{-\pmb{w}^T\pmb{\phi}}}
\end{equation}

and the probability of $\mathcal{C}_2$ is the complementary:

\begin{equation}
p(\mathcal{C}_2|\pmb{\phi}) = 1- p(\mathcal{C}_1|\pmb{\phi}) 
\end{equation}

Let's have a look to the shape of this function:

In [None]:
# Sigmoid plot

def sigmoid(z):
    return 1/(1+np.exp(-z))

z = np.linspace(-10, 10, 100)
sig = sigmoid(z)

plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(z, sig, "b-", linewidth=2, label=r"$\sigma(a) = \frac{1}{1 + e^{-a}}$")
plt.xlabel("a")
plt.legend(loc="upper left", fontsize=18)
plt.axis([-10, 10, -0.1, 1.1])
plt.show()

The final prediction is $\mathcal{C}_1$ when $y(\pmb{w}, \pmb{x}) \ge 0.5$, and $\mathcal{C}_2$ otherwise.

Given a labeled data set $\pmb{X}$=$\{\pmb{x}_1,\ldots,\pmb{x}_N\}$, $\pmb{t}$ = $\{t_1,\ldots,t_N\}$, where $t_n$ = 1 if $\pmb{x}_1 \in \mathcal{C}_1$, or 0 otherwise. The <b>cross-entropy</b> loss function is given by: 

\begin{equation}
J(\pmb{w}) = - \ln p(\pmb{t}|\pmb{w}) = - \sum_{n=1}^N [t_n \ln y_n + (1-t_n)\ln(1-y_n)]
\end{equation}

This function is concave, thus with unique minimum, which can be found using the iterative Newton-Raphson method:

\begin{equation*}
\pmb{w}^\text{(new)} = \pmb{w}^\text{(old)} - (\pmb{\Phi}^T \pmb{R} \pmb{\Phi})^{-1} \pmb{\Phi}^T (\pmb{y} - \pmb{t})
\end{equation*}

with $\pmb{R}$ a $N\times N$ diagonal matrix such that element $(\pmb{R})_{nn}$ = $y_n (1-y_n)$.

### Iris data set

We are going to see how this method applies to an example data set known as the <b>Iris data set</b>, which contains features about flower petals and sepals labeled with the exact species to which they belong. Our goal is to distinguish copies of the <a href=https://www.wikiwand.com/en/Iris_virginica>Iris species from Virginia</a>, compared to two other species of the Iris type. 

This data set is available with sklearn:

In [None]:
import numpy as np
import os
from sklearn import datasets
iris = datasets.load_iris()
# print(iris.DESCR) # Uncomment to show infor about the data set

# Let's create the feature and the target set
X = iris["data"][:, (2, 3)]  # Petal length and width
t = (iris["target"] == 2).astype(np.int)
t = t.reshape([X.shape[0],1])

# Since only two features are used, we can plot the data set
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

plt.figure(figsize=(8, 6))
plt.plot(X[t.reshape(X.shape[0],)==0, 0], X[t.reshape(X.shape[0],)==0, 1], "bs")
plt.plot(X[t.reshape(X.shape[0],)==1, 0], X[t.reshape(X.shape[0],)==1, 1], "g^")

plt.text(3.5, 1.5, "Non Iris-Virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
plt.show()

Let's create <b>fit</b> and <b>predict</b> functions like the case of linear regression

In [None]:
def fit(Phi,t): 
    n_iterations = 1000
    w = np.random.randn(Phi.shape[1],1) # Initial w at random
     
    for iteration in range(n_iterations):
        y = sigmoid(Phi.dot(w))
        R = np.diag(np.maximum(0.00001, np.multiply(y,1-y)).reshape([Phi.shape[0],])) # convert to array from matrix !
        w = w - np.linalg.inv(Phi.T.dot(R.dot(Phi))).dot(Phi.T).dot(y-t)
    
    return w
    
def predict(phi, w):
    h = (sigmoid(phi.dot(w))>0.5).astype(int)
    return h

and use them to find a linear separation boundary and plot it (note that the $\mathcal{C}_1$ probabilities are indicated as a contour plot in the resulting figure.

In [None]:
# We set a fixed seed to get always the same results
np.random.seed(23)

Phi = np.c_[np.ones([X.shape[0],1]), X]  # features: {1,x1,x2}
wopt = fit(Phi,t)
y = predict(Phi, wopt) 

accuracy = np.mean(y == t.reshape(t.shape[0],1))
print(f'Accuracy of {accuracy*100:.02f}%') 

# Plot the decision boundary
x0, x1 = np.meshgrid(np.linspace(2.9, 7, 500).reshape(-1, 1), np.linspace(0.8, 2.7, 200).reshape(-1, 1),)
xfull = np.c_[x0.ravel(), x1.ravel()]
phifull = np.c_[np.ones([xfull.shape[0],1]), xfull] 
yfull = sigmoid(phifull.dot(wopt))

zz = yfull.reshape(x0.shape)
plt.figure(figsize=(8, 6))

left_right = np.array([2.9, 7])
boundary = -(wopt[1] * left_right + wopt[0]) / wopt[2]

contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.plot(X[t.reshape(X.shape[0],)==0, 0], X[t.reshape(X.shape[0],)==0, 1], "bs")
plt.plot(X[t.reshape(X.shape[0],)==1, 0], X[t.reshape(X.shape[0],)==1, 1], "g^")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
plt.show()

### Spherical vs. Torus data set

In this second example, we generate an artificial data set which has two classes with an spherical and torus-like distributions, respectively:

In [None]:
np.random.seed(1)
r = np.random.rand(50, 1) #0...1
a = 2*np.pi*np.random.rand(50,1) #0..2pi
X_0 = np.c_[r*np.cos(a), r*np.sin(a)]

r = 0.9 + np.random.rand(50, 1) #0.9...1.9
a = 2*np.pi*np.random.rand(50,1) #0..2pi
X_1 = np.c_[r*np.cos(a), r*np.sin(a)]

X = np.r_[X_0, X_1]
t = np.r_[np.zeros([50,1]),np.ones([50,1])]

# Plot data set
plt.figure(figsize=(6, 6))
plt.plot(X[t.reshape(X.shape[0],)==0, 0], X[t.reshape(X.shape[0],)==0, 1], "bs")
plt.plot(X[t.reshape(X.shape[0],)==1, 0], X[t.reshape(X.shape[0],)==1, 1], "g^")

plt.axis([-2, 2, -2, 2])
plt.show()

Clearly, the input space is not linearly separable, but it is easy to construct a feature space (with 2-degree polynomials) that it is:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=True) # Degree M=2 with bias term
Phi = poly_features.fit_transform(X)

wopt = fit(Phi,t)
y = predict(Phi, wopt) 

accuracy = np.mean(y == t.reshape(t.shape[0],1))
print(f'Accuracy of {accuracy*100:.02f}%') 

# Plot the decision boundary
x0, x1 = np.meshgrid(np.linspace(-2, 2, 500).reshape(-1, 1), np.linspace(-2, 2, 500).reshape(-1, 1),)
xfull = np.c_[x0.ravel(), x1.ravel()]
phifull = poly_features.transform(xfull)
yfull = sigmoid(phifull.dot(wopt))

zz = yfull.reshape(x0.shape)
plt.figure(figsize=(8, 6))

left_right = np.array([2.9, 7])
boundary = -(wopt[1] * left_right + wopt[0]) / wopt[2]

contour = plt.contour(x0, x1, zz, cmap=plt.cm.tab10, levels=[0.5])
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.plot(X[t.reshape(X.shape[0],)==0, 0], X[t.reshape(X.shape[0],)==0, 1], "bs")
plt.plot(X[t.reshape(X.shape[0],)==1, 0], X[t.reshape(X.shape[0],)==1, 1], "g^")
plt.axis([-2, 2, -2, 2])
plt.show()

Note that the <b>boundary in the feature space is a hyperplane</b>, though in the input space hasn't linear shape.

***
### Question 3
> **Propose some feature space suitable for the data set created in the next example. Note that the condition for a point to be in one class or another is:**
\begin{equation}
t = \begin{cases}
\mathcal{C}_1\;\;\text{if $x>\sqrt{y}$ and $x<y^2$}\\
\mathcal{C}_2\;\;\text{otherwise}
\end{cases}
\end{equation}
***

In [None]:
np.random.seed(2)
x = 5*np.random.rand(100, 1) 
y = 5*np.random.rand(100, 1) 

t = np.logical_and(X[:,0]>X[:,1]**(1/2),X[:,0]<X[:,1]**(2))

# Plot data set
plt.figure(figsize=(6, 6))
plt.plot(X[t.reshape(X.shape[0],)==0, 0], X[t.reshape(X.shape[0],)==0, 1], "bs")
plt.plot(X[t.reshape(X.shape[0],)==1, 0], X[t.reshape(X.shape[0],)==1, 1], "g^")

plt.axis([0, 5, 0, 5])
plt.show()

## A full logistic regression example: The MNIST Data Set

In this section we explore a more complex problem using the MNIST data set, and using the sklearn libraries: 

<ol>
<li> We consider a multi-class problem using the soft-max loss function
<li> We introduce regularization (and also cross-validation to determine $\lambda$)
<li> A test set is separated from the data set to asses the accuracy of results
<li> Other performance measurements are computed using the confusion matrix
</ol> 

### MNIST 

First, we download (it may take some time) and show some examples of the MNIST data set which was described in Unit 1. I

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
# print(mnist.DESCR) # Uncomment to show description
X, t = mnist["data"], mnist["target"] # N = 70000, D = 784 
X_train, X_test, t_train, t_test = X[:60000], X[60000:], t[:60000], t[60000:]

In [None]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")
    
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.plasma, **options)
    plt.axis("off")

plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
plt.show()

Let's do softmax classification using sklearn (it takes some seconds) and show some statistics and tests:

In [None]:
from sklearn.linear_model import LogisticRegression

# Solver is lbfgs which uses Ridge regularization by default
clf = LogisticRegression(random_state=0, multi_class='multinomial').fit(X_train, t_train)

In [None]:
print(f'The score (accurary) in the training set is {clf.score(X_train,t_train)*100:.02f}%')
print(f'The score (accuracy) in the test set is {clf.score(X_test,t_test)*100:.02f}%\n')

plot_digits(X_test[:10], images_per_row=10)
print('Labels predicted for the 10 first pictures in the test set')
print(clf.predict(X_test[:10]))
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_test = clf.predict(X_test)
conf_mx = confusion_matrix(t_test, y_test)
print(conf_mx)

Despite using the original input space this classifier works quite well!!

Let's try to tell sklearn to use cross-validation to determine the best regularization parameter:

<b style="color:red"> This fitting requires a high-end computer (about 2-3 minutes in a machine with 12 cores and 64G of RAM). Let enough time for running the cell. </b>

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# Solver is lbfgs which uses Ridge regularization by default
clfCV = LogisticRegressionCV(cv=4, random_state=0, multi_class='multinomial').fit(X_train, t_train)

In [None]:
print(f'The score (accurary) in the training set is {clfCV.score(X_train,t_train)*100:.02f}%')
print(f'The score (accuracy) in the test set is {clfCV.score(X_test,t_test)*100:.02f}%\n')

plot_digits(X_test[:10], images_per_row=10)
print('Labels predicted for the 10 first pictures in the test set')
print(clfCV.predict(X_test[:10]))
plt.show()

Accuracy is similar to the previous case! Thus, not bad for a linear classifier.

***
### Question 4
> **Given the previous confusion matrix, determine the precision and recall for the class "6"**

***

## $K$-Nearest neighbors

$K$-Nearest neighbors is a nonlinear classification method which assigns a point $\pmb{x}$ to the class with the majority vote among its $k$-nearest neighbors. Different distances yield particular properties, and some of the most common are Euclidean, Cosine, Manhattan, etc. 

Its use with sklearn is straightforward. We are going to apply it to the Spherical vs Torus data set and to the MNIST data set, and check how it compares to the logistic/softmax regression:

### Spherical vs Torus data set

In [None]:
np.random.seed(1)
r = np.random.rand(500, 1) #0...1
a = 2*np.pi*np.random.rand(500,1) #0..2pi
X_0 = np.c_[r*np.cos(a), r*np.sin(a)]

r = 0.9 + np.random.rand(500, 1) #0.9...1.9
a = 2*np.pi*np.random.rand(500,1) #0..2pi
X_1 = np.c_[r*np.cos(a), r*np.sin(a)]

testsize = int(X_0.shape[0]*0.1)
X_0_train, X_0_test = X_0[testsize:], X_0[:testsize]
X_1_train, X_1_test = X_1[testsize:], X_1[:testsize]
X_test = np.r_[X_0_test, X_1_test]
X_train = np.r_[X_0_train, X_1_train]
t_test = np.r_[np.zeros([testsize,1]), np.ones([testsize,1])]
t_train = np.r_[np.zeros([500-testsize,1]), np.ones([500-testsize,1])]

# Plot test data set
plt.figure(figsize=(6, 6))
plt.plot(X_test[t_test.reshape(X_test.shape[0],)==0, 0], X_test[t_test.reshape(X_test.shape[0],)==0, 1], "bs")
plt.plot(X_test[t_test.reshape(X_test.shape[0],)==1, 0], X_test[t_test.reshape(X_test.shape[0],)==1, 1], "g^")

plt.axis([-2, 2, -2, 2])
plt.show()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(metric='haversine', n_neighbors=5)
neigh.fit(X_train, t_train)
print(f'Accuracy of {neigh.score(X_test,t_test)*100:.02f}%') 
print(neigh)

Accuracy seems similar to that obtained with the logistic regression, but there are a two hyper-parameters (metric and n_neighbors) that have been just hard-coded. We can do an exhaustive search and find the best ones using cross-validation very easily with sklearn:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

parameters = {'metric':['euclidean', 'minkowski', 'chebyshev', 'haversine'], 'p':range(1,10), 'n_neighbors':range(1,15)}
clfaux = KNeighborsClassifier()
clf = GridSearchCV(estimator=clfaux, param_grid=parameters, cv=8, scoring='precision_macro')
clf.fit(X_train,t_train)
print(clf.best_params_)
print(f'Accuracy of {clf.score(X_test,t_test)*100:.02f}%')

### MNIST data set

<b style="color:red"> Despite using only about 5% of the data set instances, this fitting requires a high-end computer (about 2-3 minutes in a machine with 12 cores and 64G of RAM). Let enough time for running the cell. </b>

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
# print(mnist.DESCR) # Uncomment to show description
X, t = mnist["data"], mnist["target"] # N = 70000, D = 784 

from sklearn.utils import shuffle
X, t = shuffle(X, t, random_state=0)

X_train, X_test, t_train, t_test = X[:1000], X[1000:1200], t[:1000], t[1000:1200]

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Search best hyper-parameters
parameters = {'metric':['sokalmichener', 'jaccard', 'matching', 'dice'], 'n_neighbors':range(1,15)}
clfaux = KNeighborsClassifier()
clf = GridSearchCV(estimator=clfaux, param_grid=parameters, cv=8, scoring='precision_macro')
clf.fit(X_train,t_train)
print(clf.best_params_)

print(f'The score (accurary) in the training set is {clf.score(X_train,t_train)*100:.02f}%')
print(f'The score (accuracy) in the test set is {clf.score(X_test,t_test)*100:.02f}%\n')

plot_digits(X_test[:10], images_per_row=10)
print('Labels predicted for the 10 first pictures in the test set')
print(clf.predict(X_test[:10]))
plt.show()

Note that even using 3% of the instances, the accuracy is similar to that of the logistic regression.

***
### Question 5
> **With MNIST we haven't used a feature space, but we worked directly in the input space. It could be of interest (as we will see in Unit 3) to make use of a smaller dimensional spaces. In this question you are asked to propose some features that can be used to simplify the problem. For example, one feature can be the ratio of yellow pixels to the total number of pixels in the instance. Try to propose some additional features for this problem.**
***