## Chapter 14 -- Least-squares classification

Modified by kmp 2022

Sources:

https://web.stanford.edu/~boyd/vmls/

https://github.com/vbartle/VMLS-Companions

Based on "Boyd and Vandenberghe, 2021, Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares - Julia Language Companion" https://web.stanford.edu/~boyd/vmls/vmls-julia-companion.pdf


In [1]:
using LinearAlgebra
using VMLS

### 14.1 Classification
**Boolean values.** Julia has the Boolean values **`true`** and **`false`**. 

These are automatically converted to the numbers **`1`** and **`0`** when they combined in numerical expressions. In VMLS we use the encoding (for classifiers) where **`true`** corresponds to **`+1`** and **`false`** corresponds to **`−1`**. We can get our encoding from a Julia Boolean value `b` using **`2*b-1`**, or via the ternary conditional operation **`b ? 1 : -1`**.

In [2]:
tf2pm1(b) = 2*b-1   # short-form function definition

tf2pm1(true), tf2pm1(false)

(1, -1)

In [3]:
b = [true, false, true]

3-element Vector{Bool}:
 1
 0
 1

In [4]:
tf2pm1.(b)

3-element Vector{Int64}:
  1
 -1
  1

**Confusion matrix.** Let us see how we would evaluate the prediction errors and confusion matrix, given a set of data `y` and predictions `yhat`, both stored as arrays (vectors) of Boolean values, of length `N`.

In [5]:
# Count errors and correct predictions
Ntp(y,yhat) = sum( (y .== true) .& (yhat .== true) )

Nfn(y,yhat) = sum( (y .== true) .& (yhat .== false) )

Nfp(y,yhat) = sum( (y .== false) .& (yhat .== true) )

Ntn(y,yhat) = sum( (y .== false) .& (yhat .== false) )

error_rate(y,yhat) = (Nfn(y,yhat) + Nfp(y,yhat)) / length(y)

confusion_matrix(y,yhat) = [ Ntp(y,yhat) Nfn(y,yhat); Nfp(y,yhat) Ntn(y,yhat) ]

y = rand(Bool,100)
yhat = rand(Bool,100)

confusion_matrix(y,yhat)

2×2 Matrix{Int64}:
 22  24
 27  27

In [6]:
error_rate(y,yhat)

0.51

The dots that precede **`==`** and **`&`** cause them to be evaluated elementwise. When we sum the Boolean vectors, they are converted to integers. 

In the last section of the code we generate two random Boolean vectors, so we expect the error rate to be around $50%$. In the code above, we compute the error rate from the numbers of **false negatives** and **false positives**. 

A more compact expression for the error rate is **`avg(y .!= yhat)`**. The VMLS package contains the function `confusion_matrix(y, yhat)`.



### 14.2 Least squares classifier
We can evaluate $f̂(x) = sign(f̃(x))$ using `ftilde(x)>0`, which returns a Boolean value.

In [7]:
ftilde(x) = x'*beta .+ v # Regression model
fhat(x) = ftilde(x) > 0 # Regression classifier

fhat (generic function with 1 method)

**Iris flower classification.** The `Iris` data set contains of $150$ examples of three types of iris flowers. There are $50$ examples of each class. For each example, four features are provided. The following code reads in a dictionary containing three $50 × 4$ matrices `setosa`, `versicolor`, `virginica` with the examples for each class, and then computes a Boolean classifier that distinguishes $Iris Virginica$ from the the other two classes.

In [8]:
D = iris_data()

# Create 150x4 data matrix
iris = vcat(D["setosa"], D["versicolor"], D["virginica"])

# y[k] is true (1) if virginica, false (0) otherwise
y = [zeros(Bool, 50); zeros(Bool, 50); ones(Bool, 50)]
A = [ones(150) iris]

theta = A \ (2*y .- 1)

5-element Vector{Float64}:
 -2.3905637266512043
 -0.09175216910134579
  0.4055367711191057
  0.007975822012793829
  1.1035586498675736

In [9]:
yhat = A*theta .> 0
C = confusion_matrix(y, yhat)

2×2 Matrix{Int64}:
 46   4
  7  93

In [10]:
err_rate = (C[1,2] + C[2,1]) / length(y)

0.07333333333333333

In [14]:
avg(y .!= yhat)

0.07333333333333333

### 14.3 Multi-class classifiers
**Multi-class error rate and confusion matrix.** The overall error rate is easily evaluated as `avg(y .!= yhat)`. We can form the $K×K$ confusion matrix from a set of $N$ true outcomes $y$ and $N$ predictions `yhat` (each with entries among ${1, . . . ,K}$) by counting the number of times each pair of values occurs.

In [13]:
function confusion_matrix(y, yhat; K=2)
    C = zeros(K,K)
    for i in 1:K for j in 1:K
        C[i,j] = sum((y .== i) .& (yhat .== j))
    end end
    return C
end

confusion_matrix (generic function with 1 method)

In [14]:
error_rate(y, yhat) = avg(y .!= yhat)

# test for K=4 on random vectors of length 100
K = 4
y = rand(1:K, 100)
yhat = rand(1:K, 100)

C = confusion_matrix(y, yhat, K=K)

4×4 Matrix{Float64}:
 9.0   3.0  2.0  7.0
 9.0   6.0  7.0  9.0
 7.0  10.0  1.0  3.0
 9.0   5.0  5.0  8.0

In [15]:
error_rate(y, yhat), 1-sum(diag(C))/sum(C)

(0.76, 0.76)

The function **`confusion_matrix`** is included in the `VMLS` package.



**Least squares multi-class classifier.** A $K$-class classifier (with regression model) can be expressed as
$$
f̂(x) = argmax_{k=1,...,K}f̃_k(x)
$$
where $f̃k(x) = x^Tθ_k$. The $n$-vectors $θ1,..., θK$ are the coefficients or parameters in the model. 

We can express this in matrix-vector notation as
$$
f̂(x) = argmax(x^TΘ)
$$
where $Θ = [θ1···θK] $ is the $n × K$ matrix of model coefficients, and the argmax of a row vector has the obvious meaning.

In Julia the function **`argmax(u)`** finds the index of the largest entry in the row or column vector $u$, i.e., $argmax_k u_k$. To extend this to matrices, we define a function **`row_argmax`** that returns a vector with, for each row, the index of the largest entry in that row.

In [17]:
row_argmax(u) = [argmax(u[i,:]) for i = 1:size(u,1)]

A = randn(4,5)

4×5 Matrix{Float64}:
  0.166401   0.52492    1.60732   -0.661108   0.9365
 -0.73142   -0.720189  -0.168272  -1.46493   -0.987553
 -0.533614  -1.13049   -0.277887   0.905306  -0.517429
 -1.65415    0.321545  -2.81065   -0.231736  -1.83179

In [18]:
row_argmax(A)

4-element Vector{Int64}:
 3
 3
 4
 2

If a data set with $N$ examples is stored as an $n × N$ data matrix `X`, and `Theta` is
an $n × K$ matrix with the coefficient vectors $θ_k$ as its columns, then we can now
define a function

In [19]:
fhat(X,Theta) = row_argmax(X'*Theta)

fhat (generic function with 2 methods)

to find the $N$-vector of predictions.

**Matrix least squares.** Let’s use least squares to find the coefficient matrix Θ for a multi-class classifier with n features and $K$ classes, from a data set of $N$ examples.

We will assume the data is given as an $n × N$ matrix $X$ and an $N$ - vector $y^{cl}$ with entries in ${1, . . . ,K}$ that give the classes of the examples. The least squares objective can be expressed as a matrix norm squared,

$$
‖X^TΘ− Y ‖^2
$$

where $Y$ is the $N × K$ vector with

$$
Y_{ij} = 
\left\{
\begin{array}{ll}
 1 & y^{cl}_i = j \\
−1 & y^{cl}_i \neq j
\end{array}
\right.
$$

In other words, the rows of Y describe the classes using one-hot encoding, converted from $0/1$ to $−1/+1$ values. The least squares solution is given by $Θ̂=(X^T)^†Y$. In Julia:

In [20]:
function one_hot(ycl,K)
    N = length(ycl)
    Y = zeros(N,K)

    for j in 1:K
        Y[findall(ycl .== j), j] .= 1
    end
    
    return Y
end

K = 4
ycl = rand(1:K, 6)

6-element Vector{Int64}:
 2
 2
 4
 4
 2
 1

In [21]:
Y = one_hot(ycl, K)

6×4 Matrix{Float64}:
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0
 0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0

In [22]:
2*Y .- 1

6×4 Matrix{Float64}:
 -1.0   1.0  -1.0  -1.0
 -1.0   1.0  -1.0  -1.0
 -1.0  -1.0  -1.0   1.0
 -1.0  -1.0  -1.0   1.0
 -1.0   1.0  -1.0  -1.0
  1.0  -1.0  -1.0  -1.0

Using the functions we have defined, the matrix least squares multi-class classifier
can be computed in a few lines.

In [23]:
function ls_multiclass(X,ycl,K)
    n, N = size(X)
    
    Theta = X' \ (2*one_hot(ycl,K) .- 1)
    yhat = row_argmax(X'*Theta)

    return Theta, yhat
end

ls_multiclass (generic function with 1 method)

**Iris flower classification.** We compute a $3$-class classifier for the iris flower data set. We split the data set of $150$ examples in a training set of $120$ ($40$ per class) and a test set of $30$ ($10$ per class). The code calls the functions we defined above.

In [24]:
D = iris_data()
setosa = D["setosa"]
versicolor = D["versicolor"]
virginica = D["virginica"]

# pick three random permutations of 1,..., 50
import Random

I1 = Random.randperm(50)
I2 = Random.randperm(50)
I3 = Random.randperm(50)

# training set is 40 randomly picked examples per class
Xtrain = [ setosa[I1[1:40],:] 
    versicolor[I2[1:40],:]
    virginica[I3[1:40],:] ]' # 4x120 data matrix

# add constant feature one
Xtrain = [ ones(1,120); Xtrain ] # 5x120 data matrix
ytrain = [ ones(40); 2*ones(40); 3*ones(40) ]

# test set is remaining 10 examples for each class
Xtest = [ setosa[I1[41:end],:]
    versicolor[I2[41:end],:] 
    virginica[I3[41:end],:] ]' # 4x30 data matrix
Xtest = [ ones(1,30); Xtest ] # 5x30 data matrix
ytest = [ones(10); 2*ones(10); 3*ones(10)]

30-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0

In [25]:
Theta, yhat = ls_multiclass(Xtrain, ytrain, 3)
Ctrain = confusion_matrix(ytrain, yhat, K=3)

3×3 Matrix{Float64}:
 40.0   0.0   0.0
  0.0  26.0  14.0
  0.0   5.0  35.0

In [26]:
error_train = error_rate(ytrain, yhat)

0.15833333333333333

In [27]:
yhat = row_argmax(Xtest'*Theta)

30-element Vector{Int64}:
 2
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3

In [28]:
Ctest = confusion_matrix(ytest, yhat, K=3)

3×3 Matrix{Float64}:
 9.0  1.0   0.0
 0.0  7.0   3.0
 0.0  0.0  10.0

In [29]:
error_test = error_rate(ytest, yhat)

0.13333333333333333