**CAUTION:** You are watching the solution. If you want to solve this exercise yourself, open `assignment.ipynb` instead

---

## 1) Loading the Data

You do not need to make any changes in this cell.

In [1]:
# import packages
using MLDataUtils, Random
Random.seed!(123) # set the random number seed

# load the IRIS data set and split it into a training and a test set
X, y = MLDataUtils.load_iris()
(X_trn, y_trn), (X_tst, y_tst) = splitobs(shuffleobs((X, y)), at=0.666)

# I assume you are more familiar with a (n_samples, n_features) shape than
# with the (n_features, n_samples) shape used by MLDataUtils.jl
X_trn = transpose(X_trn) # now the shape looks like the one used by sklearn
X_tst = transpose(X_tst)

; # ending with a semicolon omits printing the output of a cell

## 2) Computing the Euclidean Distance

The euclidean distance between two vectors `a` and `b` is the square root of the sum of their squared component-wise differences: https://en.wikipedia.org/wiki/Euclidean_distance

Your task is now to compute the euclidean distance between arbitrary vectors.

In [2]:
# I started with:   euclidean(a, b) = sqrt(sum((a-b).^2))

euclidean(a::AbstractArray{T,1}, b::AbstractArray{T,1}) where T<:Number =
  sqrt(sum((a-b).^2)) # square root of the dot product

euclidean(a::AbstractArray{T,1}, B::AbstractArray{T,2}) where T<:Number =
  map(b -> euclidean(a, b), eachrow(B)) # distance of one point a to each point in B

euclidean (generic function with 2 methods)

In [5]:
# you can use this cell to test your implementation
a_tst = Random.rand(10000) # 100000-element Array{Float64,1}
b_tst = Random.rand(10000)
@time euclidean(a_tst, b_tst)

  0.000028 seconds (9 allocations: 156.578 KiB)


40.78569015203682

In [6]:
# can you also compute the distance of one point to all other points?
a_tst = Random.rand(10000) # 100000-element Array{Float64,1}
B_tst = Random.rand(10, 10000) # 10 such vectors, i.e. a 3x10000-element Array{Float64,2}
@time euclidean(a_tst, B_tst)

  0.000419 seconds (59 allocations: 1.528 MiB)


10-element Array{Float64,1}:
 41.0881382297078  
 41.051803443649206
 40.87838569551084 
 41.15719200334341 
 40.701273788375886
 41.196095202334696
 40.81874523856506 
 40.59544693406834 
 40.958666005666316
 40.897928276524624

## 3) k-NN Classification

A k-NN classifier stores the entire training set. When predicting a new example, it computes the distance of this example to all training examples. The k closest training examples are allowed to vote for a prediction. The label which occurs most often is used as the final prediction.

**Note:** I already provide you with the (generic) type `KNN` because Jupyter complains when types are re-defined. This would happen just too often during development. Feel free to make changes, but remember to restart your kernel then.

In [7]:
struct KNN{T_X<:Number, T_y<:Any}
    X::AbstractArray{T_X,2} # training set (features)
    y::AbstractArray{T_y,1} # training set (labels)
    k::Int64 # k, the number of neighbors to consider
end

# you can instantiate an object of this type by calling KNN(X, y, k)

In [12]:
function predict(knn::KNN{T_X,T_y}, X::AbstractArray{T_X,1}) where {T_X<:Number, T_y<:Any}
    votes = knn.y[sortperm(euclidean(X, knn.X))[1:knn.k]]
    vote_counts = countmap(votes) # map unique values to counts
    return findmax(vote_counts)[2] # findmax returns a (count, vote) pair
end

predict(knn::KNN{T_X,T_y}, X::AbstractArray{T_X,2}) where {T_X<:Number, T_y<:Any} =
  map(x -> predict(knn, x), eachrow(X)) 

predict (generic function with 2 methods)

## 4) Estimate the Accuracy

You can use the test set to make predictions and compare them with true labels. The accuracy is defined as the fraction of correct predictions.

In [15]:
knn_tst = KNN(X_trn, y_trn, 3)
@time sum(predict(knn_tst, X_tst) .== y_tst) / length(y_tst)

  0.001508 seconds (21.02 k allocations: 1.517 MiB)


0.96