## Image classifier (least squares method)

This is simple classifier based on [least squares method](https://en.wikipedia.org/wiki/Least_squares).

First of all, let's import numpy and time libraries. 

In [1]:
import numpy as np
import time

Now, let's read the data. It is written in two files (<i>mnist_train.csv</i> and <i>mnist_test.csv</i>) picked from [here](https://pjreddie.com/projects/mnist-in-csv/). 

<b>Attention!</b> Reading and parsing big csv files needs some memory (up to a couple of gigabytes) and some calculation power, so it's better to run the following block of code only once.

In [2]:
train_data = np.loadtxt("mnist_train.csv", delimiter=',')
test_data = np.loadtxt("mnist_test.csv", delimiter=',')

As you can see, we have 60k vectors representing images in `train_data` and 10k vectors in `test_data`. Every vector has 785 values - first one is the tag (a digit from 0 to 9) and next 784 numbers represent the image of this digit (actually it is reshaped matrix of pixels 28x28).

In [3]:
print(train_data.shape)
print(test_data.shape)

(60000, 785)
(10000, 785)


So, let's start. Our algorithm has complexity of $O(N \times M \times L)$, where $N$ is the size of train selection, $M$ is the size of test selection and $L$ represents the length of vectors in selections. Therefore, we need to choose  subsets of our selections and  work with it, no with the full data. 

We still can take full data, but then computations will require a lot of time.

In [4]:
train_N = 10000
test_N = 1000
IMAGE_LENGTH = test_data.shape[1] - 1

In [5]:
def get_random_subset(data, new_size):
    old_size = data.shape[0]
    indexes = np.random.choice(old_size, new_size, replace = False)
    labels = data[indexes, 0].astype(int)
    images = data[indexes, 1:]
    return labels, images

train_labels, train_img = get_random_subset(train_data, train_N)
test_labels, test_img = get_random_subset(test_data, test_N)

Now, let's write our classifier itself. As said above, we will use the least squares method. 

How does it work?
For every sample from test selection (which is a vector), we need to find the closest (by Euclidean distance) vector from train selection. The label of the closest vector will be the "predicted" label of vector from test selection.


<b>Comment:</b> for small sizes of test and train subsets we could vectorize our algorithm by creating matrix of Euclidean distance between each pair of vectors. However, it will rise memory usage which is critical for huge subsets of images.   

In [6]:
def classify_image(test_img, train_img, train_labels):
    DM = np.square(test_img - train_img).sum(axis = 1)
    index = DM.argmin(axis = 0)
    return train_labels[index]

Now let's test our algorithm. We will also measure time of execution using `time` library. 

In [7]:
%%time

predicted_results = [classify_image(test, train_img, train_labels) 
                     for test in test_img]

success_count = (predicted_results == test_labels).sum()
accuracy = success_count / test_N

print("SAMPLES COUNT :", train_N)
print("TESTS COUNT   :", test_N)
print("ACCURACY      :", np.round(accuracy * 100, 2), '%')

SAMPLES COUNT : 10000
TESTS COUNT   : 1000
ACCURACY      : 94.4 %
CPU times: user 1min 21s, sys: 48.8 s, total: 2min 10s
Wall time: 1min 7s


Well done!

You can see how accuracy depends on size of train set by changing `train_N` and restarting code by your own.