# KNN Digits
Implementing a simple KNN to classify digits.

In [1]:
import cv2 as cv
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#### Data Prep

In [2]:
df = pd.read_csv('numbers.csv')

df

Unnamed: 0,numbers,file_path
0,0,media/kills_only/loop_00000.jpg
1,n,media/kills_only/loop_00001.jpg
2,2,media/kills_only/loop_00002.jpg
3,2,media/kills_only/loop_00003.jpg
4,2,media/kills_only/loop_00004.jpg
...,...,...
2109,n,media/kills_only/loop_02109.jpg
2110,n,media/kills_only/loop_02110.jpg
2111,n,media/kills_only/loop_02111.jpg
2112,n,media/kills_only/loop_02112.jpg


Standardive and/or Remove 'null' values. Both improve accuracy a few %.

In [3]:
# helps a few %
df.numbers.loc[df.numbers == 'n'] = ''
df.numbers.loc[df.numbers == 'e'] = ''
df.numbers.loc[df.numbers == 'b'] = ''

# # maybe helps a little less than making them all the same
# df = df.loc[df.numbers != 'n']
# df = df.loc[df.numbers != 'e']
# df = df.loc[df.numbers != 'b']

# df.reset_index()

# df.tail(3)

Limit the number of samples from each possible outcome.

In [4]:
# temp_df = pd.DataFrame(columns=df.columns)

# for u in df.numbers.unique():
#     sample = df.loc[df.numbers==u].copy()
#     if len(sample) > 50:
#         sample = sample.sample(50)
#     temp_df = pd.concat([temp_df, sample])

# df = temp_df

# df

In [5]:
df.numbers.value_counts()

5     250
      249
10    148
12    141
1     136
6     130
13    113
11    107
0      95
2      85
9      82
14     67
8      62
16     61
7      60
15     59
19     50
4      37
3      34
20     28
17     27
22     26
18     25
21     14
25      9
23      7
26      5
27      4
24      3
Name: numbers, dtype: int64

For `X`: Make list of lists, each holding an array (image) and its file path. `.flatten()` the arrays so they're 1D.

For `y`: Target values are found in the `numbers columns`.

After train/test splitting, split the file paths from the arrays (images) so we have an array of file paths and an array of arrays (images) for training and for testing (4 arrays total).

The arrays of file paths (`train_file_paths`, `test_file_paths`) are of no use to our model, and are only recorded so that we can examine particular instances (e.g. to see an incorrectly predicted image).

In [6]:
X = [[cv.imread(fp).flatten(), fp] for fp in df.file_path.values]
y = df.numbers.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# keep file paths 
train_file_paths = np.array([fp for img, fp in X_train])
test_file_paths = np.array([fp for img, fp in X_test])

X_train = np.array([img for img, fp in X_train])
X_test = np.array([img for img, fp in X_test])

#### Create & Train Model
And output an array of predictions (just to see what they look like).

In [7]:
knn = KNeighborsClassifier(n_neighbors=1)

In [8]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [9]:
knn.predict(X_test)

array(['5', '5', '6', '12', '', '12', '13', '', '12', '13', '13', '5',
       '10', '13', '', '10', '16', '', '13', '', '6', '13', '8', '', '5',
       '9', '12', '7', '6', '19', '5', '10', '0', '2', '15', '6', '2',
       '17', '23', '7', '1', '5', '1', '5', '', '11', '13', '5', '9',
       '19', '6', '', '1', '1', '', '19', '12', '10', '12', '0', '13',
       '6', '16', '1', '13', '1', '14', '9', '20', '1', '10', '', '11',
       '22', '5', '', '0', '', '10', '20', '18', '0', '7', '5', '10',
       '10', '9', '5', '5', '4', '13', '', '', '0', '1', '6', '12', '2',
       '12', '15', '4', '6', '8', '11', '5', '13', '2', '10', '15', '10',
       '27', '8', '11', '20', '0', '19', '23', '1', '12', '10', '12', '7',
       '20', '14', '6', '0', '', '2', '16', '12', '10', '14', '13', '10',
       '14', '12', '9', '3', '19', '18', '19', '', '5', '14', '6', '',
       '5', '18', '5', '16', '10', '6', '5', '6', '', '6', '5', '19',
       '11', '5', '1', '5', '9', '5', '', '10', '', '5', '1', '8

#### Score Model

In [10]:
preds = knn.predict(X_test)

n_correct = np.sum(preds == y_test)
n_possible = len(y_test)
print(f'n_correct: {n_correct}\nn_possible: {n_possible}\n% correct: {n_correct/n_possible*100}%')

n_correct: 559
n_possible: 635
% correct: 88.03149606299212%


#### Current 
- 2114 rows, k=1, test_size=0.3, % correct (7 runs avg): 88.008998875%


#### Previous Scores
- Week of 31 August 2020
    - 1740 rows, k=1, test_size=0.2, % correct: 89.65517241379311%
    - 2114 rows, k=1, test_size=0.2, % correct: 88.88888888888889%
    - 2114 rows, k=2, test_size=0.2, % correct: 88.65248226950354%
    - 2114 rows, k=3, test_size=100, % correct: 94.0% (one off, more range variation than above, higher highs, lower lows)
    - 2114 rows, k=1, test_size=100, % correct: 91.0% (consistent, some variation ranging 84-93%)

#### Goal Score (30 September 2020)
- n rows, k=k, test_size=test_size, % correct: 98.1%+

#### Goal Deployed Score (31 October 2020)
- n rows, k=k, test_size=live_feed, % correct: 95.1%+

## What's wrong? Predicted v Actual
Incorrect predictions on the left, actual values (labels) on the right. (Assumes labels are correct.)

In [11]:
comp_df = pd.DataFrame()

comp_df['predicted'] = preds
comp_df['actual'] = y_test
comp_df['reference_file'] = test_file_paths

comp_df.tail(2)

Unnamed: 0,predicted,actual,reference_file
633,,,media/kills_only/loop_01064.jpg
634,13.0,13.0,media/kills_only/loop_00996.jpg


In [12]:
comp_df.loc[comp_df.predicted != comp_df.actual]

Unnamed: 0,predicted,actual,reference_file
9,13,22,media/kills_only/loop_01869.jpg
10,13,14,media/kills_only/loop_01829.jpg
11,5,,media/kills_only/loop_01349.jpg
18,13,19,media/kills_only/loop_01865.jpg
31,10,16,media/kills_only/loop_01842.jpg
...,...,...,...
592,10,13,media/kills_only/loop_01440.jpg
600,13,14,media/kills_only/loop_01827.jpg
601,1,,media/kills_only/loop_00431.jpg
607,1,2,media/kills_only/loop_01246.jpg
