# KNN Digits
Implementing a simple KNN to classify digits.

In [1]:
import cv2 as cv
from PIL import Image
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#### Data Prep

In [2]:
df_28x28 = pd.read_csv('numbers.csv')
df_38x28 = pd.read_csv('digits_only_numbers.csv')

df = pd.concat([df_28x28, df_38x28], ignore_index=True)
df

Unnamed: 0,numbers,file_path
0,0,media/kills_only/loop_00000.jpg
1,n,media/kills_only/loop_00001.jpg
2,2,media/kills_only/loop_00002.jpg
3,2,media/kills_only/loop_00003.jpg
4,2,media/kills_only/loop_00004.jpg
...,...,...
7987,i148,media/digits_only/loop_04686pr.jpg
7988,0,media/digits_only/loop_04687k.jpg
7989,i148,media/digits_only/loop_04687pr.jpg
7990,0,media/digits_only/loop_04688k.jpg


Standardive and/or Remove 'null' values. Both improve accuracy a few %.reset_index

In [3]:
# helps a few %
df.numbers.loc[df.numbers == 'n'] = ''
df.numbers.loc[df.numbers == 'e'] = ''
df.numbers.loc[df.numbers == 'b'] = ''

# # maybe helps a little less than making them all the same
# df = df.loc[df.numbers != 'n']
# df = df.loc[df.numbers != 'e']
# df = df.loc[df.numbers != 'b']

# df.reset_index()

# df.tail(3)

Limit the number of samples from each possible outcome.

In [4]:
# temp_df = pd.DataFrame(columns=df.columns)

# for u in df.numbers.unique():
#     sample = df.loc[df.numbers==u].copy()
#     if len(sample) > 50:
#         sample = sample.sample(50)
#     temp_df = pd.concat([temp_df, sample])

# df = temp_df

# df

#### What target values are in the dataset?

In [5]:
len(df.numbers.unique()), df.numbers.unique()

(194,
 array(['0', '', '2', '3', '4', '1', '21', '5', '6', '8', '7', '9', '10',
        '11', '12', '14', '15', '16', '13', '17', '18', '19', '24', '27',
        '20', '22', '23', '25', '26', '61', '59', '60', '56', '50', '49',
        '47', '39', '37', '36', '35', '33', '32', '29', '28', '30', '142',
        '139', '138', '133', '132', '130', '129', '128', '126', '122',
        '121', '120', '119', '118', '117', '116', '115', '114', '113',
        '112', '110', '106', '105', '104', '102', '101', '100', '93', '89',
        '86', '79', '76', '73', '71', '65', '64', '55', '45', '43', '42',
        '40', '38', '34', '75', '58', '57', '54', '51', '48', '46', '44',
        '41', 'r', '00', '83', '81', '80', '74', '72', '68', '123', '96',
        '95', '94', '91', '88', '87', '85', '84', '82', '78', '77', '67',
        '66', '63', '62', '53', '52', 'b6', 'b16', '150', '148', '147',
        '92', '90', '144', '143', '140', '136', '135', '131', '127', '111',
        '109', '97', '70', '69', '1

In [6]:
actual_numbers = []
for un in df.numbers.unique():
    try:
        actual_numbers.append(int(un))
    except:
        pass
    
len(actual_numbers), #sorted(actual_numbers)

(152,)

In [7]:
for i in range(151):
    if i not in sorted(actual_numbers):
        print(i)

137
149


In [8]:
df.numbers.value_counts()[:10]

      640
0     416
10    376
1     356
5     316
11    309
12    268
4     268
16    258
2     247
Name: numbers, dtype: int64

In [9]:
df.numbers.value_counts()[10:30]

8     241
6     240
3     236
9     229
15    211
14    197
19    157
7     156
13    140
20    135
18    117
17    102
23     75
34     62
28     61
22     58
35     57
38     56
29     53
21     46
Name: numbers, dtype: int64

In [10]:
df.numbers.value_counts()[30:]

56      44
139     43
59      42
24      42
49      42
        ..
i107     1
98       1
i83      1
r        1
i119     1
Name: numbers, Length: 164, dtype: int64

For `X`: Make list of lists, each holding an array (image) and its file path. `.flatten()` the arrays so they're 1D.

For `y`: Target values are found in the `numbers columns`.

After train/test splitting, split the file paths from the arrays (images) so we have an array of file paths and an array of arrays (images) for training and for testing (4 arrays total).

The arrays of file paths (`train_file_paths`, `test_file_paths`) are of no use to our model, and are only recorded so that we can examine particular instances (e.g. to see an incorrectly predicted image).

In [11]:
X = [[cv.imread(fp).flatten(), fp] if Image.open(fp).size==(38, 28) else [np.array(Image.open(fp).crop((0-3, 0, 28+7, 28))).flatten(), fp] for fp in df.file_path.values]
y = df.numbers.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# keep file paths 
train_file_paths = np.array([fp for img, fp in X_train])
test_file_paths = np.array([fp for img, fp in X_test])

X_train = np.array([img for img, fp in X_train])
X_test = np.array([img for img, fp in X_test])

#### Top 5 Target Values in Train, then Test

In [12]:
print(f'Top 5 Target Values (Train)\n{pd.Series(y_train).value_counts()[:5]}\n\nTop 5 Target Values (Test)\n{pd.Series(y_test).value_counts()[:5]}')

Top 5 Target Values (Train)
      440
0     292
10    284
1     246
11    210
dtype: int64

Top 5 Target Values (Test)
      200
0     124
1     110
5     106
11     99
dtype: int64


#### Create & Train Model
And output an array of predictions (just to see what they look like).

In [13]:
knn = KNeighborsClassifier(n_neighbors=1)

In [14]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [15]:
knn.predict(X_test)

array(['17', '18', '17', ..., '10', '18', ''], dtype=object)

#### Score Model

In [16]:
preds = knn.predict(X_test)

n_correct = np.sum(preds == y_test)
n_possible = len(y_test)

print(f'n_correct:  {n_correct}\nn_possible: {n_possible}\n% correct:  {n_correct/n_possible*100}%')

n_correct:  2168
n_possible: 2398
% correct:  90.40867389491243%


#### Current 
- 4796 rows, k=1, test_size=0.3, % correct: 93.12022237665045%

#### Previous Scores
- Week of 31 August 2020
    - 1740 rows, k=1, test_size=0.2, % correct: 89.65517241379311%
    - 2114 rows, k=1, test_size=0.2, % correct: 88.88888888888889%
    - 2114 rows, k=2, test_size=0.2, % correct: 88.65248226950354%
    - 2114 rows, k=3, test_size=100, % correct: 94.0% (one off, more range variation than above, higher highs, lower lows)
    - 2114 rows, k=1, test_size=100, % correct: 91.0% (consistent, some variation ranging 84-93%)
    - 2114 rows, k=1, test_size=0.3, % correct (7 runs avg): 88.008998875%

#### Goal Score (30 September 2020)
- n rows, k=k, test_size=test_size, % correct: 98.1%+

#### Goal Deployed Score (31 October 2020)
- n rows, k=k, test_size=live_feed, % correct: 95.1%+

## What's wrong? Predicted v Actual
Incorrect predictions on the left, actual values (labels) on the right. (Assumes labels are correct.)

In [17]:
comp_df = pd.DataFrame()

comp_df['predicted'] = preds
comp_df['actual'] = y_test
comp_df['reference_file'] = test_file_paths

comp_df.tail(2)

Unnamed: 0,predicted,actual,reference_file
2396,18.0,18.0,media/digits_only/loop_04433k.jpg
2397,,,media/digits_only/loop_01024k.jpg


In [18]:
comp_df.loc[comp_df.predicted != comp_df.actual]

Unnamed: 0,predicted,actual,reference_file
13,93,53,media/digits_only/loop_03649pr.jpg
14,130,136,media/digits_only/loop_04460pr.jpg
15,50,,media/digits_only/loop_00080pr.jpg
18,23,25,media/kills_only/loop_01891.jpg
29,i146,i143,media/digits_only/loop_04441pr.jpg
...,...,...,...
2364,23,24,media/kills_only/loop_01433.jpg
2365,91,81,media/digits_only/loop_03564pr.jpg
2367,13,19,media/kills_only/loop_01678.jpg
2379,16,15,media/kills_only/loop_01831.jpg


What are the top 25 targets we are missing?

In [19]:
comp_df.loc[comp_df.predicted != comp_df.actual].actual.value_counts()[:25]

       20
6      11
13      7
18      7
19      7
5       6
11      5
81      5
16      5
60      5
12      4
25      4
8       4
82      4
67      4
72      4
1       4
14      3
85      3
7       3
69      3
136     3
59      3
150     3
123     3
Name: actual, dtype: int64

What are the top 25 targets we are hitting?

In [20]:
comp_df.loc[comp_df.predicted == comp_df.actual].actual.value_counts()[:25]

      180
0     123
1     106
5     100
11     94
10     91
4      86
16     79
12     73
8      70
3      68
15     67
6      63
2      63
9      62
19     55
14     52
20     44
7      44
18     38
13     34
17     27
23     21
28     20
35     17
Name: actual, dtype: int64