In [1]:
!pip install --upgrade --quiet pip
!pip install --quiet img2vec_pytorch
print('pip install/upgrade complete.')

pip install/upgrade complete.


In this case let's try turning our images into PNGs, embedding them in a space using ResNet, and then building a model using the embeddings.

In [2]:
from img2vec_pytorch import Img2Vec

SIZE = 512
img2vec = Img2Vec(cuda=False, model='resnet-18', layer='default', layer_output_size=SIZE)
print('built the img2vec/ResNet model.')

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 119MB/s]


built the img2vec/ResNet model.


In [3]:
import arrow
import pandas as pd

TEST = '/kaggle/input/digit-recognizer/test.csv'
TRAIN = '/kaggle/input/digit-recognizer/train.csv'

time_zero = arrow.now()
time_start = arrow.now()
test_df = pd.read_csv(filepath_or_buffer=TEST, )
train_df = pd.read_csv(filepath_or_buffer=TRAIN, )

print('{} data load complete.'.format(arrow.now() - time_start))

0:00:07.505599 data load complete.


Let's get our embeddings. This will take 60-70 minutes on a CPU.

In [4]:
import arrow
import numpy as np
from PIL import Image

def process(bits: pd.Series) -> np.array:
    image = Image.fromarray(bits.to_numpy().reshape(28, 28).astype(float)).convert('RGB')
    return img2vec.get_vec(image, tensor=True).numpy().reshape(SIZE,)

time_start = arrow.now()
test_df['embedding'] = test_df.apply(func=process, axis='columns')
print('{} : training data embedding done.'.format(arrow.now() - time_start))
train_df['embedding'] = train_df.drop(columns=['label']).apply(func=process, axis='columns')

print('{} embedding done'.format(arrow.now() - time_start))

0:28:11.768503 : training data embedding done.
1:10:35.624320 embedding done


In [5]:
from warnings import filterwarnings
from plotly import express
filterwarnings(action='ignore', category=UserWarning)

express.pie(data_frame=train_df, names='label', color='label')

Our classes are unbalanced but not terribly so.

Let's use dimension reduction and see how our data clusters.

In [6]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=500)
train_df[['x', 'y']] = pd.DataFrame(data=umap.fit_transform(X=train_df['embedding'].apply(pd.Series)))
express.scatter(data_frame=train_df, x='x', y='y', color='label').show()
print('done with UMAP in {}'.format(arrow.now() - time_start))

2024-03-29 17:16:10.899572: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 17:16:10.899845: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 17:16:11.110585: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=500, n_jobs=1, random_state=2024, verbose=True)
Fri Mar 29 17:16:32 2024 Construct fuzzy simplicial set
Fri Mar 29 17:16:32 2024 Finding Nearest Neighbors
Fri Mar 29 17:16:32 2024 Building RP forest with 15 trees
Fri Mar 29 17:16:41 2024 NN descent for 15 iterations
	 1  /  15
	 2  /  15
	 3  /  15
	 4  /  15
	 5  /  15
	 6  /  15
	Stopping threshold met -- exiting after 6 iterations
Fri Mar 29 17:17:12 2024 Finished Nearest Neighbor Search
Fri Mar 29 17:17:17 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Mar 29 17:18:51 2024 Finished embedding


done with UMAP in 0:02:26.523343


It's important to remember that our plot shows us what the UMAP projection finds: five maybe six digits are mostly isolated in this projection; the other four are in clusters that are close to other digit clusters: 3 and 5 and to a lesser degree 2 and 7. This is a different result from what we see if we use UMAP to cluster the source data directly; for some reason ResNet does a good job of identifying 8s.

But each cluster has within it cases that are misidentified, as we can easily see in this scatter plot. It's hard to say how many (what proportion of the total) they are, but the scattered hard cases and edge cases are likely the source of our model errors below. 

Let's do a quick parameter study with values of k for KNN and pick the best one.

In [7]:
import arrow
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report


time_start = arrow.now()
target = 'label'
X_train, X_test, y_train, y_test = train_test_split(train_df['embedding'].apply(pd.Series), train_df[target], test_size=0.2, random_state=2024, stratify=train_df[target])


best_k = 1
best = 0
# let's step through a range of cluster sizes to find the one that will give us the best accuracy
for n_neighbors in range(2, 20):
    current = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X=X_train, y=y_train)
    score = f1_score(average='weighted', labels=train_df['label'].unique().tolist(), y_true=y_test, y_pred=current.predict(X=X_test))
    if score > best:
        best = score
        best_k = n_neighbors
    print('neighbors: {} score: {:5.4f}'.format(n_neighbors, score))
        
time_start = arrow.now()
print('building best-k model with k = {}'.format(best_k))
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X=X_train, y=y_train)
print(classification_report(labels=train_df['label'].unique().tolist(), y_true=y_test, y_pred=knn.predict(X=X_test)))
print('model time: {}'.format(arrow.now() - time_start))

neighbors: 2 score: 0.9522
neighbors: 3 score: 0.9611
neighbors: 4 score: 0.9612
neighbors: 5 score: 0.9625
neighbors: 6 score: 0.9609
neighbors: 7 score: 0.9603
neighbors: 8 score: 0.9593
neighbors: 9 score: 0.9592
neighbors: 10 score: 0.9578
neighbors: 11 score: 0.9583
neighbors: 12 score: 0.9567
neighbors: 13 score: 0.9560
neighbors: 14 score: 0.9554
neighbors: 15 score: 0.9558
neighbors: 16 score: 0.9547
neighbors: 17 score: 0.9543
neighbors: 18 score: 0.9536
neighbors: 19 score: 0.9539
building best-k model with k = 5
              precision    recall  f1-score   support

           1       0.99      0.99      0.99       937
           0       0.96      0.99      0.98       827
           4       0.97      0.98      0.97       814
           7       0.95      0.99      0.97       880
           3       0.94      0.97      0.95       870
           5       0.95      0.93      0.94       759
           8       0.98      0.92      0.95       813
           9       0.97      0.95     

Our classification report tells us that indeed our best model has its worst performance for digits 2, 3, and 5, which we might have guessed from the UMAP scatter plot above.

In [8]:
import arrow

time_start = arrow.now()
final = KNeighborsClassifier(n_neighbors=best_k).fit(X=train_df['embedding'].apply(func=pd.Series), y=train_df['label'])
result_df = pd.DataFrame(data=final.predict(X=test_df['embedding'].apply(func=pd.Series)), columns=['Label'])
result_df['ImageId'] = list(range(1, len(test_df) + 1))
output_file = '/kaggle/working/submission.csv.zip'
print('{} : writing submission to {}'.format(arrow.now() - time_start, output_file))
result_df.to_csv(path_or_buf=output_file, index=False, compression='zip')
print('{} : done.'.format(arrow.now() - time_start, ))

0:00:38.725547 : writing submission to /kaggle/working/submission.csv.zip
0:00:38.805853 : done.


Let's try building an ensemble from all of the values of k starting just below our best case. Hint: it doesn't do any better; the errors our KNN model is making aren't randomly distributed for any particular value of k, so using a bunch of different values of k and combining the results using a voting ensemble doesn't magically fix them.

In [9]:
import arrow
from sklearn.ensemble import  VotingClassifier

time_start = arrow.now()
voting = VotingClassifier(estimators=[('k: {}'.format(k), KNeighborsClassifier(n_neighbors=k, )) for k in range(4, 20)], voting='hard').fit(X=train_df['embedding'].apply(func=pd.Series), y=train_df[target])
print('{} : built/trained model.'.format(arrow.now() - time_start, ))
voting_df = pd.DataFrame(columns=['Label'], data=current.predict(X=test_df['embedding'].apply(func=pd.Series)))
voting_df['ImageId'] = list(range(1, len(voting_df)+1))
voting_file = '/kaggle/working/voting.csv.zip'
print('{} : writing submission to {}'.format(arrow.now() - time_start, voting_file))
voting_df.to_csv(path_or_buf=voting_file, index=False, compression='zip')
print('{} : done.'.format(arrow.now() - time_start, ))
print('overall time: {}'.format(arrow.now() - time_zero))

0:00:07.917916 : built/trained model.
0:00:33.802065 : writing submission to /kaggle/working/voting.csv.zip
0:00:33.877913 : done.
overall time: 1:17:24.764048
