In [1]:
import arrow
import pandas as pd

TEST = '/kaggle/input/digit-recognizer/test.csv'
TRAIN = '/kaggle/input/digit-recognizer/train.csv'

time_start = arrow.now()
test_df = pd.read_csv(filepath_or_buffer=TEST)
train_df = pd.read_csv(filepath_or_buffer=TRAIN)
print('{}: data load complete.'.format(arrow.now() - time_start))

0:00:07.276139: data load complete.


Before we train a model let's do some visualization of our training data.

In [2]:
import arrow
from plotly import express

COLUMNS = [column for column in train_df.columns if column.startswith('pixel')]
TARGET = 'label'

for digit in range(10):
    express.imshow(img=train_df[train_df[TARGET] == digit][COLUMNS].mean().to_numpy().reshape(28, 28)).show()

All of these look pretty sensible; what could possibly go wrong here?

In [3]:
import arrow
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

ALPHA = 1e-3
COLUMNS = [column for column in train_df.columns if column.startswith('pixel')]
EPSILON = 0.1
ETA0 = 1e-1
FIT_INTERCEPT = True
LEARNING_RATE = ['optimal', 'adaptive'][1]
LOSS = ['hinge', 'log_loss', 'modified_huber', 'squared_hinge', 'perceptron'][0]
MAX_ITER = 1000
PENALTY = ['l2', 'l1', 'elasticnet'][1]
TARGET = 'label'
TOL = 1e-5

time_start = arrow.now()
model = SGDClassifier(loss=LOSS,  penalty=PENALTY, alpha=ALPHA, l1_ratio=0.15, fit_intercept=FIT_INTERCEPT, max_iter=MAX_ITER, tol=TOL, shuffle=True,
                      verbose=0, epsilon=EPSILON, n_jobs=None, random_state=2024, learning_rate=LEARNING_RATE, eta0=ETA0, power_t=0.35,
                      early_stopping=True, validation_fraction=0.6, n_iter_no_change=10, class_weight=None, warm_start=False, 
                      average=False).fit(X=train_df[COLUMNS], y=train_df[TARGET])
print('{} model fit complete.'.format(arrow.now() - time_start))
print('model required {} iterations out of {} to reach tolerance {}'.format(model.n_iter_, MAX_ITER, TOL))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=train_df[TARGET], y_pred=model.predict(X=train_df[COLUMNS]), )))
print(classification_report(y_true=train_df[TARGET], y_pred=model.predict(X=train_df[COLUMNS])))

0:02:57.174965 model fit complete.
model required 120 iterations out of 1000 to reach tolerance 1e-05
f1: 0.9045
              precision    recall  f1-score   support

           0       0.96      0.97      0.96      4132
           1       0.97      0.97      0.97      4684
           2       0.89      0.89      0.89      4177
           3       0.87      0.87      0.87      4351
           4       0.92      0.91      0.92      4072
           5       0.87      0.85      0.86      3795
           6       0.94      0.95      0.95      4137
           7       0.93      0.92      0.93      4401
           8       0.82      0.85      0.83      4063
           9       0.86      0.87      0.86      4188

    accuracy                           0.90     42000
   macro avg       0.90      0.90      0.90     42000
weighted avg       0.90      0.90      0.90     42000



Testing by looking at the classification report for the model for the training data tells us a couple of things:
* Our model does not reproduce the training data for any class, meaning that we should not expect very high accuracy for a test set drawn from the same distribution
* Our model has particular problems with 3s and 8s

Let's use dimension reduction to visualize our training data.

In [4]:
import arrow
from plotly import express
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=100)

plot_train_df = pd.DataFrame(columns=['x', 'y'], data=umap.fit_transform(X=train_df[COLUMNS]))
plot_train_df[TARGET] = train_df[TARGET].tolist()
express.scatter(data_frame=plot_train_df, x='x', y='y', color=TARGET).show()
print('done with UMAP in {}'.format(arrow.now() - time_start))

2024-04-01 15:22:37.884747: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-01 15:22:37.884918: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-01 15:22:38.046119: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=100, n_jobs=1, random_state=2024, verbose=True)
Mon Apr  1 15:22:51 2024 Construct fuzzy simplicial set
Mon Apr  1 15:22:51 2024 Finding Nearest Neighbors
Mon Apr  1 15:22:51 2024 Building RP forest with 15 trees
Mon Apr  1 15:23:00 2024 NN descent for 15 iterations
	 1  /  15
	 2  /  15
	 3  /  15
	 4  /  15
	 5  /  15
	Stopping threshold met -- exiting after 5 iterations
Mon Apr  1 15:23:26 2024 Finished Nearest Neighbor Search
Mon Apr  1 15:23:31 2024 Construct embedding


Epochs completed:   0%|            0/100 [00:00]

	completed  0  /  100 epochs
	completed  10  /  100 epochs
	completed  20  /  100 epochs
	completed  30  /  100 epochs
	completed  40  /  100 epochs
	completed  50  /  100 epochs
	completed  60  /  100 epochs
	completed  70  /  100 epochs
	completed  80  /  100 epochs
	completed  90  /  100 epochs
Mon Apr  1 15:23:49 2024 Finished embedding


done with UMAP in 0:00:58.325207


This shows us a couple of things:
* We have some digits that are mostly tightly clustered and isolated: 0, 1, 2, and 6
* We have two sets of three digits that are close to one another: 4/7/9 and 3/5/8
* Within the cluster for each digit we see isolated cases that UMAP is getting wrong

Now let's use the same UMAP model we trained above to plot our test data.

In [5]:
time_start = arrow.now()
plot_test_df = pd.DataFrame(columns=['x', 'y'], data=umap.transform(X=test_df[COLUMNS]))
plot_test_df[TARGET] = model.predict(X=test_df[COLUMNS]).tolist()
express.scatter(data_frame=plot_test_df, x='x', y='y', color=TARGET).show()
print('done with UMAP in {}'.format(arrow.now() - time_start))

Mon Apr  1 15:23:54 2024 Worst tree score: 0.61926190
Mon Apr  1 15:23:54 2024 Mean tree score: 0.62758730
Mon Apr  1 15:23:54 2024 Best tree score: 0.63571429
Mon Apr  1 15:24:00 2024 Forward diversification reduced edges from 630000 to 265896
Mon Apr  1 15:24:05 2024 Reverse diversification reduced edges from 265896 to 265896
Mon Apr  1 15:24:09 2024 Degree pruning reduced edges from 289530 to 289530
Mon Apr  1 15:24:09 2024 Resorting data and graph based on tree order
Mon Apr  1 15:24:09 2024 Building and compiling search function


Epochs completed:   0%|            0/33 [00:00]

	completed  0  /  33 epochs
	completed  3  /  33 epochs
	completed  6  /  33 epochs
	completed  9  /  33 epochs
	completed  12  /  33 epochs
	completed  15  /  33 epochs
	completed  18  /  33 epochs
	completed  21  /  33 epochs
	completed  24  /  33 epochs
	completed  27  /  33 epochs
	completed  30  /  33 epochs


done with UMAP in 0:00:40.698421


Here we see the same basic clusters, which is mostly good news, but also cases that UMAP doesn't place within clusters; we are taking the locations in this plot from UMAP and the labels from our SGD model above. We can easily see that the two models don't always agree.

Let's write our our submission.

In [6]:
import arrow

time_start = arrow.now()

result_df = plot_test_df.drop(columns=['x', 'y']).reset_index()
result_df.columns = ['ImageId', 'Label']
result_df['ImageId'] += 1
result_file = '/kaggle/working/SGDClassifier.csv.zip'
print('{} : writing SGDClassifier result to {}'.format(arrow.now() - time_start, result_file))
result_df.to_csv(path_or_buf=result_file, index=False, compression='zip')
print('{} : done.'.format(arrow.now() - time_start, ))

0:00:00.006142 : writing SGDClassifier result to /kaggle/working/SGDClassifier.csv.zip
0:00:00.086343 : done.


This submission has a public score of about 0.875, which is kind of disappointing but not surprising.