“Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion.”

In [1]:
from sklearn.datasets import fetch_openml

In [2]:
data = fetch_openml('mnist_784')

In [3]:
X = data.data
y = data.target

In [4]:
import numpy as np

In [5]:
y = y.astype(np.uint8)

In [7]:
# we want to write a function that takes each vector and reshapes it into a
# 28x28 matrix. Then for each row, we make 4 adjustments.
# shift left, right, up, and down
# for each direction, we return a tuple the augmented input and the correct label
def augment_image(image_row, label):
    reshaped = image_row.reshape(28, 28)
    
    shift_left = np.roll(reshaped, -1, axis=1)
    shift_right = np.roll(reshaped, 1, axis=1)
    shift_up = np.roll(reshaped, -1, axis=0)
    shift_down = np.roll(reshaped, 1, axis=0)
    
    return np.array([
        (shift_left.flatten(), label),
        (shift_right.flatten(), label),
        (shift_up.flatten(), label),
        (shift_down.flatten(), label)
    ])

In [8]:
from collections import defaultdict

In [None]:
augmented_data = defaultdict(list)

image_idx = 0

for data, label in zip(X, y):
    augmented_images = augment_image(data, label)
    for image_vec, label in augmented_images:
        augmented_data["data"].append(image_vec)
        augmented_data["label"].append(label)
        

In [19]:
from itertools import chain

In [26]:
X_augmented = np.array([val for val in chain(X, augmented_data["data"])])
y_augmented = np.array([val for val in chain(y, augmented_data["label"])])

In [15]:
from sklearn.model_selection import train_test_split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X_augmented, y_augmented, test_size=0.3)

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
lr = LogisticRegression()

In [30]:
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [31]:
from sklearn.metrics import accuracy_score

In [32]:
y_pred = lr.predict(X_test)

In [33]:
accuracy_score(y_test, y_pred)

0.8955428571428572

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
from sklearn.model_selection import cross_val_score

In [38]:
rf = RandomForestClassifier()
cross_val_score(rf, X_train, y_train, cv=2, scoring="accuracy")

array([0.96924898, 0.96971429])

A random forest classifier gives a lot better performance