## Dataset

We will use tensorflow's `mnist` dataset, which allows us to classify handwritten numbers.

The `mnist` dataset has been separated into:
- 60,000 samples for training
- 10,000 samples for testing

## Feature

Each data has two features: `image`, `label`
- `image` has the class of `Image`, with the shape of (`x_pixel`, `y_pixel`, `color_channel`), e.g., (28, 28, 1), which means 28 by 28 pixel with the color channel of 1 meaning black and white.
  - The `color_channel` will be 3 if it is colored (3 stands for Red, Green, and Blue).

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

In [2]:
data, metadata = tfds.load('mnist', as_supervised=True, with_info=True)

2025-03-05 07:58:43.069374: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /Users/sebastian/tensorflow_datasets/mnist/3.0.1...[0m


Dl Completed...:   0%|          | 0/5 [00:00<?, ? file/s]

[1mDataset mnist downloaded and prepared to /Users/sebastian/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m


In [4]:
data

{'test': <_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 'train': <_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>}

In [5]:
metadata

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_dir='/Users/sebastian/tensorflow_datasets/mnist/incomplete.H2XFBH_3.0.1/',
    file_format=tfrecord,
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=uint8),
        'label': ClassLabel(shape=(), dtype=int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={20

In [6]:
# Prepare the train and test data

data_train = data['train']
data_test = data['test']

In [8]:
data_train

<_PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

In [12]:
metadata.features['label']

ClassLabel(shape=(), dtype=int64, num_classes=10)

In [13]:
# `metadata.features['label']` is a ClassLabel, according to the tensorflow docs, it has an attribute called `.names` which returns the string names of the classes. Since the `num_classes=10`, the string name defaults to ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
class_name = metadata.features['label'].names

In [14]:
# Each pixel ranges from 0 to 255 (which is represented by 1 byte), it's a good idea to normalize the data before training, because all models work much between if the input values are scaled to smaller numbers.

# We will do this for both the training and testing data. To not repeat ourselves, first we will create a function called normalizer, then using python's map function to map each data to the normalizer function

def normalizer(images, labels):
    # Since the image data are integers from 0 to 255, normalizing it will make it float instead. So we had better convert the numbers to float32 first
    images = tf.cast(images, tf.float32)
    # We want to convert the numbers from 0.0 to 255.0 to 0.0 and 1.0, we will divide them by 255
    images = images / 255
    
    return images, labels

data_train = data_train.map(normalizer)
data_test = data_test.map(normalizer)

# Save data to cache to process faster from the second time on
data_train = data_train.cache()
data_test = data_test.cache()