In [1]:
import keras
import random
import collections
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist
from keras.datasets import mnist

In [2]:
dataset = keras.datasets.mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


This code is a function that helps to reduce the size of a large dataset containing images and labels. Let me explain it step by step:

1. The function is called `reduce_dataset`, and it takes two main inputs:
   - `dataset`: This is the original dataset in the form of a tuple, containing images and their corresponding labels. The images are pictures of handwritten numbers (0 to 9), and each image has a label that tells us which number it represents.
   - `target_size`: This is the desired size we want for each label in the dataset. For example, if `target_size` is set to 600, we want to have 600 images for each number in our final reduced dataset.

2. The function starts by extracting the images and labels from the input `dataset` tuple.

3. It initializes two empty lists called `reduced_images` and `reduced_labels`. These lists will be used to store the smaller subset of images and labels we select.

4. Next, it creates a counter object called `label_counter`, which helps count the occurrences of each label in the original dataset. This is useful to ensure we have an equal representation of each number in the reduced dataset.

5. The function then goes through each unique label in the dataset. For example, it starts with label 0, then label 1, and so on up to label 9.

6. For each label, it randomly selects `target_size` number of indices (positions) from the original dataset where the label matches the current label being processed. These randomly chosen indices represent the images we want to keep for that specific label.

7. It then iterates through the selected indices and adds the corresponding images and labels to the `reduced_images` and `reduced_labels` lists, respectively.

8. Once it has processed all the labels, the function creates a new dataset called `reduced_dataset` by converting the lists `reduced_images` and `reduced_labels` into numpy arrays.

9. Finally, it returns the `reduced_dataset`, which now contains a smaller and more manageable subset of images and labels, ensuring that we have exactly `target_size` number of images for each label.

The last part of the code applies this function to the original dataset `dataset`, with `target_size` set to 600 for the training dataset and 100 for the testing dataset. It then prints the number of images in the new training and testing datasets to verify that the reduction process was successful.

In [3]:
import random

def reduce_dataset(dataset, target_size):
    """
    Reduce the size of the dataset to the target size for each label.

    Inputs:
        dataset (tuple): The original dataset in tuple form, containing images and labels.
        target_size (int): The target size for each label.

    Output:
        tuple: The reduced dataset in the same format as the original dataset.
    """
    images, labels = dataset
    reduced_images = []
    reduced_labels = []
    label_counter = collections.Counter(labels)

    for label in label_counter.keys():
        # Select target_size random indices for each label
        indices = random.sample([i for i, l in enumerate(labels) if l == label], target_size)
        for index in indices:
            reduced_images.append(images[index])
            reduced_labels.append(label)

    reduced_dataset = (np.array(reduced_images), np.array(reduced_labels))
    return reduced_dataset

# Reduce the size of the training and testing datasets
new_training_dataset = reduce_dataset(dataset[0], target_size=600)
new_testing_dataset = reduce_dataset(dataset[1], target_size=100)

# Combine the reduced training and testing datasets into a new dataset
new_dataset = (new_training_dataset, new_testing_dataset)

# Verify the sizes of the new dataset
print("Number of images in the new training dataset:", len(new_dataset[0][0]))
print("Number of images in the new testing dataset:", len(new_dataset[1][0]))

Number of images in the new training dataset: 6000
Number of images in the new testing dataset: 1000
