## Dataset info


For this project we are going to use the Fashion MNIST dataset from Kaggle:
https://www.kaggle.com/datasets/zalando-research/fashionmnist/data .
The training set contains 70 000 black and white, 28x28, images scrapped from zalando.com,
a popular online clothing shop.
Each image belongs to one of these 10 product categories:
'T_shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'.

The name of this dataset was chosen by its creator because it pretends to replace the
boring old MNIST dataset as the standard multilabel image classification problem.


## Read the dataset.


The images are formatted as a csv file. The simplest way to read is to use pandas.


In [1]:
import pandas as pd

train_df = pd.read_csv("../data/fashion-mnist_train.csv", sep=",")
test_df = pd.read_csv("../data/fashion-mnist_test.csv", sep=",")

Csv is a weird format to store an image, but it actually helps if the images are going to be used
to train a ML model. This way, we can skip the step of converting the image from a png or jpg
to an array of numbers, which is the format that a model can understand.

The csv contains 785 column: 1 for the labels, 784 for the 28x28 grayscale pixels.


In [3]:
train_df.head()

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,0,0,0,0,0,0,5,0,...,0,0,0,30,43,0,0,0,0,0
3,0,0,0,0,1,2,0,0,0,0,...,3,0,0,0,0,1,0,0,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
test_df.head()

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,0,0,0,0,0,0,0,0,9,8,...,103,87,56,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,34,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,14,53,99,...,0,0,0,0,63,53,31,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,137,126,140,0,133,224,222,56,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Transformations


we need to convert the image data to the right format.
First, we convert it to a numpy array.


In [5]:
import numpy as np

train_data = np.array(train_df, dtype="float32")
test_data = np.array(test_df, dtype="float32")

Then, we extract the features and labels,
and we normalize the labels
(using float in the range 0,1 instead of an integer in the range 1,255).


In [9]:
def normalize_and_extract_features_labels(data):
    features = data[:, 1:] / 255.0
    labels = data[:, 0]
    return features, labels


x_train, y_train = normalize_and_extract_features_labels(train_data)
x_test, y_test = normalize_and_extract_features_labels(test_data)

## Train test Split


The dataset is already divided in a train and test set, i.e. "fashion-mnist_test.csv" and "fashion-mnist_train.csv".
However, we don't want to touch the training set until the very end, to be able to obtain a
realistic estimation of the test, a.k.a. out of sample, error.
For this reason, we are going to create a validation set, this way we have a set to compare
the different models that we'll try.
For this we'll use stratified train test split with sklearn.


In [1]:
from sklearn.model_selection import train_test_split

x_train, x_validate, y_train, y_validate = train_test_split(
    x_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

NameError: name 'x_train' is not defined

## Save preprocessing


To avoid having to repeat these steps on every script,
we will create a function that does all of these and directly return the train and test sets.
We'll save this function in preprocessing.py for later use.


In [None]:
def get_train_test_sets():
    # Read
    train_df = pd.read_csv("../data/fashion-mnist_train.csv", sep=",")
    test_df = pd.read_csv("../data/fashion-mnist_test.csv", sep=",")
    # Transform
    train_data = np.array(train_df, dtype="float32")
    test_data = np.array(test_df, dtype="float32")
    x_train, y_train = normalize_and_extract_features_labels(train_data)
    x_test, y_test = normalize_and_extract_features_labels(test_data)
    # Train test split
    x_train, x_validate, y_train, y_validate = train_test_split(
        x_train, y_train, test_size=0.2, random_state=42, stratify=y_train
    )
    return x_train, x_validate, x_test, y_train, y_validate, y_test