# MNIST dataset handling
---

In this notebook we will develop a custom `dataset` class which will be able to:
- import the MNIST dataset from a **url**
- **read** the MNIST dataset and **store** it in a `torch.tensor`
- **save** the dataset in `.pt` format to be easily accessible within the `PyTorch` environment
- provide a method to create the dataset **splits**, according to some proportions
- provide a method to perform some **preprocessing** operations

We will procede as follows:
- file decoding procedure
    - analisys of the MNIST dataset format (info taken from this [source](http://yann.lecun.com/exdb/mnist/))
    - download the files from the sources
    - read the file and retrieve the data dimensions and type
    - store the data into a `torch.tensor` and save it to memory in `.pt` format
- `dataset` class implementation
    - define a constructor `__init__`
    - provide a method `splits` to split the dataset

## Dataset class construction

> At the heart of PyTorch data loading utility is the `torch.utils.data.DataLoader` class. It represents a Python iterable over a dataset.

The `DataLoader` class takes several arguments but the most important two are:
- `Dataset`: an abstract class representing a dataset. All subclasses of `Dataset` should overwrite:
    - `__getitem__()`: returns a fetched data sample for a given key
    - `__len__()`: returns the size of the dataset
- `sampler` or `batch_sampler`: they define the strategy to draw samples of batches of samples from the dataset.

So, the first thing to consider is to develop a subclass of `torch.utils.data.Dataset` which overloads the aformentioned methods and implements the ones of which we talked in the `file_decoding_procedure` notebook.


### Class constructor

The class constructor must be able to:
- download the dataset if requested and read it
- create the tensors and do all the load/store stuff
- leave the dataset empty if requested (it will be usefull to create splits)

### Overloading

The overloading of the `__len__()` and `__getitem__()` functions is straightforward.

### Save the dataset

We can use the already created function `save` to exploit this task.



In [1]:
# import utils
# import torch

# class MNIST(torch.utils.data.Dataset):

#     def __init__(
#           self
#         , folder: str
#         , train: bool
#         , download: bool=False
#         , empty: bool=False
#         ) -> None:
#         """
#         Class constructor.

#         Args:
#             folder (str): folder in which contains/will contain the data
#             train (bool): if True the training dataset is built, otherwise the test dataset
#             download (bool): if True the dataset will be downloaded (default = True)
#             empty (bool): if True the tensors will be left empty (default = False)
#         """

#         # user folder check
#         # ------------------------
#         if folder is None:
#             raise FileNotFoundError("Please specify the data folder")
#         if not os.path.exists(folder) or os.path.isfile(folder):
#             raise FileNotFoundError("Invalid data path: {}".format(folder))
#         # ------------------------

#         # utilities
#         # ------------------------
#         if train:
#             urls = {
#                 'training-images': 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
#                 , 'training-labels': 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
#             }
#             self.save_file = 'training.pt'
#         else:
#             urls = {
#                 'test-images': 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
#                 , 'test-labels': 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
#             }
#             self.save_file = 'test.pt'
#         # ------------------------

#         # class members
#         # ------------------------
#         self.raw_folder = os.path.join(folder, "data/raw")
#         self.processed_folder = os.path.join(folder, "data/processed")
#         # ------------------------

#         # dataset download
#         # ------------------------
#         if download:
#             for name, url in urls.items():
#                 utils.download(url, self.raw_folder, name)
#         # ------------------------
        
#         # dataset folder check
#         # ------------------------
#         else:   # not download
#             if not os.path.exists(self.raw_folder) or os.path.isfile(self.raw_folder):
#                 raise FileNotFoundError("Invalid data path: {}".format(self.raw_folder))
#         # ------------------------

#         # data storing
#         # ------------------------
#         if not empty:
#             for name, _ in urls.items():
#                 filepath = os.path.join(self.raw_folder, name)
#                 if "images" in name:
#                     self.data = utils.store_file_to_tensor(filepath)
#                 elif "labels" in name:
#                     self.labels = utils.store_file_to_tensor(filepath)
#             self.save()
            
#         else:
#             self.data = None
#             self.labels = None
#         # ------------------------
            
    
#     def __len__(self) -> int:
#         """
#         Return the lenght of the dataset.

#         Returns:
#             length of the dataset (int)
#         """
#         return len(self.data) if self.data is not None else 0

    
#     def __getitem__(
#           self
#         , idx: int
#         ) -> tuple:
#         """
#         Retrieve the item of the dataset at index idx.

#         Args:
#             idx (int): index of the item to be retrieved.
        
#         Returns:
#             tuple: (image, label) 
#         """
#         img, label = self.data[idx], int(self.labels[idx])

#         return (img, label)
    


#     def save(self) -> None:
#         """
#         Save the dataset (tuple of torch.tensors) into a file defined by self.processed_folder and self.save_file.
#         """
#         if not os.path.exists(self.processed_folder):   
#             os.makedirs(self.processed_folder)  

#         # saving the training set into the correct folder
#         with open(os.path.join(self.processed_folder, self.save_file), 'wb') as f:
#             torch.save((self.data, self.labels), f)


#     def load(self) -> None:
#         """
#         Load the file .pt in the path defined by self.processed_folder and self.save_file.
#         """
#         file_path = os.path.join(self.processed_folder, self.save_file)

#         if not os.path.exists(file_path):
#             raise FileNotFoundError("Folder not present: {}".format(file_path))

#         self.data, self.labels = torch.load(file_path)

    
#     def splits(
#           self
#         , proportions: list=[0.8, 0.2]
#         , shuffle: bool=True
#         ) -> None:
#         """
#         Split the the dataset according to the given proportions and return two instances of MNIST, training and validation.

#         Args:
#             proportions (list): (default=[0.8,0.2]) list of proportions for training set and validation set.
#             shuffle (bool): (default=True) whether to shuffle the dataset or not
#         """

#         # check proportions
#         # ------------------------
#         if not (sum(proportions) == 1. and all([p > 0. for p in proportions])): #and len(proportions) == 2:
#             raise ValueError("Invalid proportions: they must (1) be 2 (2) sum up to 1") # (3) be all positives.")
#         # ------------------------

#         # creating a list of MNIST objects
#         # ------------------------
#         datasets = []
#         for i in range(len(proportions)):
#             dataset.append(MNIST())
#         # ------------------------





In [2]:
import dataset
import torch

path = "./"
data_set = dataset.MNIST(path, train=True, download=False, empty=False)

Loading ./data/raw/training-images ...
./data/raw/training-images loaded!
Loading ./data/raw/training-labels ...
./data/raw/training-labels loaded!


In [3]:
datasets = data_set.splits()

In [4]:
for i in datasets:
    print(i.data.shape)
    print(i.labels.shape)
    print("----------")

torch.Size([48000, 28, 28])
torch.Size([48000])
----------
torch.Size([12000, 28, 28])
torch.Size([12000])
----------


In [5]:
data_set.statistics()

N. samples:    	60000
Classes:       	{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Classes distr.: 	tensor([ 9.8717, 11.2367,  9.9300, 10.2183,  9.7367,  9.0350,  9.8633, 10.4417,
         9.7517,  9.9150])
Data type:     	<class 'torch.Tensor'>
Data shape:    	torch.Size([28, 28])



In [10]:
def foo():
    return [2,3]

[a, b] = foo()

In [11]:
print(a, b)

2 3


In [12]:
a = torch.rand(3,3)

In [15]:
b = torch.unsqueeze(a, 0)

In [16]:
a

tensor([[0.2055, 0.8854, 0.4279],
        [0.4239, 0.9809, 0.5838],
        [0.5909, 0.7128, 0.4041]])

In [18]:
b.shape

torch.Size([1, 3, 3])