In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

# Adding and processing the Fashion-MNIST (FMNIST) Dataset

*"Raw Data is Read Only." Sing it with me.*


A raw dataset is really just a list of files (and some useful metadata) that is later processed into a usable dataset. From a data provenance perspective, the most important things to know about raw data are:
* Raw data is **hash-verified**. This ensures that if something changes upstream, we know about that change.
* Raw data is **read only**, and is used to generate a separate and reproducible **Dataset** object
* Raw data is **not saved** in the source code repository. (in fact, the whole `data` directory is specifically excluded in our `.gitignore`

Our approach to building a usable dataset is:

1. Assemble the raw data files.
2. Generate (and record) hashes to ensure the validity of these files
3. Add LICENSE and DESCR (description) metadata to make the raw data usable for other people, and
4. Write a function to process the raw data into a usable format



# Assemble the Raw Data Files

In [None]:
dataset_name="f_mnist"

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

The dataset is free to use under an MIT license.

It can be found online at https://github.com/zalandoresearch/fashion-mnist

Looking at the documentation in this repo, we see that the raw FMNIST is distributed as a set of 4 files. Because Zalando are excellent data citizens, they have conveniently given us MD5 hashes that we can verify for this raw data.



| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|


## Data Directories, `paths` and `pathlib`

Recall from our `README.md` the locations of our data files

* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw` 
        * Raw (immutable) hash-verified downloads
    * `data/interim` 
        * Extracted and interim data representations
    * `data/processed` 
        * The final, canonical data sets for modeling.

We don't want to hardcode these paths in our scripts.  This is what the `src.paths` module is for.
* `src.paths.raw_data_path`
* `src.paths.interim_data_path`
* `src.paths.processed_data_path`

A quick aside: Use `pathlib`! Read more here:https://realpython.com/python-pathlib/

In [None]:
from src.paths import raw_data_path, interim_data_path
from src.data.utils import list_dir

In [None]:
!rm ../data/raw/*.license

In [None]:
print(f"{raw_data_path}")
list_dir(raw_data_path)

The next step is to fetch these files and check their hashes. 

In [None]:
from src.data import RawDataset

In [None]:
data_site = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-images-idx3-ubyte.gz','8d4fb7e6c68d591d4c3dfef9ec88bf0d'),
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'),
]

In [None]:
fmnist = RawDataset(dataset_name)
for file, hashval in file_list:
    url = f"{data_site}/{file}"
    fmnist.add_url(url=url, hash_type='md5', hash_value=hashval)
fmnist.add_url(url='https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE',
            name='LICENSE', file_name='f_mnist.license')
fmnist.fetch()

In [None]:
fmnist.file_list

In [None]:
fmnist.fetched_files_

In [None]:
!ls ../data/raw

In [None]:
fmnist.unpack()

In [None]:
ds_opts = fmnist.process()

In [None]:
from src.data import Dataset

In [None]:
ds = Dataset(**ds_opts); print(ds.LICENSE)

In [None]:
%debug

In [None]:
from src.data.utils import partial_call_signature
partial_call_signature(fmnist.load_function)

In [None]:
yelp.file_list

In [None]:
from src.data.datasets import load_dataset

In [None]:
# we can load the newly created dataset
ds = load_dataset('f_mnist')
type(ds)

What's a `Dataset` object?
* data: the processed data
* target: (optional) target vector (for supervised learning problems)
* metadata: Data about the data

Under the hood, this is basically a dictionary

Right now, this is empty, since we haven't given any indication how to process the raw files into usable data

## Process the raw files into `data` and `target`

How do we turn these raw files into processed data?

First, we need to unpack them. 

In [None]:
from src.data import fetch_and_unpack

In [None]:
untar_dir = fetch_and_unpack(dataset_name)
print(f"{untar_dir}:\n {list_dir(untar_dir)}")

fetch_and_unpack knows how to handle most compressed data types. Unpacked data is stored at `interim_data_path/dataset_name`

In [None]:
from src.paths import interim_data_path

In [None]:
list_dir(interim_data_path / dataset_name)

https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
we see how to process this data


We need numpy. How do we add this to the environment?
* Add it to `environment.yml`
* `make requirements`

In [None]:
import numpy as np

In [None]:
kind = "train"
label_path = interim_data_path / dataset_name / f"{kind}-labels-idx1-ubyte"
with open(label_path, 'rb') as fd:
    target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
dataset_path = interim_data_path / dataset_name / f"{kind}-images-idx3-ubyte"
with open(dataset_path, 'rb') as fd:
    data = np.frombuffer(fd.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

In [None]:
data.shape

## Adding a DESCR

But what is this data? We should document what this Dataset represents. Enter the first of two special pieces of metadata: DESCR. Let's be nice to our users and document out dataset 

In [None]:
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
'''

In [None]:
from src.data import add_dataset_metadata

In [None]:
add_dataset_metadata(dataset_name, kind='DESCR', from_str=fmnist_readme)

In [None]:
ds = load_dataset(dataset_name)

In [None]:
print(ds)