When you are developing in a module, it's really handy to have these lines:

In [None]:
%load_ext autoreload
%autoreload 2

We want to see debug-level logging in the notebook. Here's the incantation

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

# Adding and processing the Fashion-MNIST (FMNIST) Dataset

A raw dataset is really just a list of files (and some useful metadata) that is later processed into a usable dataset. From a data provenance perspective, the most important things to know about raw data are:
* Raw data is **hash-verified**. This ensures that if something changes upstream, we know about that change.
* Raw data is **read only**, and is used to generate a separate and reproducible **Dataset** object
* Raw data is **not saved** in the source code repository. (in fact, the whole `data` directory is specifically excluded in our `.gitignore`). Instead, the recipe for obtaining and processing the raw data is saved. (a snapshot of the raw data can be synced with a large data repo, like an AWS bucket, if desired.)

Our approach to building a usable dataset is:

1. Assemble the raw data files. Generate (and record) hashes to ensure the validity of these files.
3. Add LICENSE and DESCR (description) metadata to make the raw data usable for other people, and
4. Write a function to process the raw data into a usable format (for us, a `Dataset` object)



## Assemble the Raw Data Files

In [None]:
dataset_name="f-mnist"

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

The dataset is free to use under an MIT license.

It can be found online at https://github.com/zalandoresearch/fashion-mnist



### Data Directories, `paths` and `pathlib`

Recall from our `README.md` the locations of our data files

* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw` 
        * Raw (immutable) hash-verified downloads
    * `data/interim` 
        * Extracted and interim data representations, such as caches
    * `data/processed` 
        * The final, cleaned and processed data sets for modeling.

However, we **do not want to hardcode these paths** in our scripts.  This is what our `src.paths` module is for.


In [None]:
from src.paths import raw_data_path, interim_data_path, processed_data_path

A quick aside: Use `pathlib`! Read more here:https://realpython.com/python-pathlib/

In [None]:
# Makes exploring from a notebook a bit easier
from src.data.utils import list_dir

In [None]:
print(f"{raw_data_path}")
list_dir(raw_data_path)


### Download and Check Hashes
The next step is to fetch these files and check (or generate) their hashes. The object we use to house this information is a `RawDataset`

In [None]:
from src.data import RawDataset

Looking at the FMNIST GitHub documentation, we see that the raw data is distributed as a set of 4 files. Because Zalando are excellent data citizens, they have conveniently given us MD5 hashes that we can verify when we download this data.

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|


In [None]:
# Specify the raw files  and their hashes
data_site = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-images-idx3-ubyte.gz','8d4fb7e6c68d591d4c3dfef9ec88bf0d'),
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'),
]

In [None]:
fmnist = RawDataset(dataset_name)
for file, hashval in file_list:
    url = f"{data_site}/{file}"
    fmnist.add_url(url=url, hash_type='md5', hash_value=hashval)
# Download and check the hashes
fmnist.fetch()

### Add a License and Description
How do we transform this data into a usable dataset? We need to know at least 2 more things:
1. License: Am I allowed to use this data? Are there any restrictions on it's use, and
2. Description: data about the raw data.

Start with the **License**. Zalando are good data citizens, so their data License is directly available from
their repo on github: https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE

In [None]:
# Notice we tag this data with the name `LICENSE`
fmnist.add_url(url='https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE',
            name='LICENSE', file_name=f'{dataset_name}.license')


Next, we we should put together a quick **Description** of the dataset. Remember, you're writing this for *future you*, who will have long since forgotten what this is and where you got it. Include:  
* What does the raw data look like?
* Where did I get it from? 
* What format is it in?
* What should it look like when it's processed?

In [None]:
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original [MNIST] dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
  - [MNIST] The MNIST Database of handwritten digits. Yann LeCun, Corinna Cortes,
    Christopher J.C. Burges. http://yann.lecun.com/exdb/mnist/
'''

fmnist.add_metadata(kind="DESCR", contents=fmnist_readme)

In [None]:
# Since we modified the file list (adding the 2 metadata files), 
# fetch() will re-check and redownload all files if necessary
fmnist.fetch()

### Unpack the downloaded files
By default, everything is unpacked to `interim_data_path/dataset_name`

In [None]:
unpack_path = fmnist.unpack()

In [None]:
print(unpack_path)
list_dir(unpack_path)

### Processing the `RawDataset` into a `Dataset`

In [None]:
ds = fmnist.process()

In [None]:
ds = fmnist.process()

In [None]:
print(f"{ds.DESCR}\n\n{ds.LICENSE}\n\n{ds}")

In [None]:
fmnist.load_function

### Writing a processing function
We need to transform the raw data files into usable `data` and `target` vectors.
The code at https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py tells us how to do that:





In [None]:
import numpy as np

unpack_path = fmnist.unpack()
kind = "train"

label_path = unpack_path / f"{kind}-labels-idx1-ubyte"
with open(label_path, 'rb') as fd:
    target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
dataset_path = unpack_path / f"{kind}-images-idx3-ubyte"
with open(dataset_path, 'rb') as fd:
    data = np.frombuffer(fd.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

print(f'Data: {data.shape}, Target: {target.shape}')

#### The Dataset Constructor
A processing function must produce a dictionary of kwargs that can be used in a call to `Dataset()`:
    

In [None]:
from src.data import Dataset
help(Dataset.__init__)

A processing function **must** accept `dataset_name` and `metadata` as keywords. `dataset_name` should be obvious. `metadata` contains a dictionary that we will (optionally) add to as we construct the dataset.

By convention we will store these functions in `src/data/localdata.py`:

In [None]:
%%file ../src/data/localdata.py
__all__ = ['process_mnist']

from ..paths import interim_data_path

import numpy as np

def process_mnist(dataset_name='mnist', kind='train', metadata=None):
    '''
    Load the MNIST dataset (or a compatible variant; e.g. F-MNIST)

    dataset_name: {'mnist', 'f-mnist'}
        Which variant to load
    kind: {'train', 'test'}
        Dataset comes pre-split into training and test data.
        Indicates which dataset to load
    metadata: dict
        Additional metadata fields will be added to this dict.
        'kind': value of `kind` used to generate a subset of the data
    '''
    if metadata is None:
        metadata = {}
        
    if kind == 'test':
        kind = 't10k'

    label_path = interim_data_path / dataset_name / f"{kind}-labels-idx1-ubyte"
    with open(label_path, 'rb') as fd:
        target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
    dataset_path = interim_data_path / dataset_name / f"{kind}-images-idx3-ubyte"
    with open(dataset_path, 'rb') as fd:
        data = np.frombuffer(fd.read(), dtype=np.uint8,
                                       offset=16).reshape(len(target), 784)
    metadata['subset'] = kind
    
    dset_opts = {
        'dataset_name': dataset_name,
        'data': data,
        'target': target,
        'metadata': metadata,
    }
    return dset_opts


In [None]:
from src.data.localdata import process_mnist

### Aside: Partial functions
We can build a new function from an existing function, fixing some of the parameters. In python, this is called a `partial`. (In other languages, you might hear it called a *closure*)

In [None]:
from functools import partial
fmnist.load_function = partial(process_mnist, dataset_name='f-mnist')

Now we can try processing our dataset:

In [None]:
ds = fmnist.process(force=True)
ds.data.shape, ds.target.shape

### To and From JSON
We want to be able to save and load this Raw Dataset object.


In [None]:
jj = fmnist.to_json()

In [None]:
f2 = RawDataset.from_json('f-mnist', json_str=jj)

In [None]:
f2.fetch()
f2.unpack()
ds2 = f2.process()

In [None]:
ds2.data.shape, ds2.target.shape

# HERE BE DRAGONS

What's a `Dataset` object?
* data: the processed data
* target: (optional) target vector (for supervised learning problems)
* metadata: Data about the data

Under the hood, this is basically a dictionary

Right now, this is empty, since we haven't given any indication how to process the raw files into usable data

## Process the raw files into `data` and `target`

How do we turn these raw files into processed data?

First, we need to unpack them. 

In [None]:
from src.data import fetch_and_unpack

In [None]:
untar_dir = fetch_and_unpack(dataset_name)
print(f"{untar_dir}:\n {list_dir(untar_dir)}")

fetch_and_unpack knows how to handle most compressed data types. Unpacked data is stored at `interim_data_path/dataset_name`

In [None]:
from src.paths import interim_data_path

In [None]:
list_dir(interim_data_path / dataset_name)

https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
we see how to process this data


We need numpy. How do we add this to the environment?
* Add it to `environment.yml`
* `make requirements`

In [None]:
import numpy as np

In [None]:
kind = "train"
label_path = interim_data_path / dataset_name / f"{kind}-labels-idx1-ubyte"
with open(label_path, 'rb') as fd:
    target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
dataset_path = interim_data_path / dataset_name / f"{kind}-images-idx3-ubyte"
with open(dataset_path, 'rb') as fd:
    data = np.frombuffer(fd.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

In [None]:
data.shape

## Adding a DESCR

But what is this data? We should document what this Dataset represents. Enter the first of two special pieces of metadata: DESCR. Let's be nice to our users and document out dataset 

In [None]:
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
'''

In [None]:
from src.data import add_dataset_metadata

In [None]:
add_dataset_metadata(dataset_name, kind='DESCR', from_str=fmnist_readme)

In [None]:
ds = load_dataset(dataset_name)

In [None]:
print(ds)