When you are developing in a module, it's really handy to have these lines:

In [1]:
%load_ext autoreload
%autoreload 2

We want to see debug-level logging in the notebook. Here's the incantation

In [2]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

# Adding and processing the Fashion-MNIST (FMNIST) Dataset

A raw dataset is really just a list of files (and some useful metadata) that is later processed into a usable dataset. From a data provenance perspective, the most important things to know about raw data are:
* Raw data is **hash-verified**. This ensures that if something changes upstream, we know about that change.
* Raw data is **read only**, and is used to generate a separate and reproducible **Dataset** object
* Raw data is **not saved** in the source code repository. (in fact, the whole `data` directory is specifically excluded in our `.gitignore`). Instead, the recipe for obtaining and processing the raw data is saved. (a snapshot of the raw data can be synced with a large data repo, like an AWS bucket, if desired.)

Our approach to building a usable dataset is:

1. Assemble the raw data files. Generate (and record) hashes to ensure the validity of these files.
3. Add LICENSE and DESCR (description) metadata to make the raw data usable for other people, and
4. Write a function to process the raw data into a usable format (for us, a `Dataset` object)



## Assemble the Raw Data Files

In [3]:
dataset_name="f-mnist"

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

The dataset is free to use under an MIT license.

It can be found online at https://github.com/zalandoresearch/fashion-mnist



### Data Directories, `paths` and `pathlib`

Recall from our `README.md` the locations of our data files

* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw` 
        * Raw (immutable) hash-verified downloads
    * `data/interim` 
        * Extracted and interim data representations, such as caches
    * `data/processed` 
        * The final, cleaned and processed data sets for modeling.

However, we **do not want to hardcode these paths** in our scripts.  This is what our `src.paths` module is for.


In [4]:
from src.paths import raw_data_path, interim_data_path, processed_data_path

A quick aside: Use `pathlib`! Read more here:https://realpython.com/python-pathlib/

In [5]:
# Makes exploring from a notebook a bit easier
from src.data.utils import list_dir

In [6]:
print(f"{raw_data_path}")
list_dir(raw_data_path)


/home/ava00088/src/devel/bus_number/data/raw


['f-mnist.readme',
 'train-labels-idx1-ubyte.gz',
 't10k-labels-idx1-ubyte.gz',
 't10k-images-idx3-ubyte.gz',
 'train-images-idx3-ubyte.gz',
 'f-mnist.license']

### Download and Check Hashes
The next step is to fetch these files and check (or generate) their hashes. The object we use to house this information is a `RawDataset`

In [7]:
from src.data import RawDataset

Looking at the FMNIST GitHub documentation, we see that the raw data is distributed as a set of 4 files. Because Zalando are excellent data citizens, they have conveniently given us MD5 hashes that we can verify when we download this data.

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|


In [8]:
# Specify the raw files  and their hashes
data_site = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-images-idx3-ubyte.gz','8d4fb7e6c68d591d4c3dfef9ec88bf0d'),
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'),
]

In [9]:
fmnist = RawDataset(dataset_name)
for file, hashval in file_list:
    url = f"{data_site}/{file}"
    fmnist.add_url(url=url, hash_type='md5', hash_value=hashval)
# Download and check the hashes
fmnist.fetch()

DEBUG:src.logging:No file_name specified. Inferring train-images-idx3-ubyte.gz from URL
DEBUG:src.logging:train-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring train-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:train-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-images-idx3-ubyte.gz from URL
DEBUG:src.logging:t10k-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:t10k-labels-idx1-ubyte.gz already exists and hash is valid


True

In [10]:
list_dir(raw_data_path)

['f-mnist.readme',
 'train-labels-idx1-ubyte.gz',
 't10k-labels-idx1-ubyte.gz',
 't10k-images-idx3-ubyte.gz',
 'train-images-idx3-ubyte.gz',
 'f-mnist.license']

### Add License and Description
Before we can turn this raw data into a usable dataset, we need to know 2 things:
1. What does the raw data look like? Where did I get it from? What format is it in? What should it look like when it's processed? (DESCR)
2. Am I allowed to use this data? (LICENSE)

In [11]:
# Easy case. Zalando are good data citizens, so their data License is directly available from
# their Raw Data Repo on github

# Notice we tag this data with the name `LICENSE`
fmnist.add_url(url='https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE',
            name='LICENSE', file_name=f'{dataset_name}.license')


In [12]:
# What does the raw data look like?
# Where did I get it from? 
# What format is it in?
# What should it look like when it's processed?
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original [MNIST] dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
  - [MNIST] The MNIST Database of handwritten digits. Yann LeCun, Corinna Cortes,
    Christopher J.C. Burges. http://yann.lecun.com/exdb/mnist/
'''

fmnist.add_metadata(kind="DESCR", contents=fmnist_readme)

In [13]:
fmnist.fetch()

DEBUG:src.logging:No file_name specified. Inferring train-images-idx3-ubyte.gz from URL
DEBUG:src.logging:train-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring train-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:train-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-images-idx3-ubyte.gz from URL
DEBUG:src.logging:t10k-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:t10k-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:f-mnist.license exists, but no hash to check. Setting to sha1:a8a7a35b62521386e849ce242bdc89964e177b12
DEBUG:src.logging:Creating f-mnist.readme from `contents` string
DEBUG:src.logging:f-mnist.readme exists, but no hash to check. Setting to sha1:db57a3964b6b3515901f665412297aabf69e007e


True

In [14]:
fmnist.unpack()

INFO:src.logging:Ungzipping train-images-idx3-ubyte
INFO:src.logging:Ungzipping train-labels-idx1-ubyte
INFO:src.logging:Ungzipping t10k-images-idx3-ubyte
INFO:src.logging:Ungzipping t10k-labels-idx1-ubyte
INFO:src.logging:Copying f-mnist.license
INFO:src.logging:Copying f-mnist.readme


PosixPath('/home/ava00088/src/devel/bus_number/data/interim/f-mnist')

In [15]:
ds = fmnist.process()

DEBUG:src.logging:Found cached Dataset for f-mnist: d95a6db563698fce9ad2afc908c56a11c7693ade


In [16]:
# do it a second time. note it is cached!
ds = fmnist.process()

DEBUG:src.logging:Found cached Dataset for f-mnist: d95a6db563698fce9ad2afc908c56a11c7693ade


In [17]:
print(ds.DESCR)
print(ds.LICENSE)
print(ds)


Fashion-MNIST

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original [MNIST] dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
  - [MNIST] The M

## Converting a `RawDataset` into a usable `Dataset`

What's a `Dataset` object?
It's a scikit-learn-style `Bunch`, containing:
* data: the processed data
* target: (optional) target vector (for supervised learning problems)
* metadata: Data about the data

Under the hood, this is basically a dictionary

Right now, `data` and `target` are empty, since we haven't given any indication how to process the raw files into usable data

## Process the raw files into `data` and `target`

How do we turn these raw files into processed data?

First, we need to unpack them. Fortunately, the `RawDataset` object knows how to unpack or decompress most common archive formats.

In [36]:
unpack_dir = fmnist.unpack()
print(unpack_dir)

DEBUG:src.logging:Raw Dataset f-mnist is already unpacked. Skipping


/home/ava00088/src/devel/bus_number/data/interim/f-mnist


By default, raw data is unpacked to `interim_data_path` in a directory corresponding to the `dataset_name`

In [39]:
from src.paths import interim_data_path

In [40]:
list_dir(interim_data_path / dataset_name)

['train-labels-idx1-ubyte',
 'f-mnist.readme',
 'train-images-idx3-ubyte',
 't10k-images-idx3-ubyte',
 't10k-labels-idx1-ubyte',
 'f-mnist.license']

https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
we see how to process this data. H


### Processing the raw data
Finally, we need to convert the raw data into usable `data` and `target` vectors.
The code at https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py tells us how to do that. Having a look at the sample code, we notice that we need numpy. How do we add this to the environment?
* Add it to `environment.yml`
* `make requirements`

Once we have done this, we can do a:

In [19]:
import numpy as np

unpack_path = fmnist.unpack()
kind = "train"

label_path = unpack_path / f"{kind}-labels-idx1-ubyte"
with open(label_path, 'rb') as fd:
    target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
dataset_path = unpack_path / f"{kind}-images-idx3-ubyte"
with open(dataset_path, 'rb') as fd:
    data = np.frombuffer(fd.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

print(f'Data: {data.shape}, Target: {target.shape}')

DEBUG:src.logging:Raw Dataset f-mnist is already unpacked. Skipping


Data: (60000, 784), Target: (60000,)


### Building a `Dataset`
A Processing function produces a dictionary of kwargs that can be used as a `Dataset` constructor:
    

In [20]:
from src.data import Dataset
help(Dataset.__init__)

Help on function __init__ in module src.data.dset:

__init__(self, dataset_name=None, data=None, target=None, metadata=None, license_txt=None, descr_txt=None, license_file=None, descr_file=None, **kwargs)
    Object representing a dataset object.
    Notionally compatible with scikit-learn's Bunch object
    
    dataset_name: string (required)
        key to use for this dataset
    data:
        Data: (usually np.array or np.ndarray)
    target: np.array
        Either classification target or label to be used. for each of the points
        in `data`
    metadata: dict
        Data about the object. Key fields include `license_txt` and `descr`
    license_txt: str
        String to use as the LICENSE for this dataset
    license_file: filename
        If `license_txt` is None, license text can be read from this file
    descr_txt: str
        String to use as the DESCR (description) for this dataset
    descr_file: filename
        If `descr_txt` is None, description text can be rea

In addition, a processing function must should accept `dataset_name` and `metadata` keywords. Any additional metadata should be added to the `metadata` object that is passed in.


Rewriting the sample code into this framework gives us this:

In [21]:
%%file ../src/data/localdata.py
__all__ = ['process_mnist']

from ..paths import interim_data_path

import numpy as np

def process_mnist(dataset_name='mnist', kind='train', metadata=None):
    '''
    Load the MNIST dataset (or a compatible variant; e.g. F-MNIST)

    dataset_name: {'mnist', 'f-mnist'}
        Which variant to load
    kind: {'train', 'test'}
        Dataset comes pre-split into training and test data.
        Indicates which dataset to load
    metadata: dict
        Additional metadata fields will be added to this dict.
        'kind': value of `kind` used to generate a subset of the data
    '''
    if metadata is None:
        metadata = {}
        
    if kind == 'test':
        kind = 't10k'

    label_path = interim_data_path / dataset_name / f"{kind}-labels-idx1-ubyte"
    with open(label_path, 'rb') as fd:
        target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
    dataset_path = interim_data_path / dataset_name / f"{kind}-images-idx3-ubyte"
    with open(dataset_path, 'rb') as fd:
        data = np.frombuffer(fd.read(), dtype=np.uint8,
                                       offset=16).reshape(len(target), 784)
    metadata['subset'] = kind
    
    dset_opts = {
        'dataset_name': dataset_name,
        'data': data,
        'target': target,
        'metadata': metadata,
    }
    return dset_opts


Overwriting ../src/data/localdata.py


In [22]:
from functools import partial
from src.data.localdata import process_mnist

In [23]:
fmnist.unpack(force=True)
fmnist.load_function = partial(process_mnist, dataset_name='f-mnist')
ds = fmnist.process(force=True)

INFO:src.logging:Ungzipping train-images-idx3-ubyte
INFO:src.logging:Ungzipping train-labels-idx1-ubyte
INFO:src.logging:Ungzipping t10k-images-idx3-ubyte
INFO:src.logging:Ungzipping t10k-labels-idx1-ubyte
INFO:src.logging:Copying f-mnist.license
INFO:src.logging:Copying f-mnist.readme
DEBUG:src.logging:Wrote d95a6db563698fce9ad2afc908c56a11c7693ade.metadata
DEBUG:src.logging:Wrote d95a6db563698fce9ad2afc908c56a11c7693ade.dataset


In [24]:
ds.data.shape, ds.target.shape

((60000, 784), (60000,))

## Adding this Dataset to the master dataset list

In [41]:
from src.data.datasets import add_dataset, load_dataset, available_datasets

In [42]:
add_dataset(fmnist)

In [47]:
ds = load_dataset("f-mnist", kind="test")
print(f"Data:{ds.data.shape}, Target:{ds.target.shape}")

DEBUG:src.logging:No file_name specified. Inferring train-images-idx3-ubyte.gz from URL
DEBUG:src.logging:train-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring train-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:train-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-images-idx3-ubyte.gz from URL
DEBUG:src.logging:t10k-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:t10k-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:f-mnist.license already exists and hash is valid
DEBUG:src.logging:Creating f-mnist.readme from `contents` string
DEBUG:src.logging:f-mnist.readme already exists and hash is valid
INFO:src.logging:Ungzipping train-images-idx3-ubyte
INFO:src.logging:Ungzipping train-labels-idx1-ubyte
INFO:src.logging:Ungzipping t10k-image

Data:(10000, 784), Target:(10000,)


In [48]:
ds = load_dataset("f-mnist", kind="train")
print(f"Data:{ds.data.shape}, Target:{ds.target.shape}")

DEBUG:src.logging:No file_name specified. Inferring train-images-idx3-ubyte.gz from URL
DEBUG:src.logging:train-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring train-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:train-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-images-idx3-ubyte.gz from URL
DEBUG:src.logging:t10k-images-idx3-ubyte.gz already exists and hash is valid
DEBUG:src.logging:No file_name specified. Inferring t10k-labels-idx1-ubyte.gz from URL
DEBUG:src.logging:t10k-labels-idx1-ubyte.gz already exists and hash is valid
DEBUG:src.logging:f-mnist.license already exists and hash is valid
DEBUG:src.logging:Creating f-mnist.readme from `contents` string
DEBUG:src.logging:f-mnist.readme already exists and hash is valid
INFO:src.logging:Ungzipping train-images-idx3-ubyte
INFO:src.logging:Ungzipping train-labels-idx1-ubyte
INFO:src.logging:Ungzipping t10k-image

Data:(60000, 784), Target:(60000,)


### Check in the new `datasets.json`

* Check it in to source code control
* do a `make data`
