In [18]:
%load_ext autoreload
%autoreload 2

In [19]:
import logging
from src.logging import logger
logger.setLevel(logging.DEBUG)

# Tutorial 1: Reproducible Data (1h)
*"Raw Data is Read Only. Sing it with me"*


* RawDataset
  * Fetching + Unpack
      * Example 1: lvq-pak
      * Exercise: fmnist
  * Processing data
    * Process into data, (optionally, target)
    * create a process_my_dataset() function
        * Example 1: lvq-pak
        * Exercise: fmnist
    * save the raw dataset to the raw dataset catalog

* Datasets and Data Transformers
    * Create a transformer to produce a Dataset from the RawDataset
    * Add this dataset to the catalog
    * Load the dataset
        * example: lvq-pak
        * exercise: fmnist_test, fmnist_train
    
    * More Complicated Transformers
        * Example: Train/Test Split on lvq-pak
        * Exercise: merge labels on lvq-pak
        * Exercise: merge labels on fmnist
        
* Punchline: 
  * make clean_raw, clean_cache, clean_processed, (clean_data?) `make data`

## Introducing the Characters:

### Bjørn
* Bjørn is a dot-com millionaire. Currently he heads the Ikea R&D kitchen in Sweden.
* Bjørn employs a large number of Finnish line cooks. He can’t understand a word they say.
* Bjørn needs a **trained model** to do real-time translation from Finnish to Swedish.
* Reproducibility means that even though test kitchens have notoriously high turnover, Bjørn can ask his sous-chef to **deploy updated models** whenever needed.

### Mark
* Mark used to be an astronaut. After a rough year, he decided to change careers, and now works for a fashion magazine in NYC.
* Mark has to keep up with thousands of new fashions every week. More if it’s Fashion Week.
* Mark wants to **publish a paper and software library** for accelerating fashion review using science. 
* Reproducibility means Mark's peer reviewers can **reproduce his work**, his paper will get accepted, his library will get pulled into *scikit-learn*, and he will become a hero to the data-based fashion industry.


## Introducing the `RawDataset`
The `RawDataset` object handles downloading, unpacking, and processing raw data files, and serves as a container for some basic metadata, including **documentation** and **license** information.



Raw data files are downloaded to  `paths.raw_data_path`.
 Cache files and unpacked raw files are saved to `paths.interim_data_path`.
    

### Fetching and Unpacking Raw Data
#### Example: Bjørn and lvq-pak

##### LVQ-Pak: A Finnish Phonetic dataset

The Learning Vector Quantization (lvq-pak) project includes a simple Finnish phonetic dataset
consisting 20-dimensional Mel Frequency Cepstrum Coefficients (MFCCs) labelled with target phoneme information. Our goal is to explore this dataset, process it into a useful form, and make it a part of a reproducible data science workflow. The project can be found at: http://www.cis.hut.fi/research/lvq_pak/

For this example, we are going to help Bjørn create a `RawDataset` by:
1. Download and unpack the raw data files. 
2. Generate (and record) hash values for these files.
2. Add relevant LICENSE and DESCR (description) metadata to this RawDataset

In [22]:
from src.data import RawDataset
from src.utils import list_dir

In [13]:
raw_ds = RawDataset('lvq-pak')

In [14]:
raw_ds.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

In [31]:
raw_ds.add_url("http://www.cis.hut.fi/research/lvq_pak/README", file_name='lvq-pak.readme', name='DESCR')

In [32]:
license_txt = '''
************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming Team of the                       *
*                 Helsinki University of Technology                    *
*           Laboratory of Computer and Information Science             *
*                Rakentajanaukio 2 C, SF-02150 Espoo                   *
*                              FINLAND                                 *
*                                                                      *
*                      Copyright (c) 1991-1995                         *
*                                                                      *
************************************************************************
*                                                                      *
*  NOTE: This program package is copyrighted in the sense that it      *
*  may be used for scientific purposes. The package as a whole, or     *
*  parts thereof, cannot be included or used in any commercial         *
*  application without written permission granted by its producents.   *
*  No programs contained in this package may be copied for commercial  *
*  distribution.                                                       *
*                                                                      *
*  All comments concerning this program package may be sent to the     *
*  e-mail address 'lvq@nucleus.hut.fi'.                                *
*                                                                      *
************************************************************************
'''
raw_ds.add_metadata(contents=license_txt, kind='LICENSE')

In [33]:
raw_ds.fetch()

2018-10-19 13:59:09,964 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2018-10-19 13:59:09,967 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2018-10-19 13:59:09,968 - fetch - DEBUG - lvq-pak.readme already exists and hash is valid
2018-10-19 13:59:09,969 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-19 13:59:09,970 - fetch - DEBUG - lvq-pak.license already exists and hash is valid
2018-10-19 13:59:09,972 - fetch - DEBUG - lvq-pak.readme exists, but no hash to check. Setting to sha1:138b69cc0b4e02950cec5833752e50a54d36fd0f
2018-10-19 13:59:09,972 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-19 13:59:09,974 - fetch - DEBUG - lvq-pak.license exists, but no hash to check. Setting to sha1:e5f53b172926d34cb6a49877be49ee08bc4d51c1


True

In [34]:
unpack_dir = raw_ds.unpack()

2018-10-19 13:59:10,682 - datasets - DEBUG - Raw Dataset lvq-pak is already unpacked. Skipping


In [35]:
print(f'{unpack_dir}')
list_dir(unpack_dir)

/Users/kjell/Documents/devel/git/bus_number/data/interim/lvq-pak


['lvq-pak.license', 'lvq-pak.readme', 'lvq_pak-3.1']

In [41]:
workflow.add_raw_dataset(raw_ds)   # Add this raw dataset to the catalog
workflow.available_raw_datasets()

['lvq-pak']

#### Exercise: Mark and F-MNIST
For this excercise, you are going to help Mark build a `RawDataset` out of the Fashion-MNIST files.

[Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) is available from GitHub. Looking at the documentation there, we see that the raw data is distributed as a set of 4 files. The git repo specifies the checksums of these files:

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|

Your mission is to build a `RawDataset` that downloads these raw files and verifies that the hash values are as expected. You should make sure to include metadata in this `RawDataset`, including **description** (DESCR) and **license** (LICENSE) inforation.

### Processing Raw Data
How do we turn raw data into something useful? There are 2 steps:
1. Write a function to extract meaningful `data` (and optionally, `target`) objects from your raw files, and
2. Wrap this function in the form of a **processing function**

#### Processing Function Template
A processing function is a function that 
* takes at least 2 keyword arguments as input: `dataset_name` (a string) and `metadata` (a dict).
* Returns a dictionary with the following keys: `dataset_name`, `data`, `target` (optional), and `metadata`
Here's a template:


In [115]:
def process_raw_data(dataset_name='raw_data', metadata=None):
    """Process a raw dataset object
    Parameters
    ----------
    dataset_name: (string)
        Name of this raw dataset. This will be used as a key for accessing this raw dataset in the
        Raw Dataset catalog
    metadata: dict or None
        If None, an empty metadata dictionary will be used.
    extract_func: function returning tuple: (data, target)
        Function to extract data and target

    Returns
    -------
    Dictionary containing the following keys:
        dataset_name: (string)
            `dataset_name` that was passed to the function
        metadata: (dict)
            dict containing the input `metadata` key/value pairs, and (optionally)
            additional information about this raw dataset
        data: array-style object
            Often a `numpy.ndarray` or `pandas.DataFrame`
        target: (optional) vector-style object
            for supervised learning problems, the target vector associated with `data`
    """
    if metadata is None:
        metadata = {}

    data, target = None, None

    # Generate `data` and `target` info
    #    data, target = extract_func()

    dset_opts = {
        'dataset_name': dataset_name,
        'metadata': metadata,
        'data': data,
        'target': target,
    }
    return dset_opts

#### Example: Processing lvq-pak data
Bjørn has successfully fetched and extracted the lvq-pak data. Now he is ready to process it into `data` and `target`.

In [102]:
raw_ds = RawDataset.from_name('lvq-pak')    # load it from the catalog
unpack_dir = raw_ds.unpack()                # Find the location of the unpacked files

2018-10-19 14:38:03,528 - datasets - DEBUG - unpack() called before fetch()
2018-10-19 14:38:03,529 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2018-10-19 14:38:03,533 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2018-10-19 14:38:03,535 - fetch - DEBUG - lvq-pak.readme already exists and hash is valid
2018-10-19 14:38:03,536 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-19 14:38:03,538 - fetch - DEBUG - lvq-pak.license already exists and hash is valid
2018-10-19 14:38:03,539 - fetch - DEBUG - lvq-pak.readme already exists and hash is valid
2018-10-19 14:38:03,539 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-19 14:38:03,542 - fetch - DEBUG - lvq-pak.license already exists and hash is valid
2018-10-19 14:38:03,563 - fetch - DEBUG - Extracting lvq_pak-3.1.tar
2018-10-19 14:38:03,564 - fetch - DEBUG - Copying lvq-pak.readme
2018-10-19 14:38:03,566 - fetch - DEBUG - Copying lvq-pa

In [103]:
list_dir(unpack_dir) # what's the extracted data look like?

['lvq-pak.license', 'lvq-pak.readme', 'lvq_pak-3.1']

In [104]:
list_dir(unpack_dir / 'lvq_pak-3.1')  # Files are extracted to a subdirectory:

['accuracy.c',
 'balance.c',
 'classify.c',
 'cmatr.c',
 'config.h',
 'datafile.c',
 'datafile.h',
 'elimin.c',
 'errors.h',
 'eveninit.c',
 'ex1.dat',
 'ex2.dat',
 'extract.c',
 'fileio.c',
 'fileio.h',
 'knntest.c',
 'labels.c',
 'labels.h',
 'lvq_pak.c',
 'lvq_pak.h',
 'lvq_rout.c',
 'lvq_rout.h',
 'lvq_run.c',
 'lvqtrain.c',
 'makefile.dos',
 'makefile.unix',
 'mcnemar.c',
 'mindist.c',
 'pick.c',
 'README',
 'sammon.c',
 'setlabel.c',
 'showlabs.c',
 'stddev.c',
 'VERSION',
 'version.c',
 'version.h']

In [105]:
datafile_train = unpack_dir / 'lvq_pak-3.1' / 'ex1.dat'
datafile_test = unpack_dir / 'lvq_pak-3.1' / 'ex2.dat'
datafile_train.exists() and datafile_test.exists()

True

In [106]:
from src.utils import head_file
print(head_file(datafile_train)) # number of data columns, followed by comment, then space-delimited data

20
# Example data from speech signal
21.47 -19.90 -20.68 -6.73 13.67 -11.95 13.83 12.02 7.62 -6.15 -4.38 -2.91 4.80 -7.39 -3.54 -0.87 -5.02 -1.41 -2.33 2.12 A
0.05 28.38 9.52 -11.30 3.11 -11.88 -2.90 -11.04 2.32 -13.80 1.71 -0.40 -1.36 3.91 3.21 -0.98 -0.14 -4.70 0.30 0.27 I
-4.71 -4.61 -0.64 1.78 -1.48 5.98 12.55 -0.50 4.74 4.68 3.27 -0.36 9.24 3.39 -0.40 -1.59 0.94 2.17 -0.10 -0.45 #
10.78 -22.31 -11.32 -10.92 10.96 -14.64 7.02 13.83 6.72 -7.99 -7.45 -3.20 8.45 2.76 -2.85 1.22 -6.60 -4.96 -1.42 0.57 A



In [107]:
print(head_file(datafile_test)) # similar, but no comment header

20
13.55 -12.07 -18.81 -6.13 10.66 -16.64 10.54 3.97 6.41 -8.01 -4.28 -3.37 7.96 -3.06 1.85 -5.28 1.05 3.87 1.77 -1.79 A
0.28 22.76 8.26 -15.68 3.43 -13.39 -10.58 -14.71 3.96 -9.65 4.74 -9.55 0.90 7.25 4.69 -2.44 1.96 -4.32 2.48 1.46 I
-3.51 -6.01 -0.47 -0.82 -0.38 -2.91 -1.35 0.48 1.88 3.00 4.11 7.21 3.36 7.48 2.37 -5.26 2.58 1.99 -1.09 -4.20 #
9.76 -23.02 -12.69 -12.28 12.26 -15.74 10.01 8.30 1.95 -7.41 -0.68 -2.56 5.02 -1.56 -0.16 -1.87 -6.97 -0.08 0.51 2.00 A
-2.85 -3.63 -0.89 -1.04 -3.26 4.13 4.37 5.18 2.79 -1.02 8.60 9.78 2.49 -2.63 -1.52 -1.26 2.27 -0.17 2.05 1.54 #



In [108]:
import pandas as pd

In [109]:
def read_space_delimited(filename, skiprows=None, class_labels=True):
    """Read an space-delimited file
    
    Data is space-delimited. Last column is the (string) label for the data

    Note: we can't use automatic comment detection, as `#` characters are also used as data labels.

    Parameters
    ----------
    skiprows: None or list
        list of rows to skip when reading the file.
    class_labels: boolean
        if true, the last column is treated as the class (target) label
    """
    with open(filename, 'r') as fd:
        df = pd.read_table(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target

In [110]:
data, target = read_space_delimited(datafile_train, skiprows=[0,1])
data.shape, target.shape

((1962, 20), (1962,))

In [119]:
from src.paths import interim_data_path
import numpy as np

In [120]:
def process_lvq_pak(dataset_name='lvq-pak', metadata=None, kind='all'):
    """Process LVQ-data object
    Parameters
    ----------
    dataset_name: (string)
        Name of this raw dataset. This will be used as a key for accessing this raw dataset in the
        Raw Dataset catalog
    metadata: dict or None
        If None, an empty metadata dictionary will be used.
    extract_func: function returning tuple: (data, target)
        Function to extract data and target
    kind: {'train', 'test', 'all'}
        Whether to return training set, test set, or everything. 
        
    Returns
    -------
    Dictionary containing the following keys:
        dataset_name: (string)
            `dataset_name` that was passed to the function
        metadata: (dict)
            dict containing the input `metadata` key/value pairs, and (optionally)
            additional information about this raw dataset
        data: array-style object
            Often a `numpy.ndarray` or `pandas.DataFrame`
        target: (optional) vector-style object
            for supervised learning problems, the target vector associated with `data`
    """
    if metadata is None:
        metadata = {}

    untar_dir = interim_data_path / dataset_name
    unpack_dir = untar_dir / 'lvq_pak-3.1'

    if kind == 'train':
        data, target = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
    elif kind == 'test':
        data, target = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
    elif kind == 'all':
        data1, target1 = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
        data2, target2 = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    dset_opts = {
        'dataset_name': dataset_name,
        'metadata': metadata,
        'data': data,
        'target': target,
    }
    return dset_opts

In [121]:
process_lvq_pak()

{'dataset_name': 'lvq-pak',
 'metadata': {},
 'data': array([['21.47', '-19.90', '-20.68', ..., '-1.41', '-2.33', '2.12'],
        ['0.05', '28.38', '9.52', ..., '-4.70', '0.30', '0.27'],
        ['-4.71', '-4.61', '-0.64', ..., '2.17', '-0.10', '-0.45'],
        ...,
        ['-2.63', '-6.59', '0.19', ..., '0.76', '0.89', '-3.48'],
        ['5.35', '4.96', '18.75', ..., '-0.57', '0.00', '1.35'],
        ['-0.37', '-5.27', '-1.74', ..., '3.48', '-0.90', '-1.00']],
       dtype=object),
 'target': array(['A', 'I', '#', ..., '#', 'Y', '#'], dtype=object)}

In [122]:
raw_ds.load_function = process_lvq_pak

In [125]:
ds = raw_ds.process() # Use the load_function to convert this RawDataset to a real Dataset

2018-10-19 14:50:15,167 - datasets - DEBUG - Found cached Dataset for lvq-pak: 9237b9237ed5cfb499e7ba4c10788295b09b4659


In [128]:
print(f"Built Dataset: {ds}")

Built Dataset: <Dataset: lvq-pak, data.shape=(3924, 20), target.shape=(3924,), metadata=['descr', 'license', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>


In [130]:
ds = raw_ds.process(kind="test")  # Should be half the size
print(ds)

2018-10-19 14:57:49,630 - datasets - DEBUG - Found cached Dataset for lvq-pak: 1e177bc4e370b3cfcce9d862c8401ddb09075324


<Dataset: lvq-pak, data.shape=(1962, 20), target.shape=(1962,), metadata=['descr', 'license', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>


#### EXERCISE: Process Mark's F-MNIST Data
In the last exercise, you fetched and unpacked F-MNIST data.
Now it's time to process it into a usable dataset.

## The `Dataset` and Data Transformations

### Tour of the Dataset Object

### Creating a Simple Transformer

### More Complicated Transformers

## Reproducible Data: The Punchline