In [1]:
from src.data import RawDataset, Dataset
from src import workflow

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import logging
from src.logging import logger
logger.setLevel(logging.DEBUG)

In [4]:
import pandas as pd

# Raw Data is Read Only: Sing it with me!

How do you asseble raw data into a usable dataset? How do you make sure your raw datasets are reproducible?
We are going to work through the process of collecting raw data files (using an object called a `RawDataset`), and converting them into something useful for our reproducible data science workflow (the `Dataset` object).

## Bjørn's Problem: Supervised Learning

Bjørn employs a large number of Finnish line cooks. He can’t understand a word they say.

Bjørn needs a trained model to do real-time translation from Finnish to Swedish.

Bjørn has decided to start with the Finnish phoneme dataset shipped with a project called lvq-pak. His objective is to train three different models, and choose the one with the best overall accuracy score.


### LVQ-Pak: A Finnish Phonetic dataset

The Learning Vector Quantization (lvq-pak) project includes a simple Finnish phonetic dataset
consisting 20-dimensional Mel Frequency Cepstrum Coefficients (MFCCs) labelled with target phoneme information. Our goal is to explore this dataset, process it into a useful form, and make it a part of a reproducible data science workflow.

In [5]:
dataset_name='lvq-pak'  # Naming things: the hardest problem in computer science.

lvq-pak is shipped as a source code tarball. The data files are included as textfiles within that download. Our first goal is to retrieve these source files, and record some metadata about them.

###  Grab the source code package
We can add a file to the RawDataset using its `add_url()` method. If we know the hash of this file, we should include it here, and it will be checked on download. If not, one will be computed from this download and used for comparison on subsequent downloads.


In [39]:
lvq_pak = RawDataset(dataset_name)
lvq_pak.add_url(url="http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

There are some additional options to `fetch_url()` we may wish to use

In [7]:
help(RawDataset.add_url)

Help on function add_url in module src.data.datasets:

add_url(self, url=None, hash_type='sha1', hash_value=None, name=None, file_name=None)
    Add a URL to the file list
    
    hash_type: {'sha1', 'md5', 'sha256'}
    hash_value: string or None
        if None, hash will be computed from downloaded file
    file_name: string or None
        Name of downloaded file. If None, will be the last component of the URL
    url: string
        URL to fetch
    name: str
        text description of this file.



`hash_value` and `hash_type` should be self-evident.

`file_name` is optional. If skipped, the name will be guessed from the URL. It's often nice to specify, however, so you can use this attribute to override the default guess.

The `name` field can either be used to specify a text label to describe the dataset. It can also be used to tag two special pieces of metadata: `DESCR` and `LICENSE`, the downloaded (text) file will be used as the dataset description and license text, respectively.

We will use this feature to add a `README` to our dataset:

In [8]:
lvq_pak.add_url(url='http://www.cis.hut.fi/research/lvq_pak/README',
                file_name=f'{dataset_name}.readme',
                name='DESCR')

Datasets should *always* have an explicit license. Reading the project documentation, we see a license in one of the textfiles. We can extract and use it via the `add_metadata()` method:

In [9]:
license_txt = '''
************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming Team of the                       *
*                 Helsinki University of Technology                    *
*           Laboratory of Computer and Information Science             *
*                Rakentajanaukio 2 C, SF-02150 Espoo                   *
*                              FINLAND                                 *
*                                                                      *
*                      Copyright (c) 1991-1995                         *
*                                                                      *
************************************************************************
*                                                                      *
*  NOTE: This program package is copyrighted in the sense that it      *
*  may be used for scientific purposes. The package as a whole, or     *
*  parts thereof, cannot be included or used in any commercial         *
*  application without written permission granted by its producents.   *
*  No programs contained in this package may be copied for commercial  *
*  distribution.                                                       *
*                                                                      *
*  All comments concerning this program package may be sent to the     *
*  e-mail address 'lvq@nucleus.hut.fi'.                                *
*                                                                      *
************************************************************************
'''
lvq_pak.add_metadata(contents=license_txt, kind='LICENSE')

### Fetching the Data

In [10]:
lvq_pak.fetch()

2018-10-17 13:46:49,649 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2018-10-17 13:46:49,653 - fetch - DEBUG - lvq_pak-3.1.tar exists, but no hash to check. Setting to sha1:86024a871724e521341da0ffb783956e39aadb6e
2018-10-17 13:46:49,656 - fetch - DEBUG - lvq-pak.readme exists, but no hash to check. Setting to sha1:138b69cc0b4e02950cec5833752e50a54d36fd0f
2018-10-17 13:46:49,657 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-17 13:46:49,661 - fetch - DEBUG - lvq-pak.license exists, but no hash to check. Setting to sha1:e5f53b172926d34cb6a49877be49ee08bc4d51c1


True

In [11]:
from src.utils import list_dir
from src.paths import raw_data_path
list_dir(raw_data_path)

['lvq_pak-3.1.tar',
 'fremont_bike.readme',
 'f-mnist.license',
 'f-mnist.readme',
 'lvq-pak.readme',
 'fremont_bike.license',
 't10k-images-idx3-ubyte.gz',
 'train-images-idx3-ubyte.gz',
 'fremont.csv',
 'train-labels-idx1-ubyte.gz',
 'lvq-pak.license',
 't10k-labels-idx1-ubyte.gz']

## Processing the data
The next step is to write the importer that actually processes the data we will be using for this dataset.

The important things to generate are `data` and `target` entries. A `metadata` is optional, but recommended if you want to save additional information about the dataset.

Usually, this functionality gets bundled up into a function and added to `datasets.py`


In [12]:
# Unpack the file
untar_dir = lvq_pak.unpack()
list_dir(untar_dir)

2018-10-17 13:46:49,854 - fetch - DEBUG - Extracting lvq_pak-3.1.tar
2018-10-17 13:46:49,857 - fetch - DEBUG - Copying lvq-pak.readme
2018-10-17 13:46:49,861 - fetch - DEBUG - Copying lvq-pak.license


['lvq-pak.readme', 'lvq_pak-3.1', 'lvq-pak.license']

In [13]:
unpack_dir = untar_dir / 'lvq_pak-3.1'
list_dir(unpack_dir)

['lvq_rout.h',
 'knntest.c',
 'datafile.h',
 'extract.c',
 'lvq_pak.c',
 'version.h',
 'cmatr.c',
 'fileio.c',
 'setlabel.c',
 'eveninit.c',
 'config.h',
 'mindist.c',
 'elimin.c',
 'pick.c',
 'lvq_run.c',
 'classify.c',
 'lvqtrain.c',
 'errors.h',
 'stddev.c',
 'README',
 'labels.c',
 'lvq_pak.h',
 'balance.c',
 'sammon.c',
 'makefile.dos',
 'makefile.unix',
 'datafile.c',
 'lvq_rout.c',
 'accuracy.c',
 'VERSION',
 'ex2.dat',
 'fileio.h',
 'version.c',
 'mcnemar.c',
 'ex1.dat',
 'labels.h',
 'showlabs.c']

In this dataset, the training and test datsets are stored in files named `ex1.dat` and `ex2.dat` respectively

In [14]:
datafile_train = unpack_dir / 'ex1.dat'
datafile_test = unpack_dir / 'ex2.dat'

datafile_train.exists() and datafile_test.exists()

True

According to the documentation, the data format is space-delimited, with the class label included as the last column. Let's have a look.

In [15]:
from src.utils import head_file

In [16]:
print(head_file(datafile_train))

20
# Example data from speech signal
21.47 -19.90 -20.68 -6.73 13.67 -11.95 13.83 12.02 7.62 -6.15 -4.38 -2.91 4.80 -7.39 -3.54 -0.87 -5.02 -1.41 -2.33 2.12 A
0.05 28.38 9.52 -11.30 3.11 -11.88 -2.90 -11.04 2.32 -13.80 1.71 -0.40 -1.36 3.91 3.21 -0.98 -0.14 -4.70 0.30 0.27 I
-4.71 -4.61 -0.64 1.78 -1.48 5.98 12.55 -0.50 4.74 4.68 3.27 -0.36 9.24 3.39 -0.40 -1.59 0.94 2.17 -0.10 -0.45 #
10.78 -22.31 -11.32 -10.92 10.96 -14.64 7.02 13.83 6.72 -7.99 -7.45 -3.20 8.45 2.76 -2.85 1.22 -6.60 -4.96 -1.42 0.57 A



Indeed, the datafile consists of a single line containing the dimension of the data, a comment, and then 21 space-delimited columns, the final column being the target class label. 

**Note:** We have to be a little careful importing the data, because '#' is used both as the comment delimiter, and as a class label.

Fortunately, we have a helper function for this. We will get a little cheeky and skip the first 2 lines (hoping there are no other comments). The documentation also says ther are 1962 entries in each of the training and test datasets.

In [17]:
def read_space_delimited(filename, skiprows=None, class_labels=True):
    """Read an space-delimited file

    skiprows: list of rows to skip when reading the file.

    Note: we can't use automatic comment detection, as
    `#` characters are also used as data labels.
    class_labels: boolean
        if true, the last column is treated as the class label
    """
    with open(filename, 'r') as fd:
        df = pd.read_table(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target


In [18]:
data, target = read_space_delimited(datafile_train, skiprows=[0,1])
data2, target2 = read_space_delimited(datafile_test, skiprows=[0])

data.shape, target.shape, data2.shape, target2.shape

((1962, 20), (1962,), (1962, 20), (1962,))

In [19]:
target

array(['A', 'I', '#', ..., '#', 'Y', '#'], dtype=object)

This seems to work, so let's wrap this functionality up into a processing function.
By convention, a processing function takes a `dataset_name`, and any other options that may be useful for reading the data, and returns a dictionary that matches the `Dataset` constructor signature.

We will place this function in `localdata.py`, (and add it to `__all__`) to make it visible to our dataset code.

In [20]:
#%%file ../src/data/localdata.py
"""
Custom dataset processing/generation functions should be added to this file
"""

from src.data.utils import read_space_delimited, normalize_labels
from src.paths import interim_data_path
import numpy as np

__all__ = ['process_lvq_pak']

def process_lvq_pak(dataset_name='lvq-pak', kind='all', numeric_labels=True, metadata=None):
    """
    kind: {'test', 'train', 'all'}, default 'all'
    numeric_labels: boolean (default: True)
        if set, target is a vector of integers, and label_map is created in the metadata
        to reflect the mapping to the string targets
    """
    
    untar_dir = interim_data_path / dataset_name
    unpack_dir = untar_dir / 'lvq_pak-3.1'

    if kind == 'train':
        data, target = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
    elif kind == 'test':
        data, target = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
    elif kind == 'all':
        data1, target1 = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
        data2, target2 = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    if numeric_labels:
        if metadata is None:
            metadata = {}
        mapped_target, label_map = normalize_labels(target)
        metadata['label_map'] = label_map
        target = mapped_target

    dset_opts = {
        'dataset_name': dataset_name,
        'data': data,
        'target': target,
        'metadata': metadata
    }
    return dset_opts


Let's make sure this works as expected

In [21]:
from src.data.localdata import process_lvq_pak

for kind in ['train', 'test', 'all']:
    dset_opts = process_lvq_pak(kind=kind)
    dset = Dataset(**dset_opts)
    print(f'{kind}: data={dset.data.shape} target={dset.target.shape}')

train: data=(1962, 20) target=(1962,)
test: data=(1962, 20) target=(1962,)
all: data=(3924, 20) target=(3924,)


This all looks good. Add this function to the `RawDataset`, and add the `RawDataset` to the global dataset list


In [22]:
lvq_pak.load_function = process_lvq_pak

In [23]:
workflow.add_raw_dataset(lvq_pak)

Finally, re-load the dataset and save a copy of it

In [24]:
lvq = Dataset.from_raw(dataset_name, force=True)
print(str(lvq))

2018-10-17 13:46:51,475 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2018-10-17 13:46:51,482 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2018-10-17 13:46:51,485 - fetch - DEBUG - lvq-pak.readme already exists and hash is valid
2018-10-17 13:46:51,488 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2018-10-17 13:46:51,491 - fetch - DEBUG - lvq-pak.license already exists and hash is valid
2018-10-17 13:46:51,531 - fetch - DEBUG - Extracting lvq_pak-3.1.tar
2018-10-17 13:46:51,536 - fetch - DEBUG - Copying lvq-pak.readme
2018-10-17 13:46:51,540 - fetch - DEBUG - Copying lvq-pak.license
2018-10-17 13:46:52,278 - datasets - DEBUG - Wrote Dataset Metadata: 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.metadata
2018-10-17 13:46:52,305 - datasets - DEBUG - Wrote Dataset: 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.dataset


<Dataset: lvq-pak, data.shape=(3924, 20), target.shape=(3924,), metadata=['descr', 'license', 'dataset_name', 'label_map', 'hash_type', 'data_hash', 'target_hash']>


Alternately, we can create an empty transformer to convert this RawDataset into a Dataset and save it

In [25]:
workflow.add_transformer(from_raw=dataset_name)

In [26]:
workflow.get_transformer_list()

[{'output_dataset': 'f-mnist_train',
  'raw_dataset_name': 'f-mnist',
  'raw_dataset_opts': {'kind': 'train'}},
 {'output_dataset': 'f-mnist_test',
  'raw_dataset_name': 'f-mnist',
  'raw_dataset_opts': {'kind': 'test'}},
 {'output_dataset': 'lvq-pak', 'raw_dataset_name': 'lvq-pak'},
 {'raw_dataset_name': 'lvq-pak',
  'transformations': [['train_test_split',
    {'random_state': 6502, 'test_size': 0.2}]]},
 {'output_dataset': 'lvq-pak', 'raw_dataset_name': 'lvq-pak'}]

The complete dataset can be written to `processed_data_path`. A copy of just the metadata is also stored, so that it may be quickly checked without loading the entire dataset:

In [27]:
logger.setLevel(logging.INFO)
workflow.apply_transforms()

2018-10-17 13:46:53,410 - transform_data - INFO - Writing transformed Dataset: f-mnist_train
2018-10-17 13:46:54,896 - transform_data - INFO - Writing transformed Dataset: f-mnist_test
2018-10-17 13:46:55,019 - transform_data - INFO - Writing transformed Dataset: lvq-pak
2018-10-17 13:46:55,506 - transformers - INFO - Writing Transformed Dataset: lvq-pak_train
2018-10-17 13:46:55,800 - transformers - INFO - Writing Transformed Dataset: lvq-pak_test
2018-10-17 13:46:55,917 - transform_data - INFO - Writing transformed Dataset: lvq-pak


In [28]:
lvq_meta = Dataset.load(dataset_name, metadata_only=True)

In [29]:
type(lvq_meta)

dict

In [30]:
print(lvq.DATA_HASH)

c87a90aee1ddadb50282e68b9f0155a74b6d7a61


And even better, from now on, **anytime** we want to access this dataset, all we have to do now is to run `make data`, and then load the dataset!

In [40]:
lvq_from_file = Dataset.load(dataset_name)

In [43]:
lvq_from_file.data.shape

(3924, 20)

## Train/Test Split

Since the tasks we have for Bjørn is to build a supervised model, we will undoubtably want to be able to create a train/test split. 

How can we script this into the Dataset creation process?

We can  add a **transformer** that builds the train/test split. Since this is a transformation we want to do all the time, we already have one built in to do this.

In [46]:
workflow.available_datasets()

['lvq-pak_test', 'f-mnist_train', 'lvq-pak', 'lvq-pak_train', 'f-mnist_test']

In [47]:
workflow.available_transformers()

['index_to_date_time', 'pivot', 'train_test_split']

In [50]:
help(workflow.available_transformers(keys_only=False)['train_test_split'])

Help on function split_dataset_test_train in module src.data.transformers:

split_dataset_test_train(dset, dump_path=None, dump_metadata=True, force=True, create_dirs=True, **split_opts)
    Transformer that performs a train/test split.
    
    This transformer passes `dset` intact, but creates and dumps two new
    datasets as a side effect: {dset.name}_test and {dset.name}_train
    
    Parameters
    ----------
    dump_metadata: boolean
        If True, also dump a standalone copy of the metadata.
        Useful for checking metadata without reading
        in the (potentially large) dataset itself
    dump_path: path. (default: `processed_data_path`)
        Directory where data will be dumped.
    force: boolean
        If False, raise an exception if any dunp files already exists
        If True, overwrite any existing files
    create_dirs: boolean
        If True, `dump_path` will be created (if necessary)
    **split_opts:
        Remaining options will be passed to `train_

In [51]:
workflow.get_transformer_list()

[{'output_dataset': 'f-mnist_train',
  'raw_dataset_name': 'f-mnist',
  'raw_dataset_opts': {'kind': 'train'}},
 {'output_dataset': 'f-mnist_test',
  'raw_dataset_name': 'f-mnist',
  'raw_dataset_opts': {'kind': 'test'}},
 {'output_dataset': 'lvq-pak', 'raw_dataset_name': 'lvq-pak'},
 {'raw_dataset_name': 'lvq-pak',
  'transformations': [['train_test_split',
    {'random_state': 6502, 'test_size': 0.2}]]},
 {'output_dataset': 'lvq-pak', 'raw_dataset_name': 'lvq-pak'},
 {'raw_dataset_name': 'lvq-pak',
  'transformations': [['train_test_split',
    {'random_state': 6502, 'test_size': 0.2}]]}]

In [35]:
transform_pipeline = [("train_test_split", {'random_state':6502, 'test_size':0.2})]
workflow.add_transformer(from_raw='lvq-pak',
                         suppress_output=True,
                         transformations=transform_pipeline)

In [36]:
logger.setLevel(logging.INFO) # This step is a bit noisy, otherwise.
workflow.apply_transforms()
logger.setLevel(logging.DEBUG)

2018-10-17 13:46:58,898 - transform_data - INFO - Writing transformed Dataset: f-mnist_train
2018-10-17 13:47:00,262 - transform_data - INFO - Writing transformed Dataset: f-mnist_test
2018-10-17 13:47:00,353 - transform_data - INFO - Writing transformed Dataset: lvq-pak
2018-10-17 13:47:00,784 - transformers - INFO - Writing Transformed Dataset: lvq-pak_train
2018-10-17 13:47:01,053 - transformers - INFO - Writing Transformed Dataset: lvq-pak_test
2018-10-17 13:47:01,171 - transform_data - INFO - Writing transformed Dataset: lvq-pak
2018-10-17 13:47:01,546 - transformers - INFO - Writing Transformed Dataset: lvq-pak_train
2018-10-17 13:47:01,935 - transformers - INFO - Writing Transformed Dataset: lvq-pak_test


Notice there are now 2 new datasets: `lvq-pak_test` and `lvq-pak_train`

In [37]:
workflow.available_datasets()

['lvq-pak_test', 'f-mnist_train', 'lvq-pak', 'lvq-pak_train', 'f-mnist_test']

We can now load the split data. The metadata now reflects the options that were used to generate this split.

In [38]:
ds = Dataset.load('lvq-pak_train')
ds.SPLIT_OPTS

{'random_state': 6502, 'test_size': 0.2}

In general, any other munging that we want to do to our data can be done via a **transformer**. That way, our RawDataset stays **read-only**, and all of our munging is scripted and reproducible. Furthermore, we can easily access our transformed data whenever we need to.

