# 2.2 Transforming DataSources into Datasets
“It is a capital mistake to theorize before one has data.” Sherlock Holmes, “A Study in Scarlett” (Arthur Conan Doyle).


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
from src.logging import logger
logger.setLevel(logging.INFO)

## Turning a `DataSource` into a `Dataset`

### The Dataset object

A Dataset is the fundamental object we use for making the "munge" part of our data flow reproducible

What's a Dataset object? It's a scikit-learn-style Bunch, containing:

    data: the processed data
    target: (optional) target vector (for supervised learning problems)
    metadata: Data about the data

Under the hood, this is esentially a dictionary, with a little bit of magic to make it nicer to work with.

How do we turn raw data sources into something useful? There are 2 steps:
1. Write a function to extract meaningful `data` (and optionally, `target`) objects from your raw source files, that is, a **parse function**, and
2. package this **parse function** according to a very simple API


First, let's grab the `DataSource` we created in the last notebook.

### Loading a `DataSource` from the Catalog

In [None]:
from src import workflow
from src import paths
from src.data import DataSource
import pathlib

In [None]:
workflow.available_datasources()

In [None]:
dsrc = DataSource.from_name('lvq-pak')    # load it from the catalog
unpack_dir = dsrc.unpack()                # Find the location of the unpacked files

In [None]:
!ls -la $unpack_dir

### Building a `Dataset` object

A `Dataset` is the fundamental object we use for making the "munge" part of our data flow reproducible

What's a Dataset object? It's a scikit-learn-style Bunch, containing:

    data: the processed data
    target: (optional) target vector (for supervised learning problems)
    metadata: Data about the data

Under the hood, this is esentially a dictionary, with a little bit of magic to make it nicer to work with.

Lucky for us, we can pick up from where we were at with our `DataSource`, and simply use the a parse function to turn it into a `Dataset`.

## Processing the data

The next step is to write the importer that actually processes the data we will be using for this dataset.

The important things to generate are `data` and `target` entries. A `metadata` is optional, but recommended if you want to save additional information about the dataset.

### `parse_function` Template
A **parse function** is a function that conforms to a very simple API: given some input, it returns a triple

```(data, target, additional_metadata)```


where `data` and `target` are in a format ingestible by, say, an sklearn pipeline.
`additional_metadata` is a dictionary of key-value pairs that will be added to any existing metadata.

### Example: Processing lvq-pak data
Let's convert the lvq-pak data (introduced in the last section) into into `data` and `target` vectors.

#### Some initial exploration of lvq-pak datafiles

In [None]:
!ls -la $unpack_dir/lvq_pak-3.1  # Files are extracted to a subdirectory:

In [None]:
datafile_train = unpack_dir / 'lvq_pak-3.1' / 'ex1.dat'
datafile_test = unpack_dir / 'lvq_pak-3.1' / 'ex2.dat'
datafile_train.exists() and datafile_test.exists()

What do these datafiles look like?

In [None]:
!head -5 $datafile_train

So `datafile_train` (`ex1.dat`) appears to consists of:
* the number of data columns, followed by
* a comment line, then
* space-delimited data

**Wait!** There's a gotcha here. Look at the last entry in each row. That's the data label. In the last row, however, we see that `#` is used as a data label (easily confused for a comment). Be careful handling this!

In [None]:
!head -5 $datafile_test 

 `datafile_test` (`ex2.dat`) is similar, but has no comment header.
 


### Exercise: Initial exploration of F-MNIST datafiles

Take a look at the F-MNIST datafiles. Plot one of the images. 

**Hint:** See https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py to get an idea for how to read in the labels and images

In [None]:
# Load the fmnist DataSource
fmnist = DataSource.from_name('fmnist')

In [None]:
fmnist_unpack_dir = fmnist.unpack()

In [None]:
!ls -la $fmnist_unpack_dir

In [None]:
!cat $fmnist_unpack_dir/fmnist.readme

By taking a look at the F-MNIST repo, we see that they actually have a parser function that shows you how to use this data! https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py

In [None]:
import numpy as np

In [None]:
# parsing labels aka. target
with open(fmnist_unpack_dir / 'train-labels-idx1-ubyte', 'rb') as labels:
    target = np.frombuffer(labels.read(), dtype=np.uint8, offset=8)

In [None]:
target.shape

In [None]:
target[:10]

In [None]:
# parsing images aka. data
with open(fmnist_unpack_dir / 'train-images-idx3-ubyte', 'rb')as images:
    data = np.frombuffer(images.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

In [None]:
data.shape

In [None]:
data[:10]

For fun, let's take a look at one of the images (and to check that we're on track!)

In [None]:
# If not added yet, you'll need to add matplolib to your environment.yml file and make requirements!
import matplotlib.pyplot as plt

In [None]:
# plot an image!
plt.imshow(data[0].reshape(28, 28), cmap=plt.cm.gray);

### Parsing lvq-pak data files


Recall that we want to create a `parse_function` with the following API: given some input, it returns a triple

```(data, target, additional_metadata)```


where `data` and `target` are in a format ingestible by, say, an sklearn pipeline.
`additional_metadata` is a dictionary of key-value pairs that will be added to any existing metadata.

In [None]:
import pandas as pd
import numpy as np
from functools import partial

In [None]:
def read_space_delimited(filename, skiprows=None, class_labels=True, metadata=None):
    """Read an space-delimited file
    
    Data is space-delimited. Last column is the (string) label for the data

    Note: we can't use automatic comment detection, as `#` characters are also
    used as data labels.

    Parameters
    ----------
    skiprows: list-like, int or callable, optional
        list of rows to skip when reading the file. See `pandas.read_csv`
        entry on `skiprows` for more
    class_labels: boolean
        if true, the last column is treated as the class (target) label
    """
    with open(filename, 'r') as fd:
        df = pd.read_csv(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target, metadata

**Note:** `read_space_delimited` can be imported from `src.data.utils`.

In [None]:
data, target, metadata = read_space_delimited(datafile_train, skiprows=[0,1])
data.shape, target.shape, metadata

### Exercise: Write a parsing function for F-MNIST

Write a function that takes as input the F-MNIST data path, and returns a `(data, target, metadata)` triple.

In [None]:
def read_fmnist(data_path, kind='train', metadata=None):
    """
    Read fmnist data files.
    
    Parameters
    ----------
    data_path: path
        base directory to look for the files in
    kind: one of 'train' and 'test'
        whether to parse the training or test datasets
    metadata: dict
        metadata to add to the process
    
    Returns
    -------
    (data, target, metadata)
    """
    data_path = pathlib.Path(data_path)
    
    if kind == 'train':
        name_kind = kind
    elif kind == 'test':
        name_kind = 't10k'
    else:
        raise ValueError(f"Unknown kind:{kind}")

    # parsing labels aka. target
    with open(data_path / f'{name_kind}-labels-idx1-ubyte', 'rb') as labels:
        target = np.frombuffer(labels.read(), dtype=np.uint8, offset=8)
        
    # parsing images aka. data
    with open(data_path / f'{name_kind}-images-idx3-ubyte', 'rb')as images:
        data = np.frombuffer(images.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)
        
    return data, target, metadata

In [None]:
# test things out
data, target, metadata = read_fmnist(fmnist_unpack_dir, kind='train')
data.shape, target.shape, metadata

In [None]:
data, target, metadata = read_fmnist(fmnist_unpack_dir, kind='test')
data.shape, target.shape, metadata

## Writing a process function

We could be done here, but let's go a little further and allow the parsing function to return either `train`, `test` or `all` data. In other words, let's create a processing function: `process_lvq_pak` that takes a `kind` as input.

In [None]:
def process_lvq_pak(*, unpack_dir, kind='all', extract_dir='lvq_pak-3.1', metadata=None):
    """
    Parse LVQ-PAK datafiles into usable numpy arrays
    
    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    extract_dir: string
        name of directory in the unpacked tarfile containing
        the raw data files
    kind: {'train', 'test', 'all'}
    
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)

    data_dir = unpack_dir / extract_dir

    if kind == 'train':
        data, target, metadata = read_space_delimited(data_dir / 'ex1.dat',
                                                      skiprows=[0,1],
                                                      metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_space_delimited(data_dir / 'ex2.dat',
                                                      skiprows=[0],
                                                      metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_space_delimited(data_dir / 'ex1.dat', skiprows=[0,1],
                                                        metadata=metadata)
        data2, target2, metadata = read_space_delimited(data_dir / 'ex2.dat', skiprows=[0],
                                                        metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata

In [None]:
# All data by default
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir)
data.shape, target.shape, metadata

In [None]:
# Training data 
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir, kind='train')
data.shape, target.shape, metadata

In [None]:
# Test data 
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir, kind='test')
data.shape, target.shape, metadata

Now, by adding `process_lvq_pak` to our `DataSource` object as a `parse_function`, we'll be able to reproducibly create a `Dataset` from our `DataSource`.

In [None]:
dsrc.parse_function = partial(process_lvq_pak, unpack_dir=str(unpack_dir))

In [None]:
dsrc.dataset_opts()

### Exericse: Write a process function for F-MNIST

In [None]:
def process_fmnist(*, unpack_dir, kind='all', extract_dir=None, metadata=None):
    """
    Load the F-MNIST dataset 

    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    kind: {'train', 'test', 'all'}
        Dataset comes pre-split into training and test data.
        Indicates which dataset to load
    metadata: dict
        Additional metadata fields will be added to this dict.
        'kind': value of `kind` used to generate a subset of the data
    '''
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)
    if extract_dir is not None:
        data_dir = unpack_dir / extract_dir
    else:
        data_dir = unpack_dir

    if kind == 'train':
        data, target, metadata = read_fmnist(data_dir, kind='train', metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_fmnist(data_dir, kind='test', metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_fmnist(data_dir, kind='train', metadata=metadata)
        data2, target2, metadata = read_fmnist(data_dir, kind='test', metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata

In [None]:
# test things out
data, target, metadata = process_fmnist(unpack_dir=fmnist_unpack_dir, kind='train')
data.shape, target.shape, metadata

In [None]:
# test things out
data, target, metadata = process_fmnist(unpack_dir=fmnist_unpack_dir, kind='test')
data.shape, target.shape, metadata

In [None]:
# test things out
data, target, metadata = process_fmnist(unpack_dir=fmnist_unpack_dir, kind='all')
data.shape, target.shape, metadata

In [None]:
## Add the process function to the dataset
fmnist.parse_function = partial(process_fmnist, unpack_dir=str(fmnist_unpack_dir))

In [None]:
fmnist.dataset_opts()

### Create a `Dataset`

In [None]:
ds = dsrc.process() # Use the process function to convert this DataSource to a real Dataset
str(ds)

In [None]:
print(ds)

In [None]:
ds = dsrc.process(kind="test")  # Should be half the size
print(ds)

In [None]:
type(ds)

### Write this into the catalog

In [None]:
# Now we want to save this to the workflow. We can just do the same as before!

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.available_datasources()

In [None]:
dset_catalog, dset_catalog_file = workflow.available_datasources(keys_only=False)

In [None]:
dset_catalog['lvq-pak']

### Add `parse_lvq_pak` to the `src` module

Part of making things reproducible is moving helper functions out of notebooks and into the `src` module as we go. By convention, we add custom dataset processing and generation function to `src/data/localdata.py`. 

### Exercise: Use the `src` module for reproducibility
Add `process_lvq_pak` to `localdata.py`, (and add it to `__all__`) to make it visible to our dataset code.

In [None]:
%%file ../src/data/localdata.py
"""
Custom dataset processing/generation functions should be added to this file
"""

import pathlib
from .utils import read_space_delimited
import numpy as np

__all__ = [
    'process_lvq_pak'
]


def process_lvq_pak(*, unpack_dir, kind='all', extract_dir='lvq_pak-3.1', metadata=None):
    """
    Parse LVQ-PAK datafiles into usable numpy arrays
    
    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    extract_dir: string
        name of directory in the unpacked tarfile containing
        the raw data files
    kind: {'train', 'test', 'all'}
    
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)

    data_dir = unpack_dir / extract_dir

    if kind == 'train':
        data, target, metadata = read_space_delimited(data_dir / 'ex1.dat',
                                                      skiprows=[0,1],
                                                      metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_space_delimited(data_dir / 'ex2.dat',
                                                      skiprows=[0],
                                                      metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_space_delimited(data_dir / 'ex1.dat', skiprows=[0,1],
                                                        metadata=metadata)
        data2, target2, metadata = read_space_delimited(data_dir / 'ex2.dat', skiprows=[0],
                                                        metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata


In [None]:
# This import call should now work!
from src.data.localdata import process_lvq_pak

### Exercise:
Use `process_lvq_pak` from `src.data.localdata` as the `parse_function`, and re-add the `DataSource` to the `workflow`.

In [None]:
dsrc.parse_function = partial(process_lvq_pak, unpack_dir=str(unpack_dir))

In [None]:
workflow.add_datasource(dsrc)

In [None]:
## Should point to the function in the src module, not the function in this notebook!
dset_catalog, dset_catalog_file = workflow.available_datasources(keys_only=False)
dset_catalog['lvq-pak']['load_function_module']

In [None]:
## Check things still work!
dsrc.process()

### Exercise: Stop and check everything in using git!

Use a branch, a PR via GitHub or BitBucket, and pull your changes to `src/data/localdata.py` all back into your local master

In [None]:
!git status

## Exercise: Finish turning the the F-MNIST `DataSource` into a `Dataset`
* save your parse functions in the `src` module
* re-add your datasource to your in the `workflow`

In [None]:
# Add process fmnist to the src module

In [None]:
%%file ../src/data/localdata.py
"""
Custom dataset processing/generation functions should be added to this file
"""

import pathlib
from .utils import read_space_delimited
import numpy as np

__all__ = [
    'process_lvq_pak',
    'process_fmnist'
]


def process_lvq_pak(*, unpack_dir, kind='all', extract_dir='lvq_pak-3.1', metadata=None):
    """
    Parse LVQ-PAK datafiles into usable numpy arrays
    
    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    extract_dir: string
        name of directory in the unpacked tarfile containing
        the raw data files
    kind: {'train', 'test', 'all'}
    
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)

    data_dir = unpack_dir / extract_dir

    if kind == 'train':
        data, target, metadata = read_space_delimited(data_dir / 'ex1.dat',
                                                      skiprows=[0,1],
                                                      metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_space_delimited(data_dir / 'ex2.dat',
                                                      skiprows=[0],
                                                      metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_space_delimited(data_dir / 'ex1.dat', skiprows=[0,1],
                                                        metadata=metadata)
        data2, target2, metadata = read_space_delimited(data_dir / 'ex2.dat', skiprows=[0],
                                                        metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata

def read_fmnist(data_path, kind='train', metadata=None):
    """
    Helper function to read fmnist data files.
    
    Parameters
    ----------
    data_path: path
        base directory to look for the files in
    kind: one of 'train' and 'test'
        whether to parse the training or test datasets
    metadata: dict
        metadata to add to the process
    
    Returns
    -------
    (data, target, metadata)
    """
    data_path = pathlib.Path(data_path)
    
    if kind == 'train':
        name_kind = kind
    elif kind == 'test':
        name_kind = 't10k'
    else:
        raise ValueError(f"Unknown kind:{kind}")

    # parsing labels aka. target
    with open(data_path / f'{name_kind}-labels-idx1-ubyte', 'rb') as labels:
        target = np.frombuffer(labels.read(), dtype=np.uint8, offset=8)
        
    # parsing images aka. data
    with open(data_path / f'{name_kind}-images-idx3-ubyte', 'rb')as images:
        data = np.frombuffer(images.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)
        
    return data, target, metadata

def process_fmnist(*, unpack_dir, kind='all', extract_dir=None, metadata=None):
    """
    Load the F-MNIST dataset 

    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    kind: {'train', 'test', 'all'}
        Dataset comes pre-split into training and test data.
        Indicates which dataset to load
    metadata: dict
        Additional metadata fields will be added to this dict.
        'kind': value of `kind` used to generate a subset of the data
    '''
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)
    if extract_dir is not None:
        data_dir = unpack_dir / extract_dir
    else:
        data_dir = unpack_dir

    if kind == 'train':
        data, target, metadata = read_fmnist(data_dir, kind='train', metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_fmnist(data_dir, kind='test', metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_fmnist(data_dir, kind='train', metadata=metadata)
        data2, target2, metadata = read_fmnist(data_dir, kind='test', metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata

In [None]:
from src.data.localdata import process_fmnist

In [None]:
# check that `process_fmnist` is coming from the correct src.data.localdata
process_fmnist?

In [None]:
# Add process_fmnist as a parse function to the fmnist
# DataSource for both train and test (based on the function in src)
fmnist_train = DataSource.from_name('fmnist')
fmnist_train.name = 'fmnist_train'

fmnist_test = DataSource.from_name('fmnist')
fmnist_test.name = 'fmnist_test'

fmnist_train.name, fmnist_test.name, fmnist.name

In [None]:
fmnist.parse_function  = partial(process_fmnist, unpack_dir=str(fmnist_unpack_dir))

In [None]:
# test things out
fmnist_dataset = fmnist.process(kind='train')
fmnist_dataset.data.shape, fmnist_dataset.target.shape,

In [None]:
fmnist_dataset = fmnist.process(kind='test')
fmnist_dataset.data.shape, fmnist_dataset.target.shape,

In [None]:
fmnist_dataset = fmnist.process()
fmnist_dataset.data.shape, fmnist_dataset.target.shape,

In [None]:
# Add the fmnist DataSource to the workflow

In [None]:
workflow.add_datasource(fmnist)

In [None]:
workflow.available_datasources()

In [None]:
# Check that the fmnist DataSource `load_function_name` is pointing to the `src` module
dset_catalog, dset_catalog_file = workflow.available_datasources(keys_only=False)
dset_catalog['fmnist']['load_function_module'], dset_catalog['fmnist']['load_function_name']

In [None]:
# Check everything in using git
!git status

## Automating the workflow

What we have so far is enough to be able to load a `Dataset` from a `DataSource`. We want to go a step further and add the generation of this data to the automated workflow so that we can blow away our data and recreate it using `make` commands.

Next up, we want to be able to `make data`:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

In [None]:
from src.data import Dataset

In [None]:
workflow.available_datasources()

In [None]:
lvq_pak = Dataset.from_datasource('lvq-pak')

In [None]:
str(lvq_pak)

## Recall: so far we have up to `make sources`

In [None]:
!cd .. && make clean_raw && make clean_interim

In [None]:
!cd .. && make sources

And we can recover our newly created `Datasets` from our `DataSources`

In [None]:
ds = Dataset.from_datasource('lvq-pak')
ds.data.shape, ds.target.shape

In [None]:
ds = Dataset.from_datasource('lvq-pak', kind='train')
ds.data.shape, ds.target.shape

## Transformers: an intro to `make data`

We still need to automate our the `Dataset` generation as part of our `workflow`. We'll do this using `transformers` (which we'll get into more in the next notebook)

In [None]:
!cd .. && make clean_processed

In [None]:
!ls -la $paths.data_path/processed

In [None]:
ds.dump()

In [None]:
!ls -la $paths.data_path/processed

In [None]:
!cd .. && make clean_processed

In [None]:
!ls -la $paths.data_path/processed


Let's encode this as a transformer from a `DataSource` to a `Dataset` as part of our automated, reproducible workflow!

In [None]:
workflow.add_transformer(from_datasource='lvq-pak')

In [None]:
workflow.get_transformer_list()

In [None]:
workflow.make_data()

In [None]:
!ls -la $paths.data_path/processed

In [None]:
!cd .. && make clean_processed

In [None]:
!cd .. && make data

In [None]:
!ls -la $paths.data_path/processed

## Exercise: Create the F-MNIST dataset

* Create an F-MNIST `Dataset` for both `train` and `test`
* Blow it away and recreate it using `make data`

In [None]:
workflow.add_transformer(from_datasource='fmnist',
                         datasource_opts={'kind':'train'},
                         output_dataset='fmnist_train')

In [None]:
workflow.add_transformer(from_datasource='fmnist',
                         datasource_opts={'kind':'test'},
                         output_dataset='fmnist_test')

In [None]:
workflow.make_data()

In [None]:
workflow.available_datasets()

In [None]:
ds = Dataset.load('fmnist_train')
ds.data.shape, ds.target.shape

In [None]:
ds = Dataset.load('fmnist_test')
ds.data.shape, ds.target.shape

## Welcome to Reproducible Datasets!