# 2.2 Transforming Data Sources into Data
“It is a capital mistake to theorize before one has data.” Sherlock Holmes, “A Study in Scarlett” (Arthur Conan Doyle).

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.” – Jim Barksdale, former Netscape CEO

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
from src.logging import logger
logger.setLevel(logging.INFO)

## Turning a `DataSource` into a `Dataset`
How do we turn raw data sources into something useful? There are 2 steps:
1. Write a function to extract meaningful `data` (and optionally, `target`) objects from your raw source files, ( a **parse function**, and
2. package this **parse function** according to a very simple API


First, let's grab the dataset we created in the last notebook.


### Loading a `DataSet` from the Catalog

In [None]:
from src import workflow
from src.data import DataSource
import pathlib

In [None]:
workflow.available_datasources()

In [None]:
dsrc = DataSource.from_name('lvq-pak')    # load it from the catalog
unpack_dir = dsrc.unpack()                # Find the location of the unpacked files

In [None]:
!ls -la $unpack_dir

### `parse_function` Template
A **parse function** is a function that conforms to a very simple API: given some input, it returns a triple

```(data, target, additional_metadata)```


where `data` and `target` are in a format ingestible by, say, an sklearn pipeline.
`additional_metadata` is a dictionary of key-value pairs that will be added to any existing metadata.

### Example: Processing lvq-pak data
Let's convert the lvq-pak data (introduced in the last section) into into `data` and `target` vectors.

#### Some exploratory EDA on lvq-pak datafiles

In [None]:
!ls -la $unpack_dir/lvq_pak-3.1  # Files are extracted to a subdirectory:

In [None]:
datafile_train = unpack_dir / 'lvq_pak-3.1' / 'ex1.dat'
datafile_test = unpack_dir / 'lvq_pak-3.1' / 'ex2.dat'
datafile_train.exists() and datafile_test.exists()

What do these datafiles look like?

In [None]:
!head -5 $datafile_train

So `datafile_train` (`ex1.dat`) appears to consists of:
* the number of data columns, followed by
* a comment line, then
* space-delimited data

**Wait!** There's a gotcha here. Look at the last entry in each row. That's the data label. In the last row, however, we see that `#` is used as a data label (easily confused for a comment). Be careful handling this!

In [None]:
!head -5 $datafile_test 

 `datafile_test` (`ex2.dat`) is similar, but has no comment header.
 
#### Parsing lvq-pak data files

In [None]:
import pandas as pd
import numpy as np
from functools import partial

In [None]:
def read_space_delimited(filename, skiprows=None, class_labels=True, metadata=None):
    """Read an space-delimited file
    
    Data is space-delimited. Last column is the (string) label for the data

    Note: we can't use automatic comment detection, as `#` characters are also
    used as data labels.

    Parameters
    ----------
    skiprows: list-like, int or callable, optional
        list of rows to skip when reading the file. See `pandas.read_csv`
        entry on `skiprows` for more
    class_labels: boolean
        if true, the last column is treated as the class (target) label
    """
    with open(filename, 'r') as fd:
        df = pd.read_csv(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target, metadata

In [None]:
data, target, metadata = read_space_delimited(datafile_train, skiprows=[0,1])
data.shape, target.shape, metadata

We could be done here, but let's go a little further and allow the parsing function to return either `train`, `test` or `all` data:

In [None]:
def process_lvq_pak(*, unpack_dir, kind='all', extract_dir='lvq_pak-3.1', metadata=None):
    """
    Parse LVQ-PAK datafiles into usable numpy arrays
    
    Parameters
    ----------
    unpack_dir: path
        path to unpacked tarfile
    extract_dir: string
        name of directory in the unpacked tarfile containing
        the raw data files
    kind: {'train', 'test', 'all'}
    
    
    Returns
    -------
    A tuple: 
       (data, target, additional_metadata)
    
    """
    if metadata is None:
        metadata = {}

    if unpack_dir:
        unpack_dir = pathlib.Path(unpack_dir)

    data_dir = unpack_dir / extract_dir

    if kind == 'train':
        data, target, metadata = read_space_delimited(data_dir / 'ex1.dat',
                                                      skiprows=[0,1],
                                                      metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_space_delimited(data_dir / 'ex2.dat',
                                                      skiprows=[0],
                                                      metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_space_delimited(data_dir / 'ex1.dat', skiprows=[0,1],
                                                        metadata=metadata)
        data2, target2, metadata = read_space_delimited(data_dir / 'ex2.dat', skiprows=[0],
                                                        metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    return data, target, metadata

In [None]:
# All data by default
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir)
data.shape, target.shape, metadata

In [None]:
# Training data 
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir, kind='train')
data.shape, target.shape, metadata

In [None]:
# Test data 
data, target, metadata = process_lvq_pak(unpack_dir=unpack_dir, kind='test')
data.shape, target.shape, metadata

In [None]:
dsrc.parse_function = partial(process_lvq_pak, unpack_dir=str(unpack_dir))

In [None]:
dsrc.dataset_opts()

### Write this into the catalog

In [None]:
# Now we want to save this to the workflow. We can just do the same as before, right?

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.available_datasources()

In [None]:
dset_catalog, dset_catalog_file = workflow.available_datasources(keys_only=False)

In [None]:
dset_catalog['lvq-pak']

### Create a Dataset

In [None]:
ds = dsrc.process() # Use the load_function to convert this DataSource to a real Dataset
str(ds)

In [None]:
print(ds)

In [None]:
ds = dsrc.process(kind="test")  # Should be half the size
print(ds)

In [None]:
type(ds)

## EXERCISE: Turn the F-MNIST `DataSource` into a `Dataset`
In the last exercise, you fetched and unpacked F-MNIST data.
Now it's time to process it into a `Dataset` object.

## The `Dataset` and Data Transformations

### Tour of the Dataset Object

### Creating a Simple Transformer

### More Complicated Transformers

## Reproducible Data: The Punchline