# 2.2 Transforming Data Sources into Data
“It is a capital mistake to theorize before one has data.” Sherlock Holmes, “A Study in Scarlett” (Arthur Conan Doyle).

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.” – Jim Barksdale, former Netscape CEO

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
from src.logging import logger
logger.setLevel(logging.INFO)

## Turning a `DataSource` into a `Dataset`
How do we turn raw data sources into something useful? There are 2 steps:
1. Write a function to extract meaningful `data` (and optionally, `target`) objects from your raw source files, and
2. Wrap this function in the form of a **processing function**


First, let's grab the dataset we created in the last notebook.


### Loading a `DataSet` from the Catalog

In [3]:
from src import workflow
from src.data import DataSource

In [4]:
workflow.available_datasources()

['lvq-pak']

In [5]:
dsrc = DataSource.from_name('lvq-pak')    # load it from the catalog
unpack_dir = dsrc.unpack()                # Find the location of the unpacked files

In [6]:
!ls -la $unpack_dir

total 16
drwxr-xr-x 2 ava00088 users    0 Feb  6 17:31 .
drwxr-xr-x 2 ava00088 users 4096 Feb  6 18:24 ..
drwxr-xr-x 2 ava00088 users    0 Apr  6  1995 lvq_pak-3.1
-rwxr-xr-x 1 ava00088 users 2483 Feb  7 13:53 lvq-pak.license
-rwxr-xr-x 1 ava00088 users 4958 Feb  7 13:53 lvq-pak.readme


### Processing Function Template
A **processing function** is a function that 
* takes at least 2 keyword arguments as input: `dataset_name` (a string) and `metadata` (a dict).
* Returns a dictionary with the following keys: `dataset_name`, `data`, `target` (optional), and `metadata`

Why a dictionary? Because the ouput of this function becomes the keyword arguments
passed to the `Dataset` constructor.

Here's a template:

In [7]:
def process_raw_data(dataset_name='raw_data', metadata=None, extract_func=None, **kwargs):
    """Convert a raw DataSource files into a Dataset constructor dict
    
    Parameters
    ----------
    dataset_name: (string)
        Name of this raw dataset. This will be used as a key for accessing this raw dataset in the
        Raw Dataset catalog
    metadata: dict or None
        If None, an empty metadata dictionary will be used.
    extract_func: function returning tuple: (data, target, metadata)
    **kwargs: additional parameters to be passed to `extract_func`

    Returns
    -------
    Dictionary containing the following keys:
        dataset_name: (string)
            `dataset_name` that was passed to the function
        metadata: (dict)
            dict containing the input `metadata` key/value pairs, and (optionally)
            additional information about this raw dataset
        data: array-style object
            Often a `numpy.ndarray` or `pandas.DataFrame`
        target: (optional) vector-style object
            for supervised learning problems, the target vector associated with `data`
    """
    if metadata is None:
        metadata = {}

    data, target = None, None
    
    if extract_function is None:
        def extract_function(**kw):
            return (data, target, metadata)
  
    # Generate `data` and `target` info
    data, target, metadata = extract_func(metadata=metadata, **kwargs)

    dset_opts = {
        'dataset_name': dataset_name,
        'metadata': metadata,
        'data': data,
        'target': target,
    }
    return dset_opts

### Example: Processing lvq-pak data
Let's convert the lvq-pak data (introduced in the last section) into into `data` and `target` vectors.

In [8]:
!ls -la $unpack_dir/lvq_pak-3.1  # Files are extracted to a subdirectory:

total 762
drwxr-xr-x 2 ava00088 users      0 Apr  6  1995 .
drwxr-xr-x 2 ava00088 users      0 Feb  6 17:31 ..
-rwxr-xr-x 1 ava00088 users   6358 Apr  6  1995 accuracy.c
-rwxr-xr-x 1 ava00088 users   7805 Apr  6  1995 balance.c
-rwxr-xr-x 1 ava00088 users   5577 Apr  6  1995 classify.c
-rwxr-xr-x 1 ava00088 users   7092 Apr  6  1995 cmatr.c
-rwxr-xr-x 1 ava00088 users   3797 Apr  6  1995 config.h
-rwxr-xr-x 1 ava00088 users  28354 Apr  6  1995 datafile.c
-rwxr-xr-x 1 ava00088 users   4294 Apr  6  1995 datafile.h
-rwxr-xr-x 1 ava00088 users   5044 Apr  6  1995 elimin.c
-rwxr-xr-x 1 ava00088 users   2626 Apr  6  1995 errors.h
-rwxr-xr-x 1 ava00088 users   7122 Apr  6  1995 eveninit.c
-rwxr-xr-x 1 ava00088 users 226894 Apr  6  1995 ex1.dat
-rwxr-xr-x 1 ava00088 users 225948 Apr  6  1995 ex2.dat
-rwxr-xr-x 1 ava00088 users   4226 Apr  6  1995 extract.c
-rwxr-xr-x 1 ava00088 users  10101 Apr  6  1995 fileio.c
-rwxr-xr-x 1 ava00088 users   2896 Apr  6  1995 fileio.h
-rwxr-x

In [9]:
datafile_train = unpack_dir / 'lvq_pak-3.1' / 'ex1.dat'
datafile_test = unpack_dir / 'lvq_pak-3.1' / 'ex2.dat'
datafile_train.exists() and datafile_test.exists()

True

In [10]:
!head -5 $datafile_train

20
# Example data from speech signal
21.47 -19.90 -20.68 -6.73 13.67 -11.95 13.83 12.02 7.62 -6.15 -4.38 -2.91 4.80 -7.39 -3.54 -0.87 -5.02 -1.41 -2.33 2.12 A
0.05 28.38 9.52 -11.30 3.11 -11.88 -2.90 -11.04 2.32 -13.80 1.71 -0.40 -1.36 3.91 3.21 -0.98 -0.14 -4.70 0.30 0.27 I
-4.71 -4.61 -0.64 1.78 -1.48 5.98 12.55 -0.50 4.74 4.68 3.27 -0.36 9.24 3.39 -0.40 -1.59 0.94 2.17 -0.10 -0.45 #


So `datafile_train` (`ex1.dat`) appears to consists of:
* the number of data columns, followed by
* a comment line, then
* space-delimited data

**Wait!** There's a gotcha here. Look at the last entry in each row. That's the data label. In the last row, however, we see that `#` is used as a data label (easily confused for a comment). Be careful handling this!

In [11]:
!head -5 $datafile_test 

20
13.55 -12.07 -18.81 -6.13 10.66 -16.64 10.54 3.97 6.41 -8.01 -4.28 -3.37 7.96 -3.06 1.85 -5.28 1.05 3.87 1.77 -1.79 A
0.28 22.76 8.26 -15.68 3.43 -13.39 -10.58 -14.71 3.96 -9.65 4.74 -9.55 0.90 7.25 4.69 -2.44 1.96 -4.32 2.48 1.46 I
-3.51 -6.01 -0.47 -0.82 -0.38 -2.91 -1.35 0.48 1.88 3.00 4.11 7.21 3.36 7.48 2.37 -5.26 2.58 1.99 -1.09 -4.20 #
9.76 -23.02 -12.69 -12.28 12.26 -15.74 10.01 8.30 1.95 -7.41 -0.68 -2.56 5.02 -1.56 -0.16 -1.87 -6.97 -0.08 0.51 2.00 A


 `datafile_test` (`ex2.dat`) is similar, but has no comment header.
 Let's parse these files

In [12]:
import pandas as pd

In [13]:
def read_space_delimited(filename, skiprows=None, class_labels=True, metadata=None):
    """Read an space-delimited file
    
    Data is space-delimited. Last column is the (string) label for the data

    Note: we can't use automatic comment detection, as `#` characters are also
    used as data labels.

    Parameters
    ----------
    skiprows: list-like, int or callable, optional
        list of rows to skip when reading the file. See `pandas.read_csv`
        entry on `skiprows` for more
    class_labels: boolean
        if true, the last column is treated as the class (target) label
    """
    with open(filename, 'r') as fd:
        df = pd.read_csv(fd, skiprows=skiprows, skip_blank_lines=True,
                           comment=None, header=None, sep=' ', dtype=str)
        # targets are last column. Data is everything else
        if class_labels is True:
            target = df.loc[:, df.columns[-1]].values
            data = df.loc[:, df.columns[:-1]].values
        else:
            data = df.values
            target = np.zeros(data.shape[0])
        return data, target, metadata

In [14]:
data, target, metadata = read_space_delimited(datafile_train, skiprows=[0,1])
data.shape, target.shape, metadata

((1962, 20), (1962,), None)

In [15]:
from src.paths import interim_data_path
import numpy as np

In [16]:
def process_lvq_pak(dataset_name='lvq-pak', metadata=None, kind='all'):
    """Process LVQ-data object
    Parameters
    ----------
    dataset_name: (string)
        Name of this raw dataset. This will be used as a key for accessing this raw dataset in the
        Raw Dataset catalog
    metadata: dict or None
        If None, an empty metadata dictionary will be used.
    kind: {'train', 'test', 'all'}
        Whether to return training set, test set, or everything. 
        
    Returns
    -------
    Dictionary containing the following keys:
        dataset_name: (string)
            `dataset_name` that was passed to the function
        metadata: (dict)
            dict containing the input `metadata` key/value pairs, and (optionally)
            additional information about this raw dataset
        data: array-style object
            Often a `numpy.ndarray` or `pandas.DataFrame`
        target: (optional) vector-style object
            for supervised learning problems, the target vector associated with `data`
    """
    if metadata is None:
        metadata = {}

    untar_dir = interim_data_path / dataset_name
    unpack_dir = untar_dir / 'lvq_pak-3.1'

    if kind == 'train':
        data, target, metadata = read_space_delimited(unpack_dir / 'ex1.dat',
                                                      skiprows=[0,1],
                                                      metadata=metadata)
    elif kind == 'test':
        data, target, metadata = read_space_delimited(unpack_dir / 'ex2.dat',
                                                      skiprows=[0],
                                                      metadata=metadata)
    elif kind == 'all':
        data1, target1, metadata = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1],
                                                        metadata=metadata)
        data2, target2, metadata = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0],
                                                        metadata=metadata)
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    dset_opts = {
        'dataset_name': dataset_name,
        'metadata': metadata,
        'data': data,
        'target': target,
    }
    return dset_opts

In [17]:
process_lvq_pak()

{'dataset_name': 'lvq-pak',
 'metadata': {},
 'data': array([['21.47', '-19.90', '-20.68', ..., '-1.41', '-2.33', '2.12'],
        ['0.05', '28.38', '9.52', ..., '-4.70', '0.30', '0.27'],
        ['-4.71', '-4.61', '-0.64', ..., '2.17', '-0.10', '-0.45'],
        ...,
        ['-2.63', '-6.59', '0.19', ..., '0.76', '0.89', '-3.48'],
        ['5.35', '4.96', '18.75', ..., '-0.57', '0.00', '1.35'],
        ['-0.37', '-5.27', '-1.74', ..., '3.48', '-0.90', '-1.00']],
       dtype=object),
 'target': array(['A', 'I', '#', ..., '#', 'Y', '#'], dtype=object)}

In [18]:
dsrc.load_function = process_lvq_pak

### Write this into the catalog

In [19]:
workflow.add_datasource(dsrc)

In [20]:
workflow.available_datasources()

['lvq-pak']

### Create a Dataset

In [27]:
ds = dsrc.process() # Use the load_function to convert this DataSource to a real Dataset
str(ds)

"<Dataset: lvq-pak, data.shape=(3924, 20), target.shape=(3924,), metadata=['descr', 'license', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>"

In [22]:
print(ds)

<Dataset: lvq-pak, data.shape=(3924, 20), target.shape=(3924,), metadata=['descr', 'license', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>


In [23]:
ds = dsrc.process(kind="test")  # Should be half the size
print(ds)

<Dataset: lvq-pak, data.shape=(1962, 20), target.shape=(1962,), metadata=['descr', 'license', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>


In [None]:
type(ds)

## EXERCISE: Turn the F-MNIST `DataSource` into a `Dataset`
In the last exercise, you fetched and unpacked F-MNIST data.
Now it's time to process it into a `Dataset` object.

## The `Dataset` and Data Transformations

### Tour of the Dataset Object

### Creating a Simple Transformer

### More Complicated Transformers

## Reproducible Data: The Punchline