# 2.0: Reproducible Data Sources
"In God we trust. All others must bring data.” – W. Edwards Deming"

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
from src.logging import logger
logger.setLevel(logging.INFO)

# Introducing the `DataSource`
The `DataSource` object handles downloading, unpacking, and processing raw data files, and serves as a container for some basic metadata about the raw data, including **documentation** and **license** information.

Raw data files are downloaded to  `paths.raw_data_path`.
 Cache files and unpacked raw files are saved to `paths.interim_data_path`.
    

## Example: Bjørn's Supervised Learning Problem

Bjørn employs a large number of Finnish line cooks. He can’t understand a word they say.

Bjørn needs a trained model to do real-time translation from Finnish to Swedish.

Bjørn has decided to start with the Finnish phoneme dataset shipped with a project called lvq-pak. His objective is to train three different models, and choose the one with the best overall accuracy score.

### LVQ-Pak,  a Finnish phonetic dataset

The Learning Vector Quantization (lvq-pak) project includes a simple Finnish phonetic dataset
consisting 20-dimensional Mel Frequency Cepstrum Coefficients (MFCCs) labelled with target phoneme information. Our goal is to explore this dataset, process it into a useful form, and make it a part of a reproducible data science workflow. The project can be found at: http://www.cis.hut.fi/research/lvq_pak/




For this example, we are going create a `DataSource` for the LVQ-Pak dataset. The process will consist of
1. Downloading and unpacking the raw data files. 
2. Generating (and recording) hash values for these files.
3. Adding LICENSE and DESCR (description) metadata to this DataSource
4. Adding the complete `DataSource` to the Catalog 


### Downloading Raw Data Source Files

In [None]:
from src.data import DataSource
from src.utils import list_dir
from src import paths

In [None]:
# Create a data source object
datasource_name = 'lvq-pak'
dsrc = DataSource(datasource_name)

In [None]:
# Add URL(s) for raw data files
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

In [None]:
# Fetch the files
logger.setLevel(logging.DEBUG)
dsrc.fetch()

By default, data files are downloaded to the `paths.raw_data_path` directory:

In [None]:
!ls -la $paths.raw_data_path

Since we did not specify a hash, or target filename, these are inferred from the downloaded file:

In [None]:
dsrc.file_list

#### Cached Downloads

The DataSource object keeps track of whether the fetch has been performed successfully. Subsequent downloads will be skipped by default:

In [None]:
dsrc.fetch()

We can override this, which will check if the downloaded file exists, redownloading if necessary

In [None]:
dsrc.fetch(force=True)

In the previous case, the raw data file existed on the filesystem, and had the correct hash. If the local file has a checksum that doesn't match the saved hash, it will be re-downloaded automatically. Let's corrupt the file and see what happens.

In [None]:
!echo "XXX" >> $paths.raw_data_path/lvq_pak-3.1.tar

In [None]:
dsrc.fetch(force=True)

### Remove a file from the file_list

In [None]:
# Note that if we add a url again, we end up with more of the same file in the file list
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch()

Fetch is smart enough to not redownload the same file in this case. Still, this is messy and cumbersome. We can remove entries by removing them from the `file_list`.

In [None]:
dsrc.file_list.pop(1)

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch(force=True)

### Sometimes we make mistakes when entering information

In [None]:
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar", name='cat', file_name='dog')

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch()

In [None]:
!ls -la $paths.raw_data_path

We now  have a copy of `lvq_pak-3.1.tar` called `dog`. Every time we fetch, we will fetch twice unless we get rid of the entry for `dog`.

First, we will want to remove `dog` from our raw data.

Let's take the "Nuke it from orbit. It's the only way to be sure" approach and clean our entire raw data directory. 

In [None]:
!cd .. && make clean_raw

In [None]:
!ls -la $paths.raw_data_path

The other option would have been to manually remove the `dog` file and then forced a refetch.

### Exercise: Remove the entry for dog and refetch

In [None]:
# You should now only see the lvq_pak-3.1.tar file
!ls -la $paths.raw_data_path

## Exercise: Mark's Unsupervised Learning Problem

Mark regularly gets handed files full of fashion images, labelled by category. He wants to know how he can use this to help keep up with the latest trends for the magazine.

For now, he's interested in producing a visualization of the various categories so that he can learn more about them. He's hoping his these explorations will eventually help him speed up the process of sorting through what he gets sent to review every week.

But first, he has to put this data into a usable format.

### Creating an F-MNIST `DataSource`

For this excercise, you are going build a `DataSource` out of the Fashion-MNIST dataset.

[Fashion-MNIST][FMNIST] is available from GitHub. Looking at their [README], we see that the raw data is distributed as a set of 4 files with the following checksums:

[FMNIST]: https://github.com/zalandoresearch/fashion-mnist
[README]: https://github.com/zalandoresearch/fashion-mnist/blob/master/README.md

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|

By the end of this running example, you will build a `DataSource` that downloads these raw files and verifies that the hash values are as expected. You should make sure to include **Description** and **License** metadata in this `DataSource`. When you are finished, save the `DataSource` to the Catalog.

### Exercise: Download Raw Data Source Files for F-MNIST

In [None]:
# Create an fmnist data source object


In [None]:
# Add URL(s) for raw data files
# Note that you will be adding four files to the DataSource object
# and that the hash values have already been provided above!


In [None]:
# Fetch the files


In [None]:
# Check for your new files
!ls -la $paths.raw_data_path

### Unpacking Raw Data Files

In [None]:
unpack_dir = dsrc.unpack()

By default, files are decompressed/unpacked to the `paths.interim_data_path`/`datasource_name` directory:

In [None]:
!ls -la $paths.interim_data_path

In [None]:
# We unpack everything into interim_data_path/datasource_name, which is returned by `unpack()`

In [None]:
!ls -la $unpack_dir

In [None]:
!ls -la $unpack_dir/lvq_pak-3.1

### Exercise: Unpack raw data files for F-MNIST

In [None]:
# Check for your files in the unpacked dirs
!ls -la $fmnist_unpack_dir

### Adding Metadata to Raw Data
Wait, what have we actually downloaded, and are we actually allowed to **use** this data? We keep track of two key pieces of metadata along with a raw dataset:
* Description (`DESCR`) Text: Human-readable text describing the dataset, its source, and what it represents
* License (`LICENSE`) Text: Terms of use for this dataset, often in the form of a license agreement

Often, a dataset comes complete with its own README and LICENSE files. If these are available via URL, we can add these like we add any other data file, tagging them as metadata using the `name` field:

In [None]:
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/README",
               file_name='lvq-pak.readme', name='DESCR')

In [None]:
dsrc.fetch()
dsrc.unpack()

In [None]:
# We now fetch 2 files. Note the metadata has been tagged accordingly in the `name` field
dsrc.file_list

We need to dig a little deeper to find the license. we find it at the beginning of the README file contained within that distribution:

In [None]:
!head -35 $paths.interim_data_path/lvq-pak/lvq_pak-3.1/README

Rather than trying to be clever, let's just add the license metadata from a python string that we cut and paste from the above.

In [None]:
license_txt = '''
************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming Team of the                       *
*                 Helsinki University of Technology                    *
*           Laboratory of Computer and Information Science             *
*                Rakentajanaukio 2 C, SF-02150 Espoo                   *
*                              FINLAND                                 *
*                                                                      *
*                      Copyright (c) 1991-1995                         *
*                                                                      *
************************************************************************
*                                                                      *
*  NOTE: This program package is copyrighted in the sense that it      *
*  may be used for scientific purposes. The package as a whole, or     *
*  parts thereof, cannot be included or used in any commercial         *
*  application without written permission granted by its producents.   *
*  No programs contained in this package may be copied for commercial  *
*  distribution.                                                       *
*                                                                      *
*  All comments concerning this program package may be sent to the     *
*  e-mail address 'lvq@nucleus.hut.fi'.                                *
*                                                                      *
************************************************************************
'''
dsrc.add_metadata(contents=license_txt, kind='LICENSE')

Under the hood, this will create a file, storing the creation instructions in the same `file_list` we use to store the URLs we wish to download:

In [None]:
dsrc.file_list

Now when we fetch, the license file is created from this information:

In [None]:
logger.setLevel(logging.DEBUG)
dsrc.fetch(force=True)
dsrc.unpack()

In [None]:
!ls -la $paths.raw_data_path

### Exercise: Add metadata to F-MNIST

### Adding Raw Data to the Catalog

In [None]:
from src import workflow

In [None]:
workflow.available_datasources()

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.available_datasources()

We will make use of this raw dataset catalog later in this tutorial. We can now load our `DataSource` by name:

In [None]:
ds = DataSource.from_name('lvq-pak')

In [None]:
ds.file_list

### Exercise: Add F-MNIST to the Raw Dataset Catalog

In [None]:
# Your fmnist dataset should now show up here:
workflow.available_datasources()

### Nuke it from Orbit

Now we can blow away all the data that we've downloaded and set up so far, and recreate it from the workflow datasource. Or, use some of our `make` commands!

In [None]:
!cd .. && make clean_raw

In [None]:
!ls -la $paths.raw_data_path

In [None]:
!cd .. && make fetch_sources

In [None]:
!ls -la $paths.raw_data_path

In [None]:
# What about fetch and unpack?
!cd .. && make clean_raw && make clean_interim

In [None]:
!ls -la $paths.raw_data_path

In [None]:
!cd .. && make unpack_sources

In [None]:
!ls -la $paths.raw_data_path

In [None]:
!ls -la $paths.interim_data_path

### Your data sources are now reproducible!