# 2.0: Reproducible Data Sources
"In God we trust. All others must bring data.” – W. Edwards Deming"

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
from src.logging import logger
logger.setLevel(logging.INFO)

# Introducing the `DataSource`
The `DataSource` object handles downloading, unpacking, and processing raw data files, and serves as a container for some basic metadata about the raw data, including **documentation** and **license** information.

Raw data files are downloaded to  `paths.raw_data_path`.
 Cache files and unpacked raw files are saved to `paths.interim_data_path`.
    

## Example: LVQ-Pak,  a Finnish phonetic dataset
The Learning Vector Quantization (lvq-pak) project includes a simple Finnish phonetic dataset
consisting 20-dimensional Mel Frequency Cepstrum Coefficients (MFCCs) labelled with target phoneme information. Our goal is to explore this dataset, process it into a useful form, and make it a part of a reproducible data science workflow. The project can be found at: http://www.cis.hut.fi/research/lvq_pak/




For this example, we are going create a `DataSource` for the LVQ-Pak dataset. The process will consist of
1. Downloading and unpacking the raw data files. 
2. Generating (and recording) hash values for these files.
3. Adding LICENSE and DESCR (description) metadata to this DataSource
4. Adding the complete `DataSource` to the Catalog 


### Downloading Raw Data Source Files

In [3]:
from src.data import DataSource
from src.utils import list_dir
from src import paths

In [4]:
# Create a data source object
datasource_name = 'lvq-pak'
dsrc = DataSource(datasource_name)

In [5]:
# Add URL(s) for raw data files
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

In [6]:
# Fetch the files
logger.setLevel(logging.DEBUG)
dsrc.fetch()

2019-02-08 14:49:53,214 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2019-02-08 14:49:53,218 - fetch - DEBUG - lvq_pak-3.1.tar exists, but no hash to check. Setting to sha1:86024a871724e521341da0ffb783956e39aadb6e


True

By default, data files are downloaded to the `paths.raw_data_path` directory:

In [7]:
!ls -la $paths.raw_data_path

total 1484
drwxrwx--- 2 ava00125 domain users   4096 Feb  8 14:49 .
drwxrwx--- 5 ava00125 domain users   4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 dog
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 lvq_pak-3.1.tar
-rw-rw---- 1 ava00125 domain users   2483 Feb  8 14:49 lvq-pak.license
-rw-rw---- 1 ava00125 domain users   4958 Feb  8 14:49 lvq-pak.readme


Since we did not specify a hash, or target filename, these are inferred from the downloaded file:

In [8]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'}]

### Remove a file from the file_list

In [9]:
# Note that if we add a url again, we end up with more of the same file in the file list
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar")

In [10]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'},
 {'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': None,
  'name': None,
  'file_name': None}]

In [11]:
dsrc.fetch()

2019-02-08 14:49:53,480 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2019-02-08 14:49:53,481 - fetch - DEBUG - No file_name specified. Inferring lvq_pak-3.1.tar from URL
2019-02-08 14:49:53,486 - fetch - DEBUG - lvq_pak-3.1.tar exists, but no hash to check. Setting to sha1:86024a871724e521341da0ffb783956e39aadb6e


True

Fetch is smart enough to not redownload the same file in this case. Still, this is messy and cumbersome. We can remove entries by removing them from the `file_list`.

In [12]:
dsrc.file_list.pop(1)

{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
 'hash_type': 'sha1',
 'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
 'name': None,
 'file_name': 'lvq_pak-3.1.tar'}

In [13]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'}]

In [14]:
dsrc.fetch(force=True)

2019-02-08 14:49:53,589 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid


True

### Sometimes we make mistakes when entering information

In [15]:
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar", name='cat', file_name='dog')

In [16]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'},
 {'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': None,
  'name': 'cat',
  'file_name': 'dog'}]

In [17]:
dsrc.fetch()

2019-02-08 14:49:53,672 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2019-02-08 14:49:53,676 - fetch - DEBUG - dog exists, but no hash to check. Setting to sha1:86024a871724e521341da0ffb783956e39aadb6e


True

In [18]:
!ls -la $paths.raw_data_path

total 1484
drwxrwx--- 2 ava00125 domain users   4096 Feb  8 14:49 .
drwxrwx--- 5 ava00125 domain users   4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 dog
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 lvq_pak-3.1.tar
-rw-rw---- 1 ava00125 domain users   2483 Feb  8 14:49 lvq-pak.license
-rw-rw---- 1 ava00125 domain users   4958 Feb  8 14:49 lvq-pak.readme


We now  have a copy of `lvq_pak-3.1.tar` called `dog`. Every time we fetch, we will fetch twice unless we get rid of the entry for `dog`.

First, we will want to remove `dog` from our raw data.

Let's take the "Nuke it from orbit. It's the only way to be sure" approach and clean our entire raw data directory. 

In [19]:
!cd .. && make clean_raw

rm -f data/raw/*


In [20]:
!ls -la $paths.raw_data_path

total 8
drwxrwx--- 2 ava00125 domain users 4096 Feb  8 14:49 .
drwxrwx--- 5 ava00125 domain users 4096 Oct 12 15:45 ..


The other option would have been to manually remove the `dog` file and then forced a refetch.

### Exercise: Remove the entry for dog and refetch

In [21]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'},
 {'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': 'cat',
  'file_name': 'dog'}]

In [22]:
dsrc.file_list.pop(1)

{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
 'hash_type': 'sha1',
 'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
 'name': 'cat',
 'file_name': 'dog'}

In [23]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'}]

In [24]:
# The fetch here will need to be forced
dsrc.fetch(force=True)

2019-02-08 14:49:55,269 - fetch - DEBUG - Retrieved lvq_pak-3.1.tar (hash sha1:86024a871724e521341da0ffb783956e39aadb6e)


True

In [25]:
# You should now only see the lvq_pak-3.1.tar file
!ls -la $paths.raw_data_path

total 740
drwxrwx--- 2 ava00125 domain users   4096 Feb  8 14:49 .
drwxrwx--- 5 ava00125 domain users   4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 lvq_pak-3.1.tar


#### Cached Downloads

The DataSource object keeps track of whether the fetch has been performed successfully. Subsequent downloads will be skipped by default:

In [26]:
dsrc.fetch()

2019-02-08 14:49:55,445 - datasets - DEBUG - Data Source lvq-pak is already fetched. Skipping


We can override this, which will check if the downloaded file exists, redownloading if necessary

In [27]:
dsrc.fetch(force=True)

2019-02-08 14:49:55,468 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid


True

In the previous case, the raw data file existed on the filesystem, and had the correct hash. If the local file has a checksum that doesn't match the saved hash, it will be re-downloaded automatically. Let's corrupt the file and see what happens.

In [28]:
!echo "XXX" >> $paths.raw_data_path/lvq_pak-3.1.tar

In [29]:
dsrc.fetch(force=True)

2019-02-08 14:49:55,652 - fetch - DEBUG - Retrieved lvq_pak-3.1.tar (hash sha1:86024a871724e521341da0ffb783956e39aadb6e)


True

## Exercise: Creating an F-MNIST `DataSource`

For this excercise, you are going build a `DataSource` out of the Fashion-MNIST dataset.

[Fashion-MNIST][FMNIST] is available from GitHub. Looking at their [README], we see that the raw data is distributed as a set of 4 files with the following checksums:

[FMNIST]: https://github.com/zalandoresearch/fashion-mnist
[README]: https://github.com/zalandoresearch/fashion-mnist/blob/master/README.md

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|

By the end of this running example, you will build a `DataSource` that downloads these raw files and verifies that the hash values are as expected. You should make sure to include **Description** and **License** metadata in this `DataSource`. When you are finished, save the `DataSource` to the Catalog.

### Exercise: Download Raw Data Source Files for F-MNIST

In [30]:
# Create an fmnist data source object
fmnist_dsname = 'fmnist'
fmnist = DataSource(fmnist_dsname)

In [31]:
# Add URL(s) for raw data files
# Note that you will be adding four files to the DataSource object
# and that the hash values have already been provided above!
fmnist.add_url(url='http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
               hash_type='md5',
               hash_value='8d4fb7e6c68d591d4c3dfef9ec88bf0d',
               name='train-images')


In [32]:
## Now all the rest at once
url_base = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe', 'train-labels'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79', 'test-images'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310', 'test-labels'),
]
for file, hashval, name in file_list:
    url = f"{url_base}/{file}"
    fmnist.add_url(url=url, hash_type='md5', hash_value=hashval, name=name)

In [33]:
fmnist.file_list

[{'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '8d4fb7e6c68d591d4c3dfef9ec88bf0d',
  'name': 'train-images',
  'file_name': None},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '25c81989df183df01b3e8a0aad5dffbe',
  'name': 'train-labels',
  'file_name': None},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bef4ecab320f06d8554ea6380940ec79',
  'name': 'test-images',
  'file_name': None},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bb300cfdad3c16e7a12a480ee83cd310',
  'name': 'test-labels',
  'file_name': None}]

In [34]:
# Fetch the files
fmnist.fetch()

2019-02-08 14:49:55,765 - fetch - DEBUG - No file_name specified. Inferring train-images-idx3-ubyte.gz from URL
2019-02-08 14:49:58,714 - fetch - DEBUG - Retrieved train-images-idx3-ubyte.gz (hash md5:8d4fb7e6c68d591d4c3dfef9ec88bf0d)
2019-02-08 14:49:58,717 - fetch - DEBUG - No file_name specified. Inferring train-labels-idx1-ubyte.gz from URL
2019-02-08 14:49:58,844 - fetch - DEBUG - Retrieved train-labels-idx1-ubyte.gz (hash md5:25c81989df183df01b3e8a0aad5dffbe)
2019-02-08 14:49:58,845 - fetch - DEBUG - No file_name specified. Inferring t10k-images-idx3-ubyte.gz from URL
2019-02-08 14:49:59,457 - fetch - DEBUG - Retrieved t10k-images-idx3-ubyte.gz (hash md5:bef4ecab320f06d8554ea6380940ec79)
2019-02-08 14:49:59,458 - fetch - DEBUG - No file_name specified. Inferring t10k-labels-idx1-ubyte.gz from URL
2019-02-08 14:49:59,469 - fetch - DEBUG - Retrieved t10k-labels-idx1-ubyte.gz (hash md5:bb300cfdad3c16e7a12a480ee83cd310)


True

In [35]:
# Check for your new files
!ls -la $paths.raw_data_path

total 30904
drwxrwx--- 2 ava00125 domain users     4096 Feb  8 14:49 .
drwxrwx--- 5 ava00125 domain users     4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users   747520 Feb  8 14:49 lvq_pak-3.1.tar
-rw-rw---- 1 ava00125 domain users  4422102 Feb  8 14:49 t10k-images-idx3-ubyte.gz
-rw-rw---- 1 ava00125 domain users     5148 Feb  8 14:49 t10k-labels-idx1-ubyte.gz
-rw-rw---- 1 ava00125 domain users 26421880 Feb  8 14:49 train-images-idx3-ubyte.gz
-rw-rw---- 1 ava00125 domain users    29515 Feb  8 14:49 train-labels-idx1-ubyte.gz


### Unpacking Raw Data Files

In [36]:
unpack_dir = dsrc.unpack()

2019-02-08 14:49:59,664 - fetch - DEBUG - Extracting lvq_pak-3.1.tar


By default, files are decompressed/unpacked to the `paths.interim_data_path`/`datasource_name` directory:

In [37]:
!ls -la $paths.interim_data_path

total 102408
drwxrwx--- 6 ava00125 domain users     4096 Feb  8 12:29 .
drwxrwx--- 5 ava00125 domain users     4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users 47102837 Oct 10 20:58 048a21f52d05f88e50d70c47740ae1cf057549d2.dataset
-rw-rw---- 1 ava00125 domain users     2359 Oct 10 20:58 048a21f52d05f88e50d70c47740ae1cf057549d2.metadata
-rw-rw---- 1 ava00125 domain users   458449 Oct 11 11:21 0f0f977903be6bd247b34c1ee1c9f4ef25befe28.dataset
-rw-rw---- 1 ava00125 domain users     7743 Oct 11 11:21 0f0f977903be6bd247b34c1ee1c9f4ef25befe28.metadata
-rw-rw---- 1 ava00125 domain users  7852836 Oct 10 20:58 1bdd754d481a6fe186e958508000a620555c61b7.dataset
-rw-rw---- 1 ava00125 domain users     2358 Oct 10 20:58 1bdd754d481a6fe186e958508000a620555c61b7.metadata
-rw-rw---- 1 ava00125 domain users   905362 Oct 12 15:45 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.dataset
-rw-rw---- 1 ava00125 domain users     7743 Oct 12 15:45 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.metadata
-rw-r

In [38]:
# We unpack everything into interim_data_path/datasource_name, which is returned by `unpack()`

In [39]:
!ls -la $unpack_dir

total 756
drwxrwx--- 3 ava00125 domain users   4096 Feb  8 14:49 .
drwxrwx--- 6 ava00125 domain users   4096 Feb  8 12:29 ..
-rw-rw---- 1 ava00125 domain users 747520 Feb  8 14:49 dog
drwxr-xr-x 2 ava00125 domain users   4096 Apr  6  1995 lvq_pak-3.1
-rw-rw---- 1 ava00125 domain users   2483 Oct 12 15:45 lvq-pak.license
-rw-rw---- 1 ava00125 domain users   4958 Oct 12 15:45 lvq-pak.readme


In [40]:
!ls -la $unpack_dir/lvq_pak-3.1

total 780
drwxr-xr-x 2 ava00125 domain users   4096 Apr  6  1995 .
drwxrwx--- 3 ava00125 domain users   4096 Feb  8 14:49 ..
-rw-r--r-- 1 ava00125 domain users   6358 Apr  6  1995 accuracy.c
-rw-r--r-- 1 ava00125 domain users   7805 Apr  6  1995 balance.c
-rw-r--r-- 1 ava00125 domain users   5577 Apr  6  1995 classify.c
-rw-r--r-- 1 ava00125 domain users   7092 Apr  6  1995 cmatr.c
-rw-r--r-- 1 ava00125 domain users   3797 Apr  6  1995 config.h
-rw-r--r-- 1 ava00125 domain users  28354 Apr  6  1995 datafile.c
-rw-r--r-- 1 ava00125 domain users   4294 Apr  6  1995 datafile.h
-rw-r--r-- 1 ava00125 domain users   5044 Apr  6  1995 elimin.c
-rw-r--r-- 1 ava00125 domain users   2626 Apr  6  1995 errors.h
-rw-r--r-- 1 ava00125 domain users   7122 Apr  6  1995 eveninit.c
-rw-r--r-- 1 ava00125 domain users 226894 Apr  6  1995 ex1.dat
-rw-r--r-- 1 ava00125 domain users 225948 Apr  6  1995 ex2.dat
-rw-r--r-- 1 ava00125 domain users   4226 Apr  6  1995 extract.c
-rw-r--r-- 1 ava00

### Exercise: Unpack raw data files for F-MNIST

In [41]:
fmnist_unpack = fmnist.unpack()

2019-02-08 14:50:00,175 - fetch - DEBUG - Ungzipping train-images-idx3-ubyte
2019-02-08 14:50:00,612 - fetch - DEBUG - Ungzipping train-labels-idx1-ubyte
2019-02-08 14:50:00,616 - fetch - DEBUG - Ungzipping t10k-images-idx3-ubyte
2019-02-08 14:50:00,694 - fetch - DEBUG - Ungzipping t10k-labels-idx1-ubyte


In [42]:
fmnist_unpack

PosixPath('/home/ava00125/src/devel/bus_number/data/interim/fmnist')

In [43]:
!ls -la $paths.interim_data_path

total 102408
drwxrwx--- 6 ava00125 domain users     4096 Feb  8 12:29 .
drwxrwx--- 5 ava00125 domain users     4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users 47102837 Oct 10 20:58 048a21f52d05f88e50d70c47740ae1cf057549d2.dataset
-rw-rw---- 1 ava00125 domain users     2359 Oct 10 20:58 048a21f52d05f88e50d70c47740ae1cf057549d2.metadata
-rw-rw---- 1 ava00125 domain users   458449 Oct 11 11:21 0f0f977903be6bd247b34c1ee1c9f4ef25befe28.dataset
-rw-rw---- 1 ava00125 domain users     7743 Oct 11 11:21 0f0f977903be6bd247b34c1ee1c9f4ef25befe28.metadata
-rw-rw---- 1 ava00125 domain users  7852836 Oct 10 20:58 1bdd754d481a6fe186e958508000a620555c61b7.dataset
-rw-rw---- 1 ava00125 domain users     2358 Oct 10 20:58 1bdd754d481a6fe186e958508000a620555c61b7.metadata
-rw-rw---- 1 ava00125 domain users   905362 Oct 12 15:45 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.dataset
-rw-rw---- 1 ava00125 domain users     7743 Oct 12 15:45 2c0bb10a816a7d45cce45984f1d5f9007c0a1d16.metadata
-rw-r

In [44]:
# Check for your files in the unpacked dirs
!ls -la $fmnist_unpack

total 53812
drwxrwx--- 2 ava00125 domain users     4096 Feb  8 14:05 .
drwxrwx--- 6 ava00125 domain users     4096 Feb  8 12:29 ..
-rw-rw---- 1 ava00125 domain users    62432 Feb  8 14:11 fmnist.LICENSE
-rw-rw---- 1 ava00125 domain users     1144 Feb  8 14:11 fmnist.readme
-rw-rw---- 1 ava00125 domain users    62425 Feb  8 14:07 LICENSE
-rw-rw---- 1 ava00125 domain users  7840016 Feb  8 14:50 t10k-images-idx3-ubyte
-rw-rw---- 1 ava00125 domain users    10008 Feb  8 14:50 t10k-labels-idx1-ubyte
-rw-rw---- 1 ava00125 domain users 47040016 Feb  8 14:50 train-images-idx3-ubyte
-rw-rw---- 1 ava00125 domain users    60008 Feb  8 14:50 train-labels-idx1-ubyte


### Adding Metadata to Raw Data
Wait, what have we actually downloaded, and are we actually allowed to **use** this data? We keep track of two key pieces of metadata along with a raw dataset:
* Description (`DESCR`) Text: Human-readable text describing the dataset, its source, and what it represents
* License (`LICENSE`) Text: Terms of use for this dataset, often in the form of a license agreement

Often, a dataset comes complete with its own README and LICENSE files. If these are available via URL, we can add these like we add any other data file, tagging them as metadata using the `name` field:

In [45]:
dsrc.add_url("http://www.cis.hut.fi/research/lvq_pak/README",
               file_name='lvq-pak.readme', name='DESCR')

In [46]:
dsrc.fetch()
dsrc.unpack()

2019-02-08 14:50:01,137 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2019-02-08 14:50:01,145 - fetch - DEBUG - Retrieved lvq-pak.readme (hash sha1:138b69cc0b4e02950cec5833752e50a54d36fd0f)
2019-02-08 14:50:01,146 - datasets - DEBUG - Data Source lvq-pak is already unpacked. Skipping


PosixPath('/home/ava00125/src/devel/bus_number/data/interim/lvq-pak')

In [47]:
# We now fetch 2 files. Note the metadata has been tagged accordingly in the `name` field
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'},
 {'url': 'http://www.cis.hut.fi/research/lvq_pak/README',
  'hash_type': 'sha1',
  'hash_value': '138b69cc0b4e02950cec5833752e50a54d36fd0f',
  'name': 'DESCR',
  'file_name': 'lvq-pak.readme'}]

We need to dig a little deeper to find the license. we find it at the beginning of the README file contained within that distribution:

In [48]:
!head -35 $paths.interim_data_path/lvq-pak/lvq_pak-3.1/README

************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming T

Rather than trying to be clever, let's just add the license metadata from a python string that we cut and paste from the above.

In [49]:
license_txt = '''
************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming Team of the                       *
*                 Helsinki University of Technology                    *
*           Laboratory of Computer and Information Science             *
*                Rakentajanaukio 2 C, SF-02150 Espoo                   *
*                              FINLAND                                 *
*                                                                      *
*                      Copyright (c) 1991-1995                         *
*                                                                      *
************************************************************************
*                                                                      *
*  NOTE: This program package is copyrighted in the sense that it      *
*  may be used for scientific purposes. The package as a whole, or     *
*  parts thereof, cannot be included or used in any commercial         *
*  application without written permission granted by its producents.   *
*  No programs contained in this package may be copied for commercial  *
*  distribution.                                                       *
*                                                                      *
*  All comments concerning this program package may be sent to the     *
*  e-mail address 'lvq@nucleus.hut.fi'.                                *
*                                                                      *
************************************************************************
'''
dsrc.add_metadata(contents=license_txt, kind='LICENSE')

Under the hood, this will create a file, storing the creation instructions in the same `file_list` we use to store the URLs we wish to download:

In [50]:
dsrc.file_list

[{'url': 'http://www.cis.hut.fi/research/lvq_pak/lvq_pak-3.1.tar',
  'hash_type': 'sha1',
  'hash_value': '86024a871724e521341da0ffb783956e39aadb6e',
  'name': None,
  'file_name': 'lvq_pak-3.1.tar'},
 {'url': 'http://www.cis.hut.fi/research/lvq_pak/README',
  'hash_type': 'sha1',
  'hash_value': '138b69cc0b4e02950cec5833752e50a54d36fd0f',
  'name': 'DESCR',
  'file_name': 'lvq-pak.readme'},
 {'contents': "\n************************************************************************\n*                                                                      *\n*                              LVQ_PAK                                 *\n*                                                                      *\n*                                The                                   *\n*                                                                      *\n*                   Learning  Vector  Quantization                     *\n*                                                                     

Now when we fetch, the license file is created from this information:

In [51]:
logger.setLevel(logging.DEBUG)
dsrc.fetch(force=True)
dsrc.unpack()

2019-02-08 14:50:01,436 - fetch - DEBUG - lvq_pak-3.1.tar already exists and hash is valid
2019-02-08 14:50:01,440 - fetch - DEBUG - lvq-pak.readme already exists and hash is valid
2019-02-08 14:50:01,442 - fetch - DEBUG - Creating lvq-pak.license from `contents` string
2019-02-08 14:50:01,456 - fetch - DEBUG - lvq-pak.license exists, but no hash to check. Setting to sha1:e5f53b172926d34cb6a49877be49ee08bc4d51c1
2019-02-08 14:50:01,458 - datasets - DEBUG - Data Source lvq-pak is already unpacked. Skipping


PosixPath('/home/ava00125/src/devel/bus_number/data/interim/lvq-pak')

In [52]:
!ls -la $paths.raw_data_path

total 30916
drwxrwx--- 2 ava00125 domain users     4096 Feb  8 14:50 .
drwxrwx--- 5 ava00125 domain users     4096 Oct 12 15:45 ..
-rw-rw---- 1 ava00125 domain users   747520 Feb  8 14:49 lvq_pak-3.1.tar
-rw-rw---- 1 ava00125 domain users     2483 Feb  8 14:50 lvq-pak.license
-rw-rw---- 1 ava00125 domain users     4958 Feb  8 14:50 lvq-pak.readme
-rw-rw---- 1 ava00125 domain users  4422102 Feb  8 14:49 t10k-images-idx3-ubyte.gz
-rw-rw---- 1 ava00125 domain users     5148 Feb  8 14:49 t10k-labels-idx1-ubyte.gz
-rw-rw---- 1 ava00125 domain users 26421880 Feb  8 14:49 train-images-idx3-ubyte.gz
-rw-rw---- 1 ava00125 domain users    29515 Feb  8 14:49 train-labels-idx1-ubyte.gz


### Exercise: Add metadata to F-MNIST

In [53]:
# Here's the link to the readme
readme_url = 'https://github.com/zalandoresearch/fashion-mnist/blob/master/README.md'

In [54]:
# tidying up the readme to a nice useable format for this dataset
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original [MNIST] dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
  - [MNIST] The MNIST Database of handwritten digits. Yann LeCun, Corinna Cortes,
    Christopher J.C. Burges. http://yann.lecun.com/exdb/mnist/
'''

In [55]:
# Add the readme info as the DESCR
fmnist.add_metadata(contents=fmnist_readme, kind='DESCR')

In [56]:
fmnist.file_list

[{'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '8d4fb7e6c68d591d4c3dfef9ec88bf0d',
  'name': 'train-images',
  'file_name': 'train-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '25c81989df183df01b3e8a0aad5dffbe',
  'name': 'train-labels',
  'file_name': 'train-labels-idx1-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bef4ecab320f06d8554ea6380940ec79',
  'name': 'test-images',
  'file_name': 't10k-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bb300cfdad3c16e7a12a480ee83cd310',
  'name': 'test-labels',
  'file_name': 't10k-labels-idx1-ubyte.gz'},
  'file_name': 'fmnist.readme'

In [57]:
# We can also find the LICENSE in the repo
fmnist_license_url = 'https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE'

In [58]:
fmnist.add_url(url=fmnist_license_url, name='DESCR', file_name="fmnist.LICENSE")

In [59]:
fmnist.fetch()

2019-02-08 14:50:01,889 - fetch - DEBUG - train-images-idx3-ubyte.gz already exists and hash is valid
2019-02-08 14:50:01,891 - fetch - DEBUG - train-labels-idx1-ubyte.gz already exists and hash is valid
2019-02-08 14:50:01,904 - fetch - DEBUG - t10k-images-idx3-ubyte.gz already exists and hash is valid
2019-02-08 14:50:01,905 - fetch - DEBUG - t10k-labels-idx1-ubyte.gz already exists and hash is valid
2019-02-08 14:50:01,906 - fetch - DEBUG - Creating fmnist.readme from `contents` string
2019-02-08 14:50:01,917 - fetch - DEBUG - fmnist.readme exists, but no hash to check. Setting to sha1:db57a3964b6b3515901f665412297aabf69e007e
2019-02-08 14:50:02,280 - fetch - DEBUG - Retrieved fmnist.LICENSE (hash sha1:9cf1a09f827056b24769f829cdce9a349a635bb5)


True

In [60]:
fmnist.file_list

[{'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '8d4fb7e6c68d591d4c3dfef9ec88bf0d',
  'name': 'train-images',
  'file_name': 'train-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '25c81989df183df01b3e8a0aad5dffbe',
  'name': 'train-labels',
  'file_name': 'train-labels-idx1-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bef4ecab320f06d8554ea6380940ec79',
  'name': 'test-images',
  'file_name': 't10k-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bb300cfdad3c16e7a12a480ee83cd310',
  'name': 'test-labels',
  'file_name': 't10k-labels-idx1-ubyte.gz'},
  'file_name': 'fmnist.readme'

In [61]:
fmnist.file_list

[{'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '8d4fb7e6c68d591d4c3dfef9ec88bf0d',
  'name': 'train-images',
  'file_name': 'train-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': '25c81989df183df01b3e8a0aad5dffbe',
  'name': 'train-labels',
  'file_name': 'train-labels-idx1-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bef4ecab320f06d8554ea6380940ec79',
  'name': 'test-images',
  'file_name': 't10k-images-idx3-ubyte.gz'},
 {'url': 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
  'hash_type': 'md5',
  'hash_value': 'bb300cfdad3c16e7a12a480ee83cd310',
  'name': 'test-labels',
  'file_name': 't10k-labels-idx1-ubyte.gz'},
  'file_name': 'fmnist.readme'

In [62]:
fmnist.unpack()

2019-02-08 14:50:02,363 - datasets - DEBUG - Data Source fmnist is already unpacked. Skipping


PosixPath('/home/ava00125/src/devel/bus_number/data/interim/fmnist')

In [63]:
!ls -la $fmnist_unpack

total 53812
drwxrwx--- 2 ava00125 domain users     4096 Feb  8 14:05 .
drwxrwx--- 6 ava00125 domain users     4096 Feb  8 12:29 ..
-rw-rw---- 1 ava00125 domain users    62432 Feb  8 14:11 fmnist.LICENSE
-rw-rw---- 1 ava00125 domain users     1144 Feb  8 14:11 fmnist.readme
-rw-rw---- 1 ava00125 domain users    62425 Feb  8 14:07 LICENSE
-rw-rw---- 1 ava00125 domain users  7840016 Feb  8 14:50 t10k-images-idx3-ubyte
-rw-rw---- 1 ava00125 domain users    10008 Feb  8 14:50 t10k-labels-idx1-ubyte
-rw-rw---- 1 ava00125 domain users 47040016 Feb  8 14:50 train-images-idx3-ubyte
-rw-rw---- 1 ava00125 domain users    60008 Feb  8 14:50 train-labels-idx1-ubyte


### Adding Raw Data to the Catalog

In [64]:
from src import workflow

In [65]:
workflow.available_datasources()

['fmnist', 'lvq-pak']

In [66]:
workflow.add_datasource(dsrc)

In [67]:
workflow.available_datasources()

['fmnist', 'lvq-pak']

We will make use of this raw dataset catalog later in this tutorial

### Exercise: Add F-MNIST to the Raw Dataset Catalog

In [68]:
workflow.add_datasource(fmnist)

In [69]:
# Your fmnist dataset should now show up here:
workflow.available_datasources()

['fmnist', 'lvq-pak']