## Download a data set

This notebook downloads a data set file from a public location. If the data set file is a compressed archive it will be decompressed. Upon completion the raw data set files  are located in the `data\` directory.

This notebook requires the following environment variables:
 -  `DATASET_URL` Public data set URL, e.g. `https://dax-cdn.cdn.appdomain.cloud/dax-fashion-mnist/1.0.2/fashion-mnist.tar.gz`

In [1]:
import glob
import os
from pathlib import Path
import requests
import tarfile
from urllib.parse import urlparse

Verify that the `DATASET_URL` environment variable is set. If it is not set, a RuntimeError is raised.

In [2]:
data_file = os.getenv('DATASET_URL',
                      'https://dax-cdn.cdn.appdomain.cloud/'
                      'dax-noaa-weather-data-jfk-airport/1.1.4/'
                      'noaa-weather-data-jfk-airport.tar.gz')

Download the data set from the location specified in `dataset_url`, extract it (if it is compressed) and store it in the directory identified by `data_dir_name`.

In [3]:
data_dir_name = 'data'

print('Downloading data file {} ...'.format(data_file))
r = requests.get(data_file)
if r.status_code != 200:
    raise RuntimeError('Could not fetch {}: HTTP status code {}'
                       .format(data_file, r.status_code))
else:
    # extract data set file name from URL
    data_file_name = Path((urlparse(data_file).path)).name
    # create the directory where the downloaded file will be stored
    data_dir = Path(data_dir_name)
    data_dir.mkdir(parents=True, exist_ok=True)
    downloaded_data_file = data_dir / data_file_name

    print('Saving downloaded file "{}" as ...'.format(data_file_name))
    with open(downloaded_data_file, 'wb') as downloaded_file:
        downloaded_file.write(r.content)

    if r.headers['content-type'] == 'application/x-tar':
        print('Extracting downloaded file in directory "{}" ...'
              .format(data_dir))
        with tarfile.open(downloaded_data_file, 'r') as tar:
            tar.extractall(data_dir)
        print('Removing downloaded file ...')
        downloaded_data_file.unlink()

Downloading data file https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz ...


Saving downloaded file "noaa-weather-data-jfk-airport.tar.gz" as ...
Extracting downloaded file in directory "data" ...
Removing downloaded file ...


Display list of extracted data files

In [4]:
for entry in glob.glob(data_dir_name + "/**/*", recursive=True):
    print(entry)

data/noaa-weather-data-jfk-airport
data/noaa-weather-data-jfk-airport/clean_data.py
data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
data/noaa-weather-data-jfk-airport/jfk_weather.csv
data/noaa-weather-data-jfk-airport/README.txt
data/noaa-weather-data-jfk-airport/LICENSE.txt
