DataSet Reader

Charles Siegel edited this page Jun 27, 2017 · 18 revisions

Included here is documentation for the DataSet class in MaTEx TensorFlow.

Table of Contents

  1. MaTEx 0.6
  2. MaTEx 0.5

0.6

In MaTEx 0.6, datasets.py has been integrated into TensorFlow.

data = tf.DataSet(data_name,
                  train_file=None,
                  validation_file=None,
                  test_file=None,
                  train_batch_size=None,
                  test_batch_size=None,
                  normalize=1.0,
                  valid_pct=0.0,
                  test_pct=0.0,
                  sep=',',
                  label_col=True)

datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV, HDF5 and PNetCDF data in parallel. The DataSet class takes the following arguments:

  • data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV", "HDF5" or "PNETCDF"
  • train_file (optional): Required for CSV, HDF5 or PNETCDF, the location of a file consisting of training data
  • validation_file (optional): Location of a file consisting of Validation data for CSV, HDF5 or PNETCDF
  • test_file (optional): Location of a file consisting of Testing data for CSV, HDF5 or PNETCDF
  • train_batch_size (optional): Size of a training batch for use with next_train_batch method
  • test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
  • normalize (optional): Float to divide all data entries by (default value is 1.0).
  • valid_pct (optional): Float. If provided for CIFAR or CSV data, will place this fraction of loaded data into a validation set.
  • test_pct (optional): Float. If provided, for CIFAR or CSV data_name, will place this fraction of loaded data into a testing set
  • sep (optional): String. Separator for CSV reader. Usually ',' (the default) or '\t' for TSV files
  • label_col (optional): Bool. If False, will not look for label column in CSV file

DataSet class provides the following methods:

[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()

None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.

Note that for CSV files, all rows must have the same number of elements, with the first element the label.

Examples-0.6

Below we include several examples of use of our parallel dataset reader.

import tensorflow as tf

MNIST

Load MNIST with all entries divided by 255.0.

mnist = tf.DataSet("MNIST", normalize=255.0)

CIFAR-10

Load CIFAR-10 data

cifar10 = tf.DataSet("CIFAR10")

CSV

csv = tf.DataSet("CSV", train_file='train.csv', test_file='test.csv')

HDF5

hdf5 = tf.DataSet("HDF5", train_file='train.h5', test_file='test.h5')

PNetCDF

pnetcdf = tf.DataSet("PNETCDF", train_file='train.nc', test_file='test.nc')

With Next Batch functions

mnist = tf.DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()

0.5

from datasets import DataSet

data = DataSet(data_name,
               train_batch_size=None,
               test_batch_size=None,
               normalize=1.0,
               file1=None,
               file2=None,
               valid_pct=0.0,
               test_pct=0.0)

datasets.py provides the DataSet class, which reads MNIST, CIFAR, CSV and PNetCDF data in parallel. The DataSet class takes the following arguments:

  • data_name: "MNIST", "CIFAR10", "CIFAR100", "CSV" or "PNETCDF"
  • train_batch_size (optional): Size of a training batch for use with next_train_batch method
  • test_batch_size (optional): Size of a testing batch for use with next_validation_batch and next_test_batch methods
  • normalize (optional): Float to divide all data entries by (default value is 1.0)
  • file1 (optional): Required for CSV or PNETCDF, the location of a file consisting of training data
  • file2 (optional): Location of a file consisting of Testing or Validation data for PNETCDF
  • valid_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a validation set
  • test_pct (optional): Float. If provided, for non-MNIST or PNETCDF data_name, will place this fraction of loaded data into a testing set

DataSet class provides the following methods:

[data, labels] = data.next_train_batch()
[data, labels] = data.next_validation_batch()
[data, labels] = data.next_test_batch()

None of which take any arguments, each of which returns the next batch (of sizes provided by DataSet class) of both data and labels as a list.

Note that for CSV files, all rows must have the same number of elements, with the first element the label.

Examples-0.5

Below we include several examples of use of our parallel dataset reader.

from datasets import DataSet

MNIST

Load MNIST with all entries divided by 255.0.

mnist = DataSet("MNIST", normalize=255.0)

CIFAR-10

Load CIFAR-10 data

cifar10 = DataSet("CIFAR10")

CSV

csv = DataSet("CSV", file1='train.csv', file2='test.csv')

PNetCDF

pnetcdf = DataSet("PNETCDF", file1='train.nc', file2='test.nc')

With Next Batch functions

mnist = DataSet("MNIST", normalize=255.0, train_batch_size=64, test_batch_size=100)
batch_x, batch_y = mnist.next_train_batch()