torchcsv

An PyTorch Dataset subclass for handling numerical data too large to fit in local memory.

Installation

To install, run pip install torchcsv.

Example

The CSVDataset class inherits from torch.Dataset like we always do with custom Dataset classes. However, rather than reading the entire data and label .csv into memory, we make two assumptions:

The dataset is too large to fit in local memory
The labels are contained in a separate file. If this isn't the case, consider using Dask to obtain the column of interest, and then continue.

So, we initialize the CSVDataset object as

from torchcsv import CSVDataset 

data = CSVDataset(
    datafile='path/to/datafile.csv',
    labelfile='path/to/labelfile.csv',
    target_label='Animal Type', # Column name containing targets in labelfile.csv
    # indices=idx_list # Optionally, pass a list of purely numeric indices to use instead of the entire indices of the labelfile 
)

For example, getting a 16.3k dimensional sample takes

> %%time
> test[1]
CPU times: user 5.99 ms, sys: 576 µs, total: 6.56 ms
Wall time: 6.19 ms
(tensor([0., 0., 0.,  ..., 0., 0., 0.]), 16)

Now, we can use this like a regular PyTorch Dataset, but without having to worry about memory issues!

For example,

from torch.utils.data import Dataloader 
data = DataLoader(data, batch_size=4, num_workers=0)

Gives us that

%%time 
next(iter(test))

CPU times: user 25.6 ms, sys: 20.8 ms, total: 46.4 ms
Wall time: 76.9 ms

[tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 1.2663,  ..., 0.0000, 0.0000, 0.0000]]),
 tensor([16, 16,  4,  4])]

So loading a minibatch of size 4 takes about a quarter of a second. The CSVDataset class should be scalable, and will keep in memory what it can via the linecache library.

API Documentation

torchcsv.CSVDataset Dataset subclass that works for purely numeric data too large to fit in local memory.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tests		tests
torchcsv		torchcsv
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchcsv

Installation

Example

API Documentation

About

Releases

Packages

Languages

jlehrer1/torchcsv

Folders and files

Latest commit

History

Repository files navigation

torchcsv

Installation

Example

API Documentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages