Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed is imbalanced and slow #195

Closed
mttbx opened this issue Feb 24, 2019 · 4 comments
Closed

speed is imbalanced and slow #195

mttbx opened this issue Feb 24, 2019 · 4 comments

Comments

@mttbx
Copy link

mttbx commented Feb 24, 2019

Hi, I use lmdb library to train my deep learning model with pytorch, however, the speed is imbalanced and slow. I think maybe I can do better with you help.

I'm using ubuntu, and the version of lmdb is 0.94 which is installed by pip3.

The output of free -m is:
total used free shared buffers cached
Mem: 15938 6131 159 756 9647 8711
Swap: 2047 0 2047

My lmdb datasets is about 150GB. The key is uuid with some additional info, and the value is undecoded jpeg image in bytes. I'm reading my lmdb datasets in random (random read).

Here's my script in python, I think it's easy to read even if you don't know anything about pytorch:


class ImagenetDataset(data.Dataset):
    def __init__(self, dbfile, keyfile, transform):
        self.transform = transform
        self.db = lmdb.open(dbfile, map_size=1024**4)
        if osp.exists(keyfile):
            with open(keyfile, 'rb') as f:
                self.key = pickle.loads(f.read())
        else:
            with self.db.begin() as txn:
                self.key = [k for k, v in txn.cursor()]
            with open(keyfile, 'wb') as f:
                f.write(pickle.dumps(self.key))                
        random.shuffle(self.key)
        
    def __del__(self):
        self.db.close()
        
    def __getitem__(self, index):
        k = self.key[index]
        with self.db.begin() as txn:
            v = txn.get(k)
        cls = int(k.decode('ascii').split('_')[-1])
        img = Image.open(io.BytesIO(v))
        img = img.convert('RGB')
        img = self.transform(img)
        return img, cls
        
    def __len__(self):
        return len(self.key)

Pytorch will launch 16 works to use this class to random read images.

Can you help me out?

@jnwatson
Copy link
Owner

Your mapsize is a terabyte, which is probably too big for your system. It won't matter though, since your file isn't that big.

Your iterable [k for k, v in txn.cursor()] is receiving the entire value for each key and then dropping it. Much faster would be [k for in k txn.cursor().iternext(key=True, value=False)]

@mttbx
Copy link
Author

mttbx commented Jun 15, 2019

yeah, this is slow, however txn.get(k) is slow too. It seems random get method is much slower then iteration method

@jnwatson
Copy link
Owner

That's expected. Iteration is going to always be faster, because the keys are going to be next together on disk and in memory, so your operating system's and storage system's read-ahead is going to load the next key while you're doing other stuff.

This is particularly true for spinning disks, where random access is 100x slow than serial access. LMDB is going to be slow if either you have more data than RAM, or you have spinning disk storage.

@mttbx
Copy link
Author

mttbx commented Jun 17, 2019

Yes, that's exactly what I find. Thank you for your reply.

@mttbx mttbx closed this as completed Jun 17, 2019
rangwani-harsh added a commit to rangwani-harsh/vision that referenced this issue Aug 2, 2020
This pull request inhances the speed of the cache creation for lsun dataset. For the "kitchen_train" the speed was getting slow with cache creation taking more then two hours. This speeds up to cache creation in within minutes. The issue was pulling the large image values each time and dropping them.

For more details on this please refer this issue jnwatson/py-lmdb#195.
fmassa pushed a commit to pytorch/vision that referenced this issue Aug 20, 2020
* Only pull keys from db in lsun for faster cache.

This pull request inhances the speed of the cache creation for lsun dataset. For the "kitchen_train" the speed was getting slow with cache creation taking more then two hours. This speeds up to cache creation in within minutes. The issue was pulling the large image values each time and dropping them.

For more details on this please refer this issue jnwatson/py-lmdb#195.

* Fixed bug in lsun.py when loading multiple categories

* Make linter happy
bryant1410 pushed a commit to bryant1410/vision-1 that referenced this issue Nov 22, 2020
* Only pull keys from db in lsun for faster cache.

This pull request inhances the speed of the cache creation for lsun dataset. For the "kitchen_train" the speed was getting slow with cache creation taking more then two hours. This speeds up to cache creation in within minutes. The issue was pulling the large image values each time and dropping them.

For more details on this please refer this issue jnwatson/py-lmdb#195.

* Fixed bug in lsun.py when loading multiple categories

* Make linter happy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants