speed is imbalanced and slow #195

mttbx · 2019-02-24T09:14:46Z

Hi, I use lmdb library to train my deep learning model with pytorch, however, the speed is imbalanced and slow. I think maybe I can do better with you help.

I'm using ubuntu, and the version of lmdb is 0.94 which is installed by pip3.

The output of free -m is:
total used free shared buffers cached
Mem： 15938 6131 159 756 9647 8711
Swap： 2047 0 2047

My lmdb datasets is about 150GB. The key is uuid with some additional info, and the value is undecoded jpeg image in bytes. I'm reading my lmdb datasets in random (random read).

Here's my script in python, I think it's easy to read even if you don't know anything about pytorch:


class ImagenetDataset(data.Dataset):
    def __init__(self, dbfile, keyfile, transform):
        self.transform = transform
        self.db = lmdb.open(dbfile, map_size=1024**4)
        if osp.exists(keyfile):
            with open(keyfile, 'rb') as f:
                self.key = pickle.loads(f.read())
        else:
            with self.db.begin() as txn:
                self.key = [k for k, v in txn.cursor()]
            with open(keyfile, 'wb') as f:
                f.write(pickle.dumps(self.key))                
        random.shuffle(self.key)
        
    def __del__(self):
        self.db.close()
        
    def __getitem__(self, index):
        k = self.key[index]
        with self.db.begin() as txn:
            v = txn.get(k)
        cls = int(k.decode('ascii').split('_')[-1])
        img = Image.open(io.BytesIO(v))
        img = img.convert('RGB')
        img = self.transform(img)
        return img, cls
        
    def __len__(self):
        return len(self.key)

Pytorch will launch 16 works to use this class to random read images.

Can you help me out?

The text was updated successfully, but these errors were encountered:

jnwatson · 2019-06-15T04:19:04Z

Your mapsize is a terabyte, which is probably too big for your system. It won't matter though, since your file isn't that big.

Your iterable [k for k, v in txn.cursor()] is receiving the entire value for each key and then dropping it. Much faster would be [k for in k txn.cursor().iternext(key=True, value=False)]

mttbx · 2019-06-15T10:33:58Z

yeah, this is slow, however txn.get(k) is slow too. It seems random get method is much slower then iteration method

jnwatson · 2019-06-17T05:44:59Z

That's expected. Iteration is going to always be faster, because the keys are going to be next together on disk and in memory, so your operating system's and storage system's read-ahead is going to load the next key while you're doing other stuff.

This is particularly true for spinning disks, where random access is 100x slow than serial access. LMDB is going to be slow if either you have more data than RAM, or you have spinning disk storage.

mttbx · 2019-06-17T09:26:04Z

Yes, that's exactly what I find. Thank you for your reply.

This pull request inhances the speed of the cache creation for lsun dataset. For the "kitchen_train" the speed was getting slow with cache creation taking more then two hours. This speeds up to cache creation in within minutes. The issue was pulling the large image values each time and dropping them. For more details on this please refer this issue jnwatson/py-lmdb#195.

* Only pull keys from db in lsun for faster cache. This pull request inhances the speed of the cache creation for lsun dataset. For the "kitchen_train" the speed was getting slow with cache creation taking more then two hours. This speeds up to cache creation in within minutes. The issue was pulling the large image values each time and dropping them. For more details on this please refer this issue jnwatson/py-lmdb#195. * Fixed bug in lsun.py when loading multiple categories * Make linter happy

mttbx closed this as completed Jun 17, 2019

rangwani-harsh mentioned this issue Aug 2, 2020

Only pull keys from db in lsun for faster cache. pytorch/vision#2544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed is imbalanced and slow #195

speed is imbalanced and slow #195

mttbx commented Feb 24, 2019

jnwatson commented Jun 15, 2019

mttbx commented Jun 15, 2019

jnwatson commented Jun 17, 2019

mttbx commented Jun 17, 2019

speed is imbalanced and slow #195

speed is imbalanced and slow #195

Comments

mttbx commented Feb 24, 2019

jnwatson commented Jun 15, 2019

mttbx commented Jun 15, 2019

jnwatson commented Jun 17, 2019

mttbx commented Jun 17, 2019