# mmap.ninja

This if a demo of `mmap_ninja`, which allows you to store your machine learning datasets in memory-mapped format during training.

This allows you to significantly speed up the I/O and accelerate the time for iteration over the dataset by up to **10 times**!

We'll demonstrate its power by converting an image segmentation dataset into
a memory-mapped format.

In [None]:
!pip install mmap_ninja

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mmap_ninja
  Downloading mmap_ninja-0.2.1.tar.gz (8.2 kB)
Building wheels for collected packages: mmap-ninja
  Building wheel for mmap-ninja (setup.py) ... [?25l[?25hdone
  Created wheel for mmap-ninja: filename=mmap_ninja-0.2.1-py3-none-any.whl size=8095 sha256=7cd66c55b8d9568f6f46aeedbdcfa0425b1d3348febcfffdf4fb5fa18cfb4eeb
  Stored in directory: /root/.cache/pip/wheels/2f/da/3f/4794f761c01ddf0e0e8bd9a668ed4f91ca692437d7345a77ca
Successfully built mmap-ninja
Installing collected packages: mmap-ninja
Successfully installed mmap-ninja-0.2.1


In [None]:
# Load the data: IMDB movie review sentiment classification
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  28.8M      0  0:00:02  0:00:02 --:--:-- 28.8M


The directory has the following structure roughly:

```
.
├── imdbEr.txt
├── imdb.vocab
├── README
├── test
│   ├── labeledBow.feat
│   ├── neg [12500 entries exceeds filelimit, not opening dir]
│   ├── pos [12500 entries exceeds filelimit, not opening dir]
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg [12500 entries exceeds filelimit, not opening dir]
    ├── pos [12500 entries exceeds filelimit, not opening dir]
    ├── unsup [50000 entries exceeds filelimit, not opening dir]
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt


```

Let's print a review:

In [None]:
!cat aclImdb/train/pos/6248_7.txt


Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

Now, let's iterate over relevant text files and see how long it takes to iterate over them.

In [None]:
import numpy as np

from tqdm import tqdm
from pathlib import Path
from time import time

In [None]:
base_dir = Path('aclImdb')
text_subdirs = [
  'train/pos',
  'train/neg',
  'train/unsup',
  'test/pos',
  'test/neg'
]

In [None]:
def texts_generator():
  for sub_dir in text_subdirs:
    for text_path in (base_dir / sub_dir).iterdir():
      with open(text_path) as in_file:
        yield in_file.read()

In [None]:
print(next(texts_generator()))

<br /><br />Film dominated by raven-haired Barbara Steele, it was seen when I was seven or eight and created permanent images of pallid vampiric men and women stalking a castle, seeking blood. Steele is an icon of horror films and an otherworldly beauty, and the views of the walking dead pre-date Romero's NIGHT OF THE LIVING DEAD shamblers, unifying them in my mind.<br /><br />I don't see the connection between this film and THE HAUNTING, which is clever but ambiguous about the forces present. LA DANZA MACABRE is a b-movie without pretention, daring you to fall in love with Barbara Steele and suffer the consequences. There's no such draw to HAUNTING's overwrought Claire Bloom. The comparisons to the HAUNTING are superficial.<br /><br />And no, this movie does NOT need to be remade. Not only is it a product of the Sixties, but the large percentage of talentless cretins in Hollywood cannot fathom MACABRE's formula for terror. That formula is based on one overriding factor: GOOD WRITING. 

In [None]:
start_t = time()
for text in tqdm(texts_generator()):
  pass
text_t = time() - start_t
print(f'\nTook: {text_t}')

100000it [00:03, 25768.09it/s]


Took: 3.8872451782226562





Now, let's convert the dataset into a `StringsMmap`!

The first step is to convert the image files into a `StringsMmap`.

This is done only once for the whole project, because it is persisted.

In [None]:
from mmap_ninja.string import StringsMmap

StringsMmap.from_generator(
    out_dir='aclImdb_mmap',
    sample_generator=texts_generator(),
    batch_size=1024,
    verbose=True
)

100000it [00:04, 20619.84it/s]


<mmap_ninja.string.StringsMmap at 0x7ff1a3f05bd0>

In [None]:
texts = StringsMmap('aclImdb_mmap')

In [None]:
print(texts[0])

<br /><br />Film dominated by raven-haired Barbara Steele, it was seen when I was seven or eight and created permanent images of pallid vampiric men and women stalking a castle, seeking blood. Steele is an icon of horror films and an otherworldly beauty, and the views of the walking dead pre-date Romero's NIGHT OF THE LIVING DEAD shamblers, unifying them in my mind.<br /><br />I don't see the connection between this film and THE HAUNTING, which is clever but ambiguous about the forces present. LA DANZA MACABRE is a b-movie without pretention, daring you to fall in love with Barbara Steele and suffer the consequences. There's no such draw to HAUNTING's overwrought Claire Bloom. The comparisons to the HAUNTING are superficial.<br /><br />And no, this movie does NOT need to be remade. Not only is it a product of the Sixties, but the large percentage of talentless cretins in Hollywood cannot fathom MACABRE's formula for terror. That formula is based on one overriding factor: GOOD WRITING. 

In [None]:
start_t = time()
for text in tqdm(texts):
  pass
mmap_t = time() - start_t
print(f'\nTook: {mmap_t}')

100%|██████████| 100000/100000 [00:00<00:00, 269892.44it/s]


Took: 0.3762497901916504





In [None]:
ratio = text_t / mmap_t
print(f'We can see that the mmap_ninja is {ratio:.2f} times faster than storing as text files!')

We can see that the mmap_ninja is 10.33 times faster than storing as text files!


We've seen a dramatic improvement in the time for one iteration over the dataset.

It makes a big difference whether you are going to wait for one minute or one
second!

Especially since this has to be done for every epoch, and every model you want
to experiment with, this quickly adds up!

**Another tip**: You can `append` or `extend` to the `StringsMmap`, in the same way you could with regular Python `list`s!

In [None]:
print(len(texts))

100000


In [None]:
texts.append('This is a new document')

In [None]:
print(len(texts))

100001


In [None]:
texts[-1]

'This is a new document'

In [None]:
texts.extend(['New doc0', 'New doc1'])

In [None]:
print(len(texts))

100003


In [None]:
texts[-3:]

['This is a new document', 'New doc0', 'New doc1']