In [None]:
!git clone https://www.github.com/bminixhofer/nnsplit

Cloning into 'nnsplit'...
remote: Enumerating objects: 2873, done.[K
remote: Counting objects: 100% (719/719), done.[K
remote: Compressing objects: 100% (302/302), done.[K
remote: Total 2873 (delta 403), reused 706 (delta 392), pack-reused 2154[K
Receiving objects: 100% (2873/2873), 89.36 MiB | 37.66 MiB/s, done.
Resolving deltas: 100% (1540/1540), done.


In [None]:
!wget https://www.dropbox.com/s/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2?dl=1

--2022-06-27 08:08:57--  https://www.dropbox.com/s/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2 [following]
--2022-06-27 08:08:57--  https://www.dropbox.com/s/dl/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6307bfd9e590ea9c5911e04d3a.dl.dropboxusercontent.com/cd/0/get/Bn_YBJmWbo5tMidOUKoEYcEvo5NdcOQNXVXzsrZ_rowWPVPBRMPIuXYAo8lvEqQ_9T0ZypPC6eoN4AA2AVHYqDPLU8hPfXnuvJMVpIjlM7SR2LGEIfVPhr82D9NhYEjyFTVekQfZO63poN6pFgxKKTiEOZd128pBYLyxg84SYbZgbQ/file?dl=1# [following]
--2022-06-27 08:08:57--  https://uc6307bfd9e590ea9c5911e04d3a.dl.dropboxusercontent.com/cd/0/get/Bn_YBJmWbo5t

In [None]:
import sys
sys.path.append("nnsplit/train")
from text_data import MemoryMapDataset, xml_dump_iter

In [None]:
xml_iter = xml_dump_iter("data.xml", 
                         min_text_length=10, 
                         max_text_length=5000)
next(xml_iter)

StopIteration: ignored

`MemoryMapDataset` is another convient built-in class, but not specific to the Wikipedia dump. It is a `torch.utils.data.Dataset` which can be created using a `texts.txt` and `slices.pkl` file. The `texts.txt` file is [memory-mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) and `slices.pkl` contains a Python array with indices that determine at which position in the dataset which range of the text should be loaded. This allows accessing each text without ever loading all the data into memory.

To create `texts.txt` and `slices.pkl` from an iterator over text, use `MemoryMapDataset.iterator_to_text_and_slices`.

Note that this will be quite slow since iterating over the XML dump takes a significant amount of time, so I would recommend caching `texts.txt` and `slices.pkl` somewhere.

`max_n_texts=10_000_000` is only needed in Colab to keep disk usage in check, feel free to remove this otherwise.

In [None]:
xml_iter = xml_dump_iter("data.xml", 
                         min_text_length=10,
                         max_text_length=5000)
MemoryMapDataset.iterator_to_text_and_slices(xml_iter, 
                                             "texts.txt", 
                                             "slices.pkl",
                                             )

0it [00:00, ?it/s]

Here, I am saving the outputs to my Drive, you will have to adjust these paths.

In [None]:
!cp -a slices.pkl "/content/drive/My Drive/Projects/nnsplit/slices.pkl"
!cp -a texts.txt "/content/drive/My Drive/Projects/nnsplit/texts.txt"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Training

Now we can get started with training!

In [None]:
import sys
sys.path.append("nnsplit/train")

In [None]:
!pip install git+https://github.com/PyTorchLightning/pytorch-lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/PyTorchLightning/pytorch-lightning
  Cloning https://github.com/PyTorchLightning/pytorch-lightning to /tmp/pip-req-build-4qusdskv
  Running command git clone -q https://github.com/PyTorchLightning/pytorch-lightning /tmp/pip-req-build-4qusdskv
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyDeprecate>=0.3.1
  Downloading pyDeprecate-0.3.2-py3-none-any.whl (10 kB)
Collecting fsspec[http]!=2021.06.0,>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 5.0 MB/s 
Collecting torchmetrics>=0.4.1
  Downloading torchmetrics-0.9.1-py3-none-any.whl (419 kB)
[K     |████████████████████████████████| 419 kB 48.0 MB/s 


In [None]:
!pip install onnx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting onnx
  Downloading onnx-1.12.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 4.5 MB/s 
Installing collected packages: onnx
Successfully installed onnx-1.12.0


In [None]:
import json
from pytorch_lightning.trainer import Trainer
from tqdm.auto import tqdm
from model import Network
from text_data import MemoryMapDataset

NNSplit has a `Network` class which is a `pl.LightningModule` specifying network architecture, data loading logic etc. To instantiate a new network, we need to first get the default hyperparameters.

In [None]:
parser = Network.get_parser()
hparams = parser.parse_args([])
hparams

Namespace(accelerator=None, accumulate_grad_batches=None, amp_backend='native', amp_level=None, auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=128, benchmark=None, check_val_every_n_epoch=1, default_root_dir=None, detect_anomaly=False, deterministic=None, devices=None, enable_checkpointing=True, enable_model_summary=True, enable_progress_bar=True, fast_dev_run=False, gpus=None, gradient_clip_algorithm=None, gradient_clip_val=None, ipus=None, level_weights=[], limit_predict_batches=None, limit_test_batches=None, limit_train_batches=None, limit_val_batches=None, log_every_n_steps=50, logger=True, max_epochs=1, max_steps=-1, max_time=None, min_epochs=None, min_steps=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=None, num_sanity_val_steps=2, overfit_batches=0.0, plugins=None, precision=32, predict_indices=[], profiler=None, reload_dataloaders_every_epoch=True, reload_dataloaders_every_n_epochs=0,

## Load text data

Next, we can load the text data created previously.

In [None]:
text_dataset = 'data.txt'

Keep in mind that this can be any `torch.utils.data.Dataset` with `str` entries, so you can completely customize it.

In [None]:
text_dataset[0]

'd'

Next, create a `Labeler`, which is used to annotate the text from above. Any SpaCy model which supports sentencization can be used. You will have to install the appropriate SpaCy model with `python -m spacy ...` when running this in Colab.

In [None]:
pip install diskcache

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting diskcache
  Downloading diskcache-5.4.0-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 2.0 MB/s 
[?25hInstalling collected packages: diskcache
Successfully installed diskcache-5.4.0


In [None]:
pip install SoMaJo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting SoMaJo
  Downloading SoMaJo-2.2.1-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 4.1 MB/s 
Installing collected packages: SoMaJo
Successfully installed SoMaJo-2.2.1


In [None]:
from labeler import Labeler, SpacySentenceTokenizer, SpacyWordTokenizer
from spacy.tokenizer import Tokenizer

In [None]:
labeler = Labeler(
    [
        SpacySentenceTokenizer(
            "en_core_web_sm", lower_start_prob=0.7, remove_end_punct_prob=0.7
        ),
        SpacyWordTokenizer("en_core_web_sm"),
    ]
)

TypeError: ignored

`Labeler.visualize` shows you what the network sees: 
- `byte` is the UTF-8 encoded text. This has changed in the newest version of NNSplit. Previously characters where used, but using bytes allows NNSplit to work for any language regardless of the characters used to represent it.
- The other rows depend on the `Labeler` and determine what the neural networks tries to predict.

In [None]:
labeler.visualize("This is a test. This is another test.")

NameError: ignored

## Start training!

Now we can finally start training. 

`train_size` determines how many entries in the dataset to sample for each epoch. 

Using SpaCy with multiprocessing leaks memory, so the memory usage will continously increase during each epoch and reset at the end. So you will have to set `train_size` to a size that corresponds to how much memory is available. `500_000` works well in Colab.


In [None]:
hparams.gpus = 1
hparams.max_epochs = 4
hparams.train_size = 500_000
hparams.predict_indices = [0, 1] # which split levels of the labeler to predict
# how to weigh the selected indices
# in general sentence boundary detection should be weighed the highest
hparams.level_weights = [0.1, 2.0]

Instantiate the network.

In [None]:
model = Network(
  text_dataset,
  labeler,
  hparams,
)
model

Instantiate the `pl.trainer.Trainer`.

In [None]:
trainer = Trainer.from_argparse_args(hparams)

And fit the model. Each row of the f1 and precision scores corresponds to each tokenizer of the `Labeler`.

In [None]:
trainer.fit(model)

Finally, store the trained model somewhere. This saves a `.onnx` export of the model in the specified directory.

In [None]:
# onnx metadata which determines how to use the prediction indices to split text
metadata = {
    "split_sequence": json.dumps(
        {
            "instructions": [
                ["Sentence", {"PredictionIndex": 0}],
                ["Token", {"PredictionIndex": 1}],
                ["_Whitespace", {"Function": "whitespace"}],
            ]
        }
    )
}
model.store("en", metadata)

# Load the model in NNSplit

First, install NNSplit.

In [None]:
!pip install nnsplit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nnsplit
  Downloading nnsplit-0.5.8_post0-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.1 MB/s 
[?25hCollecting onnxruntime==1.7
  Downloading onnxruntime-1.7.0-cp37-cp37m-manylinux2014_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 47.0 MB/s 
Installing collected packages: onnxruntime, nnsplit
Successfully installed nnsplit-0.5.8.post0 onnxruntime-1.7.0


In [None]:
from nnsplit import NNSplit

Instantiate the splitter.

In [None]:
splitter = NNSplit("/content/nnsplit/models/en/model.onnx", use_cuda=False)

And split a text!

In [None]:
splits = splitter.split(["This is a test This is another test."])[0]
splits

Split(Split(Split('This', ' '), Split('is', ' '), Split('a', ' '), Split('test', ' ')), Split(Split('This', ' '), Split('is', ' '), Split('another', ' '), Split('test', ''), Split('.', '')))

The public API of NNSplit has changed significantly, making it much easier to use now. Everything is a `nnsplit.Split` which can be iterated over or stringified with `str(...)`.

In [None]:
for sentence in splits:
    print(str(sentence).ljust(30), type(sentence))

This is a test                 <class 'builtins.Split'>
This is another test.          <class 'builtins.Split'>


Or if you want to go token-level:

In [None]:
for sentence in splits:
    for token in sentence:
        print(str(token).ljust(10), repr(token).ljust(30), type(token))

    print()

This       Split('This', ' ')             <class 'builtins.Split'>
is         Split('is', ' ')               <class 'builtins.Split'>
a          Split('a', ' ')                <class 'builtins.Split'>
test       Split('test', ' ')             <class 'builtins.Split'>

This       Split('This', ' ')             <class 'builtins.Split'>
is         Split('is', ' ')               <class 'builtins.Split'>
another    Split('another', ' ')          <class 'builtins.Split'>
test       Split('test', '')              <class 'builtins.Split'>
.          Split('.', '')                 <class 'builtins.Split'>



Until the smallest unit, which then returns a `str` instead of an `nnsplit.Split`.

In [None]:
for sentence in splits:
    for [text, whitespace] in sentence:
        print(text.ljust(10), type(text))
        print(f'"{whitespace}"'.ljust(10), type(whitespace))
        print()

This       <class 'str'>
" "        <class 'str'>

is         <class 'str'>
" "        <class 'str'>

a          <class 'str'>
" "        <class 'str'>

test       <class 'str'>
" "        <class 'str'>

This       <class 'str'>
" "        <class 'str'>

is         <class 'str'>
" "        <class 'str'>

another    <class 'str'>
" "        <class 'str'>

test       <class 'str'>
""         <class 'str'>

.          <class 'str'>
""         <class 'str'>



Finally, for some benchmarks: If you are running `NNSplit` on GPU, you can increase the speed on large datasets by using a big batch size.

In [None]:
splitter = NNSplit("/content/nnsplit/models/en/model.onnx", use_cuda=False, batch_size=2**14)

In [None]:
text = "This is a test This is another test."

%timeit splitter.split([text])[0]
%timeit splitter.split([text] * 100)[0]
%timeit splitter.split([text] * 1000)[0]
%timeit splitter.split([text] * 10_000)[0]

1000 loops, best of 5: 1.59 ms per loop
10 loops, best of 5: 86.2 ms per loop
1 loop, best of 5: 919 ms per loop
1 loop, best of 5: 9.51 s per loop


And voilà! Splitting 10000 short texts in less than 400 milliseconds.