NNSplit

Fast, robust sentence splitting with bindings for Python, Rust and Javascript.

Features

Robust: Does not depend on proper punctuation and casing to split text into sentences.
Small: NNSplit uses a byte-level LSTM, so weights are very small which makes it easy to run in the browser.
Portable: Models are trained in Python, but inference can be done from Javascript, Rust and Python.
Fast: Can run on your GPU to split 10k short texts in less than 400ms in Colab. See train.ipynb.

Pretrained models

NNSplit comes with pretrained models. They were evaluated on the OPUS Open Subtitles dataset by concatenating 2 - 4 sentences and measuring the number of concatenations which are split completely correctly vs. the total number of concatenations.

See evaluate.ipynb for details.

`en`

	NNSplit	Spacy (Tagger)	Spacy (Sentencizer)
Clean	0.754371	0.853603	0.820934
Partial punctuation	0.485907	0.517829	0.249753
Partial case	0.761754	0.825119	0.819679
Partial punctuation and case	0.443704	0.458619	0.249873
No punctuation and case	0.166273	0.180859	0.00463281

`de`

	NNSplit	Spacy (Tagger)	Spacy (Sentencizer)
Clean	0.818902	0.833368	0.878471
Partial punctuation	0.463999	0.426458	0.266312
Partial case	0.823565	0.792839	0.876678
Partial punctuation and case	0.447231	0.377201	0.26697
No punctuation and case	0.198165	0.0952267	0.00756195

Python Usage

Installation

NNSplit has onnxruntime as the only dependency.

Install NNSplit with pip: pip install nnsplit

To enable GPU support, install onnxruntime-gpu: pip install onnxruntime-gpu.

Usage

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)

Javascript Usage

Installation

The Javascript bindings for NNSplit have tractjs as the only dependency.

Install them with npm: npm install nnsplit

Usage

The Javascript API has no method .load(model_name) to load a pretrained model. Instead the path to a model in your file system (in Node.js) or accessable via fetch (in the browser) has to be given as first argument to NNSplit.new. See models to download the model.onnx files for the pretrained models.

Node.js

const nnsplit = require("nnsplit");

async function run() {
    const splitter = await nnsplit.NNSplit.new("path/to/model.onnx");

    let splits = (await splitter.split(["This is a test This is another test."]))[0];
    console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits
}

run()

Browser

NNSplit in the browser currently only works with a bundler and has to be imported asynchronously. API is the same as in Node.js. See bindings/javascript/dev_server for a full example.

Rust Usage

Installation

Add NNSplit as a dependency to your Cargo.toml:

# ...

[dependencies.nnsplit]
version = "<version>"
features = ["model-loader", "tract-backend"] # to automatically download pretrained models and to use tract for inference, respectively

# ...

Usage

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let splitter =
        nnsplit::NNSplit::load("en", nnsplit::NNSplitOptions::default())?;

    let input: Vec<&str> = vec!["This is a test This is another test."];
    let splits = &splitter.split(&input)[0];

    for sentence in splits.iter() {
        println!("{}", sentence.text());
    }

    Ok(())
}

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
bindings		bindings
models		models
nnsplit		nnsplit
train		train
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
release.sh		release.sh
rust-toolchain		rust-toolchain
setup.cfg		setup.cfg
update_version.sh		update_version.sh

License

kornelski/nnsplit

Folders and files

Latest commit

History

Repository files navigation

NNSplit

Features

Pretrained models

Python Usage

Installation

Usage

Javascript Usage

Installation

Usage

Node.js

Browser

Rust Usage

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Languages