Fast, robust sentence splitting with bindings for Python, Rust and Javascript.
- Robust: Does not depend on proper punctuation and casing to split text into sentences.
- Small: NNSplit uses a byte-level LSTM, so weights are very small which makes it easy to run in the browser.
- Portable: Models are trained in Python, but inference can be done from Javascript, Rust and Python.
- Fast: Can run on your GPU to split 10k short texts in less than 400ms in Colab. See train.ipynb.
NNSplit comes with pretrained models. They were evaluated on the OPUS Open Subtitles dataset by concatenating 2 - 4 sentences and measuring the number of concatenations which are split completely correctly vs. the total number of concatenations.
See evaluate.ipynb for details.
NNSplit | Spacy (Tagger) | Spacy (Sentencizer) | |
---|---|---|---|
Clean | 0.754371 | 0.853603 | 0.820934 |
Partial punctuation | 0.485907 | 0.517829 | 0.249753 |
Partial case | 0.761754 | 0.825119 | 0.819679 |
Partial punctuation and case | 0.443704 | 0.458619 | 0.249873 |
No punctuation and case | 0.166273 | 0.180859 | 0.00463281 |
NNSplit | Spacy (Tagger) | Spacy (Sentencizer) | |
---|---|---|---|
Clean | 0.818902 | 0.833368 | 0.878471 |
Partial punctuation | 0.463999 | 0.426458 | 0.266312 |
Partial case | 0.823565 | 0.792839 | 0.876678 |
Partial punctuation and case | 0.447231 | 0.377201 | 0.26697 |
No punctuation and case | 0.198165 | 0.0952267 | 0.00756195 |
NNSplit has onnxruntime as the only dependency.
Install NNSplit with pip: pip install nnsplit
To enable GPU support, install onnxruntime-gpu: pip install onnxruntime-gpu
.
from nnsplit import NNSplit
splitter = NNSplit.load("en")
# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]
# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
print(sentence)
The Javascript bindings for NNSplit have tractjs as the only dependency.
Install them with npm: npm install nnsplit
The Javascript API has no method .load(model_name)
to load a pretrained model. Instead the path to a model in your file system (in Node.js) or accessable via fetch
(in the browser) has to be given as first argument to NNSplit.new
. See models to download the model.onnx
files for the pretrained models.
const nnsplit = require("nnsplit");
async function run() {
const splitter = await nnsplit.NNSplit.new("path/to/model.onnx");
let splits = (await splitter.split(["This is a test This is another test."]))[0];
console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits
}
run()
NNSplit in the browser currently only works with a bundler and has to be imported asynchronously. API is the same as in Node.js. See bindings/javascript/dev_server for a full example.
Add NNSplit as a dependency to your Cargo.toml
:
# ...
[dependencies.nnsplit]
version = "<version>"
features = ["model-loader", "tract-backend"] # to automatically download pretrained models and to use tract for inference, respectively
# ...
fn main() -> Result<(), Box<dyn std::error::Error>> {
let splitter =
nnsplit::NNSplit::load("en", nnsplit::NNSplitOptions::default())?;
let input: Vec<&str> = vec!["This is a test This is another test."];
let splits = &splitter.split(&input)[0];
for sentence in splits.iter() {
println!("{}", sentence.text());
}
Ok(())
}