# Where we left off
We loaded in our clean JSONL data, converted them into `spacy.Doc` objects, and reviewed some of the sentence boundaries. Now we're going to tune our `senter` model for better sentence boundary detection.

# Training data
Labeling data for NLP can be confusing. Because we're determining if a specific `token` within a `doc` is the start of a sentence, we need to review every token within the `doc`. We'll use the `"en_core_web_sm"` model to tokenize our text and then update the labels for each token as `True` (or `1`) if it is the start of the sentence, and `False` (or `0`) otherwise. See [*Converting existing corpora and annotations*](https://spacy.io/usage/training#data-convert) and [*Annotation format for creating training examples*](https://spacy.io/api/data-formats#dict-input) for more info.

Labeling our tokens will be tedious -- we have to go one `doc` at a time. If you have a [`prodigy`](https://prodi.gy/) license, this would be a great time to use the [Sentence Segmentation](https://prodi.gy/docs/recipes/#sent) recipe. For those of you who don't have `prodigy`, I've built you a [lightweight `streamlit` app](link/to/app) to make our lives easier.
> NOTE: if you'd like to know how I built the app you can follow my tutorial [here](link/to/app/tutorial).

In [158]:
import json

import spacy


with open("../data/clean.jsonl", "r") as f:
    lines = f.readlines()
    data = [json.loads(line) for line in lines]
# NOTE: I'm loading in a blank English model instead of the usual "en_core_web_sm" model.
#       **This limits us to tokenization only**
nlp = spacy.blank("en")
nlp.add_pipe("senter", source=spacy.load("en_core_web_sm"))
docs = nlp.pipe(item.get("body") for item in data if "body" in item)
doc = next(docs)

In [159]:
doc


So I understand what that means, but is there a well-known alternative that is more open-standards friendly, not proprietary?  What driver do you use and/or recommend and what are the advantages of it?

In [160]:
sent = next(doc.sents)

In [235]:
doc = nlp("This is a test\nThis is another test.")

In [236]:
[(t.text, t.is_sent_start) for t in doc]

[('This', True),
 ('is', False),
 ('a', False),
 ('test', False),
 ('\n', False),
 ('This', False),
 ('is', False),
 ('another', False),
 ('test', False),
 ('.', False)]

In [267]:
from spacy.tokens import DocBin
from spacy.training import Example


# example doc
doc = nlp("123 This is a test\nThis is another test.")

# make reference (gold-standard) doc
eg = Example.from_dict(doc, {"sent_starts": [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]})

# add reference doc to DocBin
db = DocBin()
db.add(eg.reference)

In [269]:
# predicted
[*doc.sents]

[123 This is a test
 This is another test.]

In [270]:
doc_bin_doc = [*db.get_docs(nlp.vocab)][0]

In [271]:
# reference
[*doc_bin_doc.sents]

[123, This is a test, This is another test.]

In [272]:
[(t.text, t.is_sent_start) for t in doc_bin_doc]

[('123', None),
 ('This', True),
 ('is', None),
 ('a', None),
 ('test', None),
 ('\n', None),
 ('This', True),
 ('is', None),
 ('another', None),
 ('test', None),
 ('.', None)]

In [273]:
[(t.text, t.is_sent_start) for t in eg.reference]

[('123', None),
 ('This', True),
 ('is', None),
 ('a', None),
 ('test', None),
 ('\n', None),
 ('This', True),
 ('is', None),
 ('another', None),
 ('test', None),
 ('.', None)]