### Sentence Segmentation and Boundary Detection
    Deciding where sentences begin and end
    =================================================== 
    a) If it's a period, it ends a sentence.
    (b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.
    (c) If the next token is capitalized, then it ends a sentence.
    ===================================================
    Default = Uses the Dependency parser
    Custom Rule Based or Manual
    - - You set boundaries before parsing the doc

In [28]:
import spacy
nlp = spacy.load('en')

In [79]:
# Manual or Custom Based
def mycustom_boundary(docx):
    for token in docx[:-1]:
        if token.text == '.' or token.text == '?':
            docx[token.i+1].is_sent_start = True
    return docx

In [80]:
import spacy
nlp = spacy.load('en')

In [81]:
# Adding the rule before parsing
nlp.add_pipe(mycustom_boundary,before='parser')

In [82]:
mysentence = nlp("This is my first sentence. I am learning spacy boundary detection. How are you? I am curious")

In [83]:
for sentence in mysentence.sents:
    print(sentence)

This is my first sentence.
I am learning spacy boundary detection.
How are you?
I am curious


### Custom Rule Based

In [56]:
from spacy.lang.en import English
from spacy.pipeline import SentenceSegmenter

In [72]:
def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline and not word.is_space:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text == "\n":
            seen_newline = True
    if start < len(doc):
        yield doc[start:len(doc)]

In [73]:
nlp = English() # Just the language with no model
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)

In [74]:
doc = nlp(u"But that paltry harvest limited its usefulness. More recently, sensor technology and real-time data collection have produced bumper crops of employee information for companies.")
for sent in doc.sents:
    print(sent.text)

But that paltry harvest limited its usefulness. More recently, sensor technology and real-time data collection have produced bumper crops of employee information for companies.
