<a href="https://colab.research.google.com/github/noircir/Python/blob/master/011_Sentence_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:

In [0]:
doc[0]

This

However, you *can* build a sentence collection by running `doc.sents` and saving the result to a list:

In [0]:
doc_sents = [sent for sent in doc.sents]
doc_sents

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

In [0]:
# But these are 'span' objects; they are not text strings.

# list(doc.sents) also works. But, one can pass in conditionals with list comprehension.

# Now you can access individual sentences:
print(doc_sents[1])

This is another sentence.


In [0]:
type(doc_sents[1])

spacy.tokens.span.Span

In [0]:
print(doc_sents[1].start, doc_sents[1].end)

6 11


## Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [0]:
# Parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
    print(token.is_sent_start, ' '+token.text)

True  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .


In [0]:
#Notice we haven't run doc2.sents, and yet token.is_sent_start was set to True on two tokens in the Doc.

Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.

In [0]:
# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc3.sents:
    print(sent)

"
Management is doing things right; leadership is doing the right things.
" -Peter Drucker


In [0]:
for token in doc3:
    print(token.is_sent_start, ' '+token.text)

True  "
True  Management
None  is
None  doing
None  things
None  right
None  ;
None  leadership
None  is
None  doing
None  the
None  right
None  things
None  .
True  "
None  -Peter
None  Drucker


In [0]:
# ADD A NEW RULE TO THE PIPELINE
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

The new rule has to run before the document is parsed. Here we can either pass the argument before='parser' or first=True.

In [0]:
# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"
Management is doing things right;
leadership is doing the right things.
" -Peter Drucker


### Why not change the token directly?
Why not simply set the `.is_sent_start` value to True on existing tokens?

In [0]:
# Find the token we want to change:
doc3[7]

leadership

In [0]:
# Try to change the .is_sent_start attribute:
doc3[7].is_sent_start = True

ValueError: ignored

<font color=green>spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.</font>

## Changing the Rules
In some cases we want to *replace* spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

In [0]:
nlp = spacy.load('en_core_web_sm')  # reset to the original

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']


In [0]:
# CHANGING THE RULES
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'): # handles multiple occurrences
            seen_newline = True
    yield doc[start:]      # handles the last group of tokens


sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)

In [0]:
# While the function split_on_newlines can be named anything we want, it's important to use the name sbd for the SentenceSegmenter.

In [0]:
doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']


In [0]:
# Now we've completely overwritten the default Spacy behavior
# (splitting on periods) with a new behavior of splitting on new lines.
# Instead of new lines, any other symbol may be used.

for sent in doc.sents:
    print(sent)

This is a sentence. This is another.


This is a 

third sentence.


In [0]:
# Here we see that periods no longer affect segmentation, only linebreaks do. This would be appropriate when working 
# with a long list of tweets, for instance. (Or with clauses.)

In [0]:
my_doc = '''
0.01.17	Formulaire de Soumission [Essentielle]

désigne, relativement au Contrat, le Formulaire de Soumission dûment complété, signé et déposé par l’ENTREPRENEUR pour soumettre sa Soumission relativement à l’Appel d’Offres, subséquemment accepté par l’ORGANISME PUBLIC conformément à la procédure prévue aux Documents d’Appel d’Offres, incluant toutes ses annexes, dont notamment le Bordereau de Prix; 

0.01.18	Institution Financière [Essentielle]

désigne un assureur détenant un permis émis conformément à la Loi sur les assurances (RLRQ, chapitre A-32) l'autorisant à pratiquer l'assurance cautionnement, une société de fiducie titulaire d'un permis délivré en vertu de la Loi sur les sociétés de fiducie et les sociétés d'épargne (RLRQ, chapitre S-29.01), une coopérative de services financiers visée par la Loi sur les coopératives de services financiers (RLRQ, chapitre C-67.3), ou une banque au sens de la Loi sur les banques (L.C. 1991, c. 46);

0.01.19	Loi [Essentielle]

désigne, selon le cas, qu’il s’agisse d’une juridiction fédérale, provinciale, municipale ou étrangère, une loi, un règlement, une ordonnance, un décret, un arrêté-en-conseil, une directive ou politique administrative ou autre instrument législatif ou exécutif d’une autorité publique, une règle de droit commun ainsi que toute décision judiciaire et administrative par un tribunal compétent se rapportant à leur validité, interprétation et application et comprend, lorsque requis, un traité international et un accord inter-provincial ou inter-gouvernemental;

'''

In [0]:
doc = nlp(my_doc)

for sent in doc.sents:
    print(sent)

In [0]:
doc = nlp(new_doc)

for sent in doc.sents:
    print(sent)