# Sentence Segmentation

Spacy does a great job in segmenting standard sentences separated with `.`. These can be accessed with the generator `doc.sents`, which provides lists of tokens between basedon the flag `token.is_sent_start`.

However, we may want to re-define how sentences are segmented, for instance:
- Cut sentences when `;` appears
- Cut sentences **only** when `\n` or line breaks appear (e.g., in poetry), not under `.`

This notebook presents how to deal with such cases.

Overview of contents:
1. Examples of Sentence Segmentation
2. Adding New Sentence Segmentation Rules
3. Changing Sentence Segmentation Rules (missing section, because code didn't work for my spacy lib version)

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by Jos√© Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Examples of Sentence Segmentation

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [4]:
# Standard sentence segmentation: on `.`
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [5]:
# doc.sents is a generator!
doc_sents = [sent for sent in doc.sents]
doc_sents

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

In [6]:
list(doc.sents)

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

In [7]:
# Sentences are really Spans!
type(doc_sents[1])

spacy.tokens.span.Span

In [8]:
print(doc_sents[1].start, doc_sents[1].end)

6 11


## 2. Adding New Sentence Segmentation Rules

We can **extend** the sentence segmentation rules, for instance to break sentences on `;`.

In [37]:
nlp = spacy.load('en_core_web_sm')

In [38]:
from spacy.language import Language

In [39]:
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

In [40]:
# The quote is taken as a complete sentence; but we want to break it in `;`
for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [41]:
# ADD A NEW RULE TO THE PIPELINE
@Language.component("colon_eol")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

In [42]:
nlp.add_pipe("colon_eol", before='parser')

<function __main__.set_custom_boundaries(doc)>

In [43]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'colon_eol',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [44]:
# Re-run the Doc object creation
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

In [45]:
# New segmentation
for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


## 3. Changing Sentence Segmentation Rules

We can also **replace** the sentence segmentation rules. For instance, break sentences *only* is new line symbol `\n` is found.

Note that this section is missing because the code did not work. I didn't have time to fix it. I think the reason is a newer version of spacz, which has changed the interfaces.