Recently I was working with [`spaCy`](https://spacy.io/) and wanted to break a [`Doc`](https://spacy.io/api/doc) object up into its paragraphs.
I thought this to be very similar to the existing [`SentenceRecognizer`](https://spacy.io/api/sentencerecognizer) and [`Sentencizer`](https://spacy.io/api/sentencizer) implementations and figured someone must have already done this.
After quite a bit of searching, I didn't find any promising results on the modeling side, but did come across this gist:

<script src="https://gist.github.com/wpm/bf1f2301b98a883b50e903bc3cc86439.js"></script>

Simple.
Straightforward.
The only thing I'd like more is if I could reference the paragraphs of a `Doc` via an [attribute](https://docs.python.org/3/reference/expressions.html#attribute-references) or [property](https://docs.python.org/3/library/functions.html#property).
Something akin to [`Doc.sents`](https://spacy.io/api/doc#sents).
Lucky for me, the `spaCy` devs thought of this and made it easy to do.

# Extensions

<div class="tenor-gif-embed" data-postid="20193769" data-share-method="host" data-aspect-ratio="1" data-width="100%">
    <a href="https://tenor.com/view/tony-talks-iamtonytalks-antonio-baldwin-hair-swing-swinging-hair-gif-20193769">Tony Talks Iamtonytalks GIF</a>from <a href="https://tenor.com/search/tony+talks-gifs">Tony Talks GIFs</a>
</div>
<script type="text/javascript" async src="https://tenor.com/embed.js"></script>
<table>
    <caption>Not those kinds of extensions</caption>
</table>

Per the `spaCy` [docs](https://spacy.io/usage/processing-pipelines#custom-components-attributes):
>spaCy allows you to set any custom attributes and methods on the `Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and `Token._`—for example, `Token._.my_attr`. This lets you store additional information relevant to your application, add new features and functionality to spaCy, and implement your own models trained with other machine learning libraries. It also lets you take advantage of spaCy’s data structures and the `Doc` object as the “single source of truth”.
>
>There are three main types of extensions, which can be defined using the [`Doc.set_extension`](https://spacy.io/api/doc#set_extension), [`Span.set_extension`](https://spacy.io/api/span#set_extension) and [`Token.set_extension`](https://spacy.io/api/token#set_extension) methods.

I'm interested in extracting paragraphs from a `Doc`, so I'll use the `Doc.set_extension` method.
To have the extension use the `paragraphs` function from the gist, we need to supply it as an argument to the `getter` parameter.
This is known as a **property extension**.
From the docs:
>**Property extensions**. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you _retrieve_ the attribute, you can also access values of previously added attribute extensions. For example, a `Doc` getter can average over `Token` attributes. For `Span` extensions, you’ll almost always want to use a property—otherwise, you’d have to write to _every possible_ `Span` in the `Doc` to set up the values correctly.

In [1]:
from typing import Generator

import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.span import Span


# I changed the parameter name `document` to `doc`
# added type hints, and added some whitespace.
def paragraphs(doc: Doc) -> Generator[Span, None, None]:
    start = 0
    for token in doc:
        if token.is_space and token.text.count("\n") > 1:
            yield doc[start:token.i]
            start = token.i

    yield doc[start:]


# We set the `paras` extension globally.
# This means _all_ `Doc` objects will have 
# a `_.paras` attribute.
Doc.set_extension(name="paras", getter=paragraphs)
blank = spacy.blank("en")

# Some example text with two paragraphs.
text = """This is a sentence. This is a second sentence. Here is a third.

This is the start of a new paragraph. This is the end of the paragraph."""
doc = blank(text=text)

# Iterate and print each paragraph in `doc`,
# extracted using the logic defined in the 
# `paragraph` function.
paras = doc._.paras
print(*enumerate(paras), sep="\n")

(0, This is a sentence. This is a second sentence. Here is a third.)
(1, 

This is the start of a new paragraph. This is the end of the paragraph.)


It's not as beautiful as I want—I'd like to strip the newlines from each paragraph—but it gets the job done.

And I'd be remiss if I didn't show how to remove the `_.paras` attribute (though you shouldn't have to because it's a generator and not adding much in terms of memory).

In [2]:
# Note the semicolon (;) to suppress the output.
Doc.remove_extension("paras");

Hopefully this has shed some light on the `set_extension` method(s).
Thanks for reading!