# Extension attributes

- Add custom metadata to documents, tokens and spans
- Comes in 3 flavors:
    - Attribute extensions
    - Property extensions
    - Method extensions
- Accessible via the `._` property
- Use the `*.set_extension` method to register the attributes "globally"
    - Takes the form like `*.set_extension(attribute_name, default=True)`, where `*` is `Token`, `Span`, or `Doc`
    - Attributes of `set_extension` that can be set:
        - `default`
        - `method`
        - `getter`
        - `setter`


Example:
```
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)
```

## Example - Custom "getter" on a Doc

In [1]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color)

doc = nlp("The sky is blue.")

print(doc[3]._.is_color, '-', doc[3].text)

True - blue


In [2]:
# Example from spaCy docs
from spacy.tokens import Token

fruit_getter = lambda token: token.text in (u"apple", u"pear", u"banana")

Token.set_extension("is_fruit", getter=fruit_getter)

doc = nlp(u"I have an apple")

print(f"Does the document contain a fruit? {doc[3]._.is_fruit}")

Does the document contain a fruit? True


## Example - Custom "getter" on a Span

In [3]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


# Method Extensions

Method extensions make the extension attribute a callable method. You can then pass one or more arguments to it, and compute attribute values dynamically.

Example below checks whether the doc contains a token with a given text.

In [4]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

In [5]:
doc = nlp("The sky is blue.")

print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud


## Examples - Extension attributes

- Use `Token.set_extension` to register `is_country` (default `False`).
- Update it for "Spain" and print it for all tokens.

### Example 1

In [6]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


### Example 2 - Create a 'getter' to get the REVERSE text

In [7]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


### Example 3 - Create a 'has_number' extension

In [8]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [9]:
doc = nlp("I don't have any numbers.")
print("has_number:", doc._.has_number)

has_number: False


### Example 4 - Convert Span to bold/strong HTML format

In [10]:
from spacy.lang.en import English
from spacy.tokens import Span

nlp = English()

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return "<{tag}>{text}</{tag}>".format(tag=tag, text=span.text)


# Register the Span property extension 'to_html' with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


### Example 5 - Create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location

In [11]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


### Example - Combine Pipeline + Extensions

Write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.

In [12]:
import requests
import json

req = requests.get("https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-capital-city.json")
COUNTRIES = [item["country"] for item in req.json()]
capitals = [item["city"] for item in req.json()]

In [13]:
print(COUNTRIES[:5])
print(capitals[:5])

['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra']
['Kabul', 'Tirana', 'Alger', 'Fagatogo', 'Andorra la Vella']


In [14]:
CAPITALS = dict(zip(COUNTRIES, capitals))

In [15]:
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

nlp = English()

# Create matcher that will be used to *find* countries
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))

In [16]:
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Praha'), ('Slovakia', 'GPE', 'Bratislava')]


In [17]:
doc = nlp("""Senegal retained top spot on the continent, moving up two places to reach 20th in the world - their best ever ranking.

Nigeria, who won bronze in Egypt, went up 12 places to 33 on the global list and third in Africa.

Tunisia, the other semi-finalists at the Nations Cup, were second in Africa, behind Senegal, but moved down four places to 29th in the world.

Surprise quarter-finalists Madagascar were rewarded for their impressive run in Egypt, moving up 12 places to 96th overall.

Benin - who knocked out Morocco in the last-16 - went up six places to 82nd in the world with Morocco also going up six places to 41st in the world and fifth in Africa.

Nations Cup hosts Egypt went up nine spots to make the top 50, moving up to 49th overall.

Ghana are just below the Pharaohs in 7th on the African list having maintained their position of 50th in the world.""")


print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[('Senegal', 'GPE', 'Dakar'), ('Nigeria', 'GPE', 'Abuja'), ('Egypt', 'GPE', 'Cairo'), ('Tunisia', 'GPE', 'Tunis'), ('Senegal', 'GPE', 'Dakar'), ('Madagascar', 'GPE', 'Antananarivo'), ('Egypt', 'GPE', 'Cairo'), ('Benin', 'GPE', 'Porto-Novo'), ('Morocco', 'GPE', 'Rabat'), ('Morocco', 'GPE', 'Rabat'), ('Egypt', 'GPE', 'Cairo'), ('Ghana', 'GPE', 'Accra')]
