# Data Structures: Doc, Span and Token

Now you know all about the vocabulary and String Store, we can take a look to the most important data structure: the `Doc` and its views `Token` and `Span`.

## The `Doc` object

The `Doc` is one of the central data structures in spaCy. It is created automatically when you processes a text with the `nlp` object. But you can also instatinaze the class manually

In [42]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

After creating the `nlp` object we can import the `doc` class from `spacy.tokens`.

In [3]:
# Import the Doc class
from spacy.tokens import Doc

Here we creating a `doc` free words. The spaces are lists of boolean values indicating wheather the word is followed by space. Every token includes that information, event the last one. 

In [8]:
# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

The `doc` class takes three arguments, the shared `vocab`, the words and the spaces

In [9]:
# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [10]:
print(doc.text)

Hello world!


## The span object (1)

The `span` object is a slide of the `doc` consisting of one or more tokens.  

![](../imgs/span-object.png)

The `span` takes at least three arguments, the `doc` it refers to and the start and end index of the span. Remember that the end index is exclusive.

## The Span object (2)

To create an `span` object manually, we can also import the class from `spacy.token`:

In [51]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

We can the instanciateze with the `doc` and `span` start and end index.

In [52]:
# The words and spaces to create the doc from
words = ['Hello', 'word', '!']
spaces = [True, False, False]

In [53]:
# Create a span manually
span = Span(doc, 0, 2)

In [54]:
print(span.text)

I like


To add an entity label to the span we can pass in the `label` name as the `label` argument:

In [63]:
# Create a span with a label:
GREETING = doc.vocab.strings[u'GREETING'] # get hash value of entity label
span_with_label = Span(doc, 0, 2, label=GREETING)
print(span_with_label.text, span_with_label.label_)

ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.

For consistency, we usually write label names in capital letters.

The `doc.ents` are writable. So, we can add entities manually by over writting with the list of the span

In [24]:
# Add span to the doc.ents
doc.ents = [span_with_label]

## Best practices

A few tips and trikcs before we get started. 

- `Doc` and `Span` are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.
    - If you application its to output strings make sure to convert the `doc` as late as possible. If you do it too early you loose all relationships between the tokens.
    - To keep thinks consistent try to use in-build token attributes wherever possible - for example, `token.i` for the token index
- Also, don't forget to always pass in the shared `vocab`.

## Creating a `Doc`

Let's create some `Doc` objects from scratch! The `nlp` object has already been created for you.

- Import the `Doc` from `spacy.tokens`.
- Create a `Doc` from the `words` and `spaces`. Don't forget to pass in the vocab!

In [25]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


- Import the `Doc` from `spacy.tokens`.
- Create a `Doc` from the words and spaces. Don't forget to pass in the vocab!

In [28]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


- Import the `Doc` from `spacy.tokens`.
- Complete the `words` and `spaces` to match the desired text and create a `doc`.

In [30]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ['Oh', ',', 'really', '?', '!']
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


## Docs, spans and entities from scratch
In this exercise, you'll create the `Doc` and `Span` objects manually, and update the named entities – just like spaCy does behind the scenes. A shared `nlp` object has already been created.

- Import the `Doc` and `Span` classes from `spacy.tokens`.
- Use the `Doc` class directly to create a doc from the words and spaces.

In [31]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ['I', 'like', 'David', 'Bowie']
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

I like David Bowie


- Create a `Span` for "David Bowie" from the `doc` and assign it the label "PERSON".

In [46]:
import spacy
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')

In [49]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
PERSON = doc.vocab.strings[u'PERSON'] # get hash value of entity label
span = Span(doc, 2, 4, label=PERSON)
print(span.text, span.label_)



David Bowie PERSON


reference https://spacy.io/usage/linguistic-features#setting-entities

In [64]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

TypeError: an integer is required

Overwrite the `doc.ents` with a list of one entity, the "David Bowie" span.

In [65]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

TypeError: an integer is required

Output should be:

[('David Bowie', 'PERSON')]


Perfect! Creating spaCy's objects manually and modifying the entities will come in handy later when you're writing your own information extraction pipelines.

## Data structures best practices
The code in this example is trying to analyze a text and collect all proper nouns. If the token following the proper noun is a verb, it should also be extracted. A `doc` object has already been created.

```python
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')
```

### Question

Why is the code bad?

### Possible Answers
- The tokens in the result should be converted back to Token objects. This will let you reuse them in spaCy.
- It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.
- `pos_` is the wrong attribute to use for extracting proper nouns. You should use `tag_` and the 'NNP'and 'NNS' labels instead.

In [67]:
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!

Above second answer is correct

- Rewrite the code to use the native token attributes instead of a list of `pos_tags`.
- Loop over each `token` in the `doc` and check the `token.pos_` attribute.
- Use `doc[token.i + 1]` to check for the next token and its `.pos_ attribute`.

In [68]:
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

In [69]:
# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')