# Working with container objects and customizing spacy

Token, Span and Doc are the most widely used container objects in Spacy. They represent a token, phrase or sentence and a text, respectively.

We can create a Doc object using its constructor explicitly

In [3]:
import spacy
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab(), words=[u'Hi', u'there'])
doc

Hi there 

### Iterating over a Token's Syntactic Children
We can obtain the leftward syntatic children of the word "apple" in the sample sentence below.

In [6]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I want a green apple')
print([w for w in doc[4].lefts])

[a, green]


We can also use Token.rights to find the rightward child of words.

In [7]:
print([w for w in doc[1].rights])

[apple]


The Doc objects doc.sents seperates a text into individual sentences.

In [8]:
doc = nlp(u'A severe storm hit the beach. It started to rain.')
for sent in doc.sents:
    print([sent[i] for i in range(len(sent))])

[A, severe, storm, hit, the, beach, .]
[It, started, to, rain, .]


We can still refer to the tokens in a multi sentence text using the global document level indices.

In [9]:
print([doc[i] for i in range(len(doc))])

[A, severe, storm, hit, the, beach, ., It, started, to, rain, .]


The ability to refer to the Token objects in a doc by their sentence level indices is useful if, for example, we need to check whether the first word in the second sentence of the text being processed is a pronoun.

In [10]:
for i, sent in enumerate(doc.sents):
    if i == 1 and sent[0].pos_ == 'PRON':
        print('The second sentence begins with a pronoun.')

The second sentence begins with a pronoun.


We can identify how many sentences end with a verb by checking the last word in a sentence.

In [11]:
counter = 0
for sent in doc.sents:
    if sent[len(sent) - 2].pos_ == 'VERB':
        counter += 1
print(counter)

1


Note that we reduced the value of len(sent) by 2 because the indices end at size-1 and the last token in the sentences is a period.

### The doc.noun_chunks container

The Doc objects doc.noun_chunks property allows us to iterate over the noun chunks in the document.

In [12]:
doc = nlp(u'A noun chunk is a phrase that has a noun as its head.')
for chunk in doc.noun_chunks:
    print(chunk)

A noun chunk
a phrase
that
a noun
its head


Alternatively, we might extract noun chunks by iterating over the nouns in the sentence and finding the syntactic children for each noun to form a chunk.

In [13]:
for token in doc:
    if token.pos_ == 'NOUN':
        chunk = ''
        for w in token.children:
            if w.pos_ == 'DET' or w.pos_ == 'ADJ':
                chunk = chunk + w.text + ' '
        chunk = chunk + token.text
        print(chunk)

noun
A chunk
a phrase
a noun
head


We can rewrite the previous example to use token.left and remove the check for the children to be a determiner or adjective. This is because the words used to modify a noun are always the leftward syntactic children of the noun.

### The Span Object

The span object is a slice from a Doc object. One of it's most interesting methods is span.merge which allows us to merge the span into a single token and retokenizes the document. This is useful when the text contains names consisting of several words.

In [14]:
doc = nlp(u'The Golden Gate Bridge is an iconic landmark in San Francisco.')
print([doc[i] for i in range(len(doc))])
span = doc[1:4]
lem_id = doc.vocab.strings[span.text]
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[1:4])
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

[The, Golden, Gate, Bridge, is, an, iconic, landmark, in, San, Francisco, .]
The the DET det
Golden Gate Bridge Golden Gate Bridge PROPN nsubj
is be AUX ROOT
an an DET det
iconic iconic ADJ amod
landmark landmark NOUN attr
in in ADP prep
San San PROPN compound
Francisco Francisco PROPN pobj
. . PUNCT punct


### Try this
Let's see how we can retokenize and merge the San Francisco token.

In [15]:
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[7:9])
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

The the DET det
Golden Gate Bridge Golden Gate Bridge PROPN nsubj
is be AUX ROOT
an an DET det
iconic iconic ADJ amod
landmark landmark NOUN attr
in in ADP prep
San Francisco San Francisco PROPN pobj
. . PUNCT punct


## Customizing the text-processing pipeline
We can see what pipeline components are available for your nlp object like this.

In [16]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can disable pipeline components. Here we create a processing pipeline without a dependency parser. When we call the nlp instance on the text, the tokens dont receie the dependency labels.

In [17]:
nlp = spacy.load('en_core_web_sm', disable=['parser'])

doc = nlp(u'I want a green apple.')
for token in doc:
    print(token.text, token.pos_, token.dep_)

I PRON 
want VERB 
a DET 
green ADJ 
apple NOUN 
. PUNCT 


### Loading a model

In [26]:
nlp = spacy.load('en_core_web_sm')

Load does a few things behind the scenes.
1. Looks at name of model to be loaded and what Language class it should initialize.
2. Iterate over processing pipeline names and create corresponding components to add to the processing pipeline.
3. Load the model data from disk and make it available to the Language class.

We can piece together the model name as shown below.

In [19]:
print(nlp.meta['lang'] + '_' + nlp.meta['name'])

en_core_web_sm


the meta attribute contains metadata of the language model. We can find the location of our model using the code below.

In [20]:
from spacy import util
util.get_package_path('en_core_web_sm')

PosixPath('/home/jose/VSCodeProjects/NLPwithPythonandSpacy/venv/lib/python3.10/site-packages/en_core_web_sm')

We need to know the model version since this gives us the folder name of where the model is located.

In [21]:
print(nlp.meta['lang'] + '_' + nlp.meta['name'] + '-' + nlp.meta['version'])

en_core_web_sm-3.3.0


We can look at the list of components used w/ the model.

In [22]:
nlp.meta['pipeline']

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [24]:
lang = 'en'
pipeline = ['tagger', 'parser', 'ner']
model_data_path = '/home/jose/VSCodeProjects/NLPwithPythonandSpacy/venv/lib/python3.10/site-packages/en_core_web_sm/en_core_web_sm-3.3.0'
lang_cls = spacy.util.get_lang_class(lang)
nlp = lang_cls()
for name in pipeline:
    component = nlp.create_pipe(name)
    nlp.add_pipe(component)
nlp.from_disk(model_data_path)


ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.tagger.Tagger object at 0x7f876b44f040> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

### Customizing the pipeline components

In [27]:
doc = nlp(u'I need a taxi to Festy.')
for ent in doc.ents:
    print(ent.text, ent.label_)

Festy ORG


By default this is recognized as an organization. We want it to recognize it as a city district. ORG stands for companies, agencies and other institutions. We can create a new label and give it a training example to show the entity recognizer when to apply the DISTRICT label.

In [28]:
LABEL = 'DISTRICT'
TRAIN_DATA = [
    ('We need to deliver it to Festy.', {
        'entities': [(25, 30, 'DISTRICT')]
    }),
    ('I like red oranges', {
        'entities': []
    })
]

If there is an entity in the sample, provide start and end position and type of entity. In second sample, we have no entity. We can now add the label DISTRICT to the entity recognizer. We must first get the instance of the ner pipeline component.

In [29]:
ner = nlp.get_pipe('ner')

Now, we can add a label to it using the .add_label method.

In [30]:
ner.add_label(LABEL)

1

We also need to disable other pipes to make sure only the entity recognizer will be updated during the training process.

In [31]:
nlp.disable_pipes('tagger')
nlp.disable_pipes('parser')

['parser']

Now, we can train the entity recognizer using the samples in the TRAIN_DATA we created earlier.

In [32]:
optimizer = nlp.entity.create_optimizer()
import random

for i in range(25):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)

AttributeError: 'English' object has no attribute 'entity'