In [1]:
import spacy

## Download and install spaCy English model



In [2]:
%%bash
python -m spacy download en

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 163, in _run_module_as_main
    mod_name, _Error)
  File "/usr/lib/python2.7/runpy.py", line 111, in _get_module_details
    __import__(mod_name)  # Do not catch exceptions initializing package
  File "/usr/local/lib/python2.7/dist-packages/spacy/__init__.py", line 4, in <module>
    from .cli.info import info as cli_info
  File "/usr/local/lib/python2.7/dist-packages/spacy/cli/__init__.py", line 1, in <module>
    from .download import download
  File "/usr/local/lib/python2.7/dist-packages/spacy/cli/download.py", line 10, in <module>
    from .link import link
  File "/usr/local/lib/python2.7/dist-packages/spacy/cli/link.py", line 7, in <module>
    from ..compat import symlink_to, path2str
  File "/usr/local/lib/python2.7/dist-packages/spacy/compat.py", line 9, in <module>
    from thinc.neural.util import copy_array
  File "/usr/local/lib/python2.7/dist-packages/thinc/neural/__init__.py", line 1, in <modu

## Take a look at the architecture

spaCy general architecture is described at https://spacy.io/assets/img/architecture.svg

## Tokenizing text

Tokenize text and explore the Token API at https://spacy.io/api/token.

For example, token text and associated POS tags.

In [2]:
text = "This is a sentence. And this is another sentence." 
nlp = spacy.load('en')
doc = nlp(text)
for token in doc:
    print(token, token.pos_) 

This DET
is VERB
a DET
sentence NOUN
. PUNCT
And CCONJ
this DET
is VERB
another DET
sentence NOUN
. PUNCT


## Splitting text into sentences

spaCy has integrated a sentence boundary detection (using dependency parsing). Sentences are available as a generator via the Doc.
For example:

In [3]:
text = "This is a sentence. And this is another sentence."
doc = nlp(text)
for sent in doc.sents:
    print([(token, token.pos_) for token in sent]) 

[(This, 'DET'), (is, 'VERB'), (a, 'DET'), (sentence, 'NOUN'), (., 'PUNCT')]
[(And, 'CCONJ'), (this, 'DET'), (is, 'VERB'), (another, 'DET'), (sentence, 'NOUN'), (., 'PUNCT')]


### Question

What data model type are sentences in spaCy?

In [4]:
type(next(doc.sents))

spacy.tokens.span.Span

## Part-of-speech tagging and dependency parsing

spaCy models come with a POS and Parse components for part of speech tagging and dependency parsing. Moreover, new dependency parsers can be trained using Universal Dependencies and the command line tool for training.

An example is:



In [8]:
text = "This is a sentence. And this is another sentence." 
doc = nlp(text)
for token in doc:
    print(token, token.pos_, token.dep_, token.head) 

This DET nsubj is
is VERB ROOT is
a DET det sentence
sentence NOUN attr is
. PUNCT punct is
And CCONJ cc is
this DET nsubj is
is VERB ROOT is
another DET det sentence
sentence NOUN attr is
. PUNCT punct is


### Question

Can you extract subjects and objects of sentences using the dependency annotations?

In [11]:
subj = [token for token in doc if token.dep_ == 'nsubj' or token.dep_ == 'attr']
print(subj)

[This, sentence, this, sentence]


### Visualization

spaCy comes with nice visualizers for Entities and Dependencies: displaCy (https://spacy.io/usage/visualizers). This visualizations can be generated directly from Python Code. For example:


In [32]:
from spacy import displacy
from IPython.core.display import display, HTML
html = displacy.render(doc, style='dep')
display(HTML(html))

  "__main__", mod_spec)


## Recognizing entities

Most spaCy models come with a NER component for detecting typical entities. Entity information is annotated at different levels: as a generator in the Doc (doc.ents), as IOB information in the Token, and annotation in Spans.

For example:


In [16]:
text = "Daniel Vila is visiting Austin, Texas"
doc = nlp(text)
for ent in doc.ents:
    print(ent, ent.label_, [token.dep_ for token in ent])

Daniel Vila PERSON ['compound', 'nsubj']
Austin GPE ['dobj']
Texas GPE ['appos']


### More on visualization

When done inside Jupyter notebooks, displaCy can be used to show directly inline-html. Moreover, it can be used for entity annotations as well:



In [31]:
text = "Daniel Vila is visiting Austin, Texas"
doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

### Question

How can you access NER information at the Token Level, see https://spacy.io/api/token#attributes


In [18]:
text = "Daniel Vila is visiting Austin, Texas"
doc = nlp(text)

for token in doc:
    print(token, token.ent_type_)

Daniel PERSON
Vila PERSON
is 
visiting 
Austin GPE
, 
Texas GPE
