In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup $6 million')

In [4]:
for token in doc:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN nsubj
startup VERB ccomp
$ SYM quantmod
6 NUM compound
million NUM dobj


In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1d1565de200>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1d1565de740>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1d156482f80>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1d1567c7b40>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1d15679c5c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1d156482ff0>)]

In [6]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
doc2 = nlp(u"Tesla isn't looking into startup anymore")

In [8]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
into ADP prep
startup NOUN pobj
anymore ADV advmod


___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [10]:
doc2[0].pos_

'PROPN'

In [11]:
doc2[0].dep_

'nsubj'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [12]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [13]:
doc3[0].pos_

'SCONJ'

In [14]:
life_quote = doc3[16:30]

In [15]:
print(life_quote)

"Life is what happens to us while we are making other plans"


In [16]:
type(life_quote)

spacy.tokens.span.Span

In [18]:
type(doc3)

spacy.tokens.doc.Doc

In [21]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [22]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [25]:
doc4[6:10]

This is another sentence

In [26]:
doc4[6].is_sent_start

True

In [27]:
doc4[8].is_sent_start

False