<a href="https://colab.research.google.com/github/noircir/Python/blob/master/000_spacy_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install -U spacy

In [0]:
!python -m spacy download en

In [0]:
## Steps of working with Spacy

# 1. Loading the language library
# 2. Building a pipeline object
# 3. Using tokens
# 4. POS tagging
# 5. Understanding token attributes

In [0]:
import spacy
# create a model 'nlp'
nlp = spacy.load('en_core_web_sm')

In [0]:
# Create a Doc object by applying the 'nlp'model to our text
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

## Tokenization

In [12]:
for token in doc:
  # with underscore = raw data
  # dep = syntactic dependency
  print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


Spacy pipelines: https://spacy.io/usage/spacy-101#pipelines

## Pipeline

In [13]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f6b05499588>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f6b053311c8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f6b05331228>)]

In [14]:
nlp.pipe_names

['tagger', 'parser', 'ner']

## POS tagging

In [17]:
# u = a Unicode string
doc2 = nlp(u"Tesla isn't    looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX ROOT
n't PART neg
    SPACE 
looking VERB acomp
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [18]:
# We can grab inividual tokens...
doc2[0]

Tesla

In [20]:
# and extract attributes
doc2[0].pos_

'PROPN'

In [21]:
type(doc2)

spacy.tokens.doc.Doc

## Dependencies

Spacy's part-of-speech tagging: https://spacy.io/api/annotation#pos-tagging

Dependency parsing: https://spacy.io/api/annotation#dependency-parsing

Dependencies manual: https://nlp.stanford.edu/software/dependencies_manual.pdf

In [22]:
doc2[0].dep_

'nsubj'

In [23]:
# To see the full name of a tag use

spacy.explain('PROPN')

'proper noun'

In [24]:
spacy.explain('nsubj')

'nominal subject'

## Additional token attributes

In [25]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [26]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [27]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [28]:
# Boolean Values:
print(doc2[0].is_alpha) # is alphabetic
print(doc2[0].is_stop)  # is a stopword

True
False


## Spans

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

In [0]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [30]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [31]:
type(life_quote)

spacy.tokens.span.Span

## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents. Later we'll write our own segmentation rules.

In [32]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [33]:
doc4[6].is_sent_start

True