#### Using spacy

- NLTK is a string processing library. It takes strings as input and returns strings or lists of strings as output. Whereas, spaCy uses object-oriented approach. When we parse a text, spaCy returns document object whose words and sentences are objects themselves.

- spaCy has support for word vectors whereas NLTK does not.

- As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. 

- As we can see below, in word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy. Its poor performance in sentence tokenization is a result of differing approaches: NLTK attempts to split the text into sentences. In contrast, spaCy constructs a syntactic tree for each sentence, a more robust method that yields much more information about the text.

In [20]:
# pip install spacy
# python -m spacy download en
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

In [1]:
import spacy

In [3]:
# we import the core spaCy English model
nlp = spacy.load("en_core_web_sm")

In [4]:
# create a spaCy document that we will be using to perform 
# parts of speech tagging
doc = nlp("Apple is looking at buying U.K. startup for $1 billion, One billion ...")

The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text attribute is used. 

Similarly, the pos_ attribute returns the coarse-grained POS tag. 

To obtain fine-grained POS tags, we could use the tag_ attribute. And finally, to get the explanation of a tag, we can use the spacy.explain() method and pass it the tag name.

In [5]:
doc

Apple is looking at buying U.K. startup for $1 billion, One billion ...

In [6]:
for token in doc:
    print(token.text, '\t',
          token.lemma_, '\t',
          token.pos_, '\t',
          token.tag_, '\t',
          token.dep_, '\t',
          token.shape_, '\t',
          token.is_alpha, '\t',
          token.is_stop)

Apple 	 Apple 	 PROPN 	 NNP 	 nsubj 	 Xxxxx 	 True 	 False
is 	 be 	 AUX 	 VBZ 	 aux 	 xx 	 True 	 True
looking 	 look 	 VERB 	 VBG 	 ROOT 	 xxxx 	 True 	 False
at 	 at 	 ADP 	 IN 	 prep 	 xx 	 True 	 True
buying 	 buy 	 VERB 	 VBG 	 pcomp 	 xxxx 	 True 	 False
U.K. 	 U.K. 	 PROPN 	 NNP 	 dobj 	 X.X. 	 False 	 False
startup 	 startup 	 NOUN 	 NN 	 dep 	 xxxx 	 True 	 False
for 	 for 	 ADP 	 IN 	 prep 	 xxx 	 True 	 True
$ 	 $ 	 SYM 	 $ 	 quantmod 	 $ 	 False 	 False
1 	 1 	 NUM 	 CD 	 compound 	 d 	 False 	 False
billion 	 billion 	 NUM 	 CD 	 pobj 	 xxxx 	 True 	 False
, 	 , 	 PUNCT 	 , 	 punct 	 , 	 False 	 False
One 	 one 	 NUM 	 CD 	 compound 	 Xxx 	 True 	 True
billion 	 billion 	 NUM 	 CD 	 appos 	 xxxx 	 True 	 False
... 	 ... 	 PUNCT 	 . 	 punct 	 ... 	 False 	 False


    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?

more example ..

In [7]:
sen = nlp(u"I like to play football. I hated it in my childhood though")

In [8]:
print(sen.text)

I like to play football. I hated it in my childhood though


Next, let's see pos_ attribute. We will print the POS tag of the word "hated", which is actually the seventh token in the sentence.

In [9]:
print(sen[7])
print(sen[7].pos_)

hated
VERB


Now let's print the fine-grained POS tag for the word "hated".

In [10]:
print(sen[7].tag_)

VBD


To see what VBD means, we can use spacy.explain() method as shown below:

In [11]:
print(spacy.explain(sen[7].tag_))

verb, past tense


Let's print the text, coarse-grained POS tags, fine-grained POS tags, and the explanation for the tags for all the words in the sentence.

In [12]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival "to"
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           PRON       PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       ADV        RB       adverb


Examples ..

In [13]:
sen = nlp(u'Can you google it? ')
word = sen[2]

In [14]:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       VERB       VB       verb, base form


Here the word "google" is being used as a verb. Next, we print the POS tag for the word "google" along with the explanation of the tag.

From the output, you can see that the word "google" has been correctly identified as a verb.

#### more examples using Spacy

In [15]:
ex1 = nlp('he drinks a drink')

In [16]:
for word in ex1:
    print(word.text, word.pos, word.pos_)

he 95 PRON
drinks 100 VERB
a 90 DET
drink 92 NOUN


In [17]:
ex2 = nlp('i fish a fish')

In [18]:
for word in ex2:
    print(word.text, word.pos, word.pos_, word.tag_)

i 95 PRON PRP
fish 100 VERB VBP
a 90 DET DT
fish 92 NOUN NN


In [19]:
# Explain the POS abbv
spacy.explain('DT')

'determiner'

In [20]:
spacy.explain('PRON')

'pronoun'

In [21]:
spacy.explain('VBP')

'verb, non-3rd person singular present'

In [22]:
ex3 = nlp('All the faith he had had had had no effect on the outcome of his life')
for word in ex3:
    print(word.text, word.pos, word.pos_, word.tag_)

All 90 DET PDT
the 90 DET DT
faith 92 NOUN NN
he 95 PRON PRP
had 87 AUX VBD
had 87 AUX VBN
had 87 AUX VBN
had 100 VERB VBN
no 90 DET DT
effect 92 NOUN NN
on 85 ADP IN
the 90 DET DT
outcome 92 NOUN NN
of 85 ADP IN
his 95 PRON PRP$
life 92 NOUN NN


#### Visualizing Parts of Speech Tags        

The displacy module from the spacy library is used for this purpose. 

To visualize the POS tags inside the Jupyter notebook, we need to call the render method from the displacy module and pass it the spacy document, the style of the visualization, and set the jupyter attribute to True 

In [23]:
from spacy import displacy

sen = nlp(u"I like to play football. I hated it in my childhood though")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

#### Uses of POS tagging

- Text to Speech (TTS) applications
- information retrieval/extraction
- used as an intermediate step for higher level NLP tasks such as parsing, semantics analysis, translation, and many more 
- Sentiment Analysis
- Homonym disambiguity
- Predictions
- building NERs (most named entities are Nouns)