# Parts of Speech Assessment - Solutions

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

In [2]:
with open('data/owlcreek.txt') as f:
    doc = nlp(f.read())

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Enter your code here:

for token in list(doc.sents)[3]:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

The          DET    DT     determiner
man          NOUN   NN     noun, singular or mass
's           PART   POS    possessive ending
hands        NOUN   NNS    noun, plural
were         AUX    VBD    verb, past tense
behind       ADV    RB     adverb

            SPACE  _SP    None
his          PRON   PRP$   pronoun, possessive
back         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
the          DET    DT     determiner
wrists       NOUN   NNS    noun, plural
bound        VERB   VBN    verb, past participle
with         ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
cord         NOUN   NN     noun, singular or mass
.            PUNCT  .      punctuation mark, sentence closer


**3. Provide a frequency list of POS tags from the entire document**

In [4]:
POS_counts = doc.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 257
85. ADP  : 491
86. ADV  : 200
87. AUX  : 184
89. CCONJ: 129
90. DET  : 584
91. INTJ : 5
92. NOUN : 857
93. NUM  : 29
94. PART : 65
95. PRON : 393
96. PROPN: 42
97. PUNCT: 571
98. SCONJ: 47
100. VERB : 506
103. SPACE: 475


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is just above : 92

In [5]:
percent = 100*POS_counts[92]/len(doc)

print(f'{POS_counts[92]}/{len(doc)} = {percent:{.4}}%')

857/4835 = 17.72%


**5. Display the Dependency Parse for the third sentence**

In [6]:
displacy.render(list(doc.sents)[3], style='dep', jupyter=True, options={'distance': 100})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [7]:
for ent in doc.ents[:2]:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Alabama - GPE - Countries, cities, states
twenty feet - QUANTITY - Measurements, as of weight or distance


**7. How many sentences are contained in the book ?**

In [8]:
len([sent for sent in doc.sents])

319

**8. CHALLENGE: How many sentences contain named entities?**

In [9]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]
len(list_of_ners)

37

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [18]:
displacy.render(list_of_sents[99:110], style='ent', jupyter=True)