# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
with open('sources/peterrabbit.txt') as f:
    doc = nlp(f.read())

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Enter your code here:
for token in list(doc.sents)[2]:
    print(f"{token.text:{12}} {token.pos_:{8}} {token.tag_:{8}} {spacy.explain(token.tag_)} ")

They         PRON     PRP      pronoun, personal 
lived        VERB     VBD      verb, past tense 
with         ADP      IN       conjunction, subordinating or preposition 
their        DET      PRP$     pronoun, possessive 
Mother       PROPN    NNP      noun, proper singular 
in           ADP      IN       conjunction, subordinating or preposition 
a            DET      DT       determiner 
sand         NOUN     NN       noun, singular or mass 
-            PUNCT    HYPH     punctuation mark, hyphen 
bank         NOUN     NN       noun, singular or mass 
,            PUNCT    ,        punctuation mark, comma 
underneath   ADP      IN       conjunction, subordinating or preposition 
the          DET      DT       determiner 
root         NOUN     NN       noun, singular or mass 
of           ADP      IN       conjunction, subordinating or preposition 
a            DET      DT       determiner 

            SPACE    _SP      None 
very         ADV      RB       adverb 
big          ADJ

**3. Provide a frequency list of POS tags from the entire document**

In [4]:
pos_counts = doc.count_by(spacy.attrs.POS)
for k,v in sorted(pos_counts.items()):
    print(f"{k}. {doc.vocab[k].text:{5}} {v}")

84. ADJ   49
85. ADP   122
86. ADV   67
87. AUX   48
89. CCONJ 61
90. DET   117
92. NOUN  169
93. NUM   8
94. PART  28
95. PRON  82
96. PROPN 75
97. PUNCT 174
98. SCONJ 20
100. VERB  139
103. SPACE 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [5]:
sum_nouns = 0
sum_total = 0
for k,v in sorted(pos_counts.items()):
    sum_total += v
    if doc.vocab[k].text == 'NOUN':
        sum_nouns += v
print("%.2f" % ((sum_nouns/sum_total)*100)+'%')

13.43%


In [6]:
# alternative solution
percent = 100*pos_counts[92]/len(doc)
print(f'{pos_counts[92]}/{len(doc)} = {percent:{.4}}%')

169/1258 = 13.43%


**5. Display the Dependency Parse for the third sentence**

In [7]:
displacy.render(list(doc.sents)[2], style='dep', jupyter=True, options={'distance': 50})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [8]:
for ent in doc.ents[:2]:
    print(ent.text + " - " + ent.label_ + " - " + spacy.explain(ent.label_)  )

Peter Rabbit - PERSON - People, including fictional
Beatrix Potter - PERSON - People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [9]:
len(list(doc.sents))

62

**8. CHALLENGE: How many sentences contain named entities?**

In [10]:
num_sent_contains_ent = 0
for sent in doc.sents:
    if sent.ents:
        num_sent_contains_ent +=1
num_sent_contains_ent    

34

In [11]:
len([sent_ent for sent_ent in doc.sents if sent.ents])

0

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [12]:
list_of_sents = list(doc.sents)
displacy.render(list_of_sents[0], style='ent', jupyter=True)

### Great Job!