# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
with open('../TextFiles/peterrabbit.txt') as file:
    doc = nlp(file.read())
    displacy.render(doc, style='ent', jupyter=True)

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Enter your code here:
for token in list(doc.sents)[2]:
    print("{:<10} {:<10} {:<10} {:<10}".format(token.text, token.pos_, token.tag_, spacy.explain(token.tag_)))

They       PRON       PRP        pronoun, personal
lived      VERB       VBD        verb, past tense
with       ADP        IN         conjunction, subordinating or preposition
their      PRON       PRP$       pronoun, possessive
Mother     NOUN       NN         noun, singular or mass
in         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner
sand       NOUN       NN         noun, singular or mass
-          PUNCT      HYPH       punctuation mark, hyphen
bank       NOUN       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
underneath ADP        IN         conjunction, subordinating or preposition
the        DET        DT         determiner
root       NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner

          SPACE      _SP        whitespace
very       ADV        RB       

**3. Provide a frequency list of POS tags from the entire document**

In [4]:
POS_counts = doc.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 53
85. ADP  : 125
86. ADV  : 63
87. AUX  : 49
89. CCONJ: 61
90. DET  : 90
92. NOUN : 172
93. NUM  : 9
94. PART : 28
95. PRON : 110
96. PROPN: 74
97. PUNCT: 171
98. SCONJ: 19
100. VERB : 135
103. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [5]:

total_tokens = len(doc)
noun_tokens = len([token for token in doc if token.pos_ == 'NOUN'])
noun_percentage = (noun_tokens / total_tokens) * 100
noun_percentage

13.672496025437203

**5. Display the Dependency Parse for the third sentence**

In [6]:
displacy.render(list(doc.sents)[2], style='dep', jupyter=True, options={'distance': 110})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [7]:

for entity in doc.ents[:2]:
    print(entity.text, entity.label_)

The Tale of Peter Rabbit WORK_OF_ART
Beatrix Potter PERSON


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [8]:
sentences = list(doc.sents)
print("Number of sentences:", len(sentences))

Number of sentences: 55


**8. CHALLENGE: How many sentences contain named entities?**

In [9]:

num_sentences_with_entities = 0

for sentence in sentences:
    if any(token.ent_type_ for token in sentence):
        num_sentences_with_entities += 1

num_sentences_with_entities

35

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [10]:
displacy.render(list(doc.sents)[0], style='ent', jupyter=True)