For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

I'm using the latest version of spacy

In [8]:
print(spacy.__version__)

3.2.0


**1. Create a Doc object from the file `peterrabbit.txt`**<br>

In [9]:
with open('../TextFiles/peterrabbit.txt') as f:
    doc = f.read()

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [13]:
doc = nlp(doc)

In [14]:
sent_list = [sent for sent in doc.sents]

In [19]:
sent_list[2]



They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.

In [21]:
for token in sent_list[2]:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")



         SPACE      _SP        whitespace
They       PRON       PRP        pronoun, personal
lived      VERB       VBD        verb, past tense
with       ADP        IN         conjunction, subordinating or preposition
their      PRON       PRP$       pronoun, possessive
Mother     PROPN      NNP        noun, proper singular
in         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner
sand       NOUN       NN         noun, singular or mass
-          PUNCT      HYPH       punctuation mark, hyphen
bank       NOUN       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
underneath ADP        IN         conjunction, subordinating or preposition
the        DET        DT         determiner
root       NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner

          SPACE      _SP       

**3. Provide a frequency list of POS tags from the entire document**

In [24]:
POS_count = doc.count_by(spacy.attrs.POS)

In [25]:
POS_count

{90: 91,
 96: 77,
 85: 121,
 97: 173,
 93: 9,
 103: 99,
 86: 64,
 98: 20,
 92: 168,
 95: 108,
 100: 137,
 84: 51,
 89: 61,
 87: 49,
 94: 30}

In [37]:
for k, v in sorted(POS_count.items()):
    print(f"{k:{5}}. {doc.vocab[k].text:{10}} {v}")

   84. ADJ        51
   85. ADP        121
   86. ADV        64
   87. AUX        49
   89. CCONJ      61
   90. DET        91
   92. NOUN       168
   93. NUM        9
   94. PART       30
   95. PRON       108
   96. PROPN      77
   97. PUNCT      173
   98. SCONJ      20
  100. VERB       137
  103. SPACE      99


**4. CHALLENGE: What percentage of tokens are nouns?**

In [40]:
len(doc)

1258

In [39]:
POS_count[92]/len(doc)

0.13354531001589826

**5. Display the Dependency Parse for the third sentence**

In [42]:
displacy.render(sent_list[2], style = 'dep', jupyter=True, options={'distance': 110})

**6. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit**

In [58]:
def show_ent(doc):
    if doc.ents:
        for ent in doc.ents[:2]:
            print(f"{ent.text:{25}} {ent.label_:{15}} {spacy.explain(ent.label_)}")
    else:
        print("No Entities were found")

In [59]:
show_ent(doc)

The Tale of Peter Rabbit  WORK_OF_ART     Titles of books, songs, etc.
Beatrix Potter            PERSON          People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [60]:
sent_list = [sent for sent in doc.sents]

In [61]:
len(sent_list)

58

**8. CHALLENGE: How many sentences contain named entities?**

In [66]:
count = 0
for sent in sent_list:
    if sent.ents:
        count = count+1
print(count)

30


In [71]:
with open('../TextFiles/peterrabbit.txt') as f:
    doc = f.read()

In [72]:
doc = nlp(doc)

another way of doing it

In [74]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]
len(list_of_ners)

30

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [77]:
displacy.render(sent_list[0], style = 'ent', jupyter=True, options={'distance': 110})