## Parts of speech Assessment

For this assessment we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902).

The story is in the public domain; the test file was obtained from Project Gutenberg.

In [1]:
# RUN THE CELL to perform standard imports:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

### 1.  Create a Doc object from the file peterrabbit.txt

In [2]:
with open(r'the_tale_of_peter_rabbit.txt','r') as file:
    data = file.read()

In [3]:
doc = nlp(data)

### 2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag

In [4]:
doc_sentence = []
for sentence in doc.sents:
    if sentence.text.isspace() == False:
        doc_sentence.append(sentence)

In [5]:
for token in doc_sentence[2]:
    print(f"{token.text:{11}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")

They        PRON       PRP        pronoun, personal
lived       VERB       VBD        verb, past tense
with        ADP        IN         conjunction, subordinating or preposition
their       PRON       PRP$       pronoun, possessive
Mother      NOUN       NN         noun, singular or mass
in          ADP        IN         conjunction, subordinating or preposition
a           DET        DT         determiner
sand        NOUN       NN         noun, singular or mass
-           PUNCT      HYPH       punctuation mark, hyphen
bank        NOUN       NN         noun, singular or mass
,           PUNCT      ,          punctuation mark, comma
underneath  ADP        IN         conjunction, subordinating or preposition
the         DET        DT         determiner
root        NOUN       NN         noun, singular or mass
of          ADP        IN         conjunction, subordinating or preposition
a           DET        DT         determiner
very        ADV        RB         adverb    
big         AD

### 3. Provide a frequency list of POS tags from the entire document

In [6]:
POS_count = sorted(doc.count_by(spacy.attrs.POS).items())

for pos_tag, count in POS_count:
    print(f'{pos_tag:{1}}. {doc.vocab[pos_tag].text:{5}} : {count:{5}}')

84. ADJ   :    54
85. ADP   :   124
86. ADV   :    67
87. AUX   :    43
89. CCONJ :    60
90. DET   :    93
92. NOUN  :   167
93. NUM   :     8
94. PART  :    27
95. PRON  :   105
96. PROPN :    69
97. PUNCT :   172
98. SCONJ :    16
100. VERB  :   139
101. X     :     1
103. SPACE :    36


### 4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 92

In [7]:
total_count = 0
for pos_tag, count in POS_count:
    total_count= total_count+count
    if pos_tag == 92:
        noun_count = count
        
noun_persentage = (noun_count/total_count)*100

print(f'{noun_count}/{total_count} = {noun_persentage}%')

167/1181 = 14.140558848433532%


### 5. Display the Dependency Parse for the third sentence

In [8]:
displacy.render(doc_sentence[3], style ='dep', jupyter=True)

### 6. Show the first two named entites from Beatrix Potter's **The Tale of Peter rabbit**

In [9]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - ' +ent.label_+ ' - '+str(spacy.explain(ent.label_)))

In [10]:
doc_sentence[0]

THE TALE OF PETER RABBIT, BY BEATRIX POTTER (1902).

In [11]:
show_ents(doc_sentence[0][:9])

THE TALE OF PETER RABBIT - ORG - Companies, agencies, institutions, etc.


In [12]:
# FIXING "THE TALE OF PETER RABBIT" FORM ORG TO A WORK_OF_ART
from spacy.tokens import Span

ents = list(doc.ents)
old_ent = ents[0]
new_ent = Span(doc, old_ent.start, old_ent.end, label="WORK_OF_ART")
ents[0] = new_ent
doc.ents = ents

In [13]:
show_ents(doc_sentence[0][:9])

THE TALE OF PETER RABBIT - WORK_OF_ART - Titles of books, songs, etc.


In [14]:
# spacy does not recognize "BEATRIX POTTER" as a person

PERSON = doc.vocab.strings[u'PERSON']
new_ent = Span(doc, 7, 9, label = PERSON)
doc.ents = list(doc.ents) + [new_ent]

In [15]:
show_ents(doc_sentence[0][:9])

THE TALE OF PETER RABBIT - WORK_OF_ART - Titles of books, songs, etc.
BEATRIX POTTER - PERSON - People, including fictional


### 7. How many sentences are contained in the Tale of Peter Rabbit?

In [16]:
len(doc_sentence)

56

### 8. CHALLENGE: How many sentences contain named entities

In [17]:
number_of_entites_in_sentence =0
for sentence in doc_sentence:
    sentence_entite_number =len(sentence.ents)
    if sentence_entite_number > 0:
        number_of_entites_in_sentence = number_of_entites_in_sentence +sentence_entite_number
        
print("number of named Entites in the sentence : ", number_of_entites_in_sentence)

number of named Entites in the sentence :  50


### 9. CHALLENGE: display the named entity visualization for list_of_sents[0] from the previouse problem

In [18]:
displacy.render(doc_sentence[0], style ='ent', jupyter=True)