# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm') # 말뭉치 불러오기
from spacy import displacy

**1. `peterrabbit.txt` 파일에서 Doc 객체 생성**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
with open('../TextFiles/peterrabbit.txt') as f:
    doc = nlp(f.read())

**2. 세 번째 문장의 모든 토큰에 대해 토큰 텍스트, POS 태그, 세분화된 TAG 태그 및 세분화된 태그에 대한 설명을 출력**

In [3]:
# Enter your code here:
sent_list = [sent for sent in doc.sents]
print(sent_list[2])



They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.


In [4]:
for token in sent_list[2]:
    print(f'{token.text:{10}} {token.pos_:{5}} {token.tag_:{5}} {spacy.explain(token.tag_):{10}}')



         SPACE _SP   whitespace
They       PRON  PRP   pronoun, personal
lived      VERB  VBD   verb, past tense
with       ADP   IN    conjunction, subordinating or preposition
their      PRON  PRP$  pronoun, possessive
Mother     PROPN NNP   noun, proper singular
in         ADP   IN    conjunction, subordinating or preposition
a          DET   DT    determiner
sand       NOUN  NN    noun, singular or mass
-          PUNCT HYPH  punctuation mark, hyphen
bank       NOUN  NN    noun, singular or mass
,          PUNCT ,     punctuation mark, comma
underneath ADP   IN    conjunction, subordinating or preposition
the        DET   DT    determiner
root       NOUN  NN    noun, singular or mass
of         ADP   IN    conjunction, subordinating or preposition
a          DET   DT    determiner

          SPACE _SP   whitespace
very       ADV   RB    adverb    
big        ADJ   JJ    adjective (English), other noun-modifier (Chinese)
fir        NOUN  NN    noun, singular or mass
-          PUN

**3. 전체 문서에서 POS 태그의 빈도 목록 제공**

In [5]:
POS_count = doc.count_by(spacy.attrs.POS)
POS_count

{90: 91,
 96: 77,
 85: 121,
 97: 173,
 93: 9,
 103: 99,
 86: 64,
 98: 20,
 92: 168,
 95: 108,
 100: 137,
 84: 51,
 89: 61,
 87: 49,
 94: 30}

In [6]:
POS_count.items()

dict_items([(90, 91), (96, 77), (85, 121), (97, 173), (93, 9), (103, 99), (86, 64), (98, 20), (92, 168), (95, 108), (100, 137), (84, 51), (89, 61), (87, 49), (94, 30)])

In [7]:
doc.vocab[93].text

'NUM'

In [8]:
# sorted 하면 정렬됨.
for key, value in sorted(POS_count.items()):
    print(f'{key} {doc.vocab[key].text:{5}}: {value}')

84 ADJ  : 51
85 ADP  : 121
86 ADV  : 64
87 AUX  : 49
89 CCONJ: 61
90 DET  : 91
92 NOUN : 168
93 NUM  : 9
94 PART : 30
95 PRON : 108
96 PROPN: 77
97 PUNCT: 173
98 SCONJ: 20
100 VERB : 137
103 SPACE: 99


**4. CHALLENGE: 토큰의 몇 퍼센트가 명사?**<br>
HINT: the attribute ID for 'NOUN' is 92

In [9]:
doc.vocab[92].text

'NOUN'

In [10]:
POS_count[92]

168

In [11]:
parcent = 100 * POS_count[92] / len(doc)
print(f'{POS_count[92]} / {len(doc)} = {parcent:{.4}}%')

168 / 1258 = 13.35%


**5. 세 번째 문장에 대한 종속성 구문 표시**

In [12]:
displacy.render(list(doc.sents)[2], style = 'dep', jupyter = True, options = {'distance' : 100})

In [13]:
for token in sent_list[2]:
    print(token.dep_)

dep
nsubj
ROOT
prep
poss
pobj
prep
det
compound
punct
pobj
punct
prep
det
pobj
prep
det
dep
advmod
amod
compound
punct
pobj
punct


**6. Beatrix Potter의 *The Tale of Peter Rabbit에서 처음 두 개의 Named Entities 표시**

In [14]:
doc.ents[:2]

(The Tale of Peter Rabbit, Beatrix Potter)

In [15]:
def show_entity(doc):
    if doc.ents:
        for ents in doc.ents[:2]:
            print(f'{ents.text:{20}} {ents.label_:{15}} {str(spacy.explain(ents.label_)):{20}}')
    else:
        print('Named Entities가 없음.')

In [16]:
show_entity(doc)

The Tale of Peter Rabbit WORK_OF_ART     Titles of books, songs, etc.
Beatrix Potter       PERSON          People, including fictional


**7. *The Tale of Peter Rabbit*에는 몇 개의 문장이 포함되어 있는지?**

In [17]:
len(sent_list)

58

**8. CHALLENGE: named entities가 포함된 문장 수?**

In [19]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]
len(list_of_ners)

30

In [22]:
list_of_ners

[The Tale of Peter Rabbit, by Beatrix Potter (1902).,
 
 
 Once upon a time there were four little Rabbits, and their names
 were--
 
           Flopsy,
        Mopsy,
    Cotton-tail,
 and Peter.,
 
 
 'Now my dears,' said old Mrs. Rabbit one morning, 'you may go into
 the fields or down the lane, but don't go into Mr. McGregor's garden:
 your Father had an accident there; he was put in a pie by Mrs.
 McGregor.',
 
 
 Then old Mrs. Rabbit took a basket and her umbrella, and went through
 the wood to the baker's.,
 She bought a loaf of brown bread and five
 currant buns.,
 
 
 Flopsy, Mopsy, and Cottontail, who were good little bunnies, went
 down the lane to gather blackberries:
 
 But Peter, who was very naughty, ran straight away to Mr. McGregor's
 garden, and squeezed under the gate!,
 
 
 First he ate some lettuces and some French beans; and then he ate
 some radishes;
 ,
 
 
 Mr. McGregor was on his hands and knees planting out young cabbages,
 but he jumped up and ran after Pete

**9. CHALLENGE: 이전 문제에서 list_of_sents[0]에 대한 Named Entities 시각화 표시**

In [20]:
displacy.render(list_of_sents[0], style = 'ent', jupyter = True)

### Great Job!