# 주어진 임의의 문장 POS 태깅하는 프로그램 작성하기
태거(tagger)는 nltk 5.5절의 back off n-gram 태거에 기반하여, 브라운 전체코퍼스를 트레이닝하여 개발

Combining taggers with backoff tagging

- backoff tagging is one of the core features of SequentialBackoffTagger
-  It allows you to chain taggers together so that if one tagger doesn't know how to tag a word, it can pass the word on to the next backofftagger. 
-  If that one can't do it, it can pass the word

In [1]:
import nltk
from nltk.corpus import brown

### 1. universal Tagset

In [2]:
brown_tagged_sents = brown.tagged_sents(tagset='universal')
brown_sents = brown.sents()

In [3]:
# split train/test data
size = int(len(brown_tagged_sents)*0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
size

51606

In [4]:
t0 = nltk.DefaultTagger('UNK')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)

In [5]:
t1.evaluate(test_sents)

0.9156346262651662

In [6]:
t2.evaluate(test_sents)

0.9245510362314285

In [7]:
t3.evaluate(test_sents)

0.9230527440749355

-> Trigram 까지 backoff 한 것 보다 Bigram 까지만 backoff 한 것이 성능이 더 좋았다. 따라서 t2까지만 tagging 하는게 더 좋은 방법이 될 수 있겠다.

### 예시문장

In [8]:
text = input('Enter a sentence: ')
text = nltk.word_tokenize(text) 
t2.tag(text)
print('Pos Tagging result : ',end="")
for i in t2.tag(text):
    print(i[1], end=" ")

Enter a sentence: time flies like an arrow.
Pos Tagging result : NOUN VERB ADP DET NOUN . 

->  time flies like an arrow 가 제대로 태깅된 것을 확인할 수 있다.

그밖의 예시문장들

In [9]:
text = input('Enter a sentence: ')
text = nltk.word_tokenize(text) 
t2.tag(text)
print('Pos Tagging result : ',end="")
for i in t2.tag(text):
    print(i[1], end=" ")

Enter a sentence: We will see that the tag of a word depends on the word and its context within a sentence. For this reason, we will be working with data at the level of sentences rather than words.
Pos Tagging result : PRON VERB VERB ADP DET NOUN ADP DET NOUN VERB ADP DET NOUN CONJ DET NOUN ADP DET NOUN . ADP DET NOUN . PRON VERB VERB VERB ADP NOUN ADP DET NOUN ADP NOUN ADP ADP NOUN . 

In [10]:
text = input('Enter a sentence: ')
text = nltk.word_tokenize(text) 
t2.tag(text)
print('Pos Tagging result : ',end="")
for i in t2.tag(text):
    print(i[1], end=" ")

Enter a sentence: Tagged corpora use many different conventions for tagging words.
Pos Tagging result : UNK UNK NOUN ADJ ADJ NOUN ADP VERB NOUN . 

### 2. default tagset

In [11]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
brown_sents = brown.sents()

In [12]:
size = int(len(brown_tagged_sents)*0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
size

51606

In [13]:
tt1 = nltk.UnigramTagger(train_sents, backoff=nltk.DefaultTagger('NN'))
tt2 = nltk.BigramTagger(train_sents, backoff=tt1)
tt3 = nltk.TrigramTagger(train_sents, backoff=tt2)

In [14]:
tt1.evaluate(test_sents)

0.8912742817627459

In [15]:
tt2.evaluate(test_sents)

0.9125751765470128

In [16]:
tt3.evaluate(test_sents)

0.9130466670857693

이번에는 Trigram 까지 backoff 한 것이 성능이 더 좋았다. 따라서 t3까지 tagging 하는게 더 좋은 방법이 될 수 있겠다.

In [17]:
text = input('Enter a sentence: ')
text = nltk.word_tokenize(text) 
tt3.tag(text)
print('Pos Tagging result : ',end="")
for i in tt3.tag(text):
    print(i[1], end=" ")

Enter a sentence: time flies like an arrow.
Pos Tagging result : NN NNS VB AT NN . 

In [18]:
text = input('Enter a sentence: ').split()
tt2.tag(text)
print('Pos Tagging result : ',end="")
for i in tt2.tag(text):
    print(i[1], end=" ")

Enter a sentence: time flies like an arrow.
Pos Tagging result : NN NNS CS AT NN 

-> time flies like an arrow 가 제대로 태깅 되지 않은 것을 확인할 수 있다.

그밖의 예시문장들

In [19]:
text = input('Enter a sentence: ').split()
tt2.tag(text)
print('Pos Tagging result : ',end="")
for i in tt2.tag(text):
    print(i[1], end=" ")

Enter a sentence: In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so. Thus the heavens and the earth were finished, and all the host of them. But there went up a mist from the earth, and watered the whole face of the ground.  
Pos Tagging result : IN AT NN NP VBD AT NN CC AT NN CC AT NN BEDZ IN NN CC NN CC NN BEDZ IN AT NN IN AT NN CC AT NN-TL IN-TL NP-TL VBD IN AT NN IN AT NN CC NP NN VB RB BE NN CC EX BEDZ NN CC NP VBD AT NN NN CC AT NN PPS VBD NN CC AT NN CC AT NN BED AT OD NN CC NP VBD AT NN CC VBN AT NNS WDT BED IN AT NN IN AT 

##  문제점 

우리가 어떤 단어를 명사로 표기하기로 결정했지만, 나중에 그것이 동사였어야 했다는 증거를 찾아낸다면, 다시 돌아가서 우리의 실수를 고칠 수 없다.

 -> 해결책 : 가능한 모든 시퀀스에 점수를 할당하고 전체 점수가 가장 높은 시퀀스를 선택함. (히든 마르코프 모델들이 취한 접근법).

#### + data sparseness problem of N-gram Tagger

1. n-gram table의 사이즈 문제

n을 크게 선택하면 실제 훈련 코퍼스에서 해당 n-gram을 카운트할 수 있는 확률은 적어지므로 희소 문제는 점점 심각해짐. 또한 n이 커질수록 모델 사이즈가 커진다는 문제점도 있음. 기본적으로 코퍼스의 모든 n-gram에 대해서 카운트를 해야 하기때문
n을 작게 선택하면 훈련 코퍼스에서 카운트는 잘 되겠지만 근사의 정확도는 현실의 확률분포와 멀어짐

2. context : n-gram tagger

the only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information