# Learning to Classify Text

In [41]:
import nltk
from nltk.corpus import *

# 1 Supervised Classification

# 1.1   Gender Identification
In 4 we saw that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

In [42]:
from nltk.corpus import *
import random
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
labeled_names

[('Mick', 'male'),
 ('Wally', 'female'),
 ('Rafaelita', 'female'),
 ('Mace', 'male'),
 ('Cammy', 'male'),
 ('Guenna', 'female'),
 ('Winonah', 'female'),
 ('Edgardo', 'male'),
 ('Maren', 'female'),
 ('Oswell', 'male'),
 ('Huey', 'male'),
 ('Gillian', 'female'),
 ('Konstance', 'female'),
 ('Warden', 'male'),
 ('Wilson', 'male'),
 ('Salome', 'female'),
 ('Wallie', 'male'),
 ('Enoch', 'male'),
 ('Eydie', 'female'),
 ('Bobina', 'female'),
 ('Cristabel', 'female'),
 ('Zane', 'male'),
 ('Skylar', 'male'),
 ('Clarke', 'male'),
 ('Bebe', 'female'),
 ('Katerina', 'female'),
 ('Alexina', 'female'),
 ('Alex', 'male'),
 ('Ruddie', 'male'),
 ('Thea', 'female'),
 ('Ortensia', 'female'),
 ('Bartholomeo', 'male'),
 ('Patti', 'female'),
 ('Rennie', 'female'),
 ('Beilul', 'female'),
 ('Julieta', 'female'),
 ('Elfie', 'female'),
 ('Willamina', 'female'),
 ('Mattheus', 'male'),
 ('Rob', 'male'),
 ('Archy', 'male'),
 ('Andrej', 'male'),
 ('Meriel', 'female'),
 ('Lacy', 'female'),
 ('Ignatius', 'male'),
 ('R

male.txt : 남자이름

female.txt : 여자이름

In [43]:
def gender_features(word):
        return {'last_letter': word[-1]}
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets

[({'last_letter': 'k'}, 'male'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'h'}, 'female'),
 ({'last_letter': 'o'}, 'male'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'l'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'h'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'l'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'x'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter

마지막이름을 기준으로 feature를 뽑아냄

In [44]:
len(featuresets)

7944

In [45]:
train_set, test_set = featuresets[500:], featuresets[:500]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Gender Classification:", nltk.classify.accuracy(classifier, test_set))

Gender Classification: 0.762


이름의 마지막 글자만을 이용해 나이브 베이지안을 이용해 train -> 정확도 : 약 77%

In [46]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     36.7 : 1.0
             last_letter = 'k'              male : female =     30.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0


-> 성공적으로 train 된 예시들

# 1.2   Choosing The Right Features
Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on a thorough understanding of the task at hand.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem. It's common to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually are helpful. We take this approach for name gender features in 1.2.

In [47]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.784


-> 쓸데없는 feature들을 넣으면 정확도가 낮아질 수 있다.

### Error Analysis

Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

In [48]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

In [49]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.768


In [50]:
errors = []
for (name, tag) in devtest_names:
     guess = classifier.classify(gender_features(name))
     if guess != tag:
         errors.append( (tag, guess, name) )
for (tag, guess, name) in sorted(errors):
     print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Ailyn                         
correct=female   guess=male     name=Alis                          
correct=female   guess=male     name=Annabal                       
correct=female   guess=male     name=Ardelis                       
correct=female   guess=male     name=Ariel                         
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Austin                        
correct=female   guess=male     name=Avis                          
correct=female   guess=male     name=Bo                            
correct=female   guess=male     name=Brenn                         
correct=female   guess=male     name=Britt                         
correct=female   guess=male     name=Bryn                          
correct=female   guess=male     name=Carleen                       
correct=female   guess=male     name=Carlynn                       
correct=female   guess=male     name=Carmon     

-> 틀린 예제들을 관찰할 수 있다.

ex) yn 으로 끝나는 이름은 여자이름이 많았다.

In [51]:
def gender_features(word):
     return {'suffix1': word[-1:],
             'suffix2': word[-2:]}

In [52]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.792


-> 성능이 약간 향상된 것을 확인할 수 있다.

# 1.3   Document Classification
In 1, we saw several examples of corpora where documents have been labeled with categories. Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. First, we construct a list of documents, labeled with the appropriate categories. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

### 영화비평 분류 using NaiveBayesClassifier

category : positive or negative

In [53]:
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] # 가장 많이 쓰인 2000개를 feature로 사용

In [54]:
type(documents)

list

In [55]:
documents[0]

(['ingredients',
  ':',
  'down',
  '-',
  'on',
  '-',
  'his',
  '-',
  'luck',
  'evangelist',
  ',',
  'church',
  'synopsis',
  ':',
  'sonny',
  'dewey',
  '(',
  'robert',
  'duvall',
  ')',
  'is',
  'a',
  'tireless',
  'texas',
  'pentecostal',
  'preacher',
  'who',
  'unexpectedly',
  'catches',
  'his',
  'wife',
  '(',
  'farrah',
  'fawcett',
  ')',
  'in',
  'bed',
  'with',
  'another',
  'guy',
  '.',
  'in',
  'a',
  'regrettable',
  'crime',
  'of',
  'passion',
  'he',
  'takes',
  'a',
  'baseball',
  'bat',
  'to',
  'the',
  'guy',
  "'",
  's',
  'head',
  ',',
  'and',
  'suddenly',
  'finds',
  'himself',
  'a',
  'fugitive',
  'for',
  'murder',
  ',',
  'and',
  'estranged',
  'from',
  'his',
  'wife',
  'and',
  'two',
  'kids',
  '.',
  'to',
  'atone',
  'for',
  'his',
  'sins',
  ',',
  'sonny',
  'flees',
  'to',
  'a',
  'rural',
  'bayou',
  'town',
  'in',
  'louisiana',
  'and',
  'baptizes',
  'himself',
  'as',
  'a',
  'new',
  'creature',
  '

In [56]:
type(word_features)

list

In [57]:
word_features[:100]

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 'but',
 'his',
 'girlfriend',
 'continues',
 'see',
 'him',
 'in',
 'her',
 'life',
 'has',
 'nightmares',
 'what',
 "'",
 's',
 'deal',
 '?',
 'watch',
 'movie',
 '"',
 'sorta',
 'find',
 'out',
 'critique',
 'mind',
 '-',
 'fuck',
 'for',
 'generation',
 'that',
 'touches',
 'on',
 'very',
 'cool',
 'idea',
 'presents',
 'it',
 'bad',
 'package',
 'which',
 'is',
 'makes',
 'this',
 'review',
 'even',
 'harder',
 'write',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'attempt',
 'break',
 'mold',
 'mess',
 'with',
 'your',
 'head',
 'such',
 '(',
 'lost',
 'highway',
 '&',
 'memento',
 ')',
 'there',
 'are',
 'good',
 'ways',
 'making',
 'all',
 'types',
 'these',
 'folks']

most frequent 단어들

마지막에 카테고리가 붙어있음 pos, net

In [58]:
def document_features(document):
      document_words = set(document)
      features = {}
      for word in word_features:
              features['contains({})'.format(word)] = (word in document_words)
      return features
featuresets = [(document_features(d), c) for (d,c) in documents]

feature 함수 : 그 안에 단어가 있는지 없는지를 따져서 나이브베이지안

In [59]:
featuresets[1]

({'contains(plot)': False,
  'contains(:)': False,
  'contains(two)': True,
  'contains(teen)': False,
  'contains(couples)': False,
  'contains(go)': True,
  'contains(to)': True,
  'contains(a)': True,
  'contains(church)': False,
  'contains(party)': False,
  'contains(,)': True,
  'contains(drink)': False,
  'contains(and)': True,
  'contains(then)': False,
  'contains(drive)': False,
  'contains(.)': True,
  'contains(they)': False,
  'contains(get)': False,
  'contains(into)': True,
  'contains(an)': True,
  'contains(accident)': True,
  'contains(one)': False,
  'contains(of)': True,
  'contains(the)': True,
  'contains(guys)': False,
  'contains(dies)': False,
  'contains(but)': True,
  'contains(his)': False,
  'contains(girlfriend)': False,
  'contains(continues)': False,
  'contains(see)': True,
  'contains(him)': False,
  'contains(in)': True,
  'contains(her)': True,
  'contains(life)': False,
  'contains(has)': True,
  'contains(nightmares)': False,
  'contains(what)': Fa

In [60]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Document Classification:", nltk.classify.accuracy(classifier,test_set))

Document Classification: 0.79


In [61]:
classifier.show_most_informative_features(5)

Most Informative Features
           contains(ugh) = True              neg : pos    =      8.3 : 1.0
 contains(unimaginative) = True              neg : pos    =      8.3 : 1.0
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0
          contains(mena) = True              neg : pos    =      7.0 : 1.0


# 1.4   Part-of-Speech Tagging
In 5. we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the internal make-up of the word. However, this regular expression tagger had to be hand-crafted. Instead, we can train a classifier to work out which suffixes are most informative. Let's begin by finding out what the most common suffixes are:

### 품사에 영향을 주는 suffix 활용 using decisiontreeClassifier()

In [62]:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
        word = word.lower()
        suffix_fdist[word[-1:]] += 1
        suffix_fdist[word[-2:]] += 1
        suffix_fdist[word[-3:]] += 1
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

In [63]:
common_suffixes

['e',
 ',',
 '.',
 's',
 'd',
 't',
 'he',
 'n',
 'a',
 'of',
 'the',
 'y',
 'r',
 'to',
 'in',
 'f',
 'o',
 'ed',
 'nd',
 'is',
 'on',
 'l',
 'g',
 'and',
 'ng',
 'er',
 'as',
 'ing',
 'h',
 'at',
 'es',
 'or',
 're',
 'it',
 '``',
 'an',
 "''",
 'm',
 ';',
 'i',
 'ly',
 'ion',
 'en',
 'al',
 '?',
 'nt',
 'be',
 'hat',
 'st',
 'his',
 'th',
 'll',
 'le',
 'ce',
 'by',
 'ts',
 'me',
 've',
 "'",
 'se',
 'ut',
 'was',
 'for',
 'ent',
 'ch',
 'k',
 'w',
 'ld',
 '`',
 'rs',
 'ted',
 'ere',
 'her',
 'ne',
 'ns',
 'ith',
 'ad',
 'ry',
 ')',
 '(',
 'te',
 '--',
 'ay',
 'ty',
 'ot',
 'p',
 'nce',
 "'s",
 'ter',
 'om',
 'ss',
 ':',
 'we',
 'are',
 'c',
 'ers',
 'uld',
 'had',
 'so',
 'ey']

In [64]:
def pos_features(word):
        features = {}
        for suffix in common_suffixes:
                features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
        return features
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [65]:
featuresets[:10]

[({'endswith(e)': True,
   'endswith(,)': False,
   'endswith(.)': False,
   'endswith(s)': False,
   'endswith(d)': False,
   'endswith(t)': False,
   'endswith(he)': True,
   'endswith(n)': False,
   'endswith(a)': False,
   'endswith(of)': False,
   'endswith(the)': True,
   'endswith(y)': False,
   'endswith(r)': False,
   'endswith(to)': False,
   'endswith(in)': False,
   'endswith(f)': False,
   'endswith(o)': False,
   'endswith(ed)': False,
   'endswith(nd)': False,
   'endswith(is)': False,
   'endswith(on)': False,
   'endswith(l)': False,
   'endswith(g)': False,
   'endswith(and)': False,
   'endswith(ng)': False,
   'endswith(er)': False,
   'endswith(as)': False,
   'endswith(ing)': False,
   'endswith(h)': False,
   'endswith(at)': False,
   'endswith(es)': False,
   'endswith(or)': False,
   'endswith(re)': False,
   'endswith(it)': False,
   'endswith(``)': False,
   'endswith(an)': False,
   "endswith('')": False,
   'endswith(m)': False,
   'endswith(;)': False,
   

-> Emotion 등등 으로 쓰인다

In [None]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
print("POS Tagging:", nltk.classify.accuracy(classifier,test_set))

desicisionTree 퍼포먼스가 딱히 좋지는 않음

In [None]:
print(classifier.pseudocode(depth=4))

# 1.5   Exploiting Context
By augmenting the feature extraction function, we could modify this part-of-speech tagger to leverage a variety of other word-internal features, such as the length of the word, the number of syllables it contains, or its prefix. However, as long as the feature extractor just looks at the target word, we have no way to add features that depend on the context that the word appears in. But contextual features often provide powerful clues about the correct tag — for example, when tagging the word "fly," knowing that the previous word is "a" will allow us to determine that it is functioning as a noun, not a verb.

In [None]:
def pos_features(sentence, i): 
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [None]:
pos_features(brown.sents()[0], 8)

tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
     untagged_sent = nltk.tag.untag(tagged_sent)
     for i, (word, tag) in enumerate(tagged_sent):
         featuresets.append( (pos_features(untagged_sent, i), tag) )

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

# 1.6   Sequence Classification
In order to capture the dependencies between related classification tasks, we can use joint classifier models, which choose an appropriate labeling for a collection of related inputs. In the case of part-of-speech tagging, a variety of different sequence classifier models can be used to jointly choose part-of-speech tags for all the words in a given sentence.

One sequence classification strategy, known as consecutive classification or greedy sequence classification, is to find the most likely class label for the first input, then to use that answer to help find the best label for the next input. The process can then be repeated until all of the inputs have been labeled. This is the approach that was taken by the bigram tagger from 5, which began by choosing a part-of-speech tag for the first word in the sentence, and then chose the tag for each subsequent word based on the word itself and the predicted tag for the previous word.

### 주변 단어까지 봐서! exploting context

In [None]:
def pos_features(sentence, i, history):
        features = {"suffix(1)": sentence[i][-1:],
                    "suffix(2)": sentence[i][-2:],
                    "suffix(3)": sentence[i][-3:]}
        if i == 0:
                features["prev-word"] = "<START>"
                features["prev-tag"] = "<START>"
        else:
                features["prev-word"] = sentence[i-1]
                features["prev-tag"] = history[i-1]
        return features

In [None]:
#pos_features(brown.sents()[0],8)
print(brown.sents()[0])
list(enumerate(brown.sents()[0]))

### Consecutive POS Tagger using consecutivePosTagger

In [None]:
class ConsecutivePosTagger(nltk.TaggerI):
        def __init__(self, train_sents):
                train_set = []
                for tagged_sent in train_sents:
                    untagged_sent = nltk.tag.untag(tagged_sent)
                    history = []
                    for i, (word, tag) in enumerate(tagged_sent):
                        featureset = pos_features(untagged_sent, i, history)
                        train_set.append( (featureset, tag) )
                        history.append(tag)
                self.classifier = nltk.NaiveBayesClassifier.train(train_set)
        def tag(self, sentence):
                history = []
                for i, word in enumerate(sentence):
                    featureset = pos_features(sentence, i, history)
                    tag = self.classifier.classify(featureset)
                    history.append(tag)
                return zip(sentence, history)
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print("Consecutive POS Tagger:", tagger.evaluate(test_sents))

# 2.1   Sentence Segmentation
Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

The first step is to obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features:

코퍼스의 문장 경계선 구분 -> 품사를 정확히 분석하기 위해

treebank로 sents를 corpus로 해서 offset를 계산한다. 몇번째가 문장 경계선인지를 먼저 쭉 찾아주고, 문장경계선의 주변 단어들을 살펴본다. 이걸 바탕으로 featureset를 분석

In [None]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

In [None]:
sents = nltk.corpus.treebank_raw.sents()
type(sents)
sents[:10]

In [None]:
tokens = []
boundaries = set()
offset = 0
for sent in sents:
        tokens.extend(sent)
        offset += len(sent)
        boundaries.add(offset-1) #boundary들을 모음

In [None]:
boundaries

In [None]:
def punct_features(tokens, i):
        return {'next-word-capitalized': tokens[i+1][0].isupper(),
                'prev-word': tokens[i-1].lower(),
                'punct': tokens[i],
                'prev-word-is-one-char': len(tokens[i-1]) == 1}
featuresets = [(punct_features(tokens, i), (i in boundaries))
        for i in range(1, len(tokens)-1)
                if tokens[i] in '.?!']

In [None]:
featuresets

In [None]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
print("Sentence Segmentation:", nltk.classify.accuracy(classifier, test_set))

# 2.2   Identifying Dialogue Act Types
When processing dialogue, it can be useful to think of utterances as a type of action performed by the speaker. This interpretation is most straightforward for performative statements such as "I forgive you" or "I bet you can't climb that hill." But greetings, questions, answers, assertions, and clarifications can all be thought of as types of speech-based actions. Recognizing the dialogue acts underlying the utterances in a dialogue can be an important first step in understanding the conversation.

The NPS Chat Corpus, which was demonstrated in 1, consists of over 10,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. The first step is to extract the basic messaging data. We will call xml_posts() to get a data structure representing the XML annotation for each post:# 

In [None]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
        features = {}
        for word in nltk.word_tokenize(post):
                features['contains({})'.format(word.lower())] = True
        return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Dialog Act Type:", nltk.classify.accuracy(classifier, test_set))

# 2.3   Recognizing Textual Entailment
Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the "hypothesis" (as already discussed in 5). To date, there have been four RTE Challenges, where shared development and test data is made available to competing teams. Here are a couple of examples of text/hypothesis pairs from the Challenge 3 development dataset. The label True indicates that the entailment holds, and False, that it fails to hold.

-> 문장에 어떤 문장을 함축하고 있는지 True False를 반환가능

test / hypothesis

In [None]:
rtepairs = nltk.corpus.rte.pairs(['rte3_dev.xml'])
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features
featuresets = [(rte_features(rtepair), rtepair.value) for rtepair in rtepairs]

In [None]:
len(featuresets),type(featuresets)
featuresets[:5] 

-> true / false 0 1 로 표현되어있음

In [None]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("RTE Recognition:", nltk.classify.accuracy(classifier, test_set))

단어가 겹치는 지만 보기 때문에 성능이 좋지 않음 이는 뉴럴넷을 사용해서 더 발전할 수 있음

# NLTK에서 scikit-learn Wrapper를 활용
https://youtu.be/nla4C-VYNEU 
https://www.python-course.eu/neural_networks_with_scikit.php

In [None]:
import nltk
import random
from nltk.corpus import movie_reviews
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
MNH_classifier = SklearnClassifier(MultinomialNB())
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = []
for w in movie_reviews.words():
  all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
      words = set(document)
      features = {}
      for w in word_features:
              features[w] = (w in words)
      return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
training_set = featuresets[:1900]
testing_set = featuresets[1900:]
MNB_classifier = SklearnClassifier(MultinomialNB()) # Multinomial Navie bayes
MNB_classifier.train(training_set)
print("MNB Accuracy:", (nltk.classify.accuracy(MNB_classifier,testing_set))*100)

## Confusion Matrix

In [None]:
def tag_list(tagged_sents):
     return [tag for sent in tagged_sents for (word, tag) in sent]
def apply_tagger(tagger, corpus):
     return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

In [None]:
gold = tag_list(brown.tagged_sents(categories='editorial'))
test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))