#A04 Document Classification

In [None]:
import nltk
nltk.download('senseval')

##**Problem 1**
###問題文

The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data:


```py.py
from nltk.corpus import senseval

instances = senseval.instances('hard.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]
```
Using this dataset, build a classifier that predicts the correct sense tag for a given instance. See the corpus HOWTO at http://nltk.org/howto for information on using the instance objects returned by the Senseval 2 Corpus.


In [None]:
# 問題１の解答

from nltk.corpus import senseval

def sense_features(instance):
    features = {}
    features["word-type"] = instance.word
    features["word-tag"] = instance.context[instance.position][1]
    features["prev-word"] = instance.context[instance.position-1][0]
    features["prev-word-tag"] = instance.context[instance.position-1][1]
    features["next-word"] = instance.context[instance.position+1][0]
    features["next-word-tag"] = instance.context[instance.position+1][1]
    return features

# Sensevalコーパスをもとに学習用とテスト用データセットを作成する
instances = senseval.instances('interest.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]

# 各単語とそのコンテキスト情報、sense情報を出力する
for i in train_set[:100]:
    p = i.position
    left = ' '.join(w for (w,t) in i.context[p-2:p])
    word = ' '.join(w for (w,t) in i.context[p:p+1])
    right = ' '.join(w for (w,t) in i.context[p+1:p+3])
    senses = ' '.join(i.senses)
    print ('%15s |%10s | %-15s -> %s' % (left, word, right, senses))

# 学習用とテスト用のデータセットからfeatureを抽出して、それらをもとにデータセットを作り直す
train_set = [(sense_features(instance), instance.senses) for instance in train_set]
test_set = [(sense_features(instance), instance.senses) for instance in test_set]

# 学習
classifier = nltk.NaiveBayesClassifier.train(train_set)
# テスト　認識精度を出力
print("\n\nAccuracy   " + str(nltk.classify.accuracy(classifier, test_set)))

because municipal-bond |  interest | is exempt       -> interest_6
  at prevailing |  interest | rates .         -> interest_6
       bet that |  interest | rates will      -> interest_6
      losses if |  interest | rates rise      -> interest_6
                |  interest | rates do        -> interest_6
                |  interest | rate is         -> interest_6
            8 % |  interest | for one         -> interest_6
   track market |  interest | rates ,         -> interest_6
   the national |  interest | ; in            -> interest_4
         to our | interests | in a            -> interest_4
            s . |  interest | rates continue  -> interest_6
            s . |  interest | rates that      -> interest_6
   high british |  interest | rates that      -> interest_6
  strong buying |  interest | in the          -> interest_1
    investors ' |  interest | in buying       -> interest_1
      giant has | interests | in cement       -> interest_5
   plus accrued |  interest | to 

##**Problem 2**
###問題文

The synonyms strong and powerful pattern differently (try combining them with chip and sales). What features are relevant in this distinction? Build a classifier that predicts when each word should be used.
(Please make collocation word and its histogram, and make a bigram model classifier to use some corpus)

セル１の説明
複数のコーパスをダウンロードした後、一つのデータセットに結合する。
また、前処理として各単語を小文字化する。

In [None]:
# セル１
import nltk
from nltk.collocations import *
nltk.download('genesis')
nltk.download('brown')
nltk.download('gutenberg')
nltk.download('webtext')
nltk.download('punkt')
nltk.download('reuters')
!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora/.
import random

dataset = nltk.corpus.brown.words() + nltk.corpus.gutenberg.words() + nltk.corpus.webtext.words() \
              + nltk.corpus.reuters.words() + nltk.corpus.genesis.words()
dataset = [word.lower() for word in dataset]

セル２と３の説明

Bigramに対してコロケーションを求めるfinderを初期化する。
finderを用いてそれぞれstrongとpowerfulを含むコロケーションだけを抽出するフィルターを適用する。

In [None]:
# セル２
finder = BigramCollocationFinder.from_words(dataset)
bigram_measures = nltk.collocations.BigramAssocMeasures()

word_filter_strong = lambda w1, w2: 'strong' not in (w1, w2)
finder.apply_ngram_filter(word_filter_strong)
strong = finder.ngram_fd
len_strong = len(strong.items())
print(strong.items())



In [None]:
# セル３
finder = BigramCollocationFinder.from_words(dataset)
bigram_measures = nltk.collocations.BigramAssocMeasures()

word_filter_powerful = lambda w1, w2: 'powerful' not in (w1, w2)
finder.apply_ngram_filter(word_filter_powerful)
powerful = finder.ngram_fd
len_powerful = len(powerful.items())
print(powerful.items())

dict_items([(('the', 'powerful'), 15), (('powerful', 'new'), 2), (('a', 'powerful'), 27), (('powerful', 'transmitter'), 1), (('powerful', ','), 13), (('most', 'powerful'), 12), (('powerful', 'man'), 1), (('powerful', 'central'), 2), ((',', 'powerful'), 5), (('powerful', 'nations'), 1), (('one', 'powerful'), 1), (('powerful', 'nation'), 1), (('and', 'powerful'), 14), (('powerful', 'mirror'), 1), (('more', 'powerful'), 13), (('powerful', 'than'), 5), (('powerful', 'efforts'), 1), (('powerful', 'weapon'), 1), (('powerful', 'glasses'), 1), (('supremely', 'powerful'), 1), (('powerful', 'divine'), 1), (('no', 'powerful'), 1), (('powerful', 'otherworldly'), 1), (('powerful', 'victory'), 1), (('powerful', 'engines'), 2), (('of', 'powerful'), 5), (('powerful', 'music'), 1), (('them', 'powerful'), 1), (('powerful', 'visual'), 1), (('swift', 'powerful'), 1), (('powerful', 'act'), 1), (('powerful', 'greek'), 1), (('powerful', 'or'), 1), (('powerful', 'and'), 9), (('this', 'powerful'), 2), (('power

セル４の説明

powerfulを含むコロケーションのディクショナリとstrongを含むコロケーションのディクショナリをリストとして統合する。
その後、作成したリストをシャッフルする。

In [None]:
# セル４
dataset = list(strong.items()) + list(powerful.items())
dataset = [item for item in dataset if item[1] <= 10]
random.shuffle(dataset)
print(dataset)



セル５の説明

collocationに対応する特徴量(feature)として、頻度とstrongとpowerfulのペアである単語を抽出する。

In [None]:
# セル５
def features(collocation_fq):
  features = {}
  if collocation_fq[0][0] == 'strong' or collocation_fq[0][0] == 'powerful':
    features['word'] = collocation_fq[0][1]
  elif collocation_fq[0][1] == 'strong' or collocation_fq[0][1] == 'powerful':
    features['word'] = collocation_fq[0][0]
  features['freq'] = collocation_fq[1]

  return features

def get_label(collocation_fq):
  if collocation_fq[0][0] == 'strong' or collocation_fq[0][0] == 'powerful':
    return collocation_fq[0][0]
  elif collocation_fq[0][1] == 'strong' or collocation_fq[0][1] == 'powerful':
    return collocation_fq[0][1]

セル６の説明

準備したデータセットの80%を学習のために使い、残りの20%をテストのために使う。

学習を実行し、認識精度を求めて作成したモデルの良し悪しを測る。

In [None]:
# セル６

train_size = int(len(dataset) * 0.8)

train = [(features(item), get_label(item)) for item in dataset[:train_size]]
test = [(features(item), get_label(item)) for item in dataset[train_size:]]

classifier = nltk.NaiveBayesClassifier.train(train)
print("認識精度   " + str(nltk.classify.accuracy(classifier, test) * 100.0) + " %")

認識精度   79.44444444444444 %


##**Problem 3**
###問題文

The dialog act classifier assigns labels to individual posts, without considering the context in which the post is found. However, dialog acts are highly dependent on context, and some sequences of dialog act are much more likely than others. For example, a ynQuestion dialog act is much more likely to be answered by a yanswer than by a greeting. Make use of this fact to build a consecutive classifier for labeling dialog acts. Be sure to consider what features might be useful. See the code for the consecutive classifier for part-of-speech tags in 1.7 to get some ideas.

In [None]:
# 問題３の解答

def dialogue_act_features(target_post, i, history):
  features = {}

  for word in nltk.word_tokenize(target_post):
    features["target-post-contains(%s)" % word.lower()] = True
  if i == 0:
    features["prev-post-tag"] = "<START>"
  else:
    features["prev-post-tag"] = history[i-1]
  
  return features

class Consecutive_post_tagger(nltk.TaggerI):

  def __init__(self, train_posts):
    train_set = []
    history = []
    for i in range(len(train_posts)):
      featureset = dialogue_act_features(train_posts[i].text, i, history)
      train_set.append( (featureset, train_posts[i].get('class')) )
      history.append(train_posts[i].get('class'))
    self.classifier = nltk.NaiveBayesClassifier.train(train_set)

  # 学習後のtaggerを用いてtag付けを行う関数
  def tag(self, posts):
    history = []
    for i in range(len(posts)):
      featureset = dialogue_act_features(posts[i].text, i, history)
      tag = self.classifier.classify(featureset)
      history.append(tag)
    return zip(posts, history)
  
  # テストデータに対するtaggerの精度を求める関数
  def accuracy(self, test_posts):
    history = []
    test = []
    for i in range(len(test_posts)):
      featureset = dialogue_act_features(test_posts[i].text, i, history)
      test.append( (featureset, test_posts[i].get('class')))
      tag = self.classifier.classify(featureset)
      history.append(tag)

    accuracy = nltk.classify.accuracy(self.classifier, test)
    return accuracy

1. nps_chatコーパスをダウンロードする。
2. 学習用データとテスト用データを作成する

In [None]:
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
size = int(len(posts) * 0.8)

train, test = posts[:size], posts[size:]

[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.


Consecutive classifierを用いてdialog actの学習を行う。
学習後、モデルの評価を行うためにaccuracy関数を用いて精度を求める。

In [None]:
tagger = Consecutive_post_tagger(train)
print(tagger.accuracy(test))

0.645
