TF-IDFの考え方。
http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

文書をたくさん集める。
各文書と単語ごとにTF-IDFを計算する。

ここでTFとはterm frequencyの略で、単語の頻度。
ある文書ののべ全単語数に対し、その文書におけるある単語の出現頻度。

IDFとはinverse document frequencyの略で、ある単語が含まれる文書の割合DFの逆数。
いくつかの方法があるが、今回用いるのは

$$
idf(t) = \log \frac{n_d}{1+df(d,t)}
$$
$$
idf(t) = \log \frac{n_d}{df(d,t)} + 1
$$
のいずれか。

ここで$n_d$は文書の数、$df(d,t)$は単語$t$が含まれる文書の数。

例えば与えられた文書が以下の三つだとしよう。

- This is a pen.
- This is an apple.
- This is a pineapple and this is a pen.

一番上の文書において、penのTFは1/4, IDFはlog(3/1)+1となり、
tf-idf(1, pen)=1/4 * (log3 + 1)となる。

In [1]:
import math
x, y = 1/3 * (math.log(3.0/2.0) + 1), 1/3 * (math.log(3.0/3.0) + 1)
x / math.sqrt(x**2+y**2+y**2)

0.7049094889309326

In [2]:
corpus = ['This is a pen.',
          'This is an apple.',
          'This is a pineapple and this is a pen.',]

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray() 

array([[0, 0, 0, 1, 1, 0, 1],
       [1, 0, 1, 1, 0, 0, 1],
       [0, 1, 0, 2, 1, 1, 2]], dtype=int64)

In [4]:
vectorizer.get_feature_names()

['an', 'and', 'apple', 'is', 'pen', 'pineapple', 'this']

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer

# smooth_idfはIDFに+1するか否か
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(X)
tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.50154891,  0.70490949,
         0.        ,  0.50154891],
       [ 0.63834075,  0.        ,  0.63834075,  0.30417279,  0.        ,
         0.        ,  0.30417279],
       [ 0.        ,  0.48421906,  0.        ,  0.46146595,  0.32428715,
         0.48421906,  0.46146595]])

In [6]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b')
# ngram_rangeは特徴量として取り出す単語数の範囲
# token_patternは単語として認識するものの正規表現、上だと1文字も単語として見なされる

In [7]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1],
       [2, 1, 1, 0, 0, 1, 1, 0, 2, 2, 0, 1, 1, 1, 2, 2]], dtype=int64)

In [8]:
bigram_vectorizer.get_feature_names()

['a',
 'a pen',
 'a pineapple',
 'an',
 'an apple',
 'and',
 'and this',
 'apple',
 'is',
 'is a',
 'is an',
 'pen',
 'pineapple',
 'pineapple and',
 'this',
 'this is']

tf-idfを用いてIMDBデータセットの分析を行う。

ここではロジスティック回帰で分類するが、他の方法も試した上で交差検証を行いながら分類精度を向上させるための工夫を行ってみよう。
ある程度精度が出るようになったらtestデータで試して見ること。

In [1]:
from keras.datasets import imdb
(x_train,  y_train), (x_test, y_test) = imdb.load_data()

x_train.shape, y_train.shape, x_test.shape, y_test.shape

Using TensorFlow backend.


((25000,), (25000,), (25000,), (25000,))

In [2]:
counts=[]

import collections
for t in x_train:
    counts.append(collections.Counter(t))

In [3]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
x_train_counts = v.fit_transform(counts)
x_train_counts.toarray() 

array([[  1.,   0.,  15., ...,   0.,   0.,   0.],
       [  1.,   0.,  15., ...,   0.,   0.,   0.],
       [  1.,   0.,   9., ...,   0.,   0.,   0.],
       ..., 
       [  1.,   0.,  13., ...,   0.,   0.,   0.],
       [  1.,   0.,   5., ...,   0.,   0.,   0.],
       [  1.,   0.,  10., ...,   0.,   0.,   0.]])

In [4]:
counts_test=[]

for t in x_test:
    counts_test.append(collections.Counter(t))

x_test_counts = v.transform(counts_test)
x_train_counts.shape, x_test_counts.shape

((25000, 88585), (25000, 88585))

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf = False)

x_train_tfidf = transformer.fit_transform(x_train_counts).toarray()
x_test_tfidf = transformer.transform(x_test_counts).toarray()

x_train_tfidf.shape, y_train.shape, x_test_tfidf.shape, y_test.shape

((25000, 88585), (25000,), (25000, 88585), (25000,))

In [14]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
clf.score(x_test_tfidf, y_test)

0.88551999999999997

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

parameters = {'penalty':('l1', 'l2'), 'C':[0.01, 1]}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters)
clf.fit(x_train_tfidf, y_train)

In [None]:
clf.best_score_, clf.best_estimator_

In [4]:
x_train_str = [' '.join([str(t) for t in d]) for d in x_train]
x_test_str = [' '.join([str(t) for t in d]) for d in x_test]

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(ngram_range=(2, 2),
                        token_pattern=r'\b\w+\b')
x_train_bigram_tfidf = tfidf.fit_transform(x_train_str).toarray()
x_test_bigram_tfidf = tfidf.transform(x_test_str).toarray()

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(x_train_bigram_tfidf, y_train)
clf.score(x_test_bigram_tfidf, y_test)