TF-IDFの考え方。
http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

文書をたくさん集める。
各文書と単語ごとにTF-IDFを計算する。

ここでTFとはterm frequencyの略で、単語の頻度。
ある文書ののべ全単語数に対し、その文書におけるある単語の出現頻度。

IDFとはinverse document frequencyの略で、ある単語が含まれる文書の割合DFの逆数。
いくつかの方法があるが、今回用いるのは

$$
idf(t) = \log \frac{n_d}{1+df(t)}
$$
$$
idf(t) = \log \frac{n_d}{df(t)} + 1
$$
のいずれか。

ここで$n_d$は文書の数、$df(t)$は単語$t$が含まれる文書の数。

例えば与えられた文書が以下の三つだとしよう。

- This is a pen.
- This is an apple.
- This is a pineapple and this is a pen.

一番上の文書において、penのTFは1/4, IDFはlog(3/1)+1となり、
tf-idf(1, pen)=1/4 * (log3 + 1)となる。

In [2]:
import math
1/3 * (math.log(3.0/1.0) + 1)

0.6995374295560366

In [9]:
import math
# is
# is_ = 1/3 * (math.log((1 + 3.0)/(1 + 3.0)) + 1)
is_ = 1/3 * (math.log((3.0)/(3.0)) + 1)

# pen
# pen = 1/3 * (math.log((1 + 3.0)/(1 + 2.0)) + 1)
pen = 1/3 * (math.log((3.0)/(2.0)) + 1)

# this
# this = 1/3 * (math.log((1 + 3.0)/(1 + 3.0)) + 1)
this = 1/3 * (math.log((3.0)/(3.0)) + 1)

square_sum = math.pow(is_, 2) + math.pow(pen, 2) + math.pow(this, 2)
euclidean_denom = math.sqrt(square_sum)

pen_norm = pen / euclidean_denom
print(pen_norm)

0.7049094889309326


In [2]:
corpus = ['This is a pen.',
          'This is an apple.',
          'This is a pineapple and this is a pen.',]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray() 

NameError: name 'corpus' is not defined

In [4]:
vectorizer.get_feature_names()

['an', 'and', 'apple', 'is', 'pen', 'pineapple', 'this']

In [1]:
from sklearn.feature_extraction.text import TfidfTransformer

# smooth_idfはIDFに+1するか否か
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(X)
tfidf.toarray()

NameError: name 'X' is not defined

In [6]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b', min_df=1)

In [7]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1],
       [2, 1, 1, 0, 0, 1, 1, 0, 2, 2, 0, 1, 1, 1, 2, 2]], dtype=int64)

In [8]:
bigram_vectorizer.get_feature_names()

['a',
 'a pen',
 'a pineapple',
 'an',
 'an apple',
 'and',
 'and this',
 'apple',
 'is',
 'is a',
 'is an',
 'pen',
 'pineapple',
 'pineapple and',
 'this',
 'this is']

tf-idfを用いてIMDBデータセットの分析を行う。

ここではロジスティック回帰で分類するが、他の方法も試した上で交差検証を行いながら分類精度を向上させるための工夫を行ってみよう。
ある程度精度が出るようになったらtestデータで試して見ること。

In [3]:
from keras.datasets import imdb
(x_train,  y_train), (x_test, y_test) = imdb.load_data()

x_train.shape, y_train.shape, x_test.shape, y_test.shape

Using TensorFlow backend.
  return f(*args, **kwds)


((25000,), (25000,), (25000,), (25000,))

In [2]:
max(x_train[0])

31050

In [3]:
counts=[]

import collections
for t in x_train:
    counts.append(collections.Counter(t))

In [4]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
x_train_counts = v.fit_transform(counts)
x_train_counts.toarray() 

array([[  1.,   0.,  15., ...,   0.,   0.,   0.],
       [  1.,   0.,  15., ...,   0.,   0.,   0.],
       [  1.,   0.,   9., ...,   0.,   0.,   0.],
       ..., 
       [  1.,   0.,  13., ...,   0.,   0.,   0.],
       [  1.,   0.,   5., ...,   0.,   0.,   0.],
       [  1.,   0.,  10., ...,   0.,   0.,   0.]])

In [5]:
x_train_counts.__class__

scipy.sparse.csr.csr_matrix

In [6]:
counts_test=[]

for t in x_test:
    counts_test.append(collections.Counter(t))

x_test_counts = v.transform(counts_test)
x_train_counts.shape, x_test_counts.shape

((25000, 88585), (25000, 88585))

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf = False)

x_train_tfidf = transformer.fit_transform(x_train_counts).toarray()
x_test_tfidf = transformer.transform(x_test_counts).toarray()

x_train_tfidf.shape, y_train.shape, x_test_tfidf.shape, y_test.shape

((25000, 88585), (25000,), (25000, 88585), (25000,))

In [8]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train_tfidf, y_train)

KeyboardInterrupt: 

In [None]:
clf.score(x_test_tfidf, y_test)

In [4]:
x_train_imdb = x_train[:1000]
x_train_imdb.shape

(1000,)

In [5]:
x_train_imdb_str = [' '.join([str(t) for t in d]) for d in x_train]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
bigram_vectorizer_imdb = CountVectorizer(ngram_range=(1, 2), min_df=1)

x_train_bigram = bigram_vectorizer_imdb.fit_transform(x_train_imdb_str).toarray()


In [7]:
x_train_bigram[0].shape

(1800953,)

In [8]:
y_train_bigram = y_train[:1000]

In [None]:
from sklearn.linear_model import LogisticRegression
clf_bigram = LogisticRegression()
clf_bigram.fit(x_train_bigram, y_train_bigram)