# Word2Vecを用いたテキスト分類

このノートブックでは、事前学習済みWord2vecモデルを使って、特徴抽出とテキスト分類をする方法を紹介します。

このデータセットは、Amazon、Yelp、IMDBから肯定的・否定的な文を各1500件ずつ集めて構築しています。
データセットとしては、UCIリポジトリのSentiment labelled sentencesを使用します。

- http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences


事前学習済みの埋め込みモデルとしては、Google Newsのベクトルを使います。

- https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

## 準備

### インポート

In [26]:
import os
from string import punctuation
from time import time

import nltk
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### データセットのダウンロード

In [8]:
!mkdir DATAPATH
!wget -P DATAPATH http://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip
!unzip "DATAPATH/sentiment labelled sentences.zip"

'sentiment labelled sentences.zip'  'sentiment labelled sentences.zip.1'
Archive:  DATAPATH/sentiment labelled sentences.zip
   creating: sentiment labelled sentences/
  inflating: sentiment labelled sentences/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/sentiment labelled sentences/
  inflating: __MACOSX/sentiment labelled sentences/._.DS_Store  
  inflating: sentiment labelled sentences/amazon_cells_labelled.txt  
  inflating: sentiment labelled sentences/imdb_labelled.txt  
  inflating: __MACOSX/sentiment labelled sentences/._imdb_labelled.txt  
  inflating: sentiment labelled sentences/readme.txt  
  inflating: __MACOSX/sentiment labelled sentences/._readme.txt  
  inflating: sentiment labelled sentences/yelp_labelled.txt  
  inflating: __MACOSX/._sentiment labelled sentences  


In [18]:
!mv "sentiment labelled sentences/"*.txt .
!cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > DATAPATH/sentiment_sentences.txt

In [19]:
import gdown
gdown.download('https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM', 'GoogleNews-vectors-negative300.bin.gz', quiet=False)
!gunzip GoogleNews-vectors-negative300.bin.gz
!mv GoogleNews-vectors-negative300.bin DATAPATH

### データの読み込み

In [20]:
# 事前学習済みword2vecモデルとデータセットの読み込み
data_path= "DATAPATH"
path_to_model = os.path.join(data_path,'GoogleNews-vectors-negative300.bin')
training_data_path = os.path.join(data_path, "sentiment_sentences.txt")

# Word2vecの読み込み。時間がかかる。
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

# テキストデータとカテゴリの読み込み
# ファイルはタブ区切りで、文とカテゴリから構成されている
texts = []
cats = []
fh = open(training_data_path)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    cats.append(sentiment)

CPU times: user 44.2 s, sys: 3.69 s, total: 47.9 s
Wall time: 1min 17s
done loading Word2Vec


In [21]:
# モデルの確認
word2vec_vocab = w2v_model.vocab.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


In [22]:
# データセットの確認
print(len(cats), len(texts))
print(texts[1])
print(cats[1])

3000 3000
Good case, Excellent value.
1



## テキストの前処理

In [27]:
# テキストの前処理
def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        # 小文字化、ストップワードと数字の除去
        return [token.lower() for token in tokens if token not in mystopwords and not token.isdigit()
               and token not in punctuation]
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

texts_processed = preprocess_corpus(texts)
print(len(cats), len(texts_processed))
print(texts_processed[1])
print(cats[1])

3000 3000
['good', 'case', 'excellent', 'value']
1



In [31]:
# 文に含まれる単語の埋め込みを平均して、特徴ベクトルを作成
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this +=1
        if count_for_this:
          feats.append(feat_for_this / count_for_this)
        else:
          feats.append(feat_for_this)
    return feats


train_vectors = embedding_feats(texts_processed)
print(len(train_vectors))

3000


## モデルの学習

In [32]:
# モデルの学習
classifier = LogisticRegression(random_state=1234)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
classifier.fit(train_data, train_cats)
print("Accuracy: ", classifier.score(test_data, test_cats))
preds = classifier.predict(test_data)
print(classification_report(test_cats, preds))

Accuracy:  0.8386666666666667
              precision    recall  f1-score   support

          0
       0.84      0.83      0.84       369
          1
       0.84      0.84      0.84       381

    accuracy                           0.84       750
   macro avg       0.84      0.84      0.84       750
weighted avg       0.84      0.84      0.84       750



悪くない性能です。