<a href="https://colab.research.google.com/github/natumn/ACL2017.6.27-2017.7.11/blob/master/story_or_not_classifiy_doc2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 前処理モデル生成の前処理

Cloud Storageにあるストーリーのデータにアクセスする



In [0]:
from google.cloud import storage
from google.colab import auth
import json
from json.decoder import WHITESPACE

auth.authenticate_user()
!gcloud config set project tellerapi-dev
!mkdir /content/data
!gsutil cp gs://natumn-dev/story-data/000000000000 /content/data/story-data0.json


In [0]:
storyFile = open('/content/data/storyStrings.txt', 'w')

with open('/content/data/story-data0.json', encoding='utf-8') as f:
  counter = 0
  for line in f:
      counter += 1
      storyStr = '\n#story{0}\n'.format(counter)
      data = json.loads(line)
      if data.get('Script') != '' and data.get('Script') != None:
        scriptData = json.loads(data.get('Script'))
        if scriptData.get('sceneList') != '':
          for scene in scriptData['sceneList']:
            if scene.get('ops') != '' and scene.get('ops') != None:
              for op in scene.get('ops'):
                if op.get('postMessageOp') != '' and op.get('postMessageOp') != None:
                  postMessageOp = op.get('postMessageOp')
                  if postMessageOp.get('text') != None:
                    storyStr += postMessageOp.get('text') + ' '
                elif op.get('showTextOp') != '' and op.get('showTextOp') != None:
                  showTextOp = op.get('showTextOp')
                  if showTextOp.get('text') != None:
                    storyStr += showTextOp.get('text') + ' '

      storyFile.write(storyStr)


## Doc2Vecを使った文書類似度算出を行う

Doc2Vecをgensimというライブラリを利用から使う。

[gensimについて](https://radimrehurek.com/gensim/)

[gensim入門（Qiitaの記事）](https://qiita.com/u6k/items/5170b8d8e3f41531f08a)

まずはgensimをインストールする。


In [0]:
!pip install --upgrade gensim

形態素解析システムのJUMAN++をインストールする 

[JUMAN++について](http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++)

In [0]:
!wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz
!tar -Jxvf jumanpp-1.01.tar.xz
%cd jumanpp-1.01 
!./configure
!make
!sudo make install
!jumanpp -v
%cd /content

日本語構文解析システムのKNPをインストールする

[KNPについて](http://nlp.ist.i.kyoto-u.ac.jp/?KNP)


In [0]:
!wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/knp/knp-4.19.tar.bz2
!tar xf knp-4.19.tar.bz2
%cd knp-4.19
!./configure
!make
!sudo make install
!echo "knpとjumanを組み合わせる" | jumanpp | knp
%cd /content

KNPをPython上から実行するためKNPのPythonバインディングをインストールする

[ソース](http://nlp.ist.i.kyoto-u.ac.jp/index.php?PyKNP)

[Github](https://github.com/ku-nlp/pyknp)

In [0]:
!wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/knp/pyknp-0.3.tar.gz
!tar zxvf pyknp-0.3.tar.gz
%cd pyknp-0.3
!python setup.py install
%cd /content

Doc2Vecで文章を学習させる 

In [0]:
import sys
from os import listdir, path
from pyknp import Jumanpp
from gensim import models
from gensim.models.doc2vec import LabeledSentence

def corpus_files():
    dirs = [path.join('./text', x)
            for x in listdir('./text') if not x.endswith('.txt')]
    docs = [path.join(x, y)
            for x in dirs for y in listdir(x) if not x.startswith('LICENSE')]
    return docs

def read_document(path):
    with open(path, 'r') as f:
        return f.read()

def split_into_words(text):
    result = Jumanpp().analysis(text)
    return [mrph.midasi for mrph in result.mrph_list()]
  
def doc_to_sentence(doc, name):
    words = split_into_words(doc)
    return LabeledSentence(words=words, tags=[name])

def corpus_to_sentences(corpus):
    docs   = [read_document(x) for x in corpus]
    for idx, (doc, name) in enumerate(zip(docs, corpus)):
        sys.stdout.write('\r前処理中 {}/{}'.format(idx, len(corpus)))
        yield doc_to_sentence(doc, name)
  
corpus = corpus_files()
sentences = corpus_to_sentences(corpus)

model = models.Doc2Vec(sentences, dm=0, size=300, window=15, alpha=.025,
        min_alpha=.025, min_count=1, sample=1e-6)

print('\n訓練開始')
for epoch in range(20):
    print('Epoch: {}'.format(epoch + 1))
    model.train(sentences)
    model.alpha -= (0.025 - 0.0001) / 19
    model.min_alpha = model.alpha

model.save('doc2vec.model')
model = models.Doc2Vec.load('doc2vec.model')

# model.docvecs.similarity('./text/livedoor-homme/livedoor-homme-4700669.txt', './text/movie-enter/movie-enter-5947726.txt')