# Distributed Representation - doc2vec (paragraph vector)
문서(document), 문단(paragraph), 또는 문장(sentence)를 Continuous vector로 표현하는 방법인 doc2vec(paragraph vector)를 해봅니다. paragraph vector model은 PV-DM, PV-DBOW가 있고, 본 노트에서는 PV-DM만 모두 구현합니다. arXiv에서 scraping한 text mining 관련 논문의 초록의 단어들로 진행합니다. 상세한 내용은 아래의 논문을 참고하시길 바랍니다.
  
* _nltk를 활용합니다._
* gensim을 활용합니다.
* nltk : http://www.nltk.org/book/
* gensim : https://radimrehurek.com/gensim/index.html
* 논문 : http://proceedings.mlr.press/v32/le14.pdf

## doc2vec
### Load modules

In [1]:
import os, sys
import nltk
import re
import pandas as pd
import numpy as np
from gensim.models.word2vec import Word2Vec
os.listdir()



['.ipynb_checkpoints',
 'Distributed Representation (doc2vec or paragraph vector).ipynb',
 'Distributed Representation (word2vec).ipynb',
 'Document Representation (term frequency, tf-idf).ipynb',
 'Scrapping text mining papers in arXiv.py',
 'Simple NLP for English.ipynb',
 'Simple NLP for Korean.ipynb',
 'text_mining_paper.csv']

### Load abstracts of text mining papers

In [2]:
papers = pd.read_csv('./text_mining_paper.csv', encoding = 'cp949')
papers.head()

Unnamed: 0,abstract,author,meta,subject,title
0,"The complicated, evolving landscape of cancer ...","Rocco Piazza, Daniele Ramazzotti, Roberta Spin...","Thu, 9 Mar 2017 01:24:23 GMT (948kb)",Genomics (q-bio.GN),"OncoScore: a novel, Internet-based tool to ass..."
1,"Mining textual patterns in news, tweets, paper...","Meng Jiang, Jingbo Shang, Taylor Cassidy, Xian...","Mon, 13 Mar 2017 01:06:19 GMT (1150kb,D) [v2] ...",Computation and Language (cs.CL),MetaPAD: Meta Pattern Discovery from Massive T...
2,This paper is a tutorial on Formal Concept Ana...,Dmitry I. Ignatov,"Wed, 8 Mar 2017 12:53:21 GMT (3541kb,D)",Information Retrieval (cs.IR),Introduction to Formal Concept Analysis and It...
3,Topic models have been widely used in discover...,"Jarvan Law, Hankz Hankui Zhuo, Junhua He, Erhu...","Thu, 23 Feb 2017 07:16:03 GMT (96kb,D)",Computation and Language (cs.CL),LTSG: Latent Topical Skip-Gram for Mutually Le...
4,Entity extraction is fundamental to many text ...,"Zeyi Wen, Dong Deng, Rui Zhang, Kotagiri Ramam...","Sun, 12 Feb 2017 12:46:40 GMT (89kb)",Databases (cs.DB),A Technical Report: Entity Extraction using Bo...


In [3]:
abstracts = list(papers['abstract'])

### Preprocessing
1. 2글자 이상의 영단어 추출, 모두 소문자로 변환
2. gensim의 Doc2Vec class가 input으로 받을 수 있는 corpus 형태로 변환  
(collections module의 namedtuple 이용)

참고 : http://stackoverflow.com/questions/31321209/doc2vec-how-to-get-document-vectors  
참고 : https://radimrehurek.com/gensim/models/doc2vec.html#tutorial  
참고 : https://www.lucypark.kr/slides/2015-pyconkr/#1

In [4]:
corpus = list(map(lambda x : re.findall('[A-z]{2,}',x.lower()), abstracts))

In [5]:
from collections import namedtuple
TaggedDocument = namedtuple('TaggedDocument', ['words', 'tags'])

In [6]:
tagged_train_docs = [TaggedDocument(words, [tags]) for tags, words in enumerate(corpus)]
tagged_train_docs[0:2]

[TaggedDocument(words=['the', 'complicated', 'evolving', 'landscape', 'of', 'cancer', 'mutations', 'poses', 'formidable', 'challenge', 'to', 'identify', 'cancer', 'genes', 'among', 'the', 'large', 'lists', 'of', 'mutations', 'typically', 'generated', 'in', 'ngs', 'experiments', 'the', 'ability', 'to', 'prioritize', 'these', 'variants', 'is', 'therefore', 'of', 'paramount', 'importance', 'to', 'address', 'this', 'issue', 'we', 'developed', 'oncoscore', 'text', 'mining', 'tool', 'that', 'ranks', 'genes', 'according', 'to', 'their', 'association', 'with', 'cancer', 'based', 'on', 'available', 'biomedical', 'literature', 'receiver', 'operating', 'characteristic', 'curve', 'and', 'the', 'area', 'under', 'the', 'curve', 'auc', 'metrics', 'on', 'manually', 'curated', 'datasets', 'confirmed', 'the', 'excellent', 'discriminating', 'capability', 'of', 'oncoscore', 'oncoscore', 'cut', 'off', 'threshold', 'auc', 'ci', 'indicating', 'that', 'oncoscore', 'provides', 'useful', 'results', 'in', 'cases

###  Training doc2vec
관련하여 자세한 옵션은 공식 문서를 참조할 것, 본 예제에서는 다음과 같은 parameter로 training  
참고 : https://radimrehurek.com/gensim/models/doc2vec.html
1. *size = 100* : 100차원의 벡터로 embedding (document와 word 모두, 이 때의 word embedding이 word2vec과 같은 효과를 가지는 지는 검증이 필요)
2. *dm = 1*: pv-dm (paragraph vector : distributed memory)
3. *dm_concat = 1* : 1-layer의 hidden node의 값들을 구성할 때, word vector와 paragraph vector concatenate 함  
   (논문 상에서 average 보다 concatenate가 성능이 더 좋다고 말하고 있음.)
4. *min_count = 2* : 최소 2회이상 나타난 단어만

In [7]:
from gensim.models import doc2vec
# 사전 구축
config = {'size' : 100, 'dm_concat' : 1, 'dm' : 1, 'min_count' : 2}
doc_vectorizer = doc2vec.Doc2Vec(**config)

In [8]:
# Train document vectors!
doc_vectorizer.build_vocab(tagged_train_docs)
for epoch in range(100):
    doc_vectorizer.train(tagged_train_docs)
    doc_vectorizer.alpha -= 0.002  # decrease the learning rate
    doc_vectorizer.min_alpha = doc_vectorizer.alpha  # fix the learning rate, no decay

In [9]:
# Documnet Embedding
np.asarray(doc_vectorizer.docvecs).shape

(168, 100)

In [10]:
# Word Embedding
my_word = doc_vectorizer.wv.index2word
word_embedding = [doc_vectorizer.wv[token] for token in my_word]
word_embedding = np.asarray(word_embedding)

In [11]:
word_embedding.shape

(2214, 100)