# Gensim - Doc2vec untuk Similarity Content
Similarity Content menggunakan vector merupakan cara sederhana untuk mendapatkan kesamaan dari sebuah artikel.
Dalam kasus ini saya akan menggunakan hasil scraping data google news indonesia berjumlah 77 documents saja.

Adapun module yang digunakan adalah menggunkan gensim

## Requirement
- Gensim 2.0

## Kode Sederhana

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim
from pprint import pprint
import multiprocessing
import os

2017-05-07 12:39:28,206 : INFO : 'pattern' package found; tag filters are available for English


In [3]:
dirname = 'google_news'
documents_file = os.listdir(dirname)

documents = []

for fname in documents_file:
    f = open(os.path.join(dirname,fname),'rU')
    content = f.read().decode('utf-8')
    title = fname.replace('.txt','')
    documents.append(TaggedDocument(gensim.utils.simple_preprocess(content, max_len=30), [title]))
    

In [4]:
pprint(documents[:1][0].tags)

["'Saya Turun di Lapangan dan tak Melihat Ada Upaya Makar'"]


In [5]:
pprint(documents[:1][0].words[:10])

[u'republika',
 u'co',
 u'id',
 u'jakarta',
 u'ketua',
 u'setara',
 u'institute',
 u'hendardi',
 u'yang',
 u'menyebut']


In [6]:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, dbow_words=1, size=200, window=10, min_count=10, iter=200, workers=cores, sample=1e-4, negative=5)

In [None]:
model.build_vocab(documents,update=False)
print(str(model))

2017-05-07 12:39:28,413 : INFO : collecting all words and their counts
2017-05-07 12:39:28,416 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-07 12:39:28,448 : INFO : collected 5744 word types and 119 unique tags from a corpus of 119 examples and 31481 words
2017-05-07 12:39:28,451 : INFO : Loading a fresh vocabulary
2017-05-07 12:39:28,523 : INFO : min_count=10 retains 598 unique words (10% of original 5744, drops 5146)
2017-05-07 12:39:28,525 : INFO : min_count=10 leaves 19884 word corpus (63% of original 31481, drops 11597)
2017-05-07 12:39:28,534 : INFO : deleting the raw counts dictionary of 5744 items
2017-05-07 12:39:28,536 : INFO : sample=0.0001 downsamples 598 most-common words
2017-05-07 12:39:28,539 : INFO : downsampling leaves estimated 5476 word corpus (27.5% of prior 19884)
2017-05-07 12:39:28,551 : INFO : estimated required memory for 598 words and 200 dimensions: 1374800 bytes
2017-05-07 12:39:28,560 : INFO : resetting layer weig

Doc2Vec(dbow+w,d200,n5,w10,mc10,s0.0001,t4)


In [None]:
%time model.train(documents, total_examples=model.corpus_count, epochs=model.iter)

2017-05-07 12:39:28,734 : INFO : training model with 4 workers on 598 vocabulary and 200 features, using sg=1 hs=0 sample=0.0001 negative=5 window=10
2017-05-07 12:39:29,750 : INFO : PROGRESS: at 2.04% examples, 113827 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:39:30,761 : INFO : PROGRESS: at 3.69% examples, 102839 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:39:31,765 : INFO : PROGRESS: at 5.22% examples, 97023 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:39:32,768 : INFO : PROGRESS: at 6.53% examples, 91136 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:39:33,793 : INFO : PROGRESS: at 8.53% examples, 94670 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:39:34,801 : INFO : PROGRESS: at 11.07% examples, 102307 words/s, in_qsize 7, out_qsize 1
2017-05-07 12:39:35,807 : INFO : PROGRESS: at 13.94% examples, 110434 words/s, in_qsize 8, out_qsize 1
2017-05-07 12:39:36,838 : INFO : PROGRESS: at 16.82% examples, 116267 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:39:37,848 : INFO : P

In [None]:
print(str(model))
pprint(model.docvecs.most_similar(positive=["5 Kesamaan Xiaomi Mi 6 dan iPhone 7 Plus"], topn=10))

In [None]:
text_search = '''Zidane Real Madrid'''
inferred_vector = model.infer_vector(text_search.lower().split())
model.docvecs.most_similar([inferred_vector], topn=10)