# Gensim - Doc2vec untuk Similarity Content
Similarity Content menggunakan vector merupakan cara sederhana untuk mendapatkan kesamaan dari sebuah artikel.
Dalam kasus ini saya akan menggunakan hasil scraping data google news indonesia berjumlah 77 documents saja.

Adapun module yang digunakan adalah menggunkan gensim

## Requirement
- Gensim 2.0

## Kode Sederhana

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim
from pprint import pprint
import multiprocessing
import os

2017-05-07 12:39:28,206 : INFO : 'pattern' package found; tag filters are available for English


In [3]:
dirname = 'google_news'
documents_file = os.listdir(dirname)

documents = []

for fname in documents_file:
    f = open(os.path.join(dirname,fname),'rU')
    content = f.read().decode('utf-8')
    title = fname.replace('.txt','')
    documents.append(TaggedDocument(gensim.utils.simple_preprocess(content, max_len=30), [title]))
    

In [4]:
pprint(documents[:1][0].tags)

["'Saya Turun di Lapangan dan tak Melihat Ada Upaya Makar'"]


In [5]:
pprint(documents[:1][0].words[:10])

[u'republika',
 u'co',
 u'id',
 u'jakarta',
 u'ketua',
 u'setara',
 u'institute',
 u'hendardi',
 u'yang',
 u'menyebut']


In [11]:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, dbow_words=1, size=200, window=10, min_count=10, iter=200, workers=cores, sample=1e-4, negative=5)

In [12]:
model.build_vocab(documents,update=False)
print(str(model))

2017-05-07 12:40:26,087 : INFO : collecting all words and their counts
2017-05-07 12:40:26,089 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-07 12:40:26,113 : INFO : collected 5744 word types and 119 unique tags from a corpus of 119 examples and 31481 words
2017-05-07 12:40:26,114 : INFO : Loading a fresh vocabulary
2017-05-07 12:40:26,125 : INFO : min_count=10 retains 598 unique words (10% of original 5744, drops 5146)
2017-05-07 12:40:26,127 : INFO : min_count=10 leaves 19884 word corpus (63% of original 31481, drops 11597)
2017-05-07 12:40:26,134 : INFO : deleting the raw counts dictionary of 5744 items
2017-05-07 12:40:26,136 : INFO : sample=0.0001 downsamples 598 most-common words
2017-05-07 12:40:26,138 : INFO : downsampling leaves estimated 5476 word corpus (27.5% of prior 19884)
2017-05-07 12:40:26,140 : INFO : estimated required memory for 598 words and 200 dimensions: 1374800 bytes
2017-05-07 12:40:26,145 : INFO : resetting layer weig

Doc2Vec(dbow+w,d200,n5,w10,mc10,s0.0001,t4)


In [13]:
%time model.train(documents, total_examples=model.corpus_count, epochs=model.iter)

2017-05-07 12:40:27,032 : INFO : training model with 4 workers on 598 vocabulary and 200 features, using sg=1 hs=0 sample=0.0001 negative=5 window=10
2017-05-07 12:40:28,038 : INFO : PROGRESS: at 12.84% examples, 143221 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:40:29,048 : INFO : PROGRESS: at 25.02% examples, 139411 words/s, in_qsize 8, out_qsize 0
2017-05-07 12:40:30,054 : INFO : PROGRESS: at 38.92% examples, 144447 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:40:31,061 : INFO : PROGRESS: at 52.20% examples, 145215 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:40:32,088 : INFO : PROGRESS: at 65.81% examples, 145842 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:40:33,095 : INFO : PROGRESS: at 78.45% examples, 144921 words/s, in_qsize 7, out_qsize 0
2017-05-07 12:40:34,112 : INFO : PROGRESS: at 90.81% examples, 143672 words/s, in_qsize 8, out_qsize 1
2017-05-07 12:40:34,801 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-07 12:40:34,811 : INFO : worke

CPU times: user 23.8 s, sys: 532 ms, total: 24.4 s
Wall time: 7.8 s


1119709

In [14]:
print(str(model))
pprint(model.docvecs.most_similar(positive=["5 Kesamaan Xiaomi Mi 6 dan iPhone 7 Plus"], topn=10))

2017-05-07 12:40:36,307 : INFO : precomputing L2-norms of doc weight vectors


Doc2Vec(dbow+w,d200,n5,w10,mc10,s0.0001,t4)
[('Ini Alasan Xiaomi Mi 6 Tiru iPhone 7', 0.6714545488357544),
 ('Mengapa Xiaomi Mi 6 Harus Mengikuti Langkah iPhone 7? - Kompas.com',
  0.5950295925140381),
 ('iPad Masih Jadi Rajanya Tablet', 0.4455420970916748),
 ('Negara Mana yang Jual iPhone 7 Termahal dan Termurah?',
  0.39016592502593994),
 ('Ayu Ting Ting Pasrah Hadapi Terpaan Gosip Nikah Siri', 0.38525086641311646),
 ('Wajah Baru Rio, Jadi Nafas Segar KIA Jualan di Indonesia',
  0.35606831312179565),
 ('Gempar Hujan Salju di Jakarta, PT MRT Minta Maaf', 0.3287229537963867),
 ('Cuplikan Video Barcelona Hajar Villarreal', 0.32460686564445496),
 ('Nih 5 Penyebab Utama Anda Sakit Gigi', 0.32280516624450684),
 ('Marquez Tak Remehkan Rossi dan Vinales', 0.3135116994380951)]


In [15]:
text_search = '''Zidane Real Madrid'''
inferred_vector = model.infer_vector(text_search.lower().split())
model.docvecs.most_similar([inferred_vector], topn=10)

[('Zidane Puas dengan Skuat B Real Madrid', 0.5102989077568054),
 ('Kontra BFC, Hanafi Tak Ingin Hasil Lawan Persib Terulang Lagi',
  0.4575687646865845),
 ('Cuplikan Video Barcelona Hajar Villarreal', 0.4462753236293793),
 ('5 Tips Agar Suami Lebih Menikmati Seks Oral', 0.44335606694221497),
 ('Pujian untuk Neymar, Liukannya Seperti Penari Balet', 0.4431913495063782),
 ('Morata Lebih Tajam daripada Benzema, Zidane!', 0.42689239978790283),
 ('Kondisi Julia Perez Masih Naik Turun', 0.4225735366344452),
 ('Eko Bernazar Jalan Kaki Madiun-Jakarta untuk Anies-Sandi, Prabowo: Luar Biasa Dia!',
  0.4221903085708618),
 ('Kabar Salju di Jakarta Cuma Hoax, BMKG: Itu Busa Tumpah',
  0.42209890484809875),
 ('Alvaro Morata Berharap Bisa Lebih Sering Bermain Di Real Madrid',
  0.4204276502132416)]