# Gensim - Doc2vec untuk Similarity Content
Similarity Content menggunakan vector merupakan cara sederhana untuk mendapatkan kesamaan dari sebuah artikel.
Dalam kasus ini saya akan menggunakan hasil scraping data google news indonesia berjumlah 77 documents saja.

Adapun module yang digunakan adalah menggunkan gensim

## Requirement
- Gensim 2.0

## Kode Sederhana

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim
from pprint import pprint
import multiprocessing
import os

2017-05-07 11:55:48,657 : INFO : 'pattern' package found; tag filters are available for English


In [3]:
dirname = 'google_news'
documents_file = os.listdir(dirname)

documents = []

for fname in documents_file:
    f = open(os.path.join(dirname,fname),'rU')
    content = f.read().decode('utf-8')
    title = fname.replace('.txt','')
    documents.append(TaggedDocument(gensim.utils.simple_preprocess(content, max_len=30), [title]))
    

In [4]:
pprint(documents[:1][0].tags)

['5 Kesamaan Xiaomi Mi 6 dan iPhone 7 Plus']


In [5]:
pprint(documents[:1][0].words[:10])

[u'datang',
 u'dengan',
 u'dapur',
 u'pacu',
 u'super',
 u'tangguh',
 u'xiaomi',
 u'mi',
 u'berhasilmenyabet',
 u'gelar']


In [6]:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, dbow_words=1, size=200, window=10, min_count=10, iter=1000, workers=cores, sample=1e-4, negative=5)

In [7]:
model.build_vocab(documents,update=False)
print(str(model))

2017-05-07 11:55:57,457 : INFO : collecting all words and their counts
2017-05-07 11:55:57,459 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-07 11:55:57,473 : INFO : collected 4513 word types and 74 unique tags from a corpus of 74 examples and 19417 words
2017-05-07 11:55:57,475 : INFO : Loading a fresh vocabulary
2017-05-07 11:55:57,483 : INFO : min_count=10 retains 349 unique words (7% of original 4513, drops 4164)
2017-05-07 11:55:57,485 : INFO : min_count=10 leaves 10393 word corpus (53% of original 19417, drops 9024)
2017-05-07 11:55:57,490 : INFO : deleting the raw counts dictionary of 4513 items
2017-05-07 11:55:57,493 : INFO : sample=0.0001 downsamples 349 most-common words
2017-05-07 11:55:57,494 : INFO : downsampling leaves estimated 2097 word corpus (20.2% of prior 10393)
2017-05-07 11:55:57,496 : INFO : estimated required memory for 349 words and 200 dimensions: 806900 bytes
2017-05-07 11:55:57,500 : INFO : resetting layer weights


Doc2Vec(dbow+w,d200,n5,w10,mc10,s0.0001,t4)


In [8]:
%time model.train(documents, total_examples=model.corpus_count, epochs=model.iter)

2017-05-07 11:55:58,995 : INFO : training model with 4 workers on 349 vocabulary and 200 features, using sg=1 hs=0 sample=0.0001 negative=5 window=10
2017-05-07 11:56:00,004 : INFO : PROGRESS: at 4.01% examples, 86961 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:01,013 : INFO : PROGRESS: at 9.66% examples, 104470 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:02,023 : INFO : PROGRESS: at 16.56% examples, 119093 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:03,033 : INFO : PROGRESS: at 23.16% examples, 124773 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:04,041 : INFO : PROGRESS: at 29.26% examples, 126045 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:05,043 : INFO : PROGRESS: at 35.91% examples, 129072 words/s, in_qsize 8, out_qsize 0
2017-05-07 11:56:06,048 : INFO : PROGRESS: at 42.56% examples, 131116 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:07,053 : INFO : PROGRESS: at 48.91% examples, 131876 words/s, in_qsize 7, out_qsize 0
2017-05-07 11:56:08,058 : INF

CPU times: user 46 s, sys: 2.14 s, total: 48.1 s
Wall time: 16.3 s


2171063

In [9]:
print(str(model))
pprint(model.docvecs.most_similar(positive=["5 Kesamaan Xiaomi Mi 6 dan iPhone 7 Plus"], topn=10))

2017-05-07 11:56:16,731 : INFO : precomputing L2-norms of doc weight vectors


Doc2Vec(dbow+w,d200,n5,w10,mc10,s0.0001,t4)
[('Ini Alasan Xiaomi Mi 6 Tiru iPhone 7', 0.4435577392578125),
 ('Mengapa Xiaomi Mi 6 Harus Mengikuti Langkah iPhone 7? - Kompas.com',
  0.3622362017631531),
 ('Harga Samsung Galaxy Express Prime 2 dan Spesifikasi, Smartphone LTE Android Nougat Sejutaan',
  0.2899801731109619),
 ('Tiga Kota Besar Indonesia Akhirnya Resmi Menjual Sosok Smartphone Flagship Samsung Galaxy S8 dan Samsung Galaxy S8 Plus',
  0.2826429009437561),
 ('Kebisingan "Treble Winner" di Balik Kegagalan Juventus - Kompas.com',
  0.245576411485672),
 ('Ismail Haniya Terpilih Sebagai Pemimpin Hamas', 0.2441180795431137),
 ('Sekilas Berkenalan dengan Oppo F3', 0.23804400861263275),
 ('Ini Alasan Real Madrid Simpan Ronaldo Lawan Granada', 0.23543459177017212),
 ('Stephen Hawking Desak Manusia Pindah ke Planet Lain', 0.23176153004169464),
 ('Kecelakaan, Job Sheila Marcia Banyak Yang Dibatalkan | M.Kapanlagi.com',
  0.22895048558712006)]


In [10]:
text_search = '''Zidane Real Madrid'''
inferred_vector = model.infer_vector(text_search.lower().split())
model.docvecs.most_similar([inferred_vector], topn=10)

[('Rekaman Menegangkan Saat Turbulensi di AirAsia Tujuan Kuala Lumpur, Ada Tangisan Hingga Teriakan - Pos Kupang',
  0.16304472088813782),
 ('Dibully, Curhatan Putri Bungsu Marissa Haque Bikin Nangis - VIVA.co.id',
  0.14578871428966522),
 ('Ponsel Berkamera adalah Kunci di Abad Selfie', 0.1449335366487503),
 ('Ini Penjelasan Bank Mandiri Soal Gangguan Online Banking',
  0.13119998574256897),
 ('Diculik Semenjak 2014 oleh Boko Haram, 82 Perempuan Chibok Dibebaskan',
  0.11745390295982361),
 ('Ini Kata Peserta "Gadget Story" Setelah Menjajal Galaxy S8 - Kompas.com',
  0.11424076557159424),
 ('Penis Anda Tidak Ingin Patah? Hindari Posisi Seks Ini-',
  0.10996179282665253),
 ('Stephen Hawking Desak Manusia Pindah ke Planet Lain', 0.10680105537176132),
 ('Massa Aksi Simpatik 55 Mulai Bubar Diri dari Masjid Istiqlal',
  0.081327423453331),
 ('Seri Galaxy C Bakal Jadi Smartphone Kamera Ganda Samsung?',
  0.07210025191307068)]