# <center>Word2Vec Tutorial</center>
<center>Kelas Pengolahan Bahasa Manusia 2017/2018</center>
<center>Lintang Adyuta Sutawika</center>

Tutorial ini membahas tentang word embedding khususnya model word2vec. Word2vec sendiri diperkenalkan oleh Mikolov et. al. pada tahun 2013. Word2vec merupakan salah satu model pemelajaran mesin yang mampu menangkap hubungan perbandingan antara kata tanpa harus diberikan fitur eksplisit. Hal ini dapat tercapai dengan melatih model dengan data yang sangat banyak sehingga model tersebut dapat mempelajari probabilitas bersyarat dari kata-kata tersebut.

<img src="Images/CBOW_Skip-gram.png" style="width: 800px;">

Contoh input data untuk pelatihan serta output yang diharapkan dari model.
<img src="Images/training_data.png">

<img src="Images/Word_pair_relationships.png" style="width: 1000px;"/>

In [9]:
import re
import gensim
import nltk

Pertama: pra-pemrosesan terhadap tweet, hilangkan username, hashtag, url dan emoji.

In [10]:
def renameUser(corpus):
        _new = []      
        for _temp in corpus:
            _temp =  re.sub( r'(^|[^@\w])@(\w{1,15})\b','',_temp)
            _new.append(_temp)

        return _new

def removeHashtag(corpus):
    _new = []
    for _temp in corpus:
        _temp = re.sub(r'#(\w+)', '', _temp)
        _new.append(_temp)

    return _new

def removeURL(corpus):
    _new = []
    for _temp in corpus:
        _temp = re.sub(r'http:\S+', '', _temp, flags=re.MULTILINE)
        _temp = re.sub(r'https:\S+', '', _temp, flags=re.MULTILINE)
        _new.append(_temp)

    return _new

def removeEmoticon(corpus):
    _new = []
    emoticons_str = r"(?:[:=;B\-][oO\"\_\-]?[\-D\)\]\(\]/\\Op3]{2,3})"
    for _temp in corpus:
        _temp.replace(emoticons_str, '')
        _temp = re.sub(r'[^\x00-\x7F]', '', _temp)
        _new.append(_temp)

    return _new

In [11]:
import csv

def getTweetData(filename="dataset/Indonesian_Tweets.tsv"):
        #Gain large corpus of tweets
        toFeed = []
        rawSentence = []
        with open(filename, 'rU') as csvfile:
            spamreader = csv.reader(csvfile, delimiter='\n', quotechar='|')
            for spam in spamreader:
                rawSentence.append(spam)

        corpusSentence =[]
        for individualSentence in rawSentence:
            if individualSentence == []:
                pass
            else:
                corpusSentence.append(individualSentence[0])

        _temp = removeURL(corpusSentence)
        _temp = renameUser(_temp)
        _temp = removeHashtag(_temp)
        _temp = removeEmoticon(_temp)

        for sentences in _temp:
            token = nltk.wordpunct_tokenize(sentences.lower())
            toFeed.append(token)

        return toFeed

In [15]:
def getWord2Vec(toFeed, dim=50):
    return gensim.models.Word2Vec(toFeed, min_count=1,  size=dim)

Lakukan pengecekan terhadap 10 kalimat pertama pada dataset yang telah di proses. (Output dari fungsi getTweetData adalah sebuah array)

In [38]:
corpus = getTweetData()
model = getWord2Vec(corpus)

  import sys


Tunjukkan vektor dari kata 'raja' gunakan fungsi *.wv['kata']
<img src="Images/word2vec.png">

In [39]:
#Koding di sini
model.wv['gelas']

array([-0.25514695, -0.15566237,  0.05310133,  0.10704625,  0.0473288 ,
       -0.10565176, -0.7485233 ,  0.46781865, -0.20523547, -0.05294118,
        0.48439175,  0.01012699,  0.07449104, -0.5000201 , -0.09068836,
        0.09776992, -0.28746814,  0.61485106, -0.12326465, -0.06156166,
        0.15334214,  0.20230545, -0.46344638,  0.45088896,  0.27353722,
        0.0915217 ,  0.13914753, -0.03433825,  0.4173159 , -0.27812117,
        0.3098348 ,  0.13063599, -0.07655799,  0.27501002,  0.3857099 ,
       -0.14225084,  0.08170392, -0.19950192, -0.03212536, -0.15038414,
       -0.16231695, -0.22847278, -0.5347864 , -0.31906983,  0.40229946,
        0.54658747,  0.21527736, -0.3359916 ,  0.51688844,  0.06668615],
      dtype=float32)

Implementasi Fungsi yang membandingkan kemiripan dua buah vektor kata. Gunakan rumus cosine similarity. 
<img src="Images/Dot_Product.png">
<center><img src="Images/similarity.svg"></center>

In [5]:
#Koding di sini
def similarity(arg):
    pass

Seberapa dekatkah kata 'raja' dengan 'presiden' atau dengan 'pemimpin'?

In [6]:
#Koding di sini

In [114]:
model.wv.most_similar(positive=['makanan', 'minuman'], negative=['makan'])

[('bahan', 0.815974235534668),
 ('alat', 0.814616858959198),
 ('merkur', 0.8072130680084229),
 ('jenis', 0.7932530045509338),
 ('konsumsi', 0.7894991040229797),
 ('penghasilan', 0.788066029548645),
 ('identitas', 0.7792832851409912),
 ('ponsel', 0.7776226997375488),
 ('mesin', 0.7699506282806396),
 ('keuntungan', 0.7630599737167358)]

# Gensim built-in Funtions
<center><img src="Images/gensim_function.png"></center>

# Composition
<center><img src="Images/composition.png"></center>
Bagaimana jika vektor 'indonesia' dikurangi vektor 'presiden' lalu ditambah dengan 'gubernur'?

In [37]:
#Gunakan fungsi *.wv.most_similar(positive=[array], negative=[array])
model.wv.most_similar(positive=['indonesia','gubernur'], negative=['presiden'])

[('jakarta', 0.665157675743103),
 ('pilgub', 0.6471673250198364),
 ('dki', 0.6280163526535034),
 ('jawa', 0.6114922761917114),
 ('perkembngan', 0.5894886255264282),
 ('sespibi', 0.5783848762512207),
 ('cagub', 0.5771750807762146),
 ('cawagub', 0.5649923086166382),
 ('erikestrada', 0.556946337223053),
 ('cabup', 0.556396484375)]

<tr>
<center>
    <td> <center><img src="Images/sarapan.jpg" alt="Drawing" style="width: 250px;"/> </center></td>
    <td> <center><img src="Images/sendok.jpg" alt="Drawing" style="width: 250px;"/> </center></td>
    <td> <center><img src="Images/garpu.png" alt="Drawing" style="width: 250px;"/> </center></td>
    <td> <center><img src="Images/gelas.jpg" alt="Drawing" style="width: 250px;"/> </center></td>
<center>
</tr>

In [40]:
#Gunakan fungsi *.wv.doesnt_match(array)