In [1]:
%%time
import malaya

CPU times: user 11.4 s, sys: 1.39 s, total: 12.8 s
Wall time: 16.2 s


## What is word mover distance?

<img src="https://vene.ro/images/wmd-obama.png" width="40%" align="left">

between two documents in a meaningful way, even when they have no words in common. It uses vector embeddings of words. It been shown to outperform many of the state-of-the-art methods in k-nearest neighbors classification.

You can read more about word mover distance from [Word Distance between Word Embeddings](https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632).

**Closest to 0 is better**.

In [2]:
left_sentence = 'saya suka makan ayam'
right_sentence = 'saya suka makan ikan'
left_token = left_sentence.split()
right_token = right_sentence.split()

In [3]:
w2v_wiki = malaya.word2vec.load_wiki()
w2v_wiki = malaya.word2vec.word2vec(w2v_wiki['nce_weights'],w2v_wiki['dictionary'])

In [4]:
fasttext_wiki, ngrams = malaya.fast_text.load_wiki()
fasttext_wiki = malaya.fast_text.fast_text(fasttext_wiki['embed_weights'],
                                           fasttext_wiki['dictionary'], ngrams)

## Using word2vec

In [5]:
malaya.word_mover.distance(left_token, right_token, w2v_wiki)

0.8225146532058716

In [6]:
malaya.word_mover.distance(left_token, left_token, w2v_wiki)

0.0

## Using fast-text

In [7]:
malaya.word_mover.distance(left_token, right_token, fasttext_wiki)

2.82466983795166

In [8]:
malaya.word_mover.distance(left_token, left_token, fasttext_wiki)

0.0

## Why word mover distance?

Maybe you heard about skipthought or siamese network to train sentences similarity, but both required a good corpus plus really slow to train. Malaya provided both models to train your own text similarity, can check here, [Malaya text-similarity](https://malaya.readthedocs.io/en/latest/Similarity.html)

`word2vec` or `fast-text` are really good to know semantic definitions between 2 words, like below,

In [9]:
w2v_wiki.n_closest(word = 'anwar', num_closest=8, metric='cosine')

[['zaid', 0.7285637855529785],
 ['khairy', 0.6839416027069092],
 ['zabidi', 0.6709405183792114],
 ['nizar', 0.6695379018783569],
 ['harussani', 0.6595045328140259],
 ['shahidan', 0.6565827131271362],
 ['azalina', 0.6541041135787964],
 ['shahrizat', 0.6538639068603516]]

So we got some suggestion from the interface included distance between 0-1, closest to 1 is better.

Now let say I want to compare similarity between 2 sentences, and using vectors representation from our word2vec and fast-text.

I got, `rakyat sebenarnya sukakan mahathir`, and `rakyat sebenarnya sukakan najib`

In [10]:
mahathir = 'rakyat sebenarnya sukakan mahathir'
najib = 'rakyat sebenarnya sukakan najib'
malaya.word_mover.distance(mahathir.split(), najib.split(), w2v_wiki)

0.9017602205276489

0.9, quite good. What happen if we make our sentence quite polarity ambigious for najib? (Again, this is just example)

In [11]:
mahathir = 'rakyat sebenarnya sukakan mahathir'
najib = 'rakyat sebenarnya gilakan najib'
malaya.word_mover.distance(mahathir.split(), najib.split(), w2v_wiki)

1.7690724730491638

We just changed `sukakan` with `gilakan`, but our word2vec representation based on `rakyat sebenarnya <word> <person>` not able to correlate same polarity, real definition of `gilakan` is positive polarity, but word2vec learnt `gilakan` is negative or negate.

## Soft mode

What happened if a word is not inside vectorizer dictionary? `malaya.word_mover.distance` will throw an exception.

In [13]:
left = 'tyi'
right = 'qwe'
malaya.word_mover.distance(left.split(), right.split(), w2v_wiki)

Exception: input not found in dictionary, here top-5 nearest words [qw, qe, we, qwest, qwabe]

So if use `soft = True`, if the word is not inside vectorizer, it will find the nearest word.

In [14]:
left = 'tyi'
right = 'qwe'
malaya.word_mover.distance(left.split(), right.split(), w2v_wiki, soft = True)

1.273216962814331

## Load expander

We want to expand shortforms based on `malaya.normalize.spell` by using word mover distance. If our vector knows that `mkn` semantically similar to `makan` based on `saya suka mkn ayam` sentence, word mover distance will become closer.

It is really depends on our vector, and word2vec may not able to understand shortform, so we will use fast-text to fix `OUT-OF-VOCAB` problem.

In [19]:
malays = malaya.load_malay_dictionary()
wiki, ngrams = malaya.fast_text.load_wiki()
fast_text_embed = malaya.fast_text.fast_text(wiki['embed_weights'],wiki['dictionary'],ngrams)
expander = malaya.word_mover.expander(malays, fast_text_embed)

downloading Malay texts


1.00MB [00:00, 1.70MB/s]                   


In [16]:
string = 'y u xsuka makan HUSEIN kt situ tmpt'
another = 'i mmg xska mknn HUSEIN kampng tempt'

In [17]:
expander.expand(string)

[[('tmpt',
   'kenapa awak tak suka makan Husein kat situ tut',
   0.8088938253521919),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tuit',
   0.863929785296917),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tat',
   0.8680638003787995),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ top',
   0.8688952446055412),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tip',
   0.8978437346220016),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ taat',
   0.936883625289917),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ topi',
   0.9442774548711776),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tumit',
   0.9495834815340042),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tempe',
   0.9758907731723786),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ ampe',
   0.9821926467533112),
  ('tmpt',
   'kenapa awak tak suka makan Husein kat situ tempo',
   0.9836614096956253),
  ('tmpt',
   'kenapa aw

In [18]:
expander.expand(another)

[[('ska', 'saya memang tak soka mknn Husein kampng tempt', 0.7199365496635437),
  ('ska', 'saya memang tak suka mknn Husein kampng tempt', 0.8050327301025391),
  ('ska', 'saya memang tak sika mknn Husein kampng tempt', 0.8729341626167297),
  ('ska', 'saya memang tak saka mknn Husein kampng tempt', 0.875930666923523),
  ('ska', 'saya memang tak spa mknn Husein kampng tempt', 0.8995948433876038),
  ('ska', 'saya memang tak sua mknn Husein kampng tempt', 0.9496822357177734),
  ('ska', 'saya memang tak seka mknn Husein kampng tempt', 0.9891390204429626),
  ('ska', 'saya memang tak ski mknn Husein kampng tempt', 1.1318669319152832),
  ('ska', 'saya memang tak sia mknn Husein kampng tempt', 1.1666431427001953)],
 [('mknn', 'saya memang tak ska min Husein kampng tempt', 0.8653836846351624),
  ('mknn', 'saya memang tak ska maun Husein kampng tempt', 1.045318603515625),
  ('mknn', 'saya memang tak ska kun Husein kampng tempt', 1.0710314512252808),
  ('mknn', 'saya memang tak ska ken Husein kamp