# Word Embedding Techniques (word2vec)

A word is known by the company it keeps

## What is word2vec?

Word2vec is **not** a single algorithm

It is a software package for representing words as vectors, containing:
* Two distinct models
    * Skip-gram.
    * CBOW :continuous bag of words .

Given a window size of n words around a word w :
- the **skip-gram** model predicts the neighboring words given the current word. 

In contrast, 

- the **CBOW** model predicts the current word w, given the neighboring words in the window.

# 

### Import the necessary libraries

In [1]:
import re
import gensim
from gensim.models import Word2Vec
import os
from nltk.tokenize import word_tokenize
from gensim.models.keyedvectors import KeyedVectors



### Preprocessing methodes

In [None]:
def preprocessing(string):
    string =re.sub(r'[^\u0600-\u06FF]', ' ', string)  
    return re.sub(r"\s{2,}", " ", string).strip()

### Helper class for momery

In [8]:
class Sentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            print(os.path.join(self.dirname, fname))
            for line in open(os.path.join(self.dirname, fname),encoding='utf8'):
                yield word_tokenize(line)

### Create and train  word2vec model 

In [16]:
sentences = Sentences('./corpus')

In [None]:
model = Word2Vec(sentences,size=100,min_count=10,workers=4,iter=10,sg=1)

### Save and load the model

In [8]:
model.wv.save('./saved_models/wiki_corpus_word2vec_size100_skip_gram_wv')

In [14]:
word2vec = KeyedVectors.load('./saved_models/wiki_corpus_word2vec_size100_skip_gram_wv')

## Using the trained model

### Vocabulaire

In [None]:
word2vec.index2word

### vector['word']

In [None]:
word2vec['فيها']

### Most similar

In [14]:
word2vec.most_similar('فيها')

[('بها', 0.720863401889801),
 ('فيه', 0.7065894603729248),
 ('به', 0.5436509251594543),
 ('فيهم', 0.5159602165222168),
 ('فيها،', 0.5068670511245728),
 ('بها،', 0.49020761251449585),
 ('يكفي', 0.41754448413848877),
 ('ذلك', 0.4129946827888489),
 ('لها', 0.4123121500015259),
 ('الكفاية', 0.3986240029335022)]

In [19]:
word2vec.most_similar(positive=['ذلك', 'لها'], negative=['يكفي'])

[('الحاضر', 0.43474435806274414),
 ('له', 0.41532206535339355),
 ('الحالي', 0.395750492811203),
 ('ذاك', 0.3909185528755188),
 ('ولادته', 0.3784900903701782),
 ('يحكمها', 0.37023404240608215),
 ('البداية', 0.36946800351142883),
 ('نفس', 0.3639281690120697),
 ('اسمها', 0.3634480834007263),
 ('ذلك،', 0.36070728302001953)]

### Similarity between two words

In [None]:
word2vec.similarity('على','مع')

### doesnt_match

In [20]:
word2vec.doesnt_match(['نفس', 'ذلك', 'ولادته', 'البداية'])

'نفس'