<a href="https://colab.research.google.com/github/mehrn79/npl_wrod2wec/blob/main/NLP_2wec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word2wec implementation using skip-grams**

##**Preprocessing**

### **librarys for preprocessing**

In [75]:
pip install hazm



In [76]:
from __future__ import unicode_literals
from hazm import *
from keras.preprocessing import text
from keras.preprocessing.sequence import skipgrams

### **getting our corpes and stop words**

In [77]:
with open('/content/shams.txt') as f:
  lines = f.readlines()

with open('/content/stopwords.txt') as file:
  stopLines = file.readlines()
stopWord = [item.replace('\n',"") for item in stopLines]
  

### **cleaning our corpes by stop words**

In [88]:
words= []
for sent in lines :
   words.append(word_tokenize(sent))

corpes=[]
for wordSent in words :
  for word in wordSent :
    if word in stopWord :
      wordSent.remove(word)
  corpes.append(' '.join(wordSent))


### **tokenize our corpes**

In [89]:
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(corpes)

word2id = tokenizer.word_index
id2word = {v:k for k, v in word2id.items()}

vocab_size = len(word2id) + 1 
embed_size = 100

sentencesID = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in corpes]


### **generate skip-grams**

In [90]:
skip_grams = [skipgrams(sentID, vocabulary_size=vocab_size, window_size=2) for sentID in sentencesID]


## **Build the skip-gram model architecture**

### **librarys for skip-gram model architecture**

In [91]:
 from keras.preprocessing.sequence import skipgrams 
 from keras.layers import *
 from keras.layers.core import Dense, Reshape
 from keras.layers.embeddings import Embedding
 from keras.models import Model,Sequential 
 import numpy as np

### **implementation**

In [92]:
targetWord_model = Sequential()
targetWord_model.add(Embedding(vocab_size, embed_size,
                         embeddings_initializer="glorot_uniform",
                         input_length=1))
targetWord_model.add(Reshape((embed_size, )))

contextWord_model = Sequential()
contextWord_model.add(Embedding(vocab_size, embed_size,
                  embeddings_initializer="glorot_uniform",
                  input_length=1))
contextWord_model.add(Reshape((embed_size,)))



merged_output = add([targetWord_model.output, contextWord_model.output]) 
model_combined = Sequential()
model_combined.add(Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid"))
final_model = Model([targetWord_model.input, contextWord_model.input], model_combined(merged_output))
final_model.compile(loss="mean_squared_error", optimizer="rmsprop")
final_model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 embedding_22_input (InputLayer  [(None, 1)]         0           []                               
 )                                                                                                
                                                                                                  
 embedding_23_input (InputLayer  [(None, 1)]         0           []                               
 )                                                                                                
                                                                                                  
 embedding_22 (Embedding)       (None, 1, 100)       1041900     ['embedding_22_input[0][0]']     
                                                                                            

## **train the model**

In [96]:

for epoch in range(1, 6):
     loss = 0
     for i, elem in enumerate(skip_grams):
         pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
         pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
         labels = np.array(elem[1], dtype='int32')
         X = [pair_first_elem, pair_second_elem]
         Y = labels
         if i % 10000 == 0:
             print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
         loss += final_model.train_on_batch(X,Y)  

     print('Epoch:', epoch, 'Loss:', loss)

Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 1 Loss: 816.8299669250846
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 2 Loss: 801.4483801852912
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 3 Loss: 795.1650541406125
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 4 Loss: 791.6809043437243
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 5 Loss: 789.4506377577782
