<a href="https://colab.research.google.com/github/premswaroopmusti/Word-embedding-using-keras-embedding-layer/blob/main/Word_embedding_using_Keras_embedding_layer_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word Embedding Technique using Embedding Layer in Keras**

In [101]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

## **Food review Classification**

In [102]:
# dataset
# 10 reviews

reviews = ['nice food',
           'amazing food',
           'too good',
           'just loved it',
           'will go again',
           'horrible food',
           'never go there',
           'poor service',
           'poor quality',
           'needs improvement']

# 10 corresponding labels  
# gave label 1 for positive review and label 0 for negative reviews
# first five labels are positive and rest five labels are negative

sentiment = np.array([1,1,1,1,1,0,0,0,0,0])           

In [103]:
# converting into One-hot vector
# one hot encoding will take the review and then use specific vocabulary size
# lets say we gave vocabulary size as 30, we will get unique numbers in the output in the range of 1 to 30

# it gave number 23 to the word amazing and number 1 to restaurant, but internally the keras layer will convert the unique numbers into 0011 and so on.

one_hot('amazing restaurant',30)

[23, 1]

In [104]:
one_hot('amzaing restaurant',100)

[26, 27]

**One Hot Representation**

In [105]:
# now lets encode all the reviews, let's convert all reviews into one-hot encoded or encoded vector

vocab_size = 300
encoded_reviews = [one_hot(i, vocab_size) for i in reviews]
encoded_reviews

# so we have created encoded vector for each review

[[182, 24],
 [203, 24],
 [279, 269],
 [225, 12, 206],
 [183, 169, 107],
 [90, 24],
 [6, 169, 72],
 [241, 291],
 [241, 241],
 [197, 25]]

In [106]:
# in the above vectors some are containing two numericals and some are having three
# so we need to do padding

# we need a maximum sentence size
max_length = 3
padded_reviews = pad_sequences(encoded_reviews,maxlen = max_length, padding = 'post')            # padding = post means pad towards the end
padded_reviews

# so now every vector size is 3

array([[182,  24,   0],
       [203,  24,   0],
       [279, 269,   0],
       [225,  12, 206],
       [183, 169, 107],
       [ 90,  24,   0],
       [  6, 169,  72],
       [241, 291,   0],
       [241, 241,   0],
       [197,  25,   0]], dtype=int32)

In [107]:
embedded_vector_size = 5

# let's create a model

model = Sequential()
model.add(Embedding(vocab_size,embedded_vector_size,input_length = max_length,name = 'embedding'))
model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))                # it will be a dense layer with one neuron and sigmoid activation function

In [108]:
x = padded_reviews
y = sentiment

In [109]:
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',                # binary_crossentropy bcz the output will be either 0 or 1.
    metrics = ['accuracy']
)

In [110]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 5)              1500      
                                                                 
 flatten_7 (Flatten)         (None, 15)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 16        
                                                                 
Total params: 1,516
Trainable params: 1,516
Non-trainable params: 0
_________________________________________________________________


In [111]:
model.fit(x,y, epochs = 50, verbose = 0)

<keras.callbacks.History at 0x7f6d5128ff90>

In [112]:
loss, accuracy = model.evaluate(x,y)
accuracy



1.0

**The sentiment classification is a fake problem**
**I am more interested in Word Embedding**

**So while solving sentiment classification problem, i also got my word embeddings**


In [113]:
weights = model.get_layer('embedding').get_weights()[0]

In [114]:
len(weights)                      # len is 300 bcz our vocabulary size is 300

300

In [115]:
# lets check the word embedding for the word 'nice'

weights[182]

array([ 0.06393227, -0.04134587, -0.01177133, -0.02179534, -0.07998518],
      dtype=float32)

In [116]:
# lets check the word embedding for the word 'amazing'

weights[203]

array([ 0.01377268, -0.09530769, -0.09949608, -0.07227844, -0.0379159 ],
      dtype=float32)

**these two vectors 'nice' and 'amazing' are not same**

**'nice' and 'amazing' are kind of similar words so we may think that these numbers should be same, but our dataset was too small**

**if we run it on a huge dataset may be we will find these vectors to be similar**