<a href="https://colab.research.google.com/github/jpradeesh3800/ml/blob/master/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*References*

https://www.tensorflow.org/tutorials/text/word_embeddings

https://keras.io/layers/embeddings/

In [1]:
!pip install tensorflow==2.0.0 



In [0]:
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

In [3]:
tf.__version__

'2.2.0-dev20200318'

*We know that a text cannot be inputted into neural network as such.It has to be processed before being fed into NN.*

*Each sentence can be seen as list of tokens(like words or substrings)*

*Each Token can be represented as a Integer number. The Maximum value of the Token is known as input_dim. This means no token has been mapped to a greater than  or equal to input_dim*

*Each Token has to be mapped to a 1D vector(This is embedding) of fixed size and it is trainable. The dimension of the vector is to be mentioned (it is a hyperparameters) and it is known as output_dim.*

*Each sequence is a list of tokens of different size.So, we will pad zeros at the end.*

**tf.keras.layers.Embedding(input_dim,output_dim)**

In [0]:
embedding_layer = tf.keras.layers.Embedding(1000,5)

In [5]:
x = embedding_layer(tf.constant(2))
x.numpy()

array([ 0.0339292 , -0.0392426 , -0.04252534,  0.03902936, -0.04932391],
      dtype=float32)

In [6]:
x = embedding_layer(tf.constant(999))
x.numpy()

array([-0.00839712,  0.03235992, -0.00803491, -0.0274361 , -0.01136615],
      dtype=float32)

*Token value cannot be equal to input_dim*

In [7]:
x = embedding_layer(tf.constant(1000))
x.numpy()

InvalidArgumentError: ignored

if input to embedding_layer is of size (x,)
The output of the embedding layer is of size (x,output_dim)

In [8]:
x = embedding_layer(tf.constant([1,2,3]))
x.numpy()

array([[ 0.0379575 ,  0.00796989,  0.0126771 , -0.00243366, -0.02905126],
       [ 0.0339292 , -0.0392426 , -0.04252534,  0.03902936, -0.04932391],
       [-0.04611778, -0.00851363, -0.0165308 , -0.02084555,  0.0453205 ]],
      dtype=float32)

*if input to embedding_layer is of size (x,y) The output of the embedding layer is of size (x,y,output_dim)*

In [9]:
x = embedding_layer(tf.constant([[1,2,3],[4,5,6]]))
x.numpy()

array([[[ 0.0379575 ,  0.00796989,  0.0126771 , -0.00243366,
         -0.02905126],
        [ 0.0339292 , -0.0392426 , -0.04252534,  0.03902936,
         -0.04932391],
        [-0.04611778, -0.00851363, -0.0165308 , -0.02084555,
          0.0453205 ]],

       [[-0.04653094,  0.0314175 ,  0.01323033,  0.03686059,
          0.02354066],
        [-0.01486235, -0.03579427, -0.02614143,  0.00947496,
          0.00098473],
        [-0.02317078, -0.03154291, -0.02784969, -0.02647185,
          0.03611401]]], dtype=float32)

In [0]:
(train_data,test_data),info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

*info has the encoder,which encodes tokens to integer values and decodes integer values to tokens*

In [11]:
type(info)

tensorflow_datasets.core.dataset_info.DatasetInfo

In [12]:
info.features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})

In [0]:
encoder = info.features['text'].encoder

In [14]:
encoder

<SubwordTextEncoder vocab_size=8185>

In [15]:
encoder.vocab_size

8185

In [16]:
len(encoder.subwords),type(encoder.subwords)

(7928, list)

*underscore means space*

In [17]:
encoder.subwords[:20]

['the_',
 ', ',
 '. ',
 'a_',
 'and_',
 'of_',
 'to_',
 's_',
 'is_',
 'br',
 'in_',
 'I_',
 'that_',
 'this_',
 'it_',
 ' /><',
 ' />',
 'was_',
 'The_',
 'as_']

In [18]:
sample_string = "Pradeeshwar JaiShankar"
encoded_string = encoder.encode(sample_string)
encoded_string

[7168, 1816, 190, 8033, 904, 836, 8034, 6030, 7206]

In [19]:
for i in encoded_string:
    print("'{}'".format(encoder.decode([i])))

'Pr'
'ade'
'es'
'h'
'war '
'Ja'
'i'
'Shan'
'kar'


In [20]:
decoded_string = encoder.decode(encoded_string)
decoded_string

'Pradeeshwar JaiShankar'

In [21]:
train_data

<DatasetV1Adapter shapes: ((None,), ()), types: (tf.int64, tf.int64)>

*Dataset is made into batches and padded with zeros*

In [0]:
train_batches  = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.padded_batch(10)

In [0]:
sample_batch  = next(iter(train_batches))

In [30]:
data,label = sample_batch
print(data.numpy(),label.numpy(),sep='\n\n')

[[ 518 1693  192 ...    0    0    0]
 [ 857  656   15 ...    0    0    0]
 [  62    9   43 ...    0    0    0]
 ...
 [1156  579    7 ...    0    0    0]
 [  12   81  641 ... 7961 3388 7975]
 [ 147 2498 7984 ...    0    0    0]]

[0 0 0 1 1 0 0 1 0 0]


*Every padded with according to longest sequence in the batch.Hence the size of each batch is different*

In [31]:
for data,label in train_batches.take(3):
    print("{:20s} : {}".format("batch size",data.numpy().shape),end='\n\n')

batch size           : (10, 820)

batch size           : (10, 612)

batch size           : (10, 650)



*Now the Dataset is ready to train*