<a href="https://colab.research.google.com/github/jpradeesh3800/ml/blob/master/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*References*

https://www.tensorflow.org/tutorials/text/word_embeddings

https://keras.io/layers/embeddings/

*Make sure that you install following packages to ensure that it works as expected*

In [1]:
!pip install --upgrade pip
!pip install gast==0.2.2
!pip install tensorflow_federated==0.7.0
!pip install -q tf-nightly

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 2.7MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.0.2
Collecting tensorflow_federated==0.7.0
  Downloading tensorflow_federated-0.7.0-py2.py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 2.8 MB/s 
Collecting attrs~=18.2
  Downloading attrs-18.2.0-py2.py3-none-any.whl (34 kB)
Collecting tf-nightly
  Downloading tf_nightly-2.2.0.dev20200318-cp36-cp36m-manylinux2010_x86_64.whl (533.0 MB)
[K     |████████████████████████████████| 533.0 MB 11 kB/s 
Collecting grpcio~=1.8.0
  Downloading grpcio-1.8.6-cp36-cp36m-manylinux1_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 46.2 M

[K     |████████████████████████████████| 2.9 MB 2.9 MB/s 
[?25h

In [0]:
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

In [3]:
tf.__version__

'2.2.0-dev20200318'

*We know that a text cannot be inputted into neural network as such.It has to be processed before being fed into NN.*

*Each sentence can be seen as list of tokens(like words or substrings)*

*Each Token can be represented as a Integer number. The Maximum value of the Token is known as input_dim. This means no token has been mapped to a greater than  or equal to input_dim*

*Each Token has to be mapped to a 1D vector(This is embedding) of fixed size and it is trainable. The dimension of the vector is to be mentioned (it is a hyperparameters) and it is known as output_dim.*

*Each sequence is a list of tokens of different size.So, we will pad zeros at the end.*

**tf.keras.layers.Embedding(input_dim,output_dim)**

In [0]:
embedding_layer = tf.keras.layers.Embedding(1000,5)

In [5]:
x = embedding_layer(tf.constant(2))
x.numpy()

array([ 0.00838083,  0.00927347, -0.01304644,  0.01269159, -0.01352956],
      dtype=float32)

In [6]:
x = embedding_layer(tf.constant(999))
x.numpy()

array([ 0.04020435, -0.01292528, -0.04991257, -0.0295392 , -0.01072909],
      dtype=float32)

*Token value cannot be equal to input_dim*

In [7]:
x = embedding_layer(tf.constant(1000))
x.numpy()

InvalidArgumentError: ignored

if input to embedding_layer is of size (x,)
The output of the embedding layer is of size (x,output_dim)

In [8]:
x = embedding_layer(tf.constant([1,2,3]))
x.numpy()

array([[-0.03861893,  0.03799578, -0.02285944, -0.04729235, -0.03754127],
       [ 0.00838083,  0.00927347, -0.01304644,  0.01269159, -0.01352956],
       [-0.01134779, -0.02453787, -0.00169013, -0.04312469, -0.02912997]],
      dtype=float32)

*if input to embedding_layer is of size (x,y) The output of the embedding layer is of size (x,y,output_dim)*

In [9]:
x = embedding_layer(tf.constant([[1,2,3],[4,5,6]]))
x.numpy()

array([[[-0.03861893,  0.03799578, -0.02285944, -0.04729235,
         -0.03754127],
        [ 0.00838083,  0.00927347, -0.01304644,  0.01269159,
         -0.01352956],
        [-0.01134779, -0.02453787, -0.00169013, -0.04312469,
         -0.02912997]],

       [[-0.0190549 ,  0.02655848,  0.02650747,  0.03201841,
          0.00613926],
        [-0.02125731, -0.02225229,  0.02481966, -0.03849624,
         -0.00283132],
        [ 0.00096061,  0.01598182,  0.01984005,  0.04290893,
         -0.00679637]]], dtype=float32)

In [10]:
(train_data,test_data),info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteKJ2LQV/imdb_reviews-train.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteKJ2LQV/imdb_reviews-test.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteKJ2LQV/imdb_reviews-unsupervised.tfrecord
[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


*info has the encoder,which encodes tokens to integer values and decodes integer values to tokens*

In [11]:
type(info)

tensorflow_datasets.core.dataset_info.DatasetInfo

In [12]:
info.features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})

In [0]:
encoder = info.features['text'].encoder

In [14]:
encoder

<SubwordTextEncoder vocab_size=8185>

In [15]:
encoder.vocab_size

8185

In [16]:
len(encoder.subwords),type(encoder.subwords)

(7928, list)

*underscore means space*

In [17]:
encoder.subwords[:20]

['the_',
 ', ',
 '. ',
 'a_',
 'and_',
 'of_',
 'to_',
 's_',
 'is_',
 'br',
 'in_',
 'I_',
 'that_',
 'this_',
 'it_',
 ' /><',
 ' />',
 'was_',
 'The_',
 'as_']

In [18]:
sample_string = "Pradeeshwar JaiShankar"
encoded_string = encoder.encode(sample_string)
encoded_string

[7168, 1816, 190, 8033, 904, 836, 8034, 6030, 7206]

In [19]:
for i in encoded_string:
    print("'{}'".format(encoder.decode([i])))

'Pr'
'ade'
'es'
'h'
'war '
'Ja'
'i'
'Shan'
'kar'


In [20]:
decoded_string = encoder.decode(encoded_string)
decoded_string

'Pradeeshwar JaiShankar'

In [21]:
train_data

<DatasetV1Adapter shapes: ((None,), ()), types: (tf.int64, tf.int64)>

*Dataset is made into batches and padded with zeros*

In [0]:
train_batches  = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.padded_batch(10)

In [0]:
sample_batch  = next(iter(train_batches))

In [24]:
data,label = sample_batch
print(data.numpy(),label.numpy(),sep='\n\n')

[[  69   57   93 ...    0    0    0]
 [ 133  279   86 ...    0    0    0]
 [1566  160 2124 ...    0    0    0]
 ...
 [ 147   82 1622 ...    0    0    0]
 [7963  134  404 ... 7974  166 7962]
 [  69  117   31 ...    0    0    0]]

[1 1 0 1 0 1 1 1 1 1]


*Every padded with according to longest sequence in the batch.Hence the size of each batch is different*

In [25]:
for data,label in train_batches.take(3):
    print("{:20s} : {}".format("batch size",data.numpy().shape),end='\n\n')

batch size           : (10, 455)

batch size           : (10, 539)

batch size           : (10, 754)



*Now the Dataset is ready to train*