### Embedding 

### Embedding Projector
##### https://projector.tensorflow.org/
Expanding on this, there's a library called
TensorFlow Data Services or TFTS for short,
and that contains many data sets
and lots of different categories.
Here's some examples; and while we
can see that there are many
different data sets for different types,
particularly image-based, there's also a few for text,
and we'll be using the IMDB reviews dataset next.
This dataset is ideal
because it contains a large body of texts,
50,000 movie reviews which
are categorized as positive or negative.
It was authored by Andrew Mass et al at Stanford,
and you can learn more about it at this link. http://ai.stanford.edu/~amaas/data/sentiment/


In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0-alpha0


In [2]:
pip install -q tensorflow-datasets

Note: you may need to restart the kernel to use updated packages.


In [3]:
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

AttributeError: module 'tensorflow._api.v2.autograph.experimental' has no attribute 'do_not_convert'

In [12]:
import numpy as np
train_data, test_data = imdb['train'], imdb['test']

In [13]:
training_sentences = []
training_labels = []

validation_sentences = []
validation_labels = []

In [14]:
for s,l in train_data:
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())
    
for s,l in test_data:
    validation_sentences.append(str(s.numpy()))
    validation_labels.append(l.numpy())
    
training_labels_final = np.array(training_labels)
validation_labels_final = np.array(validation_labels)

In [17]:
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = '<OOV>'

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,
                       maxlen= max_length, truncating= trunc_type)

validation_seq = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_seq, 
                                 maxlen= max_length)

In [28]:
# Hello  : 1
# World : 2
# How : 3
reverse_word_index = dict([(value, key) 
                           for (key, value) in word_index.items()])
# 1 : Hello
# 2 : World
# 3 : How

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[1]))
print(training_sentences[1])

b this was soul provoking i am an iranian and living in <OOV> 21st century i didn't know that such big <OOV> have been living in such conditions at the time of my grandfather br br you see that today or even in <OOV> on one side of the world a lady or a baby could have everything served for him or her clean and on demand but here 80 years ago people <OOV> their life to go to somewhere with more grass it's really interesting that these <OOV> bear those difficulties to find <OOV> for their sheep but they lose many the sheep on their way br br i praise the americans who accompanied this tribe they were as
b"This was soul-provoking! I am an Iranian, and living in th 21st century, I didn't know that such big tribes have been living in such conditions at the time of my grandfather!<br /><br />You see that today, or even in 1925, on one side of the world a lady or a baby could have everything served for him or her clean and on-demand, but here 80 years ago, people ventured their life to go to

In [29]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [34]:
num_epochs = 10

model.fit(padded, training_labels_final, epochs = num_epochs, 
                   validation_data = (validation_padded, validation_labels_final))

W0823 15:59:12.771138 139628876769088 deprecation.py:323] From /home/prakashraj/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7efd5b4f08d0>

This may lead to overfitting  in data , later we can learn how to avoid the overfitting


In [35]:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) #shape :  (vocab_size, embedding_dim)

(10000, 16)


### Tensorflow Projector
This file type  and uses it to plot the vector in 3D space so we can visualize them
Open tensorflow projector : https://projector.tensorflow.org/

In [44]:

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
      word = reverse_word_index[word_num]
      embeddings = weights[word_num]
      out_m.write(word + "\n") #To the metadata array,we just write out the words
# To the vectors file,we simply write out the value of
# each of the items in the array of embeddingsi.e, the co-efficient of
# each dimension on the vector for this word.
      out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [46]:
try:
  from google.colab import files
except ImportError:
  pass
else:
#     Click load button in projector and load the tsv files
  files.download('vecs.tsv') #  first one 
  files.download('meta.tsv') # second one

NameError: name 'files' is not defined

In [38]:
sentence = "I really think this is amazing. honest."
sequence = tokenizer.texts_to_sequences(sentence)
print(sequence)

[[11], [], [1430], [967], [4], [1537], [1537], [4730], [], [790], [2015], [11], [2922], [2189], [], [790], [2015], [11], [579], [], [11], [579], [], [4], [1783], [4], [4503], [11], [2922], [1277], [], [], [2015], [1005], [2922], [967], [579], [790], []]
