<a href="https://colab.research.google.com/github/mspatke/TensorFlow-NLP-DeepDive/blob/main/sentiment_analysis_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings for Sentiment Analysis

This notebook explains an introduction to word embeddings. We will train our own word embeddings using a simple Keras model for a sentiment classification task.

Steps include:
1. Downloading data from tensorflow dataset.
2. Segregating training and testing sentences & labels.
3. Data preparation to padded sequences
4. Defining out Keras model with an Embedding layer.
5. Train the model and explore the weights from the embedding layer.


In [None]:
# import required libraries

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)

## Downloading the TensorFlow `imdb_review` dataset

> Make sure tensorflow_datasets is installed

In [None]:
data , info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)


In [None]:
#seperating train and test dataset

train_data , test_data = data['train'], data['test']

In [None]:
len(train_data), len(test_data)

In [None]:
##create empty list to store sentences and labels
train_sentences = []
test_sentences = []

train_labels = []
test_labels = []


#iterate over train_data to extract sentences

for sent, label in train_data:
  train_sentences.append(str(sent.numpy().decode('utf8')))
  train_labels.append(label.numpy())


In [None]:
train_sentences[1]

In [None]:
for sent, label in test_data:
  test_sentences.append(str(sent.numpy().decode('utf8')))
  test_labels.append(label.numpy())

In [None]:
test_sentences[1]

In [None]:
##convert lists into numpy array
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

## Data preparation - setting up the tokenizer

In [None]:
vocab_size = 10000
embedding_dim=16
max_length= 120
trunc_type= 'post'
oov_tok="<oov>"

In [None]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index=tokenizer.word_index


train_seq= tokenizer.texts_to_sequences(train_sentences)
train_padded =pad_sequences(train_seq, maxlen=max_length, truncating=trunc_type)


test_seq = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_seq, maxlen = max_length, truncating=trunc_type)



In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(train_sentences[1])
print(train_padded[1])
print(decode_review(train_padded[1]))

## Define the Neural Network with Embedding layer

1. Use the Sequential API.
2. Add an embedding input layer of input size equal to vocabulary size.
3. Add a flatten layer, and two dense layers.

In [None]:
model = tf.keras.Sequential([
                          tf.keras.layers.Embedding(vocab_size , embedding_dim , input_length=max_length),
                          tf.keras.layers.Flatten(),
                          tf.keras.layers.Dense(6, activation ='relu'),
                          tf.keras.layers.Dense(1, activation ='sigmoid')
                            ])

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

## Model Training

In [None]:
num_epochs = 20

model.fit(train_padded ,
          train_labels,
          epochs = num_epochs,
          validation_data=(test_padded, test_labels))

In [None]:
l1 = model.layers[0]

weights=l1.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
print(weights[0])

In [None]:
l1.get_weights()[0][0]