# Word Embeddings and Sentiment


Embeddings are `clusters of vectors in multi-dimensional space`, where each vector represents a given word in those dimensions. 

While it’s difficult for us humans to think in many dimensions, luckily the **TensorFlow Projector** <http://projector.tensorflow.org/> makes it fairly easy for us to view these clusters in a 3D projection (later Colabs will generate the necessary files for use with the projection tool).

# Building a Basic Sentiment Model

- To create our embeddings, we’ll first use an embeddings layer, called tf.keras.layers.Embedding.
    - It takes 3 arguments:
      1. the size of the tokenized vocabulary
      2. the number of embedding dimensions to use, 
      3. the input length (from when you standardized sequence length with padding and truncation).

The output of this layer needs to be reshaped to work with any fully-connected layers. 
You can do this with a pure Flatten layer, or use GlobalAveragePooling1D for a little additional computation that sometimes creates better results.

In our case, we’re only looking at positive vs. negative sentiment, so only a single output node is needed (0 for negative, 1 for positive). You’ll be able to use a binary cross entropy loss function since the result is only binary classification.

**Given a vocabulary size of 500, maximum sequence length of 50, and embedding dimension of 16, what is the output shape of the Embedding layer?**

(None,50,16)

# A Note on Embedding Networks

The TensorFlow team has two additional suggestions:

1.They suggest that the final network does not use a sigmoid activation layer when working with embeddings, especially when using just the two classes like we are for sentiment analysis:tf.keras.layers.Dense(1)

2.Additionally, they suggest instead of using the string binary_crossentropy as the loss function, you use tf.keras.losses.BinaryCrossentropy(from_logits=True).

# Import TensorFlow and related functions

In [2]:
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Get the dataset

We're going to use a dataset containing Amazon and Yelp reviews, with their related sentiment (1 for positive, 0 for negative). This dataset was originally extracted from [here](https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set).

In [3]:
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

--2020-10-09 15:22:25--  https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving drive.google.com (drive.google.com)... 142.250.99.113, 142.250.99.100, 142.250.99.139, ...
Connecting to drive.google.com (drive.google.com)|142.250.99.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/g3neebfvip4euf0b05nj13n53201a2tv/1602256875000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P [following]
--2020-10-09 15:22:25--  https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/g3neebfvip4euf0b05nj13n53201a2tv/1602256875000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)... 74.125.20.132, 2607:f8b0:400e:c07::84
Connecting to doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)|74.125.

In [4]:
import numpy as np 
import pandas as pd

dataset = pd.read_csv('/tmp/sentiment.csv')
print(dataset.shape)
dataset.tail()

(1992, 3)


Unnamed: 0.1,Unnamed: 0,text,sentiment
1987,1987,I think food should have flavor and texture an...,0
1988,1988,Appetite instantly gone.,0
1989,1989,Overall I was not impressed and would not go b...,0
1990,1990,The whole experience was underwhelming and I t...,0
1991,1991,Then as if I hadn't wasted enough of my life t...,0


In [5]:
sentences =dataset.text.tolist()
labels = dataset.sentiment.tolist()
print(sentences[10])
print(labels[10])

And the sound quality is great.
1


In [8]:
# Separate out the sentences and labels into training and test set
training_size = int(len(sentences)*0.8)

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]

training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)


In [9]:
training_labels_final


array([0, 1, 1, ..., 1, 0, 1])

# Tokenize the dataset
Tokenize the dataset, including padding and OOV

In [10]:
vocab_size = 1000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size ,oov_token=oov_tok )
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences , maxlen=max_length, padding=padding_type, 
                       truncating= trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length, padding=padding_type, 
                       truncating= trunc_type)

## Review a Sequence
Let's quickly take a look at one of the padded sequences to ensure everything above worked appropriately.

In [11]:
reverse_word_index = dict([ (value,key) for (key,value) in word_index.items() ])

def decode_review(text):
  return ' '.join([reverse_word_index.get(i,'?') for i in text])

print(decode_review(padded[1]))
print(training_sentences[1])

good case excellent value ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Good case Excellent value.


In [12]:

print(decode_review(padded[10]))
print(training_sentences[10])

and the sound quality is great ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
And the sound quality is great.


## Train a Basic Sentiment Model with Embeddings

In [15]:
# Build a basic sentiment network
# Note the embedding layer is first, 
# and the output is only 1 node as it is either 0 or 1 (negative or positive)
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size,embedding_dim,input_length=max_length),
                             tf.keras.layers.Flatten(),
                             tf.keras.layers.Dense(6,activation='relu'),
                             tf.keras.layers.Dense(1,activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 16)           16000     
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 9606      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 25,613
Trainable params: 25,613
Non-trainable params: 0
_________________________________________________________________


In [16]:
num_epochs =10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f10dd77dcc0>

## Get files for visualizing the network

The code below will download two files for visualizing how your network "sees" the sentiment related to each word. Head to http://projector.tensorflow.org/ and load these files, then click the "Sphereize" checkbox.

In [17]:
# First get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape)

(1000, 16)


In [19]:
weights

array([[-0.08513905, -0.07916204,  0.03986736, ...,  0.05990879,
        -0.06803873, -0.06323504],
       [-0.04556246,  0.03551653, -0.03865337, ..., -0.03001493,
        -0.09691637,  0.01935509],
       [-0.01136928, -0.07080145,  0.01817639, ...,  0.05674662,
         0.02572152,  0.02062324],
       ...,
       [-0.0203247 , -0.08085664,  0.09329598, ...,  0.02630224,
        -0.03243666,  0.05038565],
       [-0.0002204 ,  0.10957355,  0.08335437, ..., -0.02368987,
        -0.03652238,  0.08833388],
       [ 0.1348406 ,  0.00526085,  0.0871685 , ...,  0.05671656,
        -0.15233889, -0.07701384]], dtype=float32)

In [20]:
import io

# Write out the embedding vectors and metadata
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [21]:
# Download the files
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Predicting Sentiment in New Reviews

Now that you've trained and visualized your network, take a look below at how we can predict sentiment in new reviews the network has never seen before.

In [23]:
# Use the model to predict a review   
fake_reviews = ['I love this phone', 'I hate spaghetti', 
                'Everything was cold',
                'Everything was hot exactly as I wanted', 
                'Everything was green', 
                'the host seated us immediately',
                'they gave us free chocolate cake', 
                'not sure about the wilted flowers on the table',
                'only works when I stand on tippy toes', 
                'does not work when I stand on my head',
                'You can\'t we right for a wrong person and the right one will find your worth',
                'Some stories are always destined to be left half way']

print(fake_reviews) 


# Create the sequences
padding_type='post'
sample_sequences = tokenizer.texts_to_sequences(fake_reviews)
fakes_padded = pad_sequences(sample_sequences, padding=padding_type, maxlen=max_length)           

print('\nHOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!\n')              

classes = model.predict(fakes_padded)

# The closer the class is to 1, the more positive the review is deemed to be
for x in range(len(fake_reviews)):
  print(fake_reviews[x])
  print(classes[x])
  print('\n')

['I love this phone', 'I hate spaghetti', 'Everything was cold', 'Everything was hot exactly as I wanted', 'Everything was green', 'the host seated us immediately', 'they gave us free chocolate cake', 'not sure about the wilted flowers on the table', 'only works when I stand on tippy toes', 'does not work when I stand on my head', "You can't we right for a wrong person and the right one will find your worth", 'Some stories are always destined to be left half way']

HOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!

I love this phone
[0.9975854]


I hate spaghetti
[0.08118349]


Everything was cold
[0.5009543]


Everything was hot exactly as I wanted
[0.63833845]


Everything was green
[0.5471215]


the host seated us immediately
[0.6866338]


they gave us free chocolate cake
[0.9074721]


not sure about the wilted flowers on the table
[0.02860391]


only works when I stand on tippy toes
[0.9821178]


does not work when I stand on my head
[0.01599413]


You can'