# Introduction to Natural Language Processing (NLP) in TensorFlow

### Word Embeddings

Word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Let's play around with a set of pre-trained word vectors, to get used to their properties. There exist many sets of pretrained word embeddings; here, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

To read an `h5` file, we'll need to use the `h5py` package. Below, we use the package to open the `mini.h5` file we just downloaded. We extract from the file a list of utf-8-encoded words, as well as their $300$-dimensional vectors.

In [1]:
!pip install h5py



In [2]:
# Load the file and pull out words and embeddings
import h5py

with h5py.File('datasets/mini.h5', 'r') as f:
    all_words = [word.decode('utf-8') for word in f['mat']['axis1'][:]]
    all_embeddings = f['mat']['block0_values'][:]
    
print("all_words dimensions: {0}".format(len(all_words)))
print("all_embeddings dimensions: {0}".format(all_embeddings.shape))

print(all_words[1337])

all_words dimensions: 362891
all_embeddings dimensions: (362891, 300)
/c/de/aufmachung


**Explanation:**

- The code loads the `mini.h5` dataset using the `h5py` package.
- It extracts all words and their corresponding embeddings from the dataset.
- all_words is a list of words encoded in utf-8, which are then decoded.
- all_embeddings is a matrix where each row corresponds to the embedding of a word.
- The dimensions of both the list and matrix are printed to get a sense of their sizes.
- It prints the word at index 1337 as an example.

Now, `all_words` is a list of $V$ strings (what we call our *vocabulary*), and `all_embeddings` is a $V \times 300$ matrix. The strings are of the form `/c/language_code/word`—for example, `/c/en/cat` and `/c/es/gato`.

We are interested only in the English words. We use Python list comprehensions to pull out the indices of the English words, then extract just the English words (stripping the six-character `/c/en/` prefix) and their embeddings.

In [3]:
# Restrict our vocabulary to just the English words
english_words = [word[6:] for word in all_words if word.startswith('/c/en/')]
english_word_indices = [i for i, word in enumerate(all_words) if word.startswith('/c/en/')]
english_embeddings = all_embeddings[english_word_indices]

print("all_words dimensions: {0}".format(len(english_words)))
print("all_embeddings dimensions: {0}".format(english_embeddings.shape))

print(english_words[1337])

#To focus on semantics (meaning), it's beneficial to normalize our vectors, 
#which means adjusting them so that they all have a length of 1. 
#After normalization, all word vectors will lie on a unit circle, and 
#the dot product of two vectors will be proportional to the cosine of the angle between them, 
#giving a measure of their similarity.

all_words dimensions: 150875
all_embeddings dimensions: (150875, 300)
activated_carbon


**Explanation:**

- This cell filters out only the English words and their embeddings.
- Words starting with `/c/en/` are identified as English. The first 6 characters `/c/en/` are stripped to retain only the word.
- Indices of English words are stored in `english_word_indices`.
- Using these indices, the corresponding embeddings are extracted to `english_embeddings`.
- Dimensions and an example word are printed for verification.

The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. 
Here, we will be interested in semantics, so we *normalize* our vectors, dividing each by its length. 
The result is that all of our word vectors are length 1, and as such, lie on a unit circle. 
The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

<img src="Figures/cosine_similarity.png" alt="cosine" style="width: 500px;"/>
<center>Figure adapted from *[Mastering Machine Learning with Spark 2.x](https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml)*</center>

In [4]:
import numpy as np

norms = np.linalg.norm(english_embeddings, axis=1)
normalized_embeddings = english_embeddings.astype('float32') / norms.astype('float32').reshape([-1, 1])

#The cell normalizes the English word embeddings.
#It computes the norms (lengths) of the embeddings using np.linalg.norm.
#Each embedding is then divided by its norm to normalize it.
#The result, normalized_embeddings, contains vectors of length 1.

**Explanation:**

- The cell normalizes the English word embeddings.
- It computes the norms (lengths) of the embeddings using `np.linalg.norm`.
- Each embedding is then divided by its norm to normalize it.
- The result, `normalized_embeddings`, contains vectors of length 1.

The np.linalg.norm function is used to compute the Euclidean norm (or length) of each word vector in english_embeddings. The axis=1 argument ensures that the norm is computed for each row (word vector) individually. The result, norms, is an array of lengths for each word vector.

Each word vector in english_embeddings is divided by its corresponding norm to normalize it. The reshape([-1, 1]) part is used to ensure that the division is carried out element-wise for each row. The astype('float32') ensures that the division is done using floating-point arithmetic.

The result is stored in normalized_embeddings, which contains the normalized word vectors, each of length 1.
After this step, we'll have word vectors that are standardized in length, allowing us to focus on their direction (or angle) when measuring similarity between words.

## Creating a Dictionary for Word Lookup

This dictionary will map words to their indices in the word embeddings matrix. Such a dictionary is useful when you want to quickly retrieve the embedding of a specific word without searching through the entire list.

By constructing this dictionary, the process of finding the vector representation for any given word becomes efficient. Instead of linearly searching through the list of words, the dictionary provides constant-time (O(1)) lookup.

In [5]:
index = {word: i for i, word in enumerate(english_words)}

- Code above constructs a dictionary named index using a dictionary comprehension.
- The enumerate function is used to loop over english_words and generate both the word and its corresponding index.
- Each word (word) is used as a key in the dictionary, and its index (i) is used as the associated value.
- The resulting dictionary allows for efficient lookups: given a word, you can quickly find its index in the english_words list (and, by extension, its corresponding embedding in the normalized_embeddings matrix).

This dictionary will be invaluable when you want to quickly retrieve the embedding of a specific word or perform operations based on word embeddings.

*The dot product is a mathematical operation that takes two equal-length sequences of numbers and returns a single number. In the context of word embeddings, the dot product between two normalized word vectors measures the cosine of the angle between them. Since the vectors are normalized (having a magnitude of 1), this dot product directly gives the cosine similarity, which is a measure of how similar the two vectors (and thus the words they represent) are. A value close to 1 indicates high similarity, while a value close to -1 indicates high dissimilarity.*

In [6]:
def similarity_score(w1, w2):
    score = np.dot(normalized_embeddings[index[w1], :], normalized_embeddings[index[w2], :])
    return score

# A word is as similar with itself as possible:
print('cat\tcat\t', similarity_score('cat', 'cat'))

# Closely related words still get high scores:
print('cat\tfeline\t', similarity_score('cat', 'feline'))
print('cat\tdog\t', similarity_score('cat', 'dog'))

# Unrelated words, not so much:
print('cat\tmoo\t', similarity_score('cat', 'moo'))
print('cat\tfreeze\t', similarity_score('cat', 'freeze'))


cat	cat	 1.0
cat	feline	 0.8199548
cat	dog	 0.590724
cat	moo	 0.0039538294
cat	freeze	 -0.030225191


# Function Definition:
- The similarity_score function calculates the cosine similarity between two words (w1 and w2) using their normalized embeddings.
- It retrieves the embeddings of the words from the normalized_embeddings matrix using the indices provided by the index dictionary.
- The np.dot function computes the dot product of the two embeddings, which, since they're normalized, directly gives the cosine similarity.

# Testing the Function:
The function is then tested on various word pairs:
- A word with itself (cat vs. cat): As expected, the similarity score is 1 because a word is maximally similar to itself.
- Semantically related words (cat vs. feline and cat vs. dog): The scores are expected to be high as the words are closely related.
- Unrelated words (cat vs. moo and cat vs. freeze): The scores should be considerably lower, reflecting the lack of semantic relation between the pairs.

This function provides a convenient way to measure how semantically similar two words are based on their embeddings.

# Finding Most Similar Words

Using the cosine similarity measure (or dot product for normalized vectors), it's possible to determine which words in the vocabulary are most similar to a specific word. This is typically achieved by computing the similarity score between the given word and every other word in the vocabulary and then ranking the words based on these scores.

In [7]:
def closest_to_vector(v, n):
    all_scores = np.dot(normalized_embeddings, v)
    best_words = map(lambda i: english_words[i], reversed(np.argsort(all_scores)))
    return [next(best_words) for _ in range(n)]

def most_similar(w, n):
    return closest_to_vector(normalized_embeddings[index[w], :], n)


**Explanation:**

**closest_to_vector(v, n) Function:**

- This function finds the n words that have the most similar embeddings to a given vector v.
- It computes the dot product (or cosine similarity since vectors are normalized) between the given vector v and all the word embeddings in normalized_embeddings using np.dot.
- The scores are then sorted in descending order using np.argsort (which returns indices in ascending order, hence the use of reversed).
- Using the sorted indices, the corresponding words are fetched from the english_words list.
- The function finally returns the top n words that are most similar to the input vector v.

**most_similar(w, n) Function:**

- This function is a more user-friendly interface that finds the n words most similar to a given word w.
- It retrieves the normalized embedding of the word w using the index dictionary.
- It then calls the closest_to_vector function to get the n most similar words.

With these functions, you can easily find which words in the vocabulary are most similar to any given word, based on their vector representations.

In [8]:
#finding most similar words for specific inputs
print(most_similar('cat', 10))
print(most_similar('dog', 10))
print(most_similar('duke', 10))

['cat', 'humane_society', 'kitten', 'feline', 'colocolo', 'cats', 'kitty', 'maine_coon', 'housecat', 'sharp_teeth']
['dog', 'dogs', 'wire_haired_dachshund', 'doggy_paddle', 'lhasa_apso', 'good_friend', 'puppy_dog', 'bichon_frise', 'woof_woof', 'golden_retrievers']
['duke', 'dukes', 'duchess', 'duchesses', 'ducal', 'dukedom', 'duchy', 'voivode', 'princes', 'prince']


**Explanation:**

- The function `most_similar` is invoked three times with different input words: 'cat', 'dog', and 'duke'.
- For each word, the function returns a list of the 10 words that are most similar to the input word based on their embeddings.
- The results are printed to provide insight into which words in the vocabulary are deemed most similar to the input words.

By examining the output, you can gauge the effectiveness of the embeddings in capturing semantic similarities between words.

# Solving Analogies with Word Embeddings
The method involves using vector arithmetic to find words "nearby" vectors that we construct ourselves.

**Explanation:**
Word embeddings have a fascinating property where semantic relationships between words can often be captured using vector arithmetic. For instance, the analogy "man : brother :: woman : ?" can be expressed as the equation: "brother - man + woman". In words, this means: start with the meaning of "brother", subtract the meaning of "man", and add the meaning of "woman". This vector arithmetic often results in a vector close to the word that solves the analogy, in this case, potentially "sister".

By using the closest_to_vector function, we can determine which words in the embedding space are closest to this new vector, and thus, solve the analogy.

In [9]:
def solve_analogy(a1, b1, a2):
    b2 = normalized_embeddings[index[b1], :] - normalized_embeddings[index[a1], :] + normalized_embeddings[index[a2], :]
    return closest_to_vector(b2, 1)

print(solve_analogy("man", "brother", "woman"))
print(solve_analogy("man", "husband", "woman"))
print(solve_analogy("spain", "madrid", "france"))


['sister']
['wife']
['paris']


**Function Definition:**
- The function `solve_analogy` is defined to solve analogies of the form a1:b1::a2:?.
- The logic involves using vector arithmetic:`b2 = embedding(b1) - embedding(a1) + embedding(a2)`. This computes the vector that represents the unknown word in the analogy.
- The function then calls closest_to_vector with the computed vector `*b2*` to find the word in the vocabulary that is closest to this computed vector. This word is the solution to the analogy.

**Testing the Function:**
The function is tested on three different analogies:
- `man:brother::woman:?`
- `man:husband::woman:?`
- `spain:madrid::france:?`
The results are printed to see which words the model predicts as the solutions to these analogies.

By examining the output, you can gauge the ability of the word embeddings to capture semantic relationships and solve word analogies.


In [10]:
#testing analogies
def solve_analogy(a1, b1, a2):
    b2 = normalized_embeddings[index[b1], :] - normalized_embeddings[index[a1], :] + normalized_embeddings[index[a2], :]
    return closest_to_vector(b2, 1)

print(solve_analogy("pen", "paper", "knife"))
print(solve_analogy("philippines", "manila", "america"))
print(solve_analogy("bottle", "liquid", "shelf"))

['knife']
['america']
['shelf']


*above - dissapointing result*

**Reasons for the Unexpected Results:**
- **Embedding Quality:** Not all word embeddings capture every semantic relationship equally well. The quality and ability of embeddings to solve analogies depend on the data they were trained on and the method used.

- **Limitations of Vector Arithmetic for Analogies:** While vector arithmetic can capture many semantic relationships, it isn't perfect and doesn't always work for every analogy.

- **Vocabulary and Training Data:** The embeddings in the mini.h5 file might be from a limited vocabulary or might not have been trained on a diverse enough dataset to capture all types of relationships.

- **Nature of Analogies:** Analogies are complex and can be interpreted in multiple ways. The relationships in your examples are not as commonly used in analogy datasets as the classic "man:king::woman:queen" type.

Potential Solutions:
- Different Embeddings
- Fine tuning
- Explicit Relationship Models


# Using word embeddings in deep models

**Continuous Space for Words:** Word embeddings enable us to perceive words as existing in a continuous, Euclidean space. This representation allows words to be treated similarly to continuous numerical data, which facilitates the use of various machine learning techniques that are tailored for such data.

**Application - Sentiment Analysis:** The notebook proposes an experiment involving sentiment analysis on a collection of movie reviews. Sentiment analysis aims to determine the mood or sentiment of a piece of text, such as identifying whether a movie review is positive or negative.

**Using Word Embeddings in Models:** To perform sentiment analysis, word embeddings can be employed as features in machine learning models, like logistic regression or neural networks. The embeddings provide a dense representation of words, which can help models capture semantic nuances in the text.

The subsequent cells will delve deeper into the process of sentiment analysis and demonstrate how to utilize word embeddings in building a sentiment analysis model.

In [11]:
#preprocessing movie reviews
import string
remove_punct=str.maketrans('','',string.punctuation)

# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.
def convert_line_to_example(line):
    # Pull out the first character: that's our label (0 or 1)
    y = int(line[0])
    
    # Split the line into words using Python's split() function
    words = line[2:].translate(remove_punct).lower().split()
    
    # Look up the embeddings of each word, ignoring words not
    # in our pretrained vocabulary.
    embeddings = [normalized_embeddings[index[w]] for w in words
                  if w in index]
    
    # Take the mean of the embeddings
    x = np.mean(np.vstack(embeddings), axis=0)
    return {'x': x, 'y': y}

# Apply the function to each line in the file.
with open("movie-simple.txt", "r", encoding='utf-8', errors='ignore') as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]

**Explanation:**

**Removing Punctuation:**
- The `string.punctuation` variable contains all punctuation characters. Using `str.maketrans()`, a translation table is created that will be used to remove all punctuation from a string.

**Function - `convert_line_to_example(line)`:**
- This function processes a line from the dataset.
- **Label Extraction**: The first character of the line is expected to be the label (0 or 1), indicating whether the review is negative or positive, respectively.
- **Text Processing**: The function then processes the review text:
    Punctuation is removed using the translation table created earlier.
    The text is converted to lowercase.
    The review is split into individual words.
- The output of the function will be a tuple `(x, y)`, where `x` represents the processed words of the review, and `y` is the label.

This function is a crucial step in preparing the data for training. It ensures that the input text is cleaned and converted into a format amenable for training machine learning models.

In [12]:
len(dataset)
# checks the length (or number of entries) of the dataset variable. 
#This is a common practice to understand the size of the dataset you're working with, 
#especially before processing or training.
#The output will give us the total number of movie reviews in the dataset.

1411

# Train/Test Split
- **Shuffling the Dataset:** Before splitting the data into training and test sets, it's a common practice to shuffle the dataset. This ensures that the training and test data are random samples and do not contain any inherent order that might affect the model's performance.

- **Train/Test Split:** The dataset will be divided into two parts:
Training Set: Used for training the model.
Test Set: Used for evaluating the model's performance on unseen data.

Typically, a common ratio like 75%-25% or 80%-20% is used for the train-test split. Here, the notebook mentions using three-quarters of the dataset for training and a quarter for testing.

- **Whole Number of Batches:** The cell also notes the intention to ensure that the training set size is a multiple of the batch size. This simplifies the batching process during training.

The subsequent code cells will demonstrate how to shuffle the dataset and perform the train/test split.

In [13]:
import random
random.shuffle(dataset)

batch_size = 100
total_batches = len(dataset) // batch_size
train_batches = 3*total_batches // 4 
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]


**Explanation:**

**Data Shuffling:**
- The `random.shuffle()` function is utilized to shuffle the entries of the dataset in place. This randomizes the order of the movie reviews.

**Setting Batch Size:**
- A `batch_size` of 100 is defined, indicating that the model will be trained using batches of 100 reviews at a time.

**Computing Total Batches:**
- The total number of batches in the dataset is computed using integer division. This ensures that the total number of reviews considered is a multiple of the batch size.

**Splitting into Training and Test Sets:**
- 75% of the batches (train_batches) are allocated for training, and the remaining 25% are for testing.
- The dataset is then split based on these batch counts to form the `train` and `test` sets.

By following this approach, the notebook ensures that both the training and test data are multiple of the batch size, which will be convenient for batch processing during model training.

# Building the MLP with TensorFlow
**Placeholders for X and y:** Before building the model, placeholders will be defined for the input data X (the movie reviews) and the labels y. In TensorFlow, placeholders are symbolic variables that allow us to feed in actual data at runtime.

**MLP Architecture:** The subsequent cells will likely detail the architecture of the MLP, including the number of layers, neurons in each layer, activation functions, and the methods used for optimization and loss computation.

The following code cells will demonstrate how to set up this TensorFlow model.

In [14]:
import tensorflow as tf

# Revised code for TensorFlow 2.x using Keras API

# Define the model using Keras Sequential API
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(300,)),  # Input layer
    tf.keras.layers.Dense(100, activation='relu'),  # First hidden layer
    tf.keras.layers.Dense(20, activation='relu'),  # Second hidden layer
    tf.keras.layers.Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model with binary cross-entropy loss and accuracy metric
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Display the model summary
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 100)               30100     
                                                                 
 dense_1 (Dense)             (None, 20)                2020      
                                                                 
 dense_2 (Dense)             (None, 1)                 21        
                                                                 
Total params: 32,141
Trainable params: 32,141
Non-trainable params: 0
_________________________________________________________________


**Explanation:**

**Importing TensorFlow:** The TensorFlow library is imported, which provides the necessary tools and functions to define, train, and evaluate neural network models.

**Input Placeholders:**
`tf.keras.layers.Input(shape=(300,))`
The concept of placeholders (tf.placeholder) is not used in TensorFlow 2.x since there's no static computational graph. Instead, you'd define models and pass data directly to them.

**MLP Architecture:**
- `tf.keras.layers.Dense(100, activation='relu')`The first hidden layer with 100 neurons and the ReLU activation function. It takes the input X and transforms it.
- `tf.keras.layers.Dense(20, activation='relu')` The second hidden layer with 20 neurons and the ReLU activation function. It takes the output of h1 as its input.
- `tf.keras.layers.Dense(1, activation='sigmoid')` The output layer. It produces the raw scores for each review. It has 1 neuron since this is a binary classification task.

**Loss and Metrics:**

**loss:** Computes the binary cross-entropy loss between the predicted logits and the true labels.
**accuracy:** Computes the classification accuracy. It first rounds the predicted probabilities to get binary predictions and then compares them to the true labels.

This setup provides a complete three-layer MLP architecture for sentiment analysis, along with the necessary components to train and evaluate the model.

# Training the MLP

**Session Initiation:** For TensorFlow 1.x, the typical approach to executing the computational graph involves starting a TensorFlow session. However, please note that in TensorFlow 2.x (as discussed previously), the concept of sessions and placeholders has been removed in favor of eager execution.

**Training Epochs:** The notebook mentions that the model will be trained for 250 epochs. An epoch is one complete forward and backward pass of all the training examples.

**Evaluation:** After training, the model's accuracy will be evaluated on the test data to determine its performance on unseen samples.

The subsequent code cells will likely demonstrate the process of training the model for the specified number of epochs and then evaluating it on the test data.

In [15]:
# Assuming the model has been defined and compiled as shown before.
# Splitting the dataset into training and test sets
# This assumes you want to use 80% of the data for training and 20% for testing.
split_ratio = 0.8
split_index = int(len(dataset) * split_ratio)
train_data = dataset[:split_index]
test_data = dataset[split_index:]

# Training Data Preparation
reviews_train = np.array([sample['x'] for sample in train_data])
labels_train = np.array([sample['y'] for sample in train_data]).reshape([-1,1])

# Test Data Preparation
reviews_test = np.array([sample['x'] for sample in test_data])
labels_test = np.array([sample['y'] for sample in test_data]).reshape([-1,1])

# Train the model for 250 epochs
history = model.fit(reviews_train, labels_train, epochs=250, batch_size=100, shuffle=True, validation_data=(reviews_test, labels_test))

# Evaluate the model's performance on test data
loss, acc = model.evaluate(reviews_test, labels_test)
print("Final accuracy on test data: {:.2f}%".format(acc * 100))

# The code above provides a more specific approach based on the data processing provided.
# You can run this in a TensorFlow 2.x environment to train the model and evaluate its performance.

Epoch 1/250
Epoch 2/250
Epoch 3/250
Epoch 4/250
Epoch 5/250
Epoch 6/250
Epoch 7/250
Epoch 8/250
Epoch 9/250
Epoch 10/250
Epoch 11/250
Epoch 12/250
Epoch 13/250
Epoch 14/250
Epoch 15/250
Epoch 16/250
Epoch 17/250
Epoch 18/250
Epoch 19/250
Epoch 20/250
Epoch 21/250
Epoch 22/250
Epoch 23/250
Epoch 24/250
Epoch 25/250
Epoch 26/250
Epoch 27/250
Epoch 28/250
Epoch 29/250
Epoch 30/250
Epoch 31/250
Epoch 32/250
Epoch 33/250
Epoch 34/250
Epoch 35/250
Epoch 36/250
Epoch 37/250
Epoch 38/250
Epoch 39/250
Epoch 40/250
Epoch 41/250
Epoch 42/250
Epoch 43/250
Epoch 44/250
Epoch 45/250
Epoch 46/250
Epoch 47/250
Epoch 48/250
Epoch 49/250
Epoch 50/250
Epoch 51/250
Epoch 52/250
Epoch 53/250
Epoch 54/250
Epoch 55/250
Epoch 56/250
Epoch 57/250
Epoch 58/250


Epoch 59/250
Epoch 60/250
Epoch 61/250
Epoch 62/250
Epoch 63/250
Epoch 64/250
Epoch 65/250
Epoch 66/250
Epoch 67/250
Epoch 68/250
Epoch 69/250
Epoch 70/250
Epoch 71/250
Epoch 72/250
Epoch 73/250
Epoch 74/250
Epoch 75/250
Epoch 76/250
Epoch 77/250
Epoch 78/250
Epoch 79/250
Epoch 80/250
Epoch 81/250
Epoch 82/250
Epoch 83/250
Epoch 84/250
Epoch 85/250
Epoch 86/250
Epoch 87/250
Epoch 88/250
Epoch 89/250
Epoch 90/250
Epoch 91/250
Epoch 92/250
Epoch 93/250
Epoch 94/250
Epoch 95/250
Epoch 96/250
Epoch 97/250
Epoch 98/250
Epoch 99/250
Epoch 100/250
Epoch 101/250
Epoch 102/250
Epoch 103/250
Epoch 104/250
Epoch 105/250
Epoch 106/250
Epoch 107/250
Epoch 108/250
Epoch 109/250
Epoch 110/250
Epoch 111/250
Epoch 112/250
Epoch 113/250
Epoch 114/250
Epoch 115/250
Epoch 116/250


Epoch 117/250
Epoch 118/250
Epoch 119/250
Epoch 120/250
Epoch 121/250
Epoch 122/250
Epoch 123/250
Epoch 124/250
Epoch 125/250
Epoch 126/250
Epoch 127/250
Epoch 128/250
Epoch 129/250
Epoch 130/250
Epoch 131/250
Epoch 132/250
Epoch 133/250
Epoch 134/250
Epoch 135/250
Epoch 136/250
Epoch 137/250
Epoch 138/250
Epoch 139/250
Epoch 140/250
Epoch 141/250
Epoch 142/250
Epoch 143/250
Epoch 144/250
Epoch 145/250
Epoch 146/250
Epoch 147/250
Epoch 148/250
Epoch 149/250
Epoch 150/250
Epoch 151/250
Epoch 152/250
Epoch 153/250
Epoch 154/250
Epoch 155/250
Epoch 156/250
Epoch 157/250
Epoch 158/250
Epoch 159/250
Epoch 160/250
Epoch 161/250
Epoch 162/250
Epoch 163/250
Epoch 164/250
Epoch 165/250
Epoch 166/250
Epoch 167/250
Epoch 168/250
Epoch 169/250
Epoch 170/250
Epoch 171/250
Epoch 172/250
Epoch 173/250


Epoch 174/250
Epoch 175/250
Epoch 176/250
Epoch 177/250
Epoch 178/250
Epoch 179/250
Epoch 180/250
Epoch 181/250
Epoch 182/250
Epoch 183/250
Epoch 184/250
Epoch 185/250
Epoch 186/250
Epoch 187/250
Epoch 188/250
Epoch 189/250
Epoch 190/250
Epoch 191/250
Epoch 192/250
Epoch 193/250
Epoch 194/250
Epoch 195/250
Epoch 196/250
Epoch 197/250
Epoch 198/250
Epoch 199/250
Epoch 200/250
Epoch 201/250
Epoch 202/250
Epoch 203/250
Epoch 204/250
Epoch 205/250
Epoch 206/250
Epoch 207/250
Epoch 208/250
Epoch 209/250
Epoch 210/250
Epoch 211/250
Epoch 212/250
Epoch 213/250
Epoch 214/250
Epoch 215/250
Epoch 216/250
Epoch 217/250
Epoch 218/250
Epoch 219/250
Epoch 220/250
Epoch 221/250
Epoch 222/250
Epoch 223/250
Epoch 224/250
Epoch 225/250
Epoch 226/250
Epoch 227/250
Epoch 228/250
Epoch 229/250
Epoch 230/250


Epoch 231/250
Epoch 232/250
Epoch 233/250
Epoch 234/250
Epoch 235/250
Epoch 236/250
Epoch 237/250
Epoch 238/250
Epoch 239/250
Epoch 240/250
Epoch 241/250
Epoch 242/250
Epoch 243/250
Epoch 244/250
Epoch 245/250
Epoch 246/250
Epoch 247/250
Epoch 248/250
Epoch 249/250
Epoch 250/250
Final accuracy on test data: 95.41%


We can now examine what our model has learned, seeing how it responds to word vectors for different words:

In [16]:
# Check some words using TensorFlow 2.x
words_to_test = ["exciting", "hated", "boring", "loved"]

for word in words_to_test:
    word_embedding = normalized_embeddings[index[word]].reshape(1, 300)
    predicted_prob = model.predict(word_embedding)
    print(word, predicted_prob)


exciting [[1.]]
hated [[2.8573204e-31]]
boring [[2.2877814e-24]]
loved [[1.]]


Explanation:

- We loop over the words in words_to_test.
- For each word, we fetch its embedding and reshape it to be compatible with the model's input shape.
- We use the `model.predict` method to get the model's output probability for the given word embedding.
- Finally, the word and its corresponding predicted probability are printed.

The results you've received indicate the model's output probabilities for the words "exciting", "hated", "boring", and "loved". Let's interpret these results:

exciting: The model predicted a probability of 1.0 (or very close to 1), suggesting that the model associates the word "exciting" with a positive sentiment.

hated: The model predicted a probability extremely close to 0, as indicated by the very small scientific notation value. This suggests that the model associates the word "hated" with a negative sentiment.

boring: Similarly, the word "boring" also gets a probability close to 0, suggesting a negative sentiment.

loved: The probability for "loved" is 1.0, indicating a strong positive sentiment.

These results are in line with what we'd expect. Words like "exciting" and "loved" are associated with positive sentiments, while words like "hated" and "boring" are linked to negative sentiments.

It's good to see the model's predictions aligning with our intuitive understanding of these words' sentiments. This suggests that the model has learned meaningful representations from the training data.

In [17]:
# Testing my own words
words_to_test = ["inspired", "annoying", "sad", "recommend"]

for word in words_to_test:
    word_embedding = normalized_embeddings[index[word]].reshape(1, 300)
    predicted_prob = model.predict(word_embedding)
    print(word, predicted_prob)

inspired [[1.]]
annoying [[7.6847117e-25]]
sad [[1.1678653e-16]]
recommend [[0.9999999]]


This model works great for such a simple dataset, but does a little less well on something more complex. `movie-pang02.txt`, for instance, has 2000 longer, more complex movie reviews. It's in the same format as our simple dataset. On those longer reviews, this model achieves only 60-80% accuracy. (Increasing the number of epochs to, say, 1000, does help.)

### Recurrent Neural Networks (RNNs)

In the context of deep learning, natural language is commonly modeled with Recurrent Neural Networks (RNNs).
RNNs pass the output of a neuron back to the input of the next time step of the same neuron.
These directed cycles in the RNN architecture gives them the ability to model temporal dynamics, making them particularly suited for modeling sequences (e.g. text).
We can visualize an RNN layer as follows:

<img src="Figures/basic_RNN.PNG" alt="basic_RNN" style="width: 80px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

We can unroll an RNN through time, making the sequence aspect of them more obvious:

<img src="Figures/unrolled_RNN.PNG" alt="basic_RNN" style="width: 400px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

#### RNNs in TensorFlow
How would we implement an RNN in TensorFlow? Given the different forms of RNNs, there are quite a few ways, but we'll stick to a simple one. 

In [19]:
# As always, import TensorFlow first
import tensorflow as tf


Let's assume we have our inputs in word embedding form already, say of dimensionality 100. We'll use a minibatch size of 16.

In [20]:
mb = 16
x_dim = 100

# In TensorFlow 2.x, you don't define placeholders. Instead, when defining a model:
# model = tf.keras.Sequential([...])
# You'll specify the input shape in the first layer:
# tf.keras.layers.Input(shape=(x_dim,))


**Explanation:**

- The code initializes two variables, mb and x_dim, representing the mini-batch size and the dimensionality of the word embeddings, respectively. These values were previously mentioned in the preceding markdown cell.

In [21]:
h_dim = 64

# For projecting the input
U = tf.Variable(tf.random.truncated_normal([x_dim, h_dim], stddev=0.1))

# For projecting the previous state
W = tf.Variable(tf.random.truncated_normal([h_dim, h_dim], stddev=0.1))

# For projecting the output
V = tf.Variable(tf.random.truncated_normal([h_dim, x_dim], stddev=0.1))


**Explanation:**

- The variable `h_dim` is set to 64, which is the dimension of the hidden layer or state in the RNN.
- `U`: This is the weight matrix for projecting the input. Its shape is `[x_dim, h_dim]`, meaning it will transform data from the input dimension `x_di`m to the hidden layer dimension `h_dim`.
- `W`: This is the weight matrix for projecting the previous state. Its shape is `[h_dim, h_dim]`, allowing it to transform the hidden state from one time step to the next.
- `V`: This is the weight matrix for projecting the output. It will transform data from the hidden layer dimension `h_dim` back to the input dimension `x_dim`.
- All these matrices are initialized with truncated normal distributions having a standard deviation of 0.1, which is a common initialization technique.

In [22]:
def RNN_step(x, h):
    h_next = tf.tanh(tf.matmul(x, U) + tf.matmul(h, W))
    
    output = tf.matmul(h_next, V)
    return output, h_next


**Explanation:**

- This function, `RNN_step`, takes in two parameters: `x`, the input data for the current time step, and `h`, the hidden state from the previous time step.
- The function calculates `h_next`, the next hidden state, by:
        - Multiplying the input x by the weight matrix U and the previous hidden state h by the weight matrix W.
        - Summing these two results.
        - Applying the `tanh` activation function.
- The output for the current time step is then computed by multiplying `h_next` with the weight matrix `V`.
- The function returns the computed output and the next hidden state h_next.

In [24]:
# Initialize hidden state to 0
h0 = tf.zeros([mb, h_dim])

# Define x1 as some sample input data for the first time step
x1 = tf.random.normal([mb, x_dim])

# Forward pass of one RNN step for time step t=1
y1, h1 = RNN_step(x1, h0)

print("Output y1 dimensions:", y1.shape)
print("Hidden state h1 dimensions:", h1.shape)


Output y1 dimensions: (16, 100)
Hidden state h1 dimensions: (16, 64)


**Explanation:**

- The initial hidden state, `h0`, is initialized to a matrix of zeros with dimensions `[mb, h_dim]`, where `mb` is the mini-batch size and `h_dim` is the hidden layer dimension (previously set to 64).
- The function `RNN_ste`p is then called with the input for the first time step `(x1)` and the initial hidden state `(h0)`. This gives the output `y1` and the next hidden state `h1` for time step `t=1`.
- The dimensions of the output `y1` and the hidden state `h1` are then printed for verification.

In [25]:
# Simulate some sample input data for the second time step
x2 = tf.random.normal([mb, x_dim])

# Forward pass of one RNN step for time step t=2
y2, h2 = RNN_step(x2, h1)

print("Output y2 dimensions:", y2.shape)
print("Hidden state h2 dimensions:", h2.shape)


Output y2 dimensions: (16, 100)
Hidden state h2 dimensions: (16, 64)


**Explanation:**

- The code defines a new placeholder `x2` to represent the input for the second time step.
- The `RNN_step` function is then called again, this time with `x2` as the input and `h1` (the hidden state from the previous time step) as the previous state.
- This produces `y2`, the output for the second time step, and `h2`, the hidden state to be passed to the next time step.
- The dimensions of `y2` and `h2` are then printed for verification.

In [26]:
# Number of steps to unroll
num_steps = 10

# List of inputs and hidden states
xs = []
hs = []

# Build RNN
rnn = tf.keras.layers.SimpleRNNCell(h_dim)

# Initialize hidden state to zero
h_t  = tf.zeros([mb, h_dim])

for t in range(num_steps):
    x_t = tf.random.normal([mb, x_dim])  # Use sample data in place of placeholders
    h_t, _ = rnn(x_t, [h_t])
    
    xs.append(x_t)
    hs.append(h_t)
    
print("x dimensions:")
print([x_t.shape for x_t in xs])
print("\nh dimensions:")
print([h_t.shape for h_t in hs])


x dimensions:
[TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100]), TensorShape([16, 100])]

h dimensions:
[TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64]), TensorShape([16, 64])]


**Explanation:**

- The code specifies that the RNN will be unrolled for `num_steps` time steps, which is set to `10` in this case.
- Empty lists, `xs` and `hs`, are created to store the inputs and hidden states at each time step, respectively.
- The RNN is built using TensorFlow's BasicRNNCell from the `tf.contrib.rnn` module. For Tensorflow 2.x We use `tf.keras.layers.SimpleRNNCell` in place of `tf.contrib.rnn.BasicRNNCell`.
- The hidden state `h_t` is initialized to zeros.

Within the loop:
- A new placeholder `x_t` is created for each time step's input.
- The built-in RNN cell is called with `x_t` and `h_t` to get the new hidden state.
- The input and hidden state are then stored in their respective lists.

- The dimensions of all inputs and hidden states are printed.