# Natural Laguage Processing - NLP

[Readmore](https://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf)

## Tokenizer

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [8]:
sentences = [
    'Today I feel very disappointed',
    'I get a very low score',
    'My parents do and teachers do not believe in me'
]

unique_words = []
for sentence in sentences:
  for word in sentence.split():
    if word not in unique_words:
      unique_words.append(word)

NUM_VOCAB = len(unique_words)
NUM_VOCAB

18

In [10]:
tokenizer = Tokenizer(num_words=100, oov_token="<OVV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
word_index

{'<OVV>': 1,
 'i': 2,
 'very': 3,
 'do': 4,
 'today': 5,
 'feel': 6,
 'disappointed': 7,
 'get': 8,
 'a': 9,
 'low': 10,
 'score': 11,
 'my': 12,
 'parents': 13,
 'and': 14,
 'teachers': 15,
 'not': 16,
 'believe': 17,
 'in': 18,
 'me': 19}

In [11]:
new_sentences = [
    'This is such a beautiful day',
    'I really want to go swimming today',
    'Sky is brightly blue'
]

new_sequences = tokenizer.texts_to_sequences(new_sentences)
new_sequences

[[1, 1, 1, 9, 1, 1], [2, 1, 1, 1, 1, 1, 5], [1, 1, 1, 1]]

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padding_sequences = pad_sequences(new_sequences)
padding_sequences

array([[0, 1, 1, 1, 9, 1, 1],
       [2, 1, 1, 1, 1, 1, 5],
       [0, 0, 0, 1, 1, 1, 1]], dtype=int32)

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)



Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteA848ZD/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteA848ZD/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteA848ZD/imdb_reviews-unsupervised.t…



Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.


In [None]:
train_dataset, test_dataset = dataset["train"], dataset["test"]
tokenizer = info.features['text'].encoder

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))

In [None]:
train_dataset

<_PaddedBatchDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [None]:
tokenizer

<SubwordTextEncoder vocab_size=8185>

## Simple RNN

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

model = Sequential([
    Embedding(tokenizer.vocab_size, 64),
    SimpleRNN(32),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid"),
])

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 64)          523840    
                                                                 
 simple_rnn_2 (SimpleRNN)    (None, 32)                3104      
                                                                 
 dense_2 (Dense)             (None, 16)                528       
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 527,489
Trainable params: 527,489
Non-trainable params: 0
_________________________________________________________________


### Explanation about model architecture and its parameters

The parameters are the trainable weights and biases that the model learns during training. They are also called the model's capacity, as they determine how complex and flexible the model can be. The more parameters, the more expressive the model, but also the more prone to overfitting and harder to train.

The number of parameters in each layer depends on the layer type, the input shape, and the output shape. Here is how you can calculate them for each layer in your model:

- **Embedding layer**: This layer transforms each token in the input sequence into a 64-dimensional vector. The number of parameters is equal to the vocabulary size (the number of unique tokens) times the embedding dimension. In your case, you are using `tokenizer.vocab_size` as the vocabulary size, which is $8192$. Therefore, the number of parameters in the embedding layer is $8192 * 64 = 523840$.

- **SimpleRNN layer**: This layer applies a recurrent neural network (RNN) to the embedded input sequence. The RNN has 32 hidden units, which means that each time step produces a 32-dimensional output vector. Look at the formula to that output hidden state $h_t$: $h_t = W x_t + H h_{t-1} + b$ with $W \in \mathbb{R}^{n \times d}, x_t \in \mathbb{R}^d, H \in \mathbb{R}^{n \times n}, h_{t-1}, h_t \in \mathbb{R}^n$. We need to optimze number of parameters: $n \times n + n \times d + n$ ($\text{hidden state} \times \text{hidden state} + \text{hidden state} \times \text{input dimension} + \text{hidden state}$). We put $n$ as common factor $→ n(n+ d + 1)$. In this case, and the hidden dimension is 32. Therefore, the number of parameters in the RNN layer is $(64 + 32 + 1) * 32 = 3104$.


- **Dense layer 1**: This layer applies a fully connected neural network (FCN) to each time step of the RNN output. The FCN has 16 units, which means that each time step produces a 16-dimensional output vector. Look at the formula of Dense layer $x_t = W x_{t-1} + b$ with $W \in \mathbb{R}^{n \times d}, x_t, x_{t-1} \in \mathbb(R)^d, b \in \mathbb{R}^n$ with $n$ is the new dimension of output $x_t$ and $d$ is the dimension of input $→$ need optimizing $n \times d + d = d(n + 1)$. In this case, the input dimension is $32$ (the RNN output dimension), and the output dimension is $16$. Therefore, the number of parameters in the dense layer 1 is $(32 + 1) * 16 = 528$.

- **Dense layer 2**: This layer applies another FCN to each time step of the dense layer 1 output. The FCN has 1 unit, which means that each time step produces a scalar output value. The number of parameters in the FCN is equal to (input dimension + 1) * output dimension, where the +1 is for the bias term. In this case, the input dimension is $16$ (the dense layer 1 output dimension), and the output dimension is $1$. Therefore, the number of parameters in the dense layer 2 is $(16 + 1) * 1 = 17$.

In [None]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(train_dataset, epochs=1, validation_data=test_dataset)



<keras.callbacks.History at 0x7b225e235030>

## Long-Short Term Memory - LSTM

[Read more](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

In [None]:
from tensorflow.keras.layers import LSTM

model_2 = Sequential([
    Embedding(tokenizer.vocab_size, 64),
    LSTM(32),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid"),
])

model_2.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, None, 64)          523840    
                                                                 
 lstm (LSTM)                 (None, 32)                12416     
                                                                 
 dense_4 (Dense)             (None, 16)                528       
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 536,801
Trainable params: 536,801
Non-trainable params: 0
_________________________________________________________________


### Explain the parameter in LSTM layer

If we look closely at the LSTM layer, we see that the number of parameters in LSTM layer is much larger than the that of SimpleRNN layer. Look at the math formula of LSTM:

- Input gate: $i_t = \sigma(W^{(i)} [h_{t-1}, x_t] + b^{i})$
- Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b^{f})$
- Output/exposure gate: $o_t = \sigma(W^{o} [h_{t-1}, x_t] + b^{o})$
- New memory cell: $𝛆_t = tanh(W^{C} [h_{t-1}, x_t] + b^{C}$
- Final memory cell: $c_t = f_t * c_{t-1} + i_t * 𝛆_t$

We need to optimize $(n \times n + n \times d + n) \times 4 = (32 \times 32 + 32 \times 64 + 32) \times 4 = 12416$

In [None]:
model_2.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model_2.fit(train_dataset, epochs=1, validation_data=test_dataset)



<keras.callbacks.History at 0x7b225d1a1720>

### Implement LSTM



```python
W = tf.get_variable('W', shape=[4, state_size, state_size])
U = tf.get_variable('U', shape=[4, state_size, state_size])
intializer = xav_init()

def lstm(pre_layer, x):
  pre_h, pre_c = tf.unstacker(pre_layer)
  
  # Gate
  # Input gate
  i_t = tf.sigmoid(W[0].dot(x) + U[0].dot(pre_h))

  # Forget gate
  f_t = tf.sigmoid(W[1].dot(x) + U[1].dot(pre_h))

  # Output gate
  o_t = tf.sigmoid(W[2].dot(x) + U[2].dot(pre_h))

  # New memory cell
  n_c_t = tf.tanh(W[3].dot(x) + U[3].dot(pre_h))

  # Final memory cell
  c = f_t * pre_c + i_t + n_c_t
  h = o_t * tf.tanh(c)

  return tf.stack([h,c])
```



## Bidirectional
But we do not use RNN for Bidirectional, we use LSTM for Bidirectional $\rightarrow$ Bidirectional LSTM

In [None]:
from tensorflow.keras.layers import Bidirectional, LSTM

model_3= Sequential([
    Embedding(tokenizer.vocab_size, 64),
    Bidirectional(LSTM(32)),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid"),
])

model_3.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, None, 64)          523840    
                                                                 
 bidirectional (Bidirectiona  (None, 64)               24832     
 l)                                                              
                                                                 
 dense_6 (Dense)             (None, 16)                1040      
                                                                 
 dense_7 (Dense)             (None, 1)                 17        
                                                                 
Total params: 549,729
Trainable params: 549,729
Non-trainable params: 0
_________________________________________________________________


### Explain the parameter in the Bidirectional LSTM layer

LSTM:
- Input gate: $i_t = \sigma(W^{(i)} [h_{t-1}, x_t] + b^{i})$
- Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b^{f})$
- Output/exposure gate: $o_t = \sigma(W^{o} [h_{t-1}, x_t] + b^{o})$
- New memory cell: $𝛆_t = tanh(W^{C} [h_{t-1}, x_t] + b^{C}$
- Final memory cell: $c_t = f_t * c_{t-1} + i_t * 𝛆_t$

$\rightarrow h_t \in \mathbb{R}^n$

Bidrection:
- We obtain two vector $h_t^{(1)}$ and $h_t^{(2)}$ from LSTM
- $\hat{y}_t = \gamma(W^{(s)} [h_t^{(1)}, h_t^{(2)}] + c)$
- We concatenate two vectors $h_t^{(1)}$ and $h_t^{(2)}$ $\rightarrow$ the number of parameters $((n \times n + n \times d + n) \times 4) \times 2 = ((32 \times 32 + 32 \times 64 + 32) \times 4) \times 2 = 24832$

In [None]:
model_3.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model_3.fit(train_dataset, epochs=1, validation_data=test_dataset)



<keras.callbacks.History at 0x7b225d9c0d00>

## Deep Bidirection LSTM

In [None]:
model_4 = Sequential([
    Embedding(tokenizer.vocab_size, 64),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid"),
])

model_4.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, None, 64)          523840    
                                                                 
 bidirectional_1 (Bidirectio  (None, None, 128)        66048     
 nal)                                                            
                                                                 
 bidirectional_2 (Bidirectio  (None, 64)               41216     
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 16)                1040      
                                                                 
 dense_9 (Dense)             (None, 1)                 17        
                                                                 
Total params: 632,161
Trainable params: 632,161
Non-tr

## Auto Completion

In [18]:
!wget --no-check-certificate -O /tmp/sonnets.txt \
    https://storage.googleapis.com/protonx-cloud-storage/data.txt

--2023-07-24 03:09:36--  https://storage.googleapis.com/protonx-cloud-storage/data.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.204.128, 172.217.203.128, 172.253.123.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.204.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93578 (91K) [text/plain]
Saving to: ‘/tmp/sonnets.txt’


2023-07-24 03:09:36 (75.6 MB/s) - ‘/tmp/sonnets.txt’ saved [93578/93578]



In [21]:
data = open('/tmp/sonnets.txt').read()
print(data)

FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
When forty winters shall beseige thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery, so gazed on now,
Will be a tatter'd weed, of small worth held:
Then being ask'd where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy b

In [22]:
type(data)

str

In [27]:
corpus = data.lower().split("\n")
print(len(corpus))
corpus[:10]

2159


['from fairest creatures we desire increase,',
 "that thereby beauty's rose might never die,",
 'but as the riper should by time decease,',
 'his tender heir might bear his memory:',
 'but thou, contracted to thine own bright eyes,',
 "feed'st thy light'st flame with self-substantial fuel,",
 'making a famine where abundance lies,',
 'thyself thy foe, to thy sweet self too cruel.',
 "thou that art now the world's fresh ornament",
 'and only herald to the gaudy spring,']

In [44]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers

In [32]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
total_words

3211

In [35]:
token_list = tokenizer.texts_to_sequences(["Love's fire heats water, water cools not love."])[0]
token_list

[101, 253, 3209, 493, 493, 3210, 15, 14]

In [34]:
# what is the input
input_sequences = []
for line in corpus:
  token_list = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(token_list)):
    n_gram_sequence = token_list[:i+1]
    print(n_gram_sequence)
    input_sequences.append(n_gram_sequence)

input_sequences

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[153, 5, 174]
[153, 5, 174, 3]
[153, 5, 174, 3, 856]
[153, 5, 174, 3, 856, 857]
[66, 209]
[66, 209, 2595]
[66, 209, 2595, 287]
[66, 209, 2595, 287, 150]
[66, 209, 2595, 287, 150, 2596]
[858, 206]
[858, 206, 1]
[858, 206, 1, 1299]
[858, 206, 1, 1299, 13]
[858, 206, 1, 1299, 13, 23]
[858, 206, 1, 1299, 13, 23, 5]
[858, 206, 1, 1299, 13, 23, 5, 400]
[858, 206]
[858, 206, 1]
[858, 206, 1, 1299]
[858, 206, 1, 1299, 2597]
[858, 206, 1, 1299, 2597, 3]
[858, 206, 1, 1299, 2597, 3, 160]
[858, 206, 1, 1299, 2597, 3, 160, 249]
[1, 7]
[1, 7, 29]
[1, 7, 29, 205]
[1, 7, 29, 205, 13]
[1, 7, 29, 205, 13, 5]
[1, 7, 29, 205, 13, 5, 488]
[1, 7, 29, 205, 13, 5, 488, 411]
[348, 2598]
[348, 2598, 7]
[348, 2598, 7, 66]
[348, 2598, 7, 66, 26]
[348, 2598, 7, 66, 26, 1298]
[348, 2598, 7, 66, 26, 1298, 471]
[348, 2598, 7, 66, 26, 1298, 471, 1258]
[858, 206]
[858, 206, 1]
[858, 206, 1, 75]
[858, 206, 1, 75, 147]
[858, 206, 1, 75, 147, 39]
[858, 206,

[[34, 417],
 [34, 417, 877],
 [34, 417, 877, 166],
 [34, 417, 877, 166, 213],
 [34, 417, 877, 166, 213, 517],
 [8, 878],
 [8, 878, 134],
 [8, 878, 134, 351],
 [8, 878, 134, 351, 102],
 [8, 878, 134, 351, 102, 156],
 [8, 878, 134, 351, 102, 156, 199],
 [16, 22],
 [16, 22, 2],
 [16, 22, 2, 879],
 [16, 22, 2, 879, 61],
 [16, 22, 2, 879, 61, 30],
 [16, 22, 2, 879, 61, 30, 48],
 [16, 22, 2, 879, 61, 30, 48, 634],
 [25, 311],
 [25, 311, 635],
 [25, 311, 635, 102],
 [25, 311, 635, 102, 200],
 [25, 311, 635, 102, 200, 25],
 [25, 311, 635, 102, 200, 25, 278],
 [16, 10],
 [16, 10, 880],
 [16, 10, 880, 3],
 [16, 10, 880, 3, 62],
 [16, 10, 880, 3, 62, 85],
 [16, 10, 880, 3, 62, 85, 214],
 [16, 10, 880, 3, 62, 85, 214, 53],
 [1372, 9],
 [1372, 9, 1373],
 [1372, 9, 1373, 636],
 [1372, 9, 1373, 636, 11],
 [1372, 9, 1373, 636, 11, 122],
 [1372, 9, 1373, 636, 11, 122, 1374],
 [1372, 9, 1373, 636, 11, 122, 1374, 1375],
 [201, 17],
 [201, 17, 1376],
 [201, 17, 1376, 64],
 [201, 17, 1376, 64, 518],
 [201,

In [37]:
import numpy as np

In [40]:
# pad sequence
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences

array([[   0,    0,    0, ...,    0,   34,  417],
       [   0,    0,    0, ...,   34,  417,  877],
       [   0,    0,    0, ...,  417,  877,  166],
       ...,
       [   0,    0,    0, ...,  493,  493, 3210],
       [   0,    0,    0, ...,  493, 3210,   15],
       [   0,    0,    0, ..., 3210,   15,   14]], dtype=int32)

In [42]:
# create predictors and label
from tensorflow.keras.utils import to_categorical
predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
label = to_categorical(label, num_classes=total_words)
label.shape

(15462, 3211)

In [45]:
# Create model architecture
model = Sequential([
    Embedding(total_words, 100, input_length=max_sequence_len-1),
    Bidirectional(LSTM(150, return_sequences=True)),
    Dropout(0.2),
    LSTM(100),
    Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    Dense(total_words, activation='softmax'),
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 100)           321100    
                                                                 
 bidirectional (Bidirectiona  (None, 10, 300)          301200    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 10, 300)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               160400    
                                                                 
 dense (Dense)               (None, 1605)              162105    
                                                                 
 dense_1 (Dense)             (None, 3211)              5156866   
                                                        

In [48]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(predictors, label, epochs=1, verbose=1)
# this should be train 100 epochs -> wait



In [54]:
# inference (generate next 10 words)
seq = 'despite of all wrinkles'
next_words = 10

for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seq])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding="pre")
  probabilities = model.predict(token_list, verbose=0)
  predicted = np.argmax(probabilities)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == predicted:
      output_word = word
      break
  seq += " " + output_word
print(seq)

despite of all wrinkles of of of of of of of of of of
