# CS 195: Natural Language Processing
## Encoder-Decoder Architectures

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_3_EncoderDecoder.ipynb)

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

A ten-minute introduction to sequence-to-sequence learning in Keras: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Character-level recurrent sequence-to-sequence model: https://keras.io/examples/nlp/lstm_seq2seq/

In [None]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/493.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/493.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.35.1-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m102.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━

## Last time: RNN Language Model

We used recurrent neural networks for *language modeling* - predicting the next word.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_languagemodeling.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.6, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN for Sequence Classification

We could also use the last hidden state an RNN as an input to a regular feed-forward network and do classification of a whole sequence.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_classification.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.8, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN Sequence Labeling

RNNs are also good for **sequence labeling** when the output is a squence corresponding 1:1 with the input words, like part-of-speech tagging.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_sequence_labeling.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.7, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### Discussion Question

What sequence-to-sequence NLP tasks can you think of where the input and target sequences don't match up word-for word?

Sequence to sequence NLP Tasks not 1:1 -


Question Answering /
Summary Tasks

## Encoder-Decoder Architecture

**Encoder RNN:** Takes input sequences, produces a context vector

**Context Vector:** Contains essence of the input sequence

**Decoder RNN:** Takes context vector as input, generates an output sequence

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/encoder-decoder.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.16, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Encoder-Decoder usage

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/encoder-decoder_detail.png?raw=1" width=800>
</div>


image source: SLP Fig. 9.18, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Text2Emoji Dataset

Here is a fun dataset that has short sequences of text along with a sequece of emojis corresponding to the task
* This is kind of like translation
* This is kind of like summarization

In [None]:
from datasets import load_dataset

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

Downloading readme:   0%|          | 0.00/100 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

# Filter the dataset
dataset = dataset.filter(is_not_none)

Filter:   0%|          | 0/503687 [00:00<?, ? examples/s]

In [None]:
dataset["text"][46]

'Going green has never been trendier! Drive around in style with a lineup of eco-friendly electric cars.'

In [None]:
dataset["emoji"][46]

'♻️🚗✨💐🍃🌱'

In [None]:
len(dataset)

503682

### Importing libraries we'll need

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, SimpleRNN, Embedding, Dense, TimeDistributed
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

### Setting up some tokenizers

In this case, we'll create two different tokenizers
* texts need to be tokenized as words
* emojis need to be tokenized as characters
* might be similar if you translate between different languages
* some problems might be able to use the same tokenizer

In [None]:
# Parameters
max_text_len = 20
max_emoji_len = 10


#texts = dataset["text"][:5000]
#emoji = dataset["emoji"][:5000]

texts = dataset["text"]
emoji = dataset["emoji"]

#add \t and \n as start and ending tokens for the emoji
for idx in range(len(emoji)):
    emoji[idx] = "\t"+emoji[idx]+"\n"

# Tokenize text
text_tokenizer = Tokenizer()

text_tokenizer.fit_on_texts(texts)
text_sequences = text_tokenizer.texts_to_sequences(texts)
text_sequences = pad_sequences(text_sequences, maxlen=max_text_len, padding='post')
text_vocab_size = len(text_tokenizer.word_index) + 1
print("text_vocab_size",text_vocab_size)



# Tokenize emojis
emoji_tokenizer = Tokenizer(char_level=True,filters="")
emoji_tokenizer.fit_on_texts(emoji)
emoji_sequences = emoji_tokenizer.texts_to_sequences(emoji)
emoji_sequences = pad_sequences(emoji_sequences, maxlen=max_emoji_len, padding='post')
emoji_vocab_size = len(emoji_tokenizer.word_index) + 1

#this might be something to try - then use categorical_crossentropy instead of sparse_categorical_crossentropy
#emoji_sequences_oh = to_categorical(emoji_sequences, num_classes=emoji_vocab_size)


print("emoji_vocab_size",emoji_vocab_size)





text_vocab_size 57073
emoji_vocab_size 1387


In [None]:
text_train, text_test, emoji_train, emoji_test = train_test_split(text_sequences,emoji_sequences)
print(text_test[0])
print(emoji_test[0])

[  99   13    1 3089  839    1 1466  370    2 1628 1304  280 3731  167
   23   21 3995    0    0    0]
[157   1 690  25   9   8 157   1 634   3]


In [None]:
print(text_train[2])
print(emoji_train[2])

[   41     1   681   452  5409  3473  1145  1469  5524     1 12374   163
     0     0     0     0     0     0     0     0]
[257   1 353   1  75   1  59 454   1   3]


### Defining the Encoder

The **Encoder** contains
* an input layer with enough nodes for the largest text input
* an Embedding layer like usual
* a Recurrent layer
    - `return_state=True` means it will return both the **output** and the internal **state**
    
When training, we will ignore the *output* and just pass the *state* as the context vector
    
Notice that we don't use a `Sequential` model for this - it's going to have to be more flexible, so we explicitly compose each layer.
    

In [None]:
# Encoder
encoder_inputs = Input(shape=(max_text_len,))

enc_emb_layer = Embedding(input_dim=text_vocab_size, output_dim=100)
enc_emb = enc_emb_layer(encoder_inputs)

encoder_rnn = SimpleRNN(100, return_state=True)

encoder_outputs, state_h = encoder_rnn(enc_emb)

context_vector = [state_h]


### Defining the Decoder

The **Decoder** contains
* an input layer with `shape=(None,)` - this should make it flexible to allow for output text of many different lengths
* an Embedding layer like usual
* a recurrent layer - called with the context vector as the initial state
* an output layer for classifying which word is next in the sequence

In [None]:
# Decoder
decoder_inputs = Input(shape=(None,))

dec_emb_layer = Embedding(emoji_vocab_size, 100)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_rnn = SimpleRNN(100, return_state=True, return_sequences=True)
decoder_outputs, _ = decoder_rnn(dec_emb, initial_state=context_vector) #ignore the returned states for now

decoder_dense = Dense(emoji_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
# Define the model that will turn
# `encoder_inputs` & `decoder_inputs` into `decoder_outputs`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
model.fit([text_train, emoji_train],
          emoji_train,
          epochs=10,
          batch_size=64,
          validation_data=([text_test, emoji_test],emoji_test) )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79108d3bc550>

## Inference

In order to make predictions on new examples (inference), we need to separate the encoder and decoder models.

In [None]:
# Encoder model for inference
encoder_model = Model(encoder_inputs, context_vector)

# Decoder model for inference
decoder_state_input = Input(shape=(100,))  # This is the input state for the decoder
decoder_emb = dec_emb_layer(decoder_inputs)  # Embedding for decoder input

# Get the output sequence from the decoder RNN
decoder_outputs, decoder_state = decoder_rnn(decoder_emb, initial_state=[decoder_state_input])

# Apply the Dense layer to the output sequence
decoder_outputs = decoder_dense(decoder_outputs)

# Define the decoder model
# Note: The model only returns the output sequence, not the final state
decoder_model = Model([decoder_inputs, decoder_state_input], [decoder_outputs,decoder_state])



### Some functions for doing inference

The results here are not good - there are a number of reasons why this could be, and I hope we can explore ideas in class.

We will try this with some higher-power recurrent architectures next time.

In [None]:
def preprocess_input(input_text, text_tokenizer, max_text_len):
    # Tokenize the input text
    input_seq = text_tokenizer.texts_to_sequences([input_text])
    # Pad the sequence
    input_seq = pad_sequences(input_seq, maxlen=max_text_len, padding='post')
    return input_seq

#ChatGPT wrote this method
def sample(preds, temperature=1.0):
    # Apply softmax temperature
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # Sample a token with probabilities adjusted by the temperature
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def decode_sequence(initial_state, decoder_model, emoji_tokenizer, max_emoji_len):
    # Start with a sequence containing just the start token index.
    target_seq = np.zeros((1, 1))
    start_token_index = emoji_tokenizer.word_index['\t']  # Assuming '\t' is the start token
    target_seq[0, 0] = start_token_index

    stop_condition = False
    decoded_sequence = ''
    state = initial_state

    while not stop_condition:
        # Predict the next token
        output_tokens, state = decoder_model.predict([target_seq, state])

        # Sample a token
        #sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token_index = sample(output_tokens[0, -1, :],temperature=1.2)
        if sampled_token_index == 0:  # Assuming 0 stands for the padding token
            break
        sampled_char = emoji_tokenizer.index_word.get(sampled_token_index, '')
        if sampled_char == '\n':  # Assuming '\n' is the stop token
            break
        decoded_sequence += sampled_char

        # Update the target sequence to the last predicted token
        target_seq = np.array([[sampled_token_index]])

        if len(decoded_sequence) > max_emoji_len:
            stop_condition = True

    return decoded_sequence


def predict(input_text):
    # Preprocess the input
    input_seq = preprocess_input(input_text, text_tokenizer, max_text_len)

    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Decode the sequence to emoji
    decoded_emoji = decode_sequence(states_value, decoder_model, emoji_tokenizer, max_emoji_len)

    return decoded_emoji

new_text = "Finally got that promotion at work! Feeling so proud and accomplished."
predicted_emoji = predict(new_text)
print("Predicted emoji sequence:", predicted_emoji)
display(predicted_emoji)

Predicted emoji sequence: 											


'\t\t\t\t\t\t\t\t\t\t\t'

## Applied Exploration

Try this code on another dataset for summarization, translation, etc.

Or, you can try a character-level encoding like in this reference: https://keras.io/examples/nlp/lstm_seq2seq/

Run the code for a little while and see if you can come up with some meaningful results

Write up a description of the data, what you tried, and what your results were.

In [None]:
# Parameters
max_text_len = 20
max_emoji_len = 10


#texts = dataset["text"][:5000]
#emoji = dataset["emoji"][:5000]

texts = dataset["text"]
emoji = dataset["emoji"]

#add \t and \n as start and ending tokens for the emoji
for idx in range(len(emoji)):
    emoji[idx] = "\t"+emoji[idx]+"\n"

# Tokenize text
text_tokenizer = Tokenizer()

text_tokenizer.fit_on_texts(texts)
text_sequences = text_tokenizer.texts_to_sequences(texts)
text_sequences = pad_sequences(text_sequences, maxlen=max_text_len, padding='post')
text_vocab_size = len(text_tokenizer.word_index) + 1
print("text_vocab_size",text_vocab_size)



# Tokenize emojis
emoji_tokenizer = Tokenizer(char_level=True,filters="")
emoji_tokenizer.fit_on_texts(emoji)
emoji_sequences = emoji_tokenizer.texts_to_sequences(emoji)
emoji_sequences = pad_sequences(emoji_sequences, maxlen=max_emoji_len, padding='post')
emoji_vocab_size = len(emoji_tokenizer.word_index) + 1

#this might be something to try - then use categorical_crossentropy instead of sparse_categorical_crossentropy
#emoji_sequences_oh = to_categorical(emoji_sequences, num_classes=emoji_vocab_size)


print("emoji_vocab_size",emoji_vocab_size)





text_vocab_size 57073
emoji_vocab_size 1387


In [None]:
text_train, text_test, emoji_train, emoji_test = train_test_split(text_sequences,emoji_sequences)

In [None]:
# Encoder
encoder_inputs = Input(shape=(max_text_len,))

enc_emb_layer = Embedding(input_dim=text_vocab_size, output_dim=100)
enc_emb = enc_emb_layer(encoder_inputs)

encoder_rnn = SimpleRNN(100, return_state=True)

encoder_outputs, state_h = encoder_rnn(enc_emb)

context_vector = [state_h]

In [None]:
# Decoder
decoder_inputs = Input(shape=(None,))

dec_emb_layer = Embedding(emoji_vocab_size, 100)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_rnn = SimpleRNN(100, return_state=True, return_sequences=True)
decoder_outputs, _ = decoder_rnn(dec_emb, initial_state=context_vector) #ignore the returned states for now

decoder_dense = Dense(emoji_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
import keras
optimizer = keras.optimizers.Adam(lr=0.1)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit([text_train, emoji_train],
          emoji_train,
          epochs=1,
          batch_size=64,
          validation_data=([text_test, emoji_test],emoji_test) )





<keras.src.callbacks.History at 0x79109707be50>

In [None]:
# Encoder model for inference
encoder_model = Model(encoder_inputs, context_vector)

# Decoder model for inference
decoder_state_input = Input(shape=(100,))  # This is the input state for the decoder
decoder_emb = dec_emb_layer(decoder_inputs)  # Embedding for decoder input

# Get the output sequence from the decoder RNN
decoder_outputs, decoder_state = decoder_rnn(decoder_emb, initial_state=[decoder_state_input])

# Apply the Dense layer to the output sequence
decoder_outputs = decoder_dense(decoder_outputs)

# Define the decoder model
# Note: The model only returns the output sequence, not the final state
decoder_model = Model([decoder_inputs, decoder_state_input], [decoder_outputs,decoder_state])

In [None]:
def preprocess_input(input_text, text_tokenizer, max_text_len):
    # Tokenize the input text
    input_seq = text_tokenizer.texts_to_sequences([input_text])
    # Pad the sequence
    input_seq = pad_sequences(input_seq, maxlen=max_text_len, padding='post')
    return input_seq

#ChatGPT wrote this method
def sample(preds, temperature=1.0):
    # Apply softmax temperature
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # Sample a token with probabilities adjusted by the temperature
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def decode_sequence(initial_state, decoder_model, emoji_tokenizer, max_emoji_len):
    # Start with a sequence containing just the start token index.
    target_seq = np.zeros((1, 1))
    start_token_index = emoji_tokenizer.word_index['\t']  # Assuming '\t' is the start token
    target_seq[0, 0] = start_token_index

    stop_condition = False
    decoded_sequence = ''
    state = initial_state

    while not stop_condition:
        # Predict the next token
        output_tokens, state = decoder_model.predict([target_seq, state])

        # Sample a token
        #sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token_index = sample(output_tokens[0, -1, :],temperature=1.2)
        if sampled_token_index == 0:  # Assuming 0 stands for the padding token
            break
        sampled_char = emoji_tokenizer.index_word.get(sampled_token_index, '')
        if sampled_char == '\n':  # Assuming '\n' is the stop token
            break
        decoded_sequence += sampled_char

        # Update the target sequence to the last predicted token
        target_seq = np.array([[sampled_token_index]])

        if len(decoded_sequence) > max_emoji_len:
            stop_condition = True

    return decoded_sequence


def predict(input_text):
    # Preprocess the input
    input_seq = preprocess_input(input_text, text_tokenizer, max_text_len)

    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Decode the sequence to emoji
    decoded_emoji = decode_sequence(states_value, decoder_model, emoji_tokenizer, max_emoji_len)

    return decoded_emoji

new_text = "Finally got that promotion at work! Feeling so proud and accomplished."
predicted_emoji = predict(new_text)
print("Predicted emoji sequence:", predicted_emoji)
display(predicted_emoji)