# CS 195: Natural Language Processing
## Encoder-Decoder Architectures

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_3_EncoderDecoder.ipynb)

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

A ten-minute introduction to sequence-to-sequence learning in Keras: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Character-level recurrent sequence-to-sequence model: https://keras.io/examples/nlp/lstm_seq2seq/

In [1]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


## Last time: RNN Language Model

We used recurrent neural networks for *language modeling* - predicting the next word.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_languagemodeling.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.6, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN for Sequence Classification

We could also use the last hidden state an RNN as an input to a regular feed-forward network and do classification of a whole sequence.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_classification.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.8, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## RNN Sequence Labeling

RNNs are also good for **sequence labeling** when the output is a squence corresponding 1:1 with the input words, like part-of-speech tagging.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_sequence_labeling.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.7, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### Discussion Question

What sequence-to-sequence NLP tasks can you think of where the input and target sequences don't match up word-for word?

## Encoder-Decoder Architecture

**Encoder RNN:** Takes input sequences, produces a context vector

**Context Vector:** Contains essence of the input sequence

**Decoder RNN:** Takes context vector as input, generates an output sequence

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/encoder-decoder.png?raw=1" width=700>
</div>


image source: SLP Fig. 9.16, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Encoder-Decoder usage

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/encoder-decoder_detail.png?raw=1" width=800>
</div>


image source: SLP Fig. 9.18, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Text2Emoji Dataset

Here is a fun dataset that has short sequences of text along with a sequece of emojis corresponding to the task
* This is kind of like translation
* This is kind of like summarization

In [2]:
from datasets import load_dataset

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

Downloading readme:   0%|          | 0.00/100 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

# Filter the dataset
dataset = dataset.filter(is_not_none)

Filter:   0%|          | 0/503687 [00:00<?, ? examples/s]

In [4]:
dataset["text"][46]

'Going green has never been trendier! Drive around in style with a lineup of eco-friendly electric cars.'

In [5]:
dataset["emoji"][46]

'♻️🚗✨💐🍃🌱'

In [6]:
len(dataset)

503682

### Importing libraries we'll need

In [5]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, SimpleRNN, Embedding, Dense, TimeDistributed
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

### Setting up some tokenizers

In this case, we'll create two different tokenizers
* texts need to be tokenized as words
* emojis need to be tokenized as characters
* might be similar if you translate between different languages
* some problems might be able to use the same tokenizer

In [6]:
# Parameters
max_text_len = 20
max_emoji_len = 10


#texts = dataset["text"][:5000]
#emoji = dataset["emoji"][:5000]

texts = dataset["text"]
emoji = dataset["emoji"]

#add \t and \n as start and ending tokens for the emoji
for idx in range(len(emoji)):
    emoji[idx] = "\t"+emoji[idx]+"\n"

# Tokenize text
text_tokenizer = Tokenizer()

text_tokenizer.fit_on_texts(texts)
text_sequences = text_tokenizer.texts_to_sequences(texts)
text_sequences = pad_sequences(text_sequences, maxlen=max_text_len, padding='post')
text_vocab_size = len(text_tokenizer.word_index) + 1
print("text_vocab_size",text_vocab_size)



# Tokenize emojis
emoji_tokenizer = Tokenizer(char_level=True,filters="")
emoji_tokenizer.fit_on_texts(emoji)
emoji_sequences = emoji_tokenizer.texts_to_sequences(emoji)
emoji_sequences = pad_sequences(emoji_sequences, maxlen=max_emoji_len, padding='post')
emoji_vocab_size = len(emoji_tokenizer.word_index) + 1

#this might be something to try - then use categorical_crossentropy instead of sparse_categorical_crossentropy
#emoji_sequences_oh = to_categorical(emoji_sequences, num_classes=emoji_vocab_size)


print("emoji_vocab_size",emoji_vocab_size)





text_vocab_size 57073
emoji_vocab_size 1387


In [11]:
text_train, text_test, emoji_train, emoji_test = train_test_split(text_sequences,emoji_sequences)
print(text_test[0])
print(emoji_test[0])

[  12   26  534    3 1048   10  842    2  559 3294    0    0    0    0
    0    0    0    0    0    0]
[  2 154  23  25 129   3   0   0   0   0]


In [10]:
print(text_train[2])
print(emoji_train[2])

[  16  102 1289   84   33  137  203   12 4774  107  668  137 2645    0
    0    0    0    0    0    0]
[  2 177 148 118 467  18   3   0   0   0]


### Defining the Encoder

The **Encoder** contains
* an input layer with enough nodes for the largest text input
* an Embedding layer like usual
* a Recurrent layer
    - `return_state=True` means it will return both the **output** and the internal **state**
    
When training, we will ignore the *output* and just pass the *state* as the context vector
    
Notice that we don't use a `Sequential` model for this - it's going to have to be more flexible, so we explicitly compose each layer.
    

In [7]:
# Encoder
encoder_inputs = Input(shape=(max_text_len,))

enc_emb_layer = Embedding(input_dim=text_vocab_size, output_dim=100)
enc_emb = enc_emb_layer(encoder_inputs)

encoder_rnn = SimpleRNN(100, return_state=True)

encoder_outputs, state_h = encoder_rnn(enc_emb)

context_vector = [state_h]


### Defining the Decoder

The **Decoder** contains
* an input layer with `shape=(None,)` - this should make it flexible to allow for output text of many different lengths
* an Embedding layer like usual
* a recurrent layer - called with the context vector as the initial state
* an output layer for classifying which word is next in the sequence

In [8]:
# Decoder
decoder_inputs = Input(shape=(None,))

dec_emb_layer = Embedding(emoji_vocab_size, 100)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_rnn = SimpleRNN(100, return_state=True, return_sequences=True)
decoder_outputs, _ = decoder_rnn(dec_emb, initial_state=context_vector) #ignore the returned states for now

decoder_dense = Dense(emoji_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [9]:
# Define the model that will turn
# `encoder_inputs` & `decoder_inputs` into `decoder_outputs`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [7]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        attention_output = self.attention(query=inputs, value=inputs, key=inputs)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.add = layers.Add()  # instead of `+` to preserve mask
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, use_causal_mask=True
        )
        out_1 = self.layernorm_1(self.add([inputs, attention_output_1]))

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
        )
        out_2 = self.layernorm_2(self.add([out_1, attention_output_2]))

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(self.add([out_2, proj_output]))

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

In [8]:
embed_dim = 256
latent_dim = 2048
num_heads = 8
vocab_size = text_vocab_size
sequence_length = 20
batch_size = 64

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

In [9]:
transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit([text_train, emoji_train],
          emoji_train,
          epochs=10,
          batch_size=64,
          validation_data=([text_test, emoji_test],emoji_test) )

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 positional_embedding (Posi  (None, None, 256)            1461580   ['encoder_inputs[0][0]']      
 tionalEmbedding)                                         8                                       
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                      

<keras.src.callbacks.History at 0x785aa4379e70>

In [12]:
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
model.fit([text_train, emoji_train],
          emoji_train,
          epochs=1,
          batch_size=64,
          validation_data=([text_test, emoji_test],emoji_test) )



<keras.src.callbacks.History at 0x7a2177b21060>

## Inference

In order to make predictions on new examples (inference), we need to separate the encoder and decoder models.

In [13]:
# Encoder model for inference
encoder_model = Model(encoder_inputs, context_vector)

# Decoder model for inference
decoder_state_input = Input(shape=(100,))  # This is the input state for the decoder
decoder_emb = dec_emb_layer(decoder_inputs)  # Embedding for decoder input

# Get the output sequence from the decoder RNN
decoder_outputs, decoder_state = decoder_rnn(decoder_emb, initial_state=[decoder_state_input])

# Apply the Dense layer to the output sequence
decoder_outputs = decoder_dense(decoder_outputs)

# Define the decoder model
# Note: The model only returns the output sequence, not the final state
decoder_model = Model([decoder_inputs, decoder_state_input], [decoder_outputs,decoder_state])



### Some functions for doing inference

The results here are not good - there are a number of reasons why this could be, and I hope we can explore ideas in class.

We will try this with some higher-power recurrent architectures next time.

In [11]:
def preprocess_input(input_text, text_tokenizer, max_text_len):
    # Tokenize the input text
    input_seq = text_tokenizer.texts_to_sequences([input_text])
    # Pad the sequence
    input_seq = pad_sequences(input_seq, maxlen=max_text_len, padding='post')
    return input_seq

def preprocess_target(input_text, text_tokenizer, max_text_len):
    # Tokenize the input text
    input_seq = emoji_tokenizer.texts_to_sequences([input_text])
    # Pad the sequence
    input_seq = pad_sequences(input_seq, maxlen=max_emoji_len, padding='post')
    return input_seq

#ChatGPT wrote this method
def sample(preds, temperature=1.0):
    # Apply softmax temperature
    #print("B:", preds)
    preds = np.asarray(preds).astype('float64')
    #print("A:",preds)
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    #print(exp_preds)
    preds = exp_preds / np.sum(exp_preds)
    #print("PL:",preds)
    # Sample a token with probabilities adjusted by the temperature
    probas = np.random.multinomial(1, preds,int(temperature))
    #print("L:",probas)
    #print("LL:",np.argmax(probas))
    return np.argmax(probas)

def decode_sequence(initial_state, decoder_model, emoji_tokenizer, max_emoji_len):
    # Start with a sequence containing just the start token index.
    target_seq = np.zeros((1, 1))
    start_token_index = emoji_tokenizer.word_index['\t']  # Assuming '\t' is the start token
    target_seq[0, 0] = start_token_index

    stop_condition = False
    decoded_sequence = ''
    state = initial_state

    while not stop_condition:
        # Predict the next token
        output_tokens, state = decoder_model.predict([target_seq, state])
        #print(output_tokens[0, -1, :])
        # Sample a token
        #sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token_index = sample(output_tokens[0, -1, :],temperature=2)
        #if sampled_token_index == 2:
        #  sampled_token_index = sample(output_tokens[0, -1, :].pop(2),temperature=2)
        if sampled_token_index == 0:  # Assuming 0 stands for the padding token
          break
        sampled_char = emoji_tokenizer.index_word.get(sampled_token_index, '')
        if sampled_char == '\n':  # Assuming '\n' is the stop token
            break
        decoded_sequence += sampled_char

        # Update the target sequence to the last predicted token
        target_seq = np.array([[sampled_token_index]])

        if len(decoded_sequence) > max_emoji_len:
            stop_condition = True

    return decoded_sequence


def predict(input_text):
    # Preprocess the input
    input_seq = preprocess_input(input_text, text_tokenizer, max_text_len)

    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Decode the sequence to emoji
    decoded_emoji = decode_sequence(states_value, decoder_model, emoji_tokenizer, max_emoji_len)

    return decoded_emoji

new_text = "Finally got that promotion at work! Feeling so proud and accomplished."
#predicted_emoji = predict(new_text)
#print(new_text)
#print("Predicted emoji sequence:", predicted_emoji)
#display(predicted_emoji)

In [12]:
def decode_sequence(input_sentence):
    tokenized_input_sentence = preprocess_input(input_sentence, text_tokenizer, max_text_len)
    decoded_sentence = "\t"
    for i in range(max_emoji_len-1):
        tokenized_target_sentence = preprocess_target(decoded_sentence, text_tokenizer, max_text_len)[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        print(type(predictions))
        sampled_token_index = np.argmax(predictions[0, i, :])
        print(sampled_token_index)
        sampled_token = emoji_tokenizer.index_word.get(sampled_token_index, '')

        print(sampled_token)
        decoded_sentence += " " + sampled_token

        if sampled_token == "\n":
            break
    return decoded_sentence


test_eng_texts = "Finally got that promotion at work! Feeling so proud and accomplished."
translated = decode_sequence(test_eng_texts)

print(translated)

2
	
2
	
2
	
2
	
2
	
2
	
2
	
2
	
2
	
	 	 	 	 	 	 	 	 	 	


## Applied Exploration

Try this code on another dataset for summarization, translation, etc.

Or, you can try a character-level encoding like in this reference: https://keras.io/examples/nlp/lstm_seq2seq/

Run the code for a little while and see if you can come up with some meaningful results

Write up a description of the data, what you tried, and what your results were.

In [9]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, SimpleRNN, Embedding, Dense, TimeDistributed, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
import keras
import pandas as pd
import random

In [10]:
# Create an empty df since this problem can use procedurally generated data
df = pd.DataFrame(columns = ['Description', 'Equation'])

#loop number
l_num = 1000
### Add 3 variable equations
for k in range(l_num):
  i = random.randrange(0, 1000)
  j = random.randrange(0, 1000)


  description_string = "Add " + str(i) + " and " + str(j)
  equation_string = str(i) + " + " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]

  description_string = "Add " + str(i) + " and " + str(j) + " together"
  equation_string = str(i) + " + " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = "Find the sum of " + str(i) + " and " + str(j)
  equation_string = str(i) + " + " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = str(i) + " plus " + str(j)
  equation_string = str(i) + " + " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = "Subtract " + str(i) + " from " + str(j)
  equation_string = str(j) + " - " + str(i)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = str(i) + " minus " + str(j)
  equation_string = str(i) + " - " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]

  description_string = "Do " + str(i) + " times " + str(j)
  equation_string = str(i) + " * " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = str(i) + " times " + str(j)
  equation_string = str(i) + " * " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = "Find the product of " + str(i) + " and " + str(j)
  equation_string = str(i) + " * " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = "Divide " + str(i) + " by " + str(j)
  equation_string = str(i) + " / " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]


  description_string = str(i) + " divided by " + str(j)
  equation_string = str(i) + " / " + str(j)
  df.loc[len(df.index)] = [description_string, equation_string]

df.head()

Unnamed: 0,Description,Equation
0,Add 640 and 33,640 + 33
1,Add 640 and 33 together,640 + 33
2,Find the sum of 640 and 33,640 + 33
3,640 plus 33,640 + 33
4,Subtract 640 from 33,33 - 640


In [11]:
#loop number
l_num = 1000
### Add 3 variable equations
for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = "Add " + str(a) + " and " + str(b) + " then subtract" + str(c)
  equation_string = "(" + str(a) + " + " + str(b) + ") - " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " plus " + str(b) + " minus" + str(c)
  equation_string = "(" + str(a) + " + " + str(b) + ") - " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " minus " + str(b) + " plus" + str(c)
  equation_string = "(" + str(a) + " - " + str(b) + ") + " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " plus " + str(b) + " times " + str(c)
  equation_string = "(" + str(a) + " + " + str(b) + ") * " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " times " + str(b) + " times " + str(c)
  equation_string = "(" + str(a) + " * " + str(b) + ") * " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " times " + str(b) + " times " + str(c)
  equation_string = "(" + str(a) + " * " + str(b) + ") * " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " times " + str(b) + " divide " + str(c)
  equation_string = "(" + str(a) + " * " + str(b) + ") / " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " times " + str(b) + " divided by " + str(c)
  equation_string = "(" + str(a) + " * " + str(b) + ") / " + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = str(a) + " times " + str(b) + " divide " + str(c)
  equation_string = str(c) + "/ (" + str(a) + " * " + str(b) + ")"
  df.loc[len(df.index)] = [description_string, equation_string]

for i in range(l_num):
  a = random.randrange(0, 1000)
  b = random.randrange(0, 1000)
  c = random.randrange(0, 1000)
  description_string = " Add " + str(a) + " and " + str(b) + " then multiply by " + str(c)
  equation_string =  "(" + str(a) + " + " + str(b) + ") *" + str(c)
  df.loc[len(df.index)] = [description_string, equation_string]

In [12]:
df = df.sample(frac=0.5)

In [13]:
# Parameters
max_desc_len = 20
max_eq_len = 10

from datasets.utils.logging import EmptyTqdm
desc = list(df["Description"])
eq = list(df["Equation"])


#add \t and \n as start and ending tokens for the emoji
for idx in range(len(eq)):
    eq[idx] = "\t " + eq[idx] + " \n"

target_characters = set()
for eq_text in eq:
  for char in eq_text:
      if char not in target_characters:
          target_characters.add(char)

input_characters = sorted(list(desc))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in desc])
max_decoder_seq_length = max([len(txt) for txt in eq])

print("Number of samples:", len(desc))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

Number of samples: 10500
Number of unique input tokens: 10500
Number of unique output tokens: 19
Max sequence length for inputs: 37
Max sequence length for outputs: 21


In [14]:
# Tokenize descriptions
desc_tokenizer = Tokenizer()

desc_tokenizer.fit_on_texts(desc)
desc_sequences = desc_tokenizer.texts_to_sequences(desc)
desc_sequences = pad_sequences(desc_sequences, maxlen=max_desc_len, padding='post')
desc_vocab_size = len(desc_tokenizer.word_index) + 1
print("Description Vocab Size: ",desc_vocab_size)

# Tokenize equations
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])
#print(target_token_index)

encoder_input_data = np.zeros(
    (len(desc), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)
decoder_input_data = np.zeros(
    (len(desc), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)
decoder_target_data = np.zeros(
    (len(desc), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

for i, (input_text, target_text) in enumerate(zip(desc_sequences, eq)):
    encoder_input_data[i, 0, input_text] = 1.0
    #encoder_input_data[i, 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

Description Vocab Size:  2197


In [15]:
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [16]:
#enc_in_train, enc_in_test, dec_in_train, dec_in_test, dec_tar_train, dec_tar_test = train_test_split(encoder_input_data,decoder_input_data,decoder_target_data)
split = int((len(encoder_input_data)/6)*5)
enc_in_train = encoder_input_data[:split]
enc_in_test = encoder_input_data[split:]
dec_in_train = decoder_input_data[:split]
dec_in_test = decoder_input_data[split:]
dec_tar_train = decoder_target_data[:split]
dec_tar_test = decoder_target_data[split:]

In [18]:
print(len(enc_in_train))

8750


In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([enc_in_train, dec_in_train],
          dec_tar_train,
          epochs=1,
          batch_size=batch_size,
          validation_data=([enc_in_test, dec_in_test],dec_tar_test))

In [None]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [None]:
def preprocess_input(input_desc, desc_tokenizer, max_desc_len):
    # Tokenize the input text
    input_seq = desc_tokenizer.texts_to_sequences([input_desc])
    # Pad the sequence
    input_seq = pad_sequences(input_seq, maxlen=max_desc_len, padding='post')

    encoder_input_data = np.zeros(
    (len(desc), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
    )

    for i, (input_text) in enumerate(desc_sequences):
      encoder_input_data[i, 0, input_text] = 1.0
    #print(encoder_input_data)
    return encoder_input_data

def decode_sequence(states_value):
    # Encode the input as state vectors.
    #states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence

def predict(input_desc):

    # Preprocess the input
    input_seq = preprocess_input(input_desc, desc_tokenizer, max_desc_len)

    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Decode the sequence to an equation
    decoded_eq = decode_sequence(states_value)

    return decoded_eq

new_desc = "12 plus 14"
predicted_equation = predict(new_desc)
print(new_desc)
print("Predicted equation:", predicted_equation)
display(predicted_equation)