<a href="https://colab.research.google.com/github/mmonch/Sidecar_Project/blob/main/notebooks/Sidecar_Project_Word_level_seq2seq_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='Q0'></a>
<center><a target="_blank" href="https://sit.academy/"><img src="https://drive.google.com/uc?id=1z0U84GYqhbWWpCenFajh8_8XFRGyOc3U" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>
<center> <h1> Notebook 4: Word-level Seq2Seq RNN Model </h1> </center>
<p style="margin-bottom:1cm;"></p>
<center><h4>Marlies Monch, SIT Academy, 2022</h4></center>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Run a word-level seq2seq model (RNN) on the attribute technical names to match the attribute business names</p>
        <strong> Outline:</strong>
        <a id='P0' name="P0"></a>
        <ol>
            <li> <a style="color:#303030" href='#I'>Introduction </a> </li>
            <li> <a style="color:#303030" href='#SU'>Set up</a></li>
            <li> <a style="color:#303030" href='#DP'>Data Preparation</a></li>
            <li> <a style="color:#303030" href='#CT'>Compile and Train the Model</a></li>
            <li> <a style="color:#303030" href='#TP'>Test Data Performance</a></li>
            <li> <a style="color:#303030" href='#CL'>Conclusion</a></li>
        </ol>
        <strong>Keywords:</strong> data preprocessing, seq2seq, NLP, Sidecar attribute names, word-level.
    </div>
</div>
</nav>

<a id='I' name="I"></a>
## [Introduction](#P0)

Sources:

https://keras.io/examples/nlp/neural_machine_translation_with_transformer/

https://loeb.nyc/blog/data-science-word-expander

https://towardsdatascience.com/nlp-building-text-cleanup-and-preprocessing-pipeline-eba4095245a0

https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e

https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/

<a id='SU' name="SU"></a>
## [Set up](#P0)

###Package Installations

In [8]:
!pip install nb_black
!pip install contractions
!pip install textsearch
!pip install tqdm
!pip install --upgrade IPython



<IPython.core.display.Javascript object>

### Magics

In [9]:
# auto reload packages and modules when they are modified
%load_ext autoreload
%autoreload 2
# draw matplotlib plots in line
%matplotlib inline
# enforce PEP 8 code on jupyter lab ...
#%load_ext lab_black
# ... or jupyter notebook
%load_ext nb_black

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

### Package Imports

In [10]:
import nltk
import pandas as pd
import numpy as np
import tqdm
import tensorflow as tf
import unicodedata
import re
import contractions
import sklearn
from tensorflow.keras import layers
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import TextVectorization

# from nltk get "punkt"
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<IPython.core.display.Javascript object>

In [11]:
# fix random seed for reproducibility
seed = 42

# for numpy
np.random.seed(seed)
# for tenserflow.keras
tf.random.set_seed(seed)

<IPython.core.display.Javascript object>

### User-Dependent Variables

In [12]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


<IPython.core.display.Javascript object>

In [13]:
data = pd.read_csv("gdrive/My Drive/SIDECAR_P/data/Sidecar_Data_Sample.csv")

<IPython.core.display.Javascript object>

<a id='DP'></a>
## [Data Preparation](#P0)

First we will remove very abstract rows of data that would confuse the model. Ten we will pre-process the text by stripping underscores, excess white spaces etc. Next, we create a Dataframe paring each Attribute Technical Name with it's respective Attribute Business Name. 




### Text Pre-Processing

In [14]:
# remove P_AF18XXXX values for better performance
data_no_paf = data[data['Attribute_Technical_Name'].str.contains("P_AF")==False]

<IPython.core.display.Javascript object>

In [15]:
# preprocess and normalize Text

# in case text not english
def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

# preprocessing
def pre_process_text(labels):
  norm_docs = []
  for string in tqdm.tqdm(labels):
    string = string.replace("_", " ")
    string = string.translate(string.maketrans("\n\t\r", "   "))
    string = remove_accented_chars(string) 
    # and inset a space where a number follows a letter et vice versa
    string = re.sub(r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)', ' ', string)
    # insert space where an uppercase letter follows a lowercase letter
    string = re.sub(r"(?<![A-Z\W])(?=[A-Z])", " ", string)
    string = contractions.fix(string)
    string = string.replace("-", " to ")
    # remove special characters or whitespaces
    string = re.sub(r"[^a-zA-Z0-9\s]", "", string, flags=re.I|re.A)
    string = string.lower()
    string = string.strip()
    # no splitting needed for this RNN
    # string = string.split(" ")
    norm_docs.append(string)
  return norm_docs

<IPython.core.display.Javascript object>

In [16]:
prep_tech = pre_process_text(data_no_paf["Attribute_Technical_Name"])


100%|██████████| 3941/3941 [00:00<00:00, 55763.50it/s]


<IPython.core.display.Javascript object>

In [17]:
prep_business = pre_process_text(data_no_paf["Attribute_Business_Name"])

100%|██████████| 3941/3941 [00:00<00:00, 49707.26it/s]


<IPython.core.display.Javascript object>

In [18]:
# add start and end tags for business labels so that they can be read by the RNN model
prep_business = ["[start] " + s + " [end]" for s in prep_business]

<IPython.core.display.Javascript object>

In [None]:
prep_business

### Create Training Dataset

In [20]:
# parse text into text pairs
tech = pd.DataFrame(prep_tech, columns =["tech_name"], dtype="string")
busi = pd.DataFrame(prep_business, columns =["busi_name"], dtype="string")

attribute_df = pd.concat([tech,busi], axis=1)
attribute_df

Unnamed: 0,tech_name,busi_name
0,id,[start] technical id of the patient [end]
1,gndr cd,[start] gender code [end]
2,livg arngmnt cd,[start] living arrangement [end]
3,mrtl stus cd,[start] marital status code [end]
4,ocupatn cd,[start] occupation code [end]
...,...,...
3936,config asset list data source,[start] config asset list data source [end]
3937,attribute sample data,[start] attribute sample data [end]
3938,property is dq,[start] property is dq [end]
3939,property dq calculation,[start] property dq calculation [end]


<IPython.core.display.Javascript object>

### Train Test Split

In [21]:
# shuffle and define labels
attribute_df = attribute_df.sample(frac=1, random_state=42)
# X = attribute_df["tech_name"]
# Y = attribute_df["busi_name"]

# train test split
# x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, train_size=None, random_state=42)

num_val_samples = int(0.15 * len(attribute_df))
num_train_samples = len(attribute_df) - 2 * num_val_samples
train_pairs = attribute_df[:num_train_samples]
val_pairs = attribute_df[num_train_samples : num_train_samples + num_val_samples]
test_pairs = attribute_df[num_train_samples + num_val_samples :]

print(f"{len(attribute_df)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

3941 total pairs
2759 training pairs
591 validation pairs
591 test pairs


<IPython.core.display.Javascript object>

### Text Vectorization

In [22]:
# Vectorize the data

vocab_size = 15000
sequence_length = 20
batch_size = 64

tech_vectorization = TextVectorization(
    max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length,
)
busi_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
)
train_tech_texts = [pair[0] for pair in train_pairs]
train_busi_texts = [pair[1] for pair in train_pairs]
tech_vectorization.adapt(train_tech_texts)
busi_vectorization.adapt(train_busi_texts)

<IPython.core.display.Javascript object>

In [None]:
train_pairs = train_pairs.to_records(index=False)
train_pairs = list(train_pairs)


In [None]:
val_pairs = val_pairs.to_records(index=False)
val_pairs = list(val_pairs)

In [None]:
val_pairs

In [26]:
# function to format datasets and vectorize them
def format_dataset(tech, busi):
    tech = tech_vectorization(tech)
    busi = busi_vectorization(busi)
    return ({"encoder_inputs": tech, "decoder_inputs": busi[:, :-1],}, busi[:, 1:])

# function to make dataset pairs and shuffle the data
def make_dataset(pairs):
    tech_texts, busi_texts = zip(*pairs)
    tech_texts = list(tech_texts)
    busi_texts = list(busi_texts)
    dataset = tf.data.Dataset.from_tensor_slices((tech_texts, busi_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

<IPython.core.display.Javascript object>

In [27]:
# encoder and decoder input shapes
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


<IPython.core.display.Javascript object>

<a id='CT'></a>
## [Compile and Train the Model](#P0)

In [28]:
# Build the model
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, tf.newaxis, :], dtype="int32")
        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(latent_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)

<IPython.core.display.Javascript object>

In [29]:
embed_dim = 256
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

<IPython.core.display.Javascript object>

In [30]:
# Compile the model
epochs = 60  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
# train the model 
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)
# save model
transformer.save("transformer_RNN_attribute_labels_class_weights")

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 positional_embedding (Position  (None, None, 256)   3845120     ['encoder_inputs[0][0]']         
 alEmbedding)                                                                                     
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder (Transform  (None, None, 256)   3155456     ['positional_embedding[



INFO:tensorflow:Assets written to: transformer_RNN_attribute_labels_class_weights/assets


INFO:tensorflow:Assets written to: transformer_RNN_attribute_labels_class_weights/assets
  layer_config = serialize_layer_fn(layer)
  return generic_utils.serialize_keras_object(obj)


<IPython.core.display.Javascript object>

<a id='TP'></a>
## [Test Data Performance](#P0)

In [31]:
busi_vocab = busi_vectorization.get_vocabulary()
busi_index_lookup = dict(zip(range(len(busi_vocab)), busi_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = tech_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = busi_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = busi_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence


test_tech_texts = [pair[0] for pair in val_pairs]
for _ in range(len(test_pairs)):
    input_sentence = np.random.choice(test_tech_texts)
    translated = decode_sequence(input_sentence)
    print(translated)

[start] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK]  [UNK] 
[start] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
[start] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK]  [UNK] 
[start] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK]  [UNK] 
[start] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK]  [UNK]  [UNK]
[start] [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK]    [UNK] [UNK] [UNK]
[start] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
[start] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
[start] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 

<IPython.core.display.Javascript object>

<a id='CL'></a>
## [Conclusion](#P0)

Although the model has acceptablee Accuracy of .86, the predicions made on the test data set by the model are all unclear. Thus, this model is unable to generate any useful translations of the Technical Names into the Attribute Business Names. 

In Sum, the character-level Sequence to Sequence Model is the best performing nlp model for this task.

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>