# Disclaimer

Released under the CC BY 4.0 License (https://creativecommons.org/licenses/by/4.0/)

Part of the code was inspired by the material presented [here](https://www.coursera.org/learn/sequence-models-in-nlp). Some of the material shown here is related to an assignment for that course. To avoid any ethical and legal problem, some of the code (e.g. the implemented loss function) will not be shown in its entirety.

# Purpose of this notebook

The purpose of this notebook is to present the implementation I prepared for the task of identifiying if two short texts are related. For example, one can train a model to identify if two questions are duplicates of one another, or if two tweets are likely to come from the same author. The first task was the motivator for an assignment of the course cited before, while the second task is one that is motivating my learning process. The presented implementation uses TensorFlow 2.3.0.

# Intro

The model is structured as a siamese model, where the input is composed of two separate instances of the texts to be analyzed. In this specific model, the similarity measure between input instances is implicitly given by the structure of the batches passed to the model (more on this optimization later on). The labels that we care about (e.g. question topic, tweet author) are not passed to the model. In this way we fit a generic model that, in principle, can predict the similarity of two inputs even if the model did not encounter that particular author or question topic at training time.

# Features

The input features for the model are represented by the sequences of the tokenized input texts. The details of the implementation of how these sequences are produced is not the focus of this notebook, however an example based on the tweet author identification task is presented later on.

# Model definition
The model is a quite simple example of a siamese model, where a sequence of layers treats each input simultaneously. The resulting embeddings are then concatenated to form the acutal model output, from which one can compute the similarity between the inputs. It is important to remark that both inputs are treated using the same parameter values, since the layer instances are the same.

The core of the model is represented by the siamese sequential part, composed of:
1. an embedding layer
2. a bidirectional LSTM sequence of cells
3. a flattening layer

The last layer simply concatenates the results obtained by the sequential part, applied to each input.

In [None]:
#base imports
import os
import re
import string

#to avoid GPU-related problems when using masking features
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

from tensorflow.keras import Input, layers
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

import pandas as pd
import numpy as np

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords_english = stopwords.words('english') 
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [None]:
def get_model(model_dimension=32, sequence_length=sequence_length):

    a_sequence = Input(shape=(None, 1 ), name="input_a_sequence", dtype=tf.float32)

    b_sequence = Input(shape=(None, 1 ), name="input_b_sequence", dtype=tf.float32)

    lstm_seq = tf.keras.Sequential()
    lstm_seq.add(layers.Embedding(n_tokens, model_dimension, input_length=sequence_length, mask_zero=True))
    lstm_seq.add(layers.Bidirectional(layers.LSTM(model_dimension)))
    lstm_seq.add(layers.Flatten())

    a_branch = lstm_seq(a_sequence)

    b_branch = lstm_seq(b_sequence)


    inputs = [
        a_sequence,
        b_sequence
    ]

    concatenated_branches_layer = layers.Concatenate(axis=1, name="branches")([a_branch, b_branch])

    model = tf.keras.Model(
        inputs=inputs,
        outputs=[concatenated_branches_layer]
    )

    return model

## Batch structure and loss function definition

The basic idea is to use a triplet loss function that extends the relation between input entries to the entire batch of data. In essence, if $A, B$ are two input batches of shape $(batch\_size, sequence\_size)$ we want all entries at the same row position to be positive examples, and all possible pairs in the same input batch to be negative examples. We want to build batches where the resulting representations of similar examples are much closer than the resulting representations of dissimilar examples: $d(repr(A^{(j)}), repr(B^{(j)})) + \alpha < d(repr(A^{(j)}), repr(B^{(k)}))$ where $k \neq j$ and $A^{(j)}, B^{(j)}, B^{(k)}$ are the anchor, positive and negative examples and $\alpha$ is a margin constant (an hyperparameter of the model) that pushes the model to learn to actually represent different instances in a different way.

If batches are built this way, we can define a loss function that takes into consideration all possible pairs in a batch. We can measure the similarity between two of these resulting embeddings using the cosine similarity metric, and produce a matrix of cosine similarities for all possible pairs in the batch. The function described in [week4](https://www.coursera.org/learn/sequence-models-in-nlp) uses this matrix to measure an average loss computed using the similarities of the positive pairs and the negative pairs.

In [None]:
#loss function
def get_loss_function(alpha=0.25, batch_size=64):

    identity_matrix = tf.eye(batch_size)

    def _loss_fn(y_true, y_pred):
        """
            y_pred: the data computed by the last layer of the model.
                    In my tf2 model, the last layer concatenates the two resulting embeddings
        """
        branch_size = y_pred.shape[1] // 2
        a_branch = y_pred[:, 0:branch_size]
        b_branch = y_pred[:, branch_size:]
        ###
        # the rest of the implementation is mostly identical to the 
        # loss function that has to be implemented in [week4]
        # apart from the different modules in which helper functions
        # can be found: np, tf.keras.backend and tf.linalg instead of
        # the fastnp implementations in trax
        ###
        loss = None
        return loss
    return _loss_fn

## Generator for the tweet author identification task

The following utility functions are used by the generator function to build the batched inputs. Using these functions, the generator transforms the input sentences by preproceessing them with stop-word filtering, stemming and transforming them in a sequence of indexes using a tokenizer.

In [None]:
def init_tokenizer(df, n_tokens=100):

    token_generator = TweetTokenizer(
        preserve_case=False,
        reduce_len=False,
        strip_handles=False
    )

    texts = [token_generator.tokenize(normalize_text(text)) for text in df["text"].values]

    tokenizer = Tokenizer(oov_token="OOV", num_words=n_tokens)

    tokenizer.fit_on_texts(texts)

    return tokenizer


def normalize_text(text):

    text = text.replace("\n", " ").lower()

    text = re.sub(r"\s+", " ", text).strip()
    text = re.sub(r'[,!?;-]+', '.', text)
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)

    text = re.sub(r"@[\w\-\d]+", "", text)

    text = re.sub(r"#([\w\-\d]+)", "\\1", text)

    stemmer = PorterStemmer()
    premade_tokenizer = TweetTokenizer(preserve_case=False)

    filtered_words = []

    for word in premade_tokenizer.tokenize(text):
        if(
            word not in stopwords_english
            and word not in string.punctuation
        ):
            stem_word = stemmer.stem(word)
            filtered_words.append(stem_word)

    return " ".join(filtered_words)


def get_normal_tokenized_text_sequence(tokenizer, text, length):
    oov_index = tokenizer.texts_to_sequences(['OOV'])[0][0]

    tokenized_sequence = tokenizer.texts_to_sequences([text])[0][0:length]

    if(len(tokenized_sequence) < length):
        tokenized_sequence.extend([oov_index] * (length - len(tokenized_sequence)))

    remapped_oov_index = 0

    # remapping the default OOV index (1) to 0, to take advantage of the masking capability
    # of the Embedding layer
    tokenized_sequence = [(v if v != oov_index else remapped_oov_index) for v in tokenized_sequence]

    return tokenized_sequence


def get_instance_features(text_a, text_b, tokenizer, sequence_length):
    text_a = normalize_text(text_a)

    text_b = normalize_text(text_b)

    instance_features = {
        "input_a_sequence": get_normal_tokenized_text_sequence(tokenizer, text_a, length=sequence_length),
        "input_b_sequence": get_normal_tokenized_text_sequence(tokenizer, text_b, length=sequence_length),
    }

    return instance_features


def get_example_dataframe():
    data = [
        {'handle': 'h4', 'text': 'hey this is a completely unrelated text 0'},
        {'handle': 'h4', 'text': 'hey this is a completely unrelated text 1'},
        {'handle': 'h4', 'text': 'hey this is a completely unrelated text 2'},
        {'handle': 'h3', 'text': 'another handle is writing 0'},
        {'handle': 'h3', 'text': 'another handle is writing 1'},
        {'handle': 'h3', 'text': 'another handle is writing 2'},
        {'handle': 'h2', 'text': 'a very different type of text 0'},
        {'handle': 'h2', 'text': 'a very different type of text 1'},
        {'handle': 'h2', 'text': 'a very different type of text 2'},
        {'handle': 'h1', 'text': 'this is a test tweet 0'},
        {'handle': 'h1', 'text': 'this is a test tweet 1'},
        {'handle': 'h1', 'text': 'this is a test tweet 2'}
    ]
    return pd.DataFrame(data)


def get_random_tweet_for_handle(df, handle, other_text=None):

    go_on = True
    max_attempts = 10
    attempts = 0
    while(go_on):
        result = normalize_text(df[df["handle"] == handle]["text"].sample().iloc[0])
        attempts += 1
        go_on = (
            result in [None, '']
            and other_text is not None 
            and result == other_text 
            and attempts <= max_attempts
        )
    
    return result

The following is a possible implementation of the generator function that can be used when training the model. The returned value is a batch consisting of the sequences of indexes corresponding to the tokenized inputs. Note that the second element of the returned pair, which represents the $y\_true$ predictions, is not actually used by the model itself.

In [None]:
def generator_siamese_network(batch_size=4, sequence_length=5):
    df = get_example_dataframe()

    tokenizer = init_tokenizer(df)

    features = {
        "input_a_sequence": [],
        "input_b_sequence": []
    }

    output = np.full((batch_size, batch_size), -1)
    np.fill_diagonal(output, 1)

    while(True):
        features = {
            "input_a_sequence": [],
            "input_b_sequence": []
        }

        #getting a list of unique authors (handles)
        handles = df.handle.unique()
        np.random.shuffle(handles)
        handles = handles[0:batch_size]

        for handle in handles:
            tweet_a = get_random_tweet_for_handle(df, handle)
            tweet_b = get_random_tweet_for_handle(df, handle, tweet_a)

            instance_features = get_instance_features(tweet_a, tweet_b, tokenizer, sequence_length=sequence_length)

            for instance_key, instance_value in instance_features.items():
                features[instance_key].append(instance_value)

        features["input_a_sequence"] = np.array(features["input_a_sequence"], dtype=np.float32)
        features["input_b_sequence"] = np.array(features["input_b_sequence"], dtype=np.float32)

        yield(features, output)

# Model training

When compiling the model, an instance of the loss function must be created and passed as parameter. In this case, I used a dictionary to link the loss function with the named output layer of the model. This can be useful if more than one output layer is specified. Given the definition of this model, tracking accuracy at this stage is not useful: the last layer represents the embeddings of the input sequences. The accuracy of the model will be evaluated later on.

In [None]:
batch_size = 4
steps_per_epoch = 128
epochs = 1
alpha = 0.25
learning_rate = 0.001

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate),
    loss={
        "branches": define_triplet_loss(
            batch_size=batch_size,
            alpha=alpha
        )
    },
    metrics=[]
)

batch_generator = get_generator()

history = model.fit(
    generator_siamese_network,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs
)

# Model evaluation

The model defined here can be evaluated in terms of the similarity between the produced embeddings. To compute this metric, we simply have to take the prediction results with respect to an input batch and compute the cosine similarity of the resulting embeddings, after normalization. If everything worked as expected, the resulting matrix will have the higher values (row-wise) in the main diagonal, where values are expected to be close to 1. Other cells hopefully will contain lower values, even negative ones. To make a prediction, one can set and tune a threshold: if the value is over that threshold, then the inputs are predicted as similar, otherwise the result is negative. A second threshold could be used so that values in between the two thresholds can be interpreted as undetermined.

In [None]:
model_result = model.predict(next(batch_generator))

threshold = 0.75

def compute_predictions(model_result):
    #using normalization, dot product, compute the similarity matrix
    return None

similarity_matrix_values = compute_predictions(model_result)

similarity_decision = similarity_matrix_values >= threshold

# Final remarks

The first task (similarity between questions) resulted much easier for the model to learn when compared to the second task (tweet author identification). This can be explained when considering that the text sequences in the question dataset are longer and arguably contain more information than short, possibly unrelated tweets retrieved from the Twitter platform. Also, in this particular setting, emojis and special characters are not considered.