Skip to content

Understanding the Models

Tejasvi Chebrolu edited this page Jul 14, 2022 · 3 revisions

Google Translate API

The first model built is based on the Google Translate API. The code for converting the requests into Spanish can be found here in the file google_translate.ipynb. The model is built using the googletrans library.

Note: Only the 4.0.0.rc1 version works because of the ever-changing Google API rules. The command for the installation of this particular version can be seen in the code and the documentation.

Working

  • The library contains the Translator class, which calls the Google Translate API.
  • The translate method within the class takes the text to be translated, the source language, and the target language as parameters.
  • It returns the translated text as a string.

Sequence-to-Sequence Transformer

The second model built is transformer-based. The code for converting the requests into Spanish can be found here in the file english_to_spanish.ipynb. The model is mainly built using the tensorflow library. The sequence-to-sequence Transformer consists of a TransformerEncoder and a TransformerDecoder chained together. To make the model aware of word order, a PositionalEmbedding layer is also used.

Note: The complete list of requirements can be found in the libraries required section in this part of the Wiki.

Working

The code loosely follows a Keras tutorial that can be found here. The basic steps that constitute the model are as follows:

  1. Text Vectorization
  2. TransformerEncoder Layer Implementation
  3. TransformerDecoder Layer Implementation
  4. PositionalEmbedding Layer Implementation

Text Vectorization

Using the TextVectorization library from Keras, we implement two TextVectorization layers (one for English and one for Spanish). This helps transform every original string into integer sequences where every integer represents the index of a word in a vocabulary.

The English layer will use the default string standardization (strip punctuation characters) and splitting scheme (split on whitespace). In contrast, the Spanish layer will use a custom standardization, where we add the character "¿" to the set of punctuation characters to be stripped.

TransformerEncoder Layer Implementation

The source sequence is passed to the TransformerEncoder, which produces a new representation of the sequence.

TransformerDecoder Layer Representation

The representation created in the previous layer is now passed to the TransformerDecoder layer with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the following words in the target sequence (N+1 and beyond).

PositionalEmbedding Layer Implementation

The PositionalEmbedding layer makes the model aware of the word order in a particular sequence. This is necessary because the TransformerDecoder sees the entire sequence simultaneously. Thus we must make sure that it only uses information from target tokens 0 to N when predicting token N+1 (otherwise, it could use information from the future, which would result in a model that cannot be used at inference time).


Transformers Explained

To understand the transformers model, it is essential to understand the concept and the mechanism of attention. The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output.

Architecture

The Encoder-Decoder Structure of the Transformer Architecture Taken from “Attention Is All You Need“

In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder.

The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.

Working

The Transformer model runs as follows:

  • Each word forming an input sequence is transformed into a d-dimensional embedding vector.
  • Each embedding vector representing an input word is augmented by summing it (element-wise) to a positional encoding vector of the same length, hence introducing positional information into the input.
  • The augmented embedding vectors are fed into the encoder block, consisting of the two sublayers explained above. Since the encoder attends to all words in the input sequence, irrespective if they precede or succeed the word under consideration, then the Transformer encoder is bidirectional.
  • The decoder receives as input its own predicted output word at time-step, t - 1.
  • The input to the decoder is also augmented by positional encoding, in the same manner as this is done on the encoder side.
  • The augmented decoder input is fed into the three sublayers comprising the decoder block explained above. Masking is applied in the first sublayer, in order to stop the decoder from attending to succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all of the words in the input sequence.
  • The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.

For a detailed understanding refer to this blog by Stefania Cristina.