Source: https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt

# Decoder models

Decoder models use only the decoder of a Transformer model. <span style="color:blue">At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called <b><i>auto-regressive models</i></b></span>.

<span style="color:blue">The <b>pretraining</b> of decoder models usually revolves around <b>predicting the next word in the sentence.</b></span>

These models are <span style="color:blue">best suited for tasks involving <b>text generation</b></span>.

Representatives of this family of models include:
- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)
- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl.html)

<img src="images/Decoder-Models-1.png" style="width:800px;" title="decoder models">

<span style="color:blue">Decoder outputs numerical representation for each word in the initial input sequence. Decoder outputs 1-sequence of numbers per input word (a.k.a. feature vectors, or feature tensor). The dimension of the feature vector is defined by the architecture of the model. </span>

<span style="color:blue">Where the <b>decoder differs from the encoder is principally with its self-attention mechanism</b>. It's using <b>"masked self-attention"</b>. </span>

<span style="color:red">For e.g., for the word "to" in "Welcome to NYC", the vector is unmodified by the word "NYC" as all the words on the right (a.k.a. right context words) are masked. Rather than benefitting from all the words on the left and right (i.e., bidirectional context), decoders only have access to the words on single context, either left context words or right context words)</span>. <span style="color:green">The <b>masked self-attention mechanism</b> differs from the self-attention mechanism by <b>using an additional mask to hide the context on either side of the words</b>. The words numerical representation will not be affected by words in the hidden context.</span>

<img src="images/Decoder-Models-2.png" style="width:800px;" title="decoder models">
<br><br>
<b>Why should we use a decoder?</b>
<img src="images/Decoder-Models-3.png" style="width:500px;" title="decoder models">

<b>Casual Language Modelling</b>

<span style="color:blue">Ability to generate words or sequence of words given a known sequence of words. This is known as <b>Casual Language Modelling</b> or <b>Natural Language Generation.</b></span>

<b>Example: Guessing the next word in a sentence</b>

- Step 1:
    - Input to decoder model: "My"
    - Output of decoder model: Vector/sequence of numbers that represent a single word > Apply a small transformation to the vector so that it maps to all the words known by the model $\Rightarrow$ Predict the most probably following word. In this case, "name" 

- Step 2: <span style="color:green"><b>Auto-regressive aspect</b> (i.e., we use the past outputs into inputs in the following steps)</span>
    - Input to decoder model: "My name"
    - Output of decoder model: "is"
    
- Step 3: Repeat the word until we are satisfied.
    
> GPT-2 for example, has a maximum context size of 1024 => We can generate upto 1024 words

<img src="images/Decoder-Models-4.png" style="width:500px;" title="decoder models">