In [None]:
'We’ll cover the following models and their corresponding tasks:'

In [None]:
'''
* Wav2Vec2 for audio classification and automatic speech recognition (ASR)
* Vision Transformer (ViT) and ConvNeXT for image classification
* DETR for object detection
* Mask2Former for image segmentation
* GLPN for depth estimation
* BERT for NLP tasks like text classification, token classification and question answering that use an encoder
* GPT2 for NLP tasks like text generation that use a decoder
* BART for NLP tasks like summarization and translation that use an encoder-decoder
'''

#### @Types of language models

In [None]:
'''
* Encoder-only models (like BERT): These models use a bidirectional approach to understand context from both directions. 
They’re best suited for tasks that require deep understanding of text, such as classification, named entity recognition, 
and question answering.

* Decoder-only models (like GPT, Llama): These models process text from left to right and are particularly good 
at text generation tasks. They can complete sentences, write essays, or even generate code based on a prompt.

* Encoder-decoder models (like T5, BART): These models combine both approaches, using an encoder to understand 
the input and a decoder to generate output. 
They excel at sequence-to-sequence tasks like translation, summarization, and question answering.
'''

![Sample Image](images/transformers_architecture.png)

> Understanding which part of the Transformer architecture (encoder, decoder, or both) is best suited for a particular NLP task is key to choosing the right model. Generally, tasks requiring bidirectional context use encoders, tasks generating text use decoders, and tasks converting one sequence to another use encoder-decoders.

#### @Text generation

In [None]:
'''
* GPT-2 is a decoder-only model pretrained on a large amount of text. 
It can generate convincing (though not always true!) text given a prompt and complete other NLP tasks 
like question answering despite not being explicitly trained to.
'''

#### @Text classification

In [None]:
'''
Text classification involves assigning predefined categories to text documents, 
such as sentiment analysis, topic classification, or spam detection.
'''

In [None]:
'''
* BERT is an encoder-only model and is the first model to effectively implement deep bidirectionality to learn 
richer representations of the text by attending to words on both sides.
'''

#### @Token classification

In [None]:
'''Token classification involves assigning a label to each token in a sequence, 
such as in named entity recognition or part-of-speech tagging.'''

In [None]:
'''To use BERT for token classification tasks like named entity recognition (NER), 
add a token classification head on top of the base BERT model. '''

#### @Question answering

In [None]:
'''Question answering involves finding the answer to a question within a given context or passage.'''

In [None]:
'''To use BERT for question answering, add a span classification head on top of the base BERT model.'''

💡 Notice how easy it is to use BERT for different tasks once it’s been pretrained. You only need to add a specific head to the pretrained model to manipulate the hidden states into your desired output!

#### @Summarization

In [None]:
'''
Summarization involves condensing a longer text into a shorter version while preserving its key information and meaning.

Encoder-decoder models like BART and T5 are designed for the sequence-to-sequence pattern of a summarization task.
'''

#### @Translation

In [None]:
'''
Translation involves converting text from one language to another while preserving its meaning. Translation is another example of a sequence-to-sequence task, 
which means you can use an encoder-decoder model like BART or T5 to do it
'''

#### @Speech and audio

In [None]:
'''
* Whisper is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. 
This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages
'''

![Sample Image](images/whisper_architecture.png)

In [None]:
'''
This model has two main components:

* An encoder processes the input audio. The raw audio is first converted into a log-Mel spectrogram. 
This spectrogram is then passed through a Transformer encoder network.

* A decoder takes the encoded audio representation and autoregressively predicts the corresponding text tokens.
'''

#### @Automatic speech recognition

In [None]:
'''
To use the pretrained model for automatic speech recognition, you leverage its full encoder-decoder structure. 
The encoder processes the audio input, and the decoder autoregressively generates the transcript token by token.
'''

#### @Computer vision

In [None]:
'''
There are two ways to approach computer vision tasks:

* Split an image into a sequence of patches and process them in parallel with a Transformer.
* Use a modern CNN, like ConvNeXT, which relies on convolutional layers but adopts modern network designs.
'''


> A third approach mixes Transformers with convolutions (for example, Convolutional Vision Transformer or LeViT). 

#### @Image classification

In [None]:
'''ViT and ConvNeXT can both be used for image classification; the main difference is that ViT uses an attention mechanism
while ConvNeXT uses convolutions.

ViT replaces convolutions entirely with a pure Transformer architecture. If you’re familiar with the original Transformer, then you’re already most of the way toward understanding ViT.

'''