**Transformers**

A transformer is a neural network architecture used for sequence-to-sequence data tasks such as text summarization, question answering, and machine translation. In the diagram of the transformer (Attention is All You Need), there are two blocks: one for the encoder and the other for the decoder. Both blocks use self-attention instead of LSTM. Self-attention aids in parallel processing, which accelerates data training and scalability.

![Transformer _model_architecture](Transformer_model_architecture..png)

**History of Transformers**

The transformer is built on the "Attention is All You Need" research paper introduced by Google Brain in 2017. Initially developed for machine translation, it is now utilized for various purposes.

**Impact of Transformers in NLP**

- **Revolution in NLP:** For about 50 years, humans have worked on NLP using heuristic, statistical models, embedding models, N-gram, bag-of-words, RNNs, GRU, LSTM, etc. Transformers, however, have provided state-of-the-art results, with GPT being a prominent example.

- **Democratizing AI:** Before transformers, building NLP applications from scratch was common. Now, fine-tuning large models like BERT and GPT for specific domains yields state-of-the-art results.

- **Multimodal Capability:** Transformers aren't limited to NLP; they are used in various domains such as speech-to-text, text-to-image, and text-to-video.

- **Acceleration of GenAI:** Transformers have revolutionized image generation and other Generative AI domains like GANs.

- **Unification of Deep Learning:** Transformers are used in NLP, GenAI, computer vision, reinforcement learning, and more.

**Why Transformers Were Created?**

Sequence-to-sequence learning with neural networks involves two blocks: encoder and decoder. Previously, LSTM was commonly used, but it had limitations with large context vectors. The introduction of attention mechanisms in neural machine translation solved this issue. However, sequential training in these models was slow.

**Attention is All You Need**

This research paper introduced the transformer, which replaced LSTM with self-attention. This led to parallel processing, smaller components, stability, and robustness, along with the use of different hyperparameters.

**Timeline of Transformers**

- **2000-2014:** RNNs/LSTMs - The Origin Story
- **2014:** Attention
- **2017:** Transformer
- **2018:** BERT/GPT (Transfer Learning)
- **2018-2020:** Vision Transformer/Alphasord. 2
- **2021:** Gen AI
- **2022:** CharGPT/Stable Diffusion

**Advantages of Transformers**

- Scalability
- Transfer learning
- Multimodal capability
- Flexible architecture (e.g., BERT for encoder, GPT for decoder)
- Ecosystem support (e.g., Hugging Face)
- Integrated AI technologies (e.g., GANs + Transformers, RL + Transformers, CV + Transformers)

**Disadvantages of Transformers**

- High computational cost
- Requires a huge amount of data
- Overfitting
- Energy consumption
- Interpretability issues
- Bias (ethical concerns with data)

**Self Attention**

Self-attention is a mechanism that converts static word embeddings into contextual embeddings, suitable for various NLP applications.

**Self Attention in Transformers**

- Various types of encoding: hot encoding, bag of words, embedding, word embedding, contextual embedding.
- Word embedding is used to convert sentences into semantic vectors.
- Self-attention converts word embeddings into contextual embeddings dynamically.
- Calculated self-attention: For instance, given the sentence "money bank grows," each word is converted into a new embedding based on self-attention calculations. the calculation is given down in the iamge
![self attention machanism](self_attention.png)

- **point of consideration:**

1- this operation is in parallel but the disadvantage is that the which word is used with which words 

![parallel operation](parallel_operation.png)

2- there are perameter involved its mean their is no weight and bais involved therefore this emddeing is call general contextual embedding(mean that that the new embedding is generated is general is not spacific menn taht its is not learnable becaue in it yhey dont use the parameters)

![no learning](no_learning.png)

**progress**


![progress](progress.png)


in the progress digram we see we that how to feed the word to the machine for their we introduce the the word embedding now the disadvantage is that this embedding does't no the context of the words than we convert the word embedding into the contextual embedding which is also khow as general context embedding the drawback of the general context embeding is that it is not learnable parameters for learn the parameters we use the task specific embedding 

- to overcome the learnable parameters it introduce to convert the word embedding into the three new embedding query value and key 

![three new embedding](three_new_embedding.png)

- from one vector to born new vector we use linear transformation(is the process where we multiple the random weights matrix which is the data on which the model is traing whith the word embeding) from result we get wq,wk,wv this process is perform untill we got best weights at the end we got new three matrix query,key,and value

![formation](formation.png)

- now we use this three new matrix in self attention calulation at the end we got task specific embedding 


![calulation](calulation.png)

- all above process we do in parrelle we have 3 word or 3000k work we do this in parallel at show in the image

![self attention parallel](self_attention_parallel.png)

- The overall summary of the discussion above can be encapsulated in the formula: Attention(Q, K, V) = softmax(QKᵀ/√dk)V. However, in the actual formula, Attention(Q, K, V) = softmax(QKᵀ/√dk)V, we observe that QKᵀ is divided by √dk. Here, 'dk' represents the dimension length. It is divided by √dk because low dimensions have low variance, and high dimensions have high variance. Thus, dividing QKᵀ by the square root of dk ensures normalization. Additionally, dividing by √dk helps to mitigate the problem of vanishing gradients.This process is called Scaled Dot Product Attention

![detail](detail.png)

### TRANSFORMER ACRHITECHTURE

![Transformer _model_architecture](Transformer_model_architecture..png)

Detailed Notes:
Summary of Transformer Networks

Transformer networks are deep learning models designed to mimic the human process of attention. They excel in tasks involving sequence understanding and generation, such as natural language processing and image recognition.

Key Concepts:

Attention: Transformers attend to elements of a sequence, capturing dependencies and relationships.
Encoder: The encoder processes the input sequence, generating a representation that captures its context and meaning.
Decoder: The decoder generates the output sequence, one element at a time, using information from the encoder and its own attention mechanisms.
Encoder Architecture:

Consists of multiple encoder layers, each with:
Multi-Head Self-Attention (MHSA): Captures relationships between elements in the sequence.
Feed-Forward Neural Network (FFNN): Processes information from MHSA.
Adds Positional Encoding to input embeddings to convey element order.
Decoder Architecture:

Consists of multiple decoder layers, each with:
Masked Self-Attention (MSA): Ensures elements in the output sequence only attend to previous elements.
Multi-Head Attention (MHA): Attends to encoder output to gather relevant information.
FFNN: Further processes information from MHA.
How Transformers Work:

Input sequence is converted to embeddings and positional encodings.
Encoders process the embeddings, capturing contextual information and relationships.
Decoder receives encoder output and attends to both the output and its own sequence.
Decoder generates the output sequence, one element at a time, using information from attention and feed-forward layers.
Transformer model adjusts weights throughout training to optimize output accuracy.
Benefits:

Can handle long-range dependencies effectively.
Efficient processing by processing all elements in parallel.
Improved performance in tasks involving sequence understanding and generation.
Applications:

Natural language processing: Translation, summarization, language modeling
Image recognition and understanding
Sequence-to-sequence modeling: Time series forecasting, speech recognition


### Vision Transformers

![Vision Transformers](vt1.png)

![Vision Transformers](vt2.png)

![Vision Transformers](vt3.png)

![Vision Transformers](vt4.png)

![Vision Transformers](vt5.png)

![Vision Transformers](vt6.png)


## Summary of Vision Transformers 

Introduction:

Vision Transformers (ViTs) are deep learning models designed for computer vision tasks, inspired by the success of Transformers in natural language processing. ViTs employ a self-attention mechanism to understand relationships between different parts of an image.

Image Segmentation and Tokenization:

ViTs divide an image into smaller patches, which are then flattened into sequences of tokens. This process enables the Transformer model to process and understand image content more efficiently.

Architecture of Vision Transformers:

- Encoder: A stack of layers, including self-attention layers, feed-forward networks, and a classification layer. The encoder extracts meaningful features and understands spatial relationships within the image.
- No Decoder: Unlike traditional Transformers, ViTs lack a decoder because the primary goal is to understand the image content rather than generate output sequences.

Benefits of Vision Transformers:

- Ability to understand global and local relationships within an image.
- Reduced dimensionality of input vectors, improving computational efficiency and noise reduction.
- Extraction of essential features and elimination of irrelevant variations.

Key Differences from Traditional Transformers:

- Absence of a decoder.
- Self-attention layers tailored for computer vision tasks.
- Focus on understanding image content rather than generating output sequences.

Applications of Vision Transformers:

- Image classification
- Object detection
- Other computer vision tasks


### SWIN TRANSFORMER

![SWIN TRANSFORMER](swin_t.png)

Summary:

The Swin Transformer (SWin-T) architecture, introduced in 2021, makes advancements in Vision Transformers by effectively handling high-resolution images with lower computational complexity.

Architecture:

SWin-T consists of:

- Patenization: Dividing the input image into patches.
- Linear Embedding: Converting image pixels into vectors for Transformer layers.

SWin-T Blocks:
- Each block has two subunits with normalization layers, attention modules, and multi-layer perceptron layers.
- The first subunit uses window multi-head self-attention while the second uses shifted window multi-head self-attention.

Window-based Multi-Head Self-Attention:
- Computes relationships between patches within local windows.
- Shifted Window Multi-Head Self-Attention:
- Maintains connections between windows while preserving computational efficiency by shifting windows and performing cyclic shifts on patches.

Patch Merging:
- Selectively merges adjacent patches to capture global information.
- Reduces image resolution by a factor of 2 at each merging stage.

Task-Specific Head:
- For image classification, a linear layer is applied to the final output.
- For object detection or segmentation, outputs from all stages are provided to task-specific heads.

Variants:

There are four variants of SWin-T: Tiny, Small, Base, and Large, each differing in parameters and number of layers.

Advantages:

Effectively handles high-resolution images.
Lower computational complexity than Vision Transformers.
Provides features from different resolutions for object detection and segmentation tasks.