# Transformer Image Classification on Flowers-104 (Conceptual Version)

## 1. Overview

Goal: Classify 104 species of flowers using a Vision Transformer (ViT).
- Dataset: Flowers-104 (~8k train, ~1k val, ~1k test)
- Input images resized to 224×224
- ViT model applied: patch-based self-attention transformer

Diagram: High-level workflow:

In [None]:
Input Image (224x224x3)
        │
  Split into patches (16x16)
        │
 Flatten & Linear projection → patch embeddings
        │
  + Position embeddings
        │
 Transformer Encoder blocks (self-attention + MLP)
        │
  Flatten → MLP head
        │
  Softmax → Class probabilities (104 classes)


## 2. Data Handling (Conceptual)

- Data stored in TFRecords
- Training / validation / test splits
- Preprocessing:
  - Resize images to 224×224
  - Normalize pixels to [0,1]
  - Data augmentation: random flip, saturation (optional)

Illustration:

In [None]:
[Image] → Resize → Normalize → Optional Augment → Transformer


Code (demo only):

In [None]:
def preprocess(image, label):
    image = tf.image.resize(image, (224,224))
    image = tf.cast(image, tf.float32)/255.0
    return image, label


## 3. Patch Creation

- Patch size: 16×16 pixels
- Image → split into patches → flatten → linear projection

Diagram:

In [None]:
224x224 image → 16x16 patches → 196 patches (14x14 grid) 
Each patch → 16*16*3 = 768 features → Dense projection to 64-dim vector


Patch layer (conceptual):

In [None]:
class Patches(tf.keras.layers.Layer):
    def call(self, images):
        patches = tf.image.extract_patches(images, sizes=[1,16,16,1], strides=[1,16,16,1], padding="VALID")
        return tf.reshape(patches, [batch_size, -1, patches.shape[-1]])


## 4. Positional Encoding

- Transformers have no inherent spatial info → add position embeddings to patches
- Encodes location of each patch

Diagram:

In [None]:
Patch embeddings + Position embeddings → Input to Transformer


## 5. Transformer Encoder

- Multi-head self-attention: capture global relationships between patches
- Layer normalization before attention
- MLP block after attention
- Residual connections for stability

Block diagram (single transformer layer):

In [None]:
Patch embeddings
      │
 LayerNorm
      │
 Multi-Head Self-Attention
      │
 Add Residual
      │
 LayerNorm
      │
 MLP (Dense + Dropout)
      │
 Add Residual → Output


## 6. Classification Head

- Flatten transformer output
- MLP layers → dense features
- Final dense layer → 104 classes

Diagram:

In [None]:
[Flattened patches] → Dense(2048) → Dense(1024) → Dense(104) → Softmax


Conceptual code:

In [None]:
x = tf.keras.layers.Flatten()(transformer_output)
x = tf.keras.layers.Dense(2048, activation='gelu')(x)
x = tf.keras.layers.Dense(1024, activation='gelu')(x)
logits = tf.keras.layers.Dense(104)(x)


## 7. Training Strategy (Trace-only)

- Learning rate schedule: warmup → max → decay
- Batch size: scaled with available hardware (TPU/GPU)
- Steps per epoch: training samples / batch size

Diagram (LR schedule):

In [None]:
LR Start → Ramp-up → Max LR → Sustain → Exponential Decay → LR Min


Training curves (conceptual):

In [None]:
Loss: decreases over epochs
Accuracy: increases over epochs


## 8. Evaluation

- Metrics: Accuracy, F1, Precision, Recall
- Confusion matrix: normalized per class

Diagram (Conceptual Confusion Matrix):

In [None]:
Predicted classes on X-axis
Actual classes on Y-axis
Diagonal = correct predictions

Batch predictions visualized:

In [None]:
Image + True label → Predicted label


## 9. Key Takeaways

- ViT splits images into patches → enables transformers for vision tasks.
- Self-attention captures global dependencies → different from CNNs’ local receptive fields.
- MLP head maps features → classes → standard classification layer.
- TPU/GPU acceleration speeds up training but conceptual understanding does not require heavy training.
- Visualization of patches, attention maps, and confusion matrices helps interpret model behavior.