ViT: Visual Transformer Networks Toy implementation of An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale on MNIST. Chunking Images tiles = einops.rearrange(images, 'b c (h t1) (w t2) -> b (h w) c t1 t2', t1=tile_size, t2=tile_size) Attention Positional Embeddings Misslabelled Images Confusion Matrix Heatmap