[View in Colaboratory](https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/TPU_MachineLearning.ipynb)

### Lecture Structure
- Q&A
- Why make a TPU? (10 minutes)
- Short demo of a matrix operation (5 minutes)
- Q&A 
- TPU Explained (technical) (10 minutes)
- TPU Use cases (5 minutes) 
- Benchmark Demo #1 (Simple Matrix Add Operation) for CPU vs TPU (10 minutes)
- Q&A 
- Benchmark Demo #2 (Neural Network for Image Classificatio) for GPU vs TPU (10 minutes)
- Demo #3 Training a text generation on a TPU (10 minutes)
- Rap, probably. Lol



## What is a Tensor Processing Unit?

![alt text](http://www.cdrinfo.com/images/uploaded/Google_TPU_3.jpg)

- A Tensor Processing Unit (TPU) is a custom computer processing chip designed by Google

![alt text](https://i.imgur.com/dMx3Ilw.png)

- They've been using TPUs in their data centers since 2015

![alt text](https://i.imgur.com/TnpIdxd.png)

- They designed them specifically for machine learning applications
- They use them for Google Translate, Photos, Search Assistant, Gmail, Cloud, etc. 

## Why did they make their own chip? 

![alt text](https://i.imgur.com/N1eOy9m.png)

- There has been a lot of progress in machine learning in the past few years

![alt text](https://image.slidesharecdn.com/20170222mldlintroduction-170222094304/95/machine-learning-deep-learning-and-data-analysis-introduction-38-638.jpg?cb=1496630353)

- Neural Networks in particular almost always outperform other machine learning models if given enough data & compute
- Neural networks require a lot of compute! It's called deep learning.

![alt text](https://image.slidesharecdn.com/2015scalaworld-150927123309-lva1-app6892/95/neural-network-as-a-function-12-638.jpg?cb=1443357296)

![alt text](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/Images/Linear_1.jpg)

- Neural networks are just a series of matrix operations applied to input data
- And if theres a lot of data to input, thats a lot of matrix operations to compute
- Like a lot. Giant, giant matrices full of numbers all being multiplied in parallel
-Most of the math is just 'multiply a bunch of numbers, and add the results' 
- We can connect these two together in a single operation called multiply-accumulate (MAC). 
- And if we don’t need to do anything else, we can multiply-accumulate really, really fast.

![alt text](https://c1.staticflickr.com/2/1640/25046013104_68059057ab_b.jpg)

![alt text](https://cdn-images-1.medium.com/max/800/1*OZWu6AW9nI9HXb4g1XFCQA.png)

- A CPU is a scalar machine, which means it processes instructions one step at a time. 
- CPUs can peform these matrix operations pretty well.
- But Moore's Law is coming to an end. 

![alt text](https://www.carestream.com/blog/wp-content/uploads/2015/09/CSH_CPU-GPU_Illustration.png)

- A CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time
- Luckily, GPUs (Graphics Processing Unit) can perform matrix operations orders of magnitude better than CPUs.

![alt text](https://www.researchgate.net/profile/Abel_Paz/publication/231167191/figure/fig2/AS:300437820461057@1448641366026/Comparison-of-CPU-versus-GPU-architecture.png)

 - A GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.
- Thats because GPUs were designed for 3d game rendering, which often involves parallel operations
-The ability of a GPU with 100+ cores to process thousands of threads can accelerate some software by 100x over a CPU alone. 
- What’s more, the GPU achieves this acceleration while being more power- and cost-efficient than a CPU.
- So when neural networks run on GPUs, they run much faster than on CPUs

![alt text](http://share.opsy.st/54176dc2ec16c-Vivante-Sept-Fig1.jpg)
![alt text](https://cdn-images-1.medium.com/max/600/1*Yx9XF4H4spE8Bm8XVY1bYA.png)

- A GPU is a vector machine. You can give it a long list of data — a 1D vector — and run computations on the entire list at the same time. 
- This way, we can perform more computations per second, but we have to perform the same computation on a vector of data in parallel. 
- GPUs are general purpose chips. They don't just perform matrix operations, they can really do any kind of computation.
- GPUs are optimized for taking huge batches of data and performing the same operation over and over very quickly


## The Power of Linear Algebra

- Lets perform a simple matrix operation on the GPU using the MXNet machine learning
- This is basically CUDA (GPU Programming) MXNet's NDArray is a thin wrapper on top of CUDA constructs. Awesome. 
- Tensorflow uses lots of CUDA as well. 

![alt text](https://cdn.kastatic.org/googleusercontent/xj8YqV88KB29MEKR5Iq68oUo1h2kFAIAewMsHeWS9-7l0KaB6BI3sOmpfGSCzsVU8z5Evq6QIrwbEAqBnZ5W06g0CQ)

In [0]:
#1 - MXNet Dependencies

In [0]:
#2 Basically CUDA time. CUDA is awesomeeeeeeee <3 u Nvidia 

# Q&A 

## The TPU

![alt text](https://pbs.twimg.com/media/Dcww6cOUQAAgKYb.jpg)

- 3 Generations 
- Google took 15 months for the TPUv1, and that was astonishingly fast for an ASIC.
- ASICs are initially expensive, requiring specialized engineers and manufacturing costs that start at around a million dollars. 
- And they are inflexible: there’s no way to change the chip once it’s finished.
- But if you know you’ll be doing one particular job in enough volume, the recurring benefits can make up for the initial drawbacks. 
- ASICs are generally the fastest and most energy-efficient way to accomplish a task. 

3 Generation of TPUS. Describe a TPU. Then Simple CPU VS TPU Add benchmark. Then Japanese MLP GPU vs TPU benchmark. Then Tensorflow Shakespeare. 

![alt text](https://cdn-images-1.medium.com/max/800/1*dW4_Z8wsMcoS44dtKNacZQ.gif)

-The data of a neural network is arranged in a matrix, a 2D vector.
- So, Google decided they needed to build a a matrix machine (The tensor processing unit or TPU)
- And they really only care about multiply-accumulate, so they prioritized that over other instructions that a processor would normally support.
- We’ll devote most of our chip to the MACs that perform matrix multiplication, and mostly ignore other operations.²
- Google wanted to design a chip specifically for the matrix operations that neural networks require so that it would run them even more efficiently.

![alt text](https://cloud.google.com/tpu/docs/images/tpu--sys-arch3.png)

- TPU hardware is comprised of four independent chips.
- The following block diagram describes the components of a single chip. 
- Each chip consists of two compute cores called Tensor Cores. 
- A Tensor Core consists of scalar, vector and matrix units (MXU). 
- In addition, 8 GB of on-chip memory (HBM) is associated with each Tensor Core.
- The bulk of the compute horsepower in a Cloud TPU is provided by the MXU. 
- Each MXU is capable of performing 16K multiply-accumulate operations in each cycle. 
- While the MXU's inputs and outputs are 32-bit floating point values, the MXU performs multiplies at reduced bfloat16 precision. 
- Bfloat16 is a 16-bit floating point representation that provides better training and model accuracy than the IEEE half-precision representation.
-From a software perspective, each of the 8 cores on a Cloud TPU can execute user computations (XLA ops) independently. 
- High-bandwidth interconnects allow the chips to communicate directly with each other.

![alt text](https://cdn-images-1.medium.com/max/600/1*d7Lg4cYdQO2kxt9nSmyj-g.png)

- XLA is an experimental JIT compiler for the backend of Tensorflow. 
- It turns your TF graph into linear algebra, and it has backends of its own to run on CPUs, GPUs, or TPUs
- proprietary 

## The Systolic Array

- The way to achieve that matrix performance is through a piece of architecture called a systolic array.
- This is the interesting bit, and it’s why a TPU is performant. 
- A systolic array is a kind of hardware algorithm, and it describes a pattern of cells on a chip that computes matrix multiplication. 
- “Systolic” describes how data moves in waves across the chip, like the beating of a human heart.

![alt text](https://cdn-images-1.medium.com/max/800/1*umH-qhj3j1Z1k35uBkXM8g.png)

- On a TPU, For 2x2 inputs, each term in the output is the sum of two products. 
- No product is reused, but the individual terms are.
- We’ll implement this by building a 2x2 grid. (It’s actually a grid, not just an abstraction — hardware is fun like that).
- Note that 2x2 is a toy example, and the full size MXU is a monstrous 128x128.
- Let’s say AB/CD represents our activations and EF/GH our weights. For our array, we first load up the weights like so

![alt text](https://cdn-images-1.medium.com/max/800/1*P83xXpFMjgLkAXf9i4uC9g.png)

- Our activations go into an input queue, which for our example will go on the left of each row.

![alt text](https://cdn-images-1.medium.com/max/800/1*PyZRdGouPq26ON8G1vo7gQ.png)

- Every cycle of the clock, each cell will execute the following steps, all in parallel:

1. Multiply our weight and the activation coming in from the left. If there’s no cell on the left, take from the input queue.
2. Add that product to the partial sum coming in from above. If there’s no cell above, the partial sum from above is zero.
3. Pass the activation to the cell to the right. If there’s no cell on the right, throw away the activation.
4. Pass the partial sum to the cell on the bottom. If there’s no cell on the bottom, collect the partial sum as an output.

[Python Example](https://github.com/antonpaquin/SystolicArrayDemo/blob/master/systolic.py)

By these rules, you can see the activations will start on the left and move one cell to the right per cycle, and the partial sums will start on top amd move one cell down per cycle.

![alt text](https://cdn-images-1.medium.com/max/800/1*b7t8CxJJ8K61BC0NHXjfyg.png )

- Thats our data flow. 

- So one Cycle of many happening in paralellel, for example, looks like this  

1. Top left reads A from input queue, multiplies with weight E to produce product AE.
2. AE added to partial sum 0 from above, produces partial sum AE.
3. Activation A passed to the cell in the top right.
4. Partial sum AE passed to the cell in the bottom left.

![alt text](https://cdn-images-1.medium.com/max/600/1*_v7REW2VWoVsob9HS7kMFw.gif)

- It takes 3n-2 cycles to fully compute the result matrix, whereas a standard sequential solution is n³. 

![alt text](https://cdn-images-1.medium.com/max/600/1*IyFcYIFIaMwAn90RIJYfMw.png)

- We can do this because we’re running 128x128 MAC operations in parallel. 
- Multipliers are usually big and expensive to implement in hardware, but the high density of systolic arrays lets Google pack 16,384 of them into the MXU. 
- This ranslates directly into speed training and running your network.
- Weights are loaded in much the same way as activations — through the input queue. 
- We just send a special control signal (red in the above diagram) to tell the array to store weights as they pass by, instead of running MAC operations.
- Weights remain in the same processing elements, so we can send an entire batch through before loading a new set, reducing overhead.
- That’s it! The rest of the chip is important and worth going over, but the core advantage of the TPU is its MXU — a systolic array matrix multiplication unit.


## What are the Use Cases for TPUs?

![alt text](https://storage.googleapis.com/gweb-cloudblog-publish/original_images/tpu-6tlel.PNG)

CPUs:

- Quick prototyping that requires maximum flexibility
- Simple models that do not take long to train
- Small models with small effective batch sizes
- Models that are dominated by custom TensorFlow operations written in C++
- Models that are limited by available I/O or the networking bandwidth of the host system


GPUs:

- Models that are not written in TensorFlow or cannot be written in TensorFlow
- Models for which source does not exist or is too onerous to change
- Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs
- Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
- Medium-to-large models with larger effective batch sizes

TPUs:

- Models dominated by matrix computations
- Models with no custom TensorFlow operations inside the main training loop
- Models that train for weeks or months
- Larger and very large models with very large effective batch sizes


# Benchmark #1 - Simple matrix add operation on CPU vs TPU

In [0]:
# 3 Dependencies for benchmark 1

In [0]:
#4 CPU Benchmark

In [0]:
# 5 TPU Benchmark

In [0]:
# 6 TPU Continued

# Q&A 

## Key TPU Functions in Tensorflow

#### The TPU Estimator

- Estimators are TensorFlow's model-level abstraction.
- Standard Estimators can drive models on CPU and GPUs. You must use tf.contrib.tpu.TPUEstimator to drive a model on TPUs.

```
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
    model_fn=my_model_fn,
    config=tf.contrib.tpu.RunConfig()
    use_tpu=False)
```
#### The TPU Run Configuration



```
my_tpu_run_config = tf.contrib.tpu.RunConfig(
    master=master,
    evaluation_master=master,
    model_dir=FLAGS.model_dir,
    session_config=tf.ConfigProto(
        allow_soft_placement=True, log_device_placement=True),
    tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations,
                                        FLAGS.num_shards),
)

```
#### The Cross Shard Optimizer

- When training on a cloud TPU you must wrap the optimizer in a tf.contrib.tpu.CrossShardOptimizer, which uses an allreduce to aggregate gradients and broadcast the result to each shard (each TPU core).

```
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
if FLAGS.use_tpu:
  optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
```

#### Tensor dimensions need to be statically defined at compile time thanks to XLA

- During regular Tensorflow execution any unknown shape dimensions are determined dynamically
- but XLA requires that all tensor dimensions be statically defined at compile time. All shapes must evaluate to a constant, and not depend on external data, or stateful operations like variables or a random number generator.





# Benchmark #2 - Multilayer perceptron for image classification using the CIFAR Dataset (GPU vs TPU)

In [0]:
import tensorflow as tf
import tensorflow.keras.backend as K
import numpy as np
from tensorflow.contrib.tpu.python.tpu import keras_support
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, AveragePooling2D, Dense, Dropout, Flatten
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.utils import to_categorical
import os

def basic_mlp_module(input, units):
    x = Dense(units)(input)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = Dropout(0.5)(x)
    return x

def create_mlp_model():
    input = Input((32*32*3,))
    x = basic_mlp_module(input, 2048)
    x = basic_mlp_module(x, 1024)
    x = basic_mlp_module(x, 512)
    x = basic_mlp_module(x, 256)
    x = basic_mlp_module(x, 128)
    x = basic_mlp_module(x, 64)
    x = basic_mlp_module(x, 32)
    x = basic_mlp_module(x, 16)
    x = Dense(10, activation="softmax")(x)
    return Model(input, x)

def main():
    # 7 lets make this into TPU Code!!!
    K.clear_session()

    # CIFAR
    (X_train, y_train), (_, _) = cifar10.load_data()
    X_train = (X_train / 255.0).reshape(50000, -1)
    y_train = to_categorical(y_train)

    # Model building
    model = create_mlp_model()
    model.compile(tf.train.AdamOptimizer(learning_rate=1e-3), loss="categorical_crossentropy", metrics=["acc"])

    model.fit(X_train, y_train, batch_size=1024, epochs=10)

if __name__ == "__main__":
    main()

### Lets look at their default 'train shakespeare in 5 minutes example'

https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb

 