========================================Compilers and Interpreters==================================

In [None]:
def add_():
    return '''
def add(a, b):
    return a + b
'''

def fancy_func_():
    return '''
def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g
'''

def evoke_():
    return add_() + fancy_func_() + 'print(fancy_func(1, 2, 3, 4))'

prog = evoke_()
print(prog)
y = compile(prog, '', 'exec')
exec(y)

##### imperative (interpreted) programming and symbolic programming:

Imperative programming is easier. When imperative programming is used in Python, the majority of the code is straightforward and easy to write. It is also easier to debug imperative programming code. This is because it is easier to obtain and print all relevant intermediate variable values, or use Python‚Äôs built-in debugging tools.

Symbolic programming is more efficient and easier to port. Symbolic programming makes it easier to optimize the code during compilation, while also having the ability to port the program into a format independent of Python. This allows the program to be run in a non-Python environment, thus avoiding any potential performance issues related to the Python interpreter.

In [1]:
import torch
from torch import nn
from d2l import torch as d2l


# Factory for networks
def get_net():
    net = nn.Sequential(nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 2))
    return net

x = torch.randn(size=(1, 512))
net = get_net()
net(x)

tensor([[-0.0645,  0.0214]], grad_fn=<AddmmBackward0>)

In [None]:
net = torch.jit.script(net)
net(x)

In [5]:
#@save
class Benchmark:
    """For measuring running time."""
    def __init__(self, description='Done'):
        self.description = description

    def __enter__(self):
        self.timer = d2l.Timer()
        return self

    def __exit__(self, *args):
        print(f'{self.description}: {self.timer.stop():.4f} sec')

net = get_net()
with Benchmark('Without torchscript'):
    for i in range(10000): net(x)

net = torch.jit.script(net)
with Benchmark('With torchscript'):
    for i in range(10000): net(x)

Without torchscript: 0.7279 sec
With torchscript: 0.6189 sec


In [7]:
# One of the benefits of compiling the models is that we can serialize (save) the model and its parameters to disk. 
net.save('../data/my_mlp')
!ls -lh my_mlp*

'ls' is not recognized as an internal or external command,
operable program or batch file.


======================================Asynchronous Computation==================================

 For PyTorch, by default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on the CPU or other GPUs.

In [None]:
import os
import subprocess
import numpy
import torch
from torch import nn
from d2l import torch as d2l

In [None]:
# On the first GPU operation, PyTorch triggers a bunch of one-time costs:
# 1. CUDA context initialization
#   Creating the CUDA context can take tens to hundreds of milliseconds
# 2. Memory allocator setup
#   GPU memory pools are initialized lazily
# 3. Kernel loading / JIT compilation
#   cuBLAS / CUDA kernels are loaded and sometimes compiled
# 4. Driver synchronization
#   The first call often forces a sync that later calls avoid
# If you don‚Äôt warm up, the first timed iteration will include all of this overhead.

# Below 3 lines are  Warmup for GPU computation, they ensure:
# CUDA stream is created, Kernels are loaded and later operations are purely async compute
device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)

with d2l.Benchmark('numpy'):
    for _ in range(10):
        a = numpy.random.normal(size=(1000, 1000))
        b = numpy.dot(a, a)

with d2l.Benchmark('torch'):
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)

with d2l.Benchmark():
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)
    torch.cuda.synchronize(device)  # Wait for all the computations to finish

In [None]:
x = torch.ones((1, 2), device=device)
y = torch.ones((1, 2), device=device)
z = x * y + 2
z

==========================================Automatic Parallelism======================================

#### Computational Graphs and Automatic Parallelism

Modern deep learning frameworks such as **PyTorch** and **MXNet** automatically build **computational graphs** that represent operations and their dependencies. These graphs make it clear which computations depend on others and which can run **in parallel**.

By explicitly encoding dependencies, the framework can execute **independent operations simultaneously**, improving performance without manual coordination.

##### Execution and Parallelism

On a **single device**, most operators (e.g., matrix multiplication or convolution) already use **all available hardware resources**:
- CPUs utilize all cores and threads
- GPUs utilize all compute units

Because of this, additional parallelism provides limited benefit on a single device. Significant speedups appear mainly in **multi-device setups**.

##### Automatic Multi-Device Scaling

When multiple devices are available, frameworks can automatically:
- Distribute computations across GPUs and CPUs
- Overlap computation with communication
- Increase overall training throughput

This requires little to no manual device management.

##### Key Takeaways

| Concept | Description |
|-------|-------------|
| **Computational Graph** | Encodes operations and dependencies |
| **Automatic Parallelism** | Independent operations run concurrently |
| **Single Device** | One operator already saturates hardware |
| **Multi-Device** | Workload distribution enables major speedups |
| **Developer Experience** | Simple code with optimized execution |

##### Summary

Deep learning frameworks rely on **computational graphs** to automatically schedule and parallelize computation across CPUs and GPUs. This allows developers to write **high-level, concise code** while still achieving **efficient hardware utilization**.


In [None]:
# need at least two GPUs to run the experiments in this section

In [None]:
import torch
from d2l import torch as d2l

In [None]:
devices = d2l.try_all_gpus()
def run(x):
    return [x.mm(x) for _ in range(50)]

# two variables
x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

In [None]:
# warm up all devices first
run(x_gpu1)
run(x_gpu2)
torch.cuda.synchronize(devices[0]) #  waits for all kernels in all streams on a CUDA device to complete. 
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)
    torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)
    torch.cuda.synchronize(devices[1])

In [None]:
# If we remove the synchronize statement between both tasks the system is free to parallelize computation on both devices automatically.

with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)
    torch.cuda.synchronize()

In [None]:
# In many cases we need to move data between different devices, say between the CPU and GPU, or between different GPUs. For instance, this occurs when we want to perform 
# distributed optimization where we need to aggregate the gradients over multiple accelerator cards. Let‚Äôs simulate this by computing on the GPU and then copying the results 
# back to the CPU.

def copy_to_cpu(x, non_blocking=False):
    return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    torch.cuda.synchronize()

In [None]:
# This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while the remainder of the list is still being computed. This situation occurs, 
# e.g., when we compute the (backprop) gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our
# advantage to start using PCI-Express bus bandwidth while the GPU is still running. In PyTorch, several functions such as to() and copy_() admit an explicit non_blocking argument, 
# which lets the caller bypass synchronization when it is unnecessary. Setting non_blocking=True allows us to simulate this scenario.

# The total time required for both operations is (as expected) less than the sum of their parts. Note that this task is different from parallel computation as it uses a different 
# resource: the bus between the CPU and GPUs. In fact, we could compute on both devices and communicate, all at the same time. As noted above, there is a dependency between 
# computation and communication: y[i] must be computed before it can be copied to the CPU. Fortunately, the system can copy y[i-1] while computing y[i] to reduce the total 
# running time.

with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y, True)
    torch.cuda.synchronize()

Below is an illustration of the computational graph and its dependencies for a simple two-layer MLP when training on a CPU and two GPUs. 
It would be quite painful to schedule the parallel program resulting from this manually. This is where it is advantageous to have a graph-based computing backend for optimization.

!["c g"](./Images/Computationalgraph1.png)



=======================================Hardware=========================================

![Latency numbers](./Images/LatencyNumbers.png)

#### Computers

##### 1. Overview

Modern computers used in **deep learning** or **high-performance computing (HPC)** environments are composed of multiple key components that work together to execute, store, and transfer data efficiently.

A well-balanced system design ensures that **no single component becomes a bottleneck** during computation, data movement, or communication.

##### 2. Key Components of a Computer

A typical deep learning computer includes:

1. **Processor (CPU)**  
   - Executes programs and system operations.  
   - Modern CPUs have 8 or more cores.  
   - Manages program logic, control flow, and data orchestration.

2. **Memory (RAM)**  
   - Temporarily stores data during computation, such as model parameters and activations.  
   - Provides fast access for CPUs and GPUs.  

3. **Network Connection**  
   - Provides external communication (e.g., Ethernet).  
   - Bandwidth ranges from **1 GB/s to 100 GB/s**, depending on hardware.  
   - In servers, advanced interconnects improve data flow between machines.

4. **Expansion Bus (PCIe)**  
   - High-speed interface connecting CPUs to GPUs, SSDs, and network cards.  
   - Provides **direct data transfer** between components.  
   - Servers may include up to **8 GPUs** connected via PCIe; desktops usually have 1‚Äì2.

5. **Durable Storage**  
   - Includes hard drives (HDDs) or solid-state drives (SSDs).  
   - Often connected via PCIe for fast data movement.  
   - Stores large datasets and checkpoints efficiently.

##### 3. Data Flow and Connectivity

In a typical system, the **CPU acts as the central hub**:
- It connects to **RAM, storage, GPUs, and the network** via the **PCIe bus**.
- Data flows between these components as shown:

![PCI connection](./Images/PCIConnection.png)

##### Example:
- AMD‚Äôs **Threadripper 3** has **64 PCIe 4.0 lanes**, each supporting **16 Gbit/s bidirectional transfer**.
- Total memory bandwidth can reach **up to 100 GB/s**.

##### 4. System Performance Considerations

To achieve high performance:
- Data movement between CPU, GPU, and storage must be **well-balanced**.  
- **CPU/GPU starvation** occurs if data isn‚Äôt fed quickly enough to processing units.  
- The **network should not slow down synchronization** when using distributed training.

##### Optimization Techniques:
- **Interleave computation and communication** ‚Äî ensures processors are always busy.  
- **Avoid bottlenecks** ‚Äî each component should operate near its optimal throughput.

##### ‚úÖ Key Takeaways

| Component | Function | Typical Bandwidth | Key Role in Deep Learning |
|------------|-----------|-------------------|---------------------------|
| **CPU** | Executes instructions, orchestrates operations | 100+ GB/s (RAM access) | Control & coordination |
| **RAM** | Stores intermediate results and activations | 50‚Äì100 GB/s | Fast temporary storage |
| **PCIe** | Connects CPU, GPU, and storage | 16‚Äì32 GB/s per lane | High-speed interconnect |
| **GPU** | Performs large-scale parallel computation | 500+ GB/s (onboard) | Training/inference core |
| **Storage (SSD)** | Persistent data storage | 1‚Äì3 GB/s | Dataset & checkpoint storage |
| **Network** | Cross-system data exchange | 1‚Äì100 GB/s | Distributed computing |


#### Memory

##### 1. CPU Memory (Main RAM)

Modern CPUs typically use **DDR4 RAM**, providing **20‚Äì25 GB/s bandwidth per module**.  
Each module has a **64-bit-wide bus**, and CPUs often support **2‚Äì4 memory channels**, giving a total bandwidth between **40 GB/s and 100 GB/s**.  
There are usually multiple **banks per channel** (e.g., AMD‚Äôs Threadripper CPUs have 8 slots).

##### 2. How Memory Access Works

When data is accessed from RAM, the CPU must:
1. **Send an address** to the memory module.
2. Perform a **burst read**, where multiple consecutive data elements are read after the initial setup.

- **First access latency:** ~100 ns  
- **Subsequent reads:** ~0.2 ns  
  ‚Üí The first read is about **500√ó slower** than subsequent ones.

This means:
- **Random memory access** is costly.
- **Sequential (burst) access** is much faster and should be preferred whenever possible.

##### 3. Memory Banks and Alignment

Memory is divided into **banks**, each capable of independent access.  
This allows:
- **Up to 4√ó higher random read throughput** when accesses are evenly distributed.  
- However, **burst reads** still outperform random reads overall.

To optimize memory performance:
- **Align data structures** to 64-bit boundaries.
- Modern compilers handle alignment **automatically** when appropriate flags are used.

##### 4. GPU Memory

GPUs have far higher bandwidth needs due to massive parallelism.  
They address this through two main design strategies:
1. **Wider memory buses** (e.g., NVIDIA RTX 2080 Ti ‚Üí 352-bit bus).  
2. **High-performance memory types**, such as:
   - **GDDR6** (500+ GB/s bandwidth)
   - **HBM (High Bandwidth Memory)** for top-end models like **NVIDIA Volta V100**.

GPU memory:
- Uses a **dedicated silicon interface**, making it **very fast but expensive**.
- Has **smaller capacity** than CPU RAM.
- Prioritizes **throughput over capacity** for deep learning workloads.

##### ‚úÖ Key Takeaways

| Aspect | CPU Memory | GPU Memory |
|--------|-------------|------------|
| Type | DDR4 | GDDR6 / HBM |
| Bandwidth | 20‚Äì100 GB/s | 500+ GB/s |
| Capacity | Larger | Smaller |
| Latency | Higher initial read | Optimized for throughput |
| Access pattern | Prefer sequential (burst) | Highly parallel |
| Cost | Relatively cheap | Expensive (dedicated silicon) |

##### 5. Summary

- Memory serves as the CPU/GPU‚Äôs workspace for actively used data.
- The key to high performance lies in **sequential access** and **alignment**.
- GPU memory is designed for **speed and parallelism**, while CPU memory is optimized for **capacity and flexibility**.
- Understanding burst reads, banks, and alignment helps avoid inefficient random memory access.

#### Storage

##### 1. Overview

Like RAM, **storage devices** are defined by two main performance metrics:
- **Bandwidth** ‚Äì how much data can be transferred per second.  
- **Latency** ‚Äì how long it takes to start a transfer.  

Storage systems tend to have much higher latency and lower bandwidth than memory.  
Different storage technologies balance cost, speed, and capacity differently.

##### 2. Hard Disk Drives (HDDs)

**Hard disk drives (HDDs)** are mechanical storage devices that have been used for decades.  
They store data on **spinning platters** accessed by **read/write heads** that move physically to the correct track.

##### Key characteristics
- Speed: ~**7,200 RPM**  
- Latency: ~**8 ms per access**  
- Performance: ~**100 IOPs (I/O operations per second)**  
- Bandwidth: **100‚Äì200 MB/s**

##### Limitations
- Slow random access due to mechanical movement.  
- Bandwidth improvements limited by physics (disk speed and density).  
- Susceptible to **mechanical failure** (catastrophic).  

üü† **Best for:** archival or low-cost large-capacity storage.

##### 3. Solid State Drives (SSDs)

**Solid state drives (SSDs)** use flash memory to store data without moving parts.  
They are **orders of magnitude faster** than HDDs, especially for random access.

##### Performance
- **100,000‚Äì500,000 IOPs** (vs. ~100 for HDDs)  
- **1‚Äì3 GB/s bandwidth**, up to **8 GB/s** with **NVMe PCIe 4.0**

##### Advantages
- No moving parts ‚Üí low latency and high reliability.  
- Much faster for both sequential and random reads.

##### Drawbacks
- **Writes are slow**: SSDs store data in large blocks (‚â•256 KB).  
  Writing requires reading, erasing, and rewriting an entire block.  
- **Limited write endurance:** cells wear out after thousands of writes.  
  ‚Üí Mitigated by **wear leveling** in firmware.  
- Not recommended for swap space or large log aggregation due to high write volume.  

üü¢ **Best for:** OS drives, high-speed data processing, ML datasets, caching.

##### 4. Cloud Storage

**Cloud storage** provides virtualized, scalable storage over a network.

##### Characteristics
- Dynamically adjustable capacity and bandwidth.  
- Users can configure **IOPs (Input/Output Operations per Second)** to control performance.  
- Higher latency due to network communication.

üü£ **Best for:** scalable workloads and distributed systems, not low-latency operations like local ML training.

##### ‚úÖ Key Comparison

| Feature | HDD | SSD | Cloud Storage |
|----------|-----|-----|---------------|
| Type | Mechanical | Flash (no moving parts) | Virtual / Network |
| Typical IOPS | ~100 | 100,000‚Äì500,000 | Configurable |
| Bandwidth | 100‚Äì200 MB/s | 1‚Äì8 GB/s | Varies (network-limited) |
| Latency | ~8 ms | ~100 ¬µs | Depends on network |
| Cost | Low | Medium | Pay-as-you-go |
| Durability | Fragile | Moderate (wear-out) | High (redundant) |
| Best for | Archival / cold storage | Fast workloads | Elastic storage |

##### 5. Summary

- **HDDs**: Cheap, high capacity, but slow and mechanical.  
- **SSDs**: Fast, silent, and reliable, but have limited write endurance.  
- **Cloud storage**: Scalable and flexible, but slower due to network latency.  
Understanding these trade-offs is essential for balancing **speed, cost, and capacity** in computing and deep learning workflows.


#### CPUs ‚Äî Summary & Explanation

##### 1. Overview

**Central Processing Units (CPUs)** are the core computational components of a computer system.  
They contain:
- **Processor cores** ‚Äî execute instructions.
- **Caches** ‚Äî store recently used data for fast access.
- **Interconnects** ‚Äî connect cores, caches, and memory subsystems.

Modern CPUs often include:
- **Integrated GPUs** (for graphics and computation)
- **Vector processing units** (for high-performance math operations, e.g., convolutions)

**Example:**  
An Intel Skylake quad-core CPU integrates cores, caches, GPU, and system interfaces (Ethernet, Wi-Fi, USB) via a ring bus or PCIe connection.

##### 2. Microarchitecture

Each CPU core contains a pipeline with multiple stages that process instructions efficiently.

##### Typical stages:
1. **Fetch** ‚Äî Load instructions from memory.  
2. **Decode** ‚Äî Translate assembly code to micro-operations.  
3. **Dispatch** ‚Äî Send instructions to execution units.  
4. **Execute** ‚Äî Perform arithmetic or logical operations.  
5. **Write-back** ‚Äî Store results in registers or memory.

Modern CPUs execute **multiple instructions per clock cycle** (superscalar and out-of-order execution).  
For example, **ARM Cortex A77** can perform **up to 8 operations per cycle**.

**Branch prediction:**  
When encountering a conditional branch, CPUs predict which path will be taken to avoid idle cycles.  
If the prediction is wrong, the pipeline is flushed and restarted ‚Äî losing cycles but improving average throughput.

##### 3. Vectorization

To boost performance in compute-heavy tasks like deep learning, CPUs use **vectorization**, i.e., performing the same operation on multiple data points simultaneously.

##### SIMD (Single Instruction, Multiple Data)
Instruction sets enabling vectorization:
- ARM: **NEON**
- Intel/AMD: **AVX**, **AVX2**, **AVX-512**

Example: **128-bit NEON** can process 8 integers in one clock cycle.

**Benefit:**  
Vectorization enables CPUs to process arrays, matrices, or tensors much faster by grouping operations, reducing the number of instructions executed.

While this boosts CPU throughput, **GPUs** still outperform CPUs because they have **thousands of parallel vector units**.

##### 4. Cache

**Caches** are small, fast memories inside the CPU that minimize data retrieval delays from slower RAM.

##### Cache hierarch

| Level | Typical Size | Speed | Description |
|--------|---------------|--------|--------------|
| **Registers** | Few KB | Fastest | Store currently executing data. |
| **L1 Cache** | 32‚Äì64 KB per core | Extremely fast | First-level data and instruction cache. |
| **L2 Cache** | 256‚Äì512 KB per core | Fast | Intermediate cache layer. |
| **L3 Cache** | 4‚Äì8 MB shared | Slower | Shared across cores, used for inter-core data. |

**Example:** AMD EPYC CPUs can feature **256 MB of L3 cache** to boost multi-core data sharing.

##### Optimization Concepts
- **Spatial locality:** Consecutive data stored together to exploit sequential access.  
- **Temporal locality:** Recently used data kept in cache for reuse.  
- **Prefetching:** Predicts and loads data likely to be used next.

##### Cache Misses and Performance

A **cache miss** occurs when data is not found in cache, forcing the CPU to fetch it from RAM (much slower).

- **Miss penalty:** CPU stalls while waiting for data.  
- **Mitigation:** Access data sequentially, reuse data, and design algorithms for locality.

##### Trade-offs
- Larger caches reduce misses but increase **latency** and **power usage**.  
- CPU design balances cache size and access speed to optimize performance.

##### ‚úÖ Key Takeaways

| Concept | Purpose | Example / Benefit |
|----------|----------|------------------|
| **Cores** | Execute parallel tasks | Multicore performance |
| **Pipelines** | Overlap instruction executionÔºõ parallel arithmetic at the register level | Improved throughput |
| **Vector Units** | SIMD parallel operations | Boost ML and math performance |
| **Caches** | Reduce latency | Faster memory access |
| **Branch Prediction** | Minimize stalls | Higher efficiency |

While CPUs handle **control-heavy and sequential** logic efficiently, GPUs dominate in **massively parallel** computation tasks like deep learning.

#### GPUs and Other Accelerators

##### 1. Role of GPUs in Deep Learning

It‚Äôs no exaggeration to say that **deep learning owes its success to GPUs**. GPU evolution paralleled the rise of modern neural networks ‚Äî their massive parallelism and floating-point efficiency made training feasible on large datasets.

Unlike CPUs, GPUs are optimized for **massive parallel computation**, making them ideal for:
- **Training neural networks** (large matrix multiplications, gradient accumulation)
- **Inference** (forward propagation with minimal intermediate storage)

##### 2. Precision and Efficiency

Training requires maintaining numerical stability while storing gradients ‚Äî hence **mixed precision** is used:
- **FP16 (half-precision)** for efficiency and memory savings  
- **FP32 (single-precision)** for minimal numerical error  

Modern GPUs (e.g., NVIDIA T4) balance both to handle training and inference efficiently using specialized units (Tensor Cores).

##### 3. Vectorization and Core Scaling

##### 3.1 Vectorization Beyond CPUs
Traditional CPUs perform scalar or short-vector operations (e.g., SIMD).  
GPUs extend this by:
- Performing **dozens of operations in parallel** per core.
- Handling **matrix operations** rather than just vector arithmetic.

For instance, NVIDIA Turing GPUs allow **16 floating-point operations per vector simultaneously**.

##### 3.2 Adding More Cores
Instead of a few powerful cores like CPUs, GPUs scale horizontally with **thousands of smaller cores**, each performing simple operations in parallel.  
This design massively boosts throughput for matrix-heavy computations like convolutions or attention mechanisms.

##### 4. GPU Architecture Overview

Each modern GPU contains:
- Multiple **Streaming Multiprocessors (SMs)** ‚Äî independent blocks of parallel execution.  
- Each SM includes **integer units**, **floating-point units**, and **Tensor Cores**.  
- SMs are grouped into **Graphics Processing Clusters (GPCs)**, which together form the GPU die.

**Example:**
NVIDIA‚Äôs **Turing TU102** GPU (e.g., RTX 2080 Ti) consists of:
- **12 streaming multiprocessors per GPC**
- **Shared L2 cache and memory channels**
- **Flexible modular design** ‚Äî blocks can be enabled/disabled for yield or thermal reasons.

##### 5. Tensor Cores

##### What They Are
Tensor Cores are **specialized hardware units** that accelerate matrix multiplications ‚Äî the heart of deep learning.  
They execute small dense matrix operations (e.g., **4√ó4**, **8√ó8**, **16√ó16**) extremely efficiently.

##### Purpose
Tensor Cores are optimized for:
- **Mixed precision arithmetic** (FP16 input, FP32 accumulation)
- **High-throughput training and inference**
- **Deep learning frameworks** (CUDA, cuDNN, PyTorch) that automatically leverage them.

**Comparison Example:**
| Operation Type | Hardware | Efficiency |
|----------------|-----------|-------------|
| Scalar ops | CPU | Low |
| Vector ops | GPU SM | High |
| Matrix ops | Tensor Core | Extremely high |

##### 6. Practical Implications for Deep Learning

##### Advantages:
- **Massive parallelism** ‚Üí ideal for batch matrix operations.  
- **Efficient mixed precision** ‚Üí reduced memory use and higher speed.  
- **Dedicated hardware (Tensor Cores)** ‚Üí exponential speed-up for neural networks.

##### Limitations:
- GPUs are **less efficient for serial or branching logic**.  
- Limited **on-chip memory** can restrict model size.  
- Performance depends heavily on **memory bandwidth** and **framework optimization**.

##### ‚úÖ Key Takeaways

| Feature | CPU | GPU | Tensor Core |
|----------|-----|-----|--------------|
| Parallelism | Few cores | Thousands of cores | Specialized matrix units |
| Optimized for | Sequential logic | Vector/matrix operations | Matrix multiplications |
| Precision | FP32 / FP64 | FP16 / FP32 | FP16 / FP32 mixed |
| Typical use | Control-heavy workloads | Deep learning, rendering | Neural network training |
| Example | Intel Core i9 | NVIDIA Turing TU102 | NVIDIA Tensor Core in T4/V100 |

##### In short:
GPUs and their accelerators (like TPUs) are built for **parallel math**, not general computation.  
They revolutionized deep learning by enabling large-scale training with **billions of parameters** through:
- **Parallel execution of small operations**
- **Tensor Core acceleration**
- **Mixed-precision computation**

These architectural innovations make GPUs indispensable for both **AI research** and **real-world model deployment**.


#### Networks and Buses

##### 1. Overview

When a single device (like a CPU or GPU) is not enough for computation, **data transfer** between devices becomes essential.  
This is where **networks and buses** play a critical role in synchronizing and transferring information efficiently.

Key trade-offs include:
- **Bandwidth** ‚Äî how much data can be transferred per second.
- **Latency** ‚Äî how fast communication happens.
- **Cost, distance, and flexibility** ‚Äî affecting scalability and architecture.

While **WiFi** is flexible and cheap, it‚Äôs unsuitable for deep learning due to low bandwidth and high latency.  
Instead, we focus on **high-performance interconnects** like PCIe, Ethernet, NVLink, and switches.

##### 2. PCIe (Peripheral Component Interconnect Express)

- **Purpose:** High-bandwidth, low-latency communication between CPU, GPU, and other peripherals.  
- **Speed:** Up to **32 GB/s per lane** on PCIe 4.0 with a 16-lane slot.  
- **Latency:** Extremely low (~5 ¬µs).  
- **Limitations:**  
  - Limited number of lanes per processor (e.g., 128 for AMD EPYC, 48 for Intel Xeon, 16 for Core i9).  
  - GPUs often occupy 16 lanes each, limiting how many can be connected to the CPU.  
  - Large bulk transfers are preferred to minimize packet overhead.

##### 3. Ethernet

- **Most common** networking method for connecting computers and servers.  
- **Pros:**  
  - Cheap, resilient, and supports long-distance connections.  
  - Widely used in **data centers and cloud environments** (e.g., AWS, Azure).  
- **Bandwidth:**  
  - Consumer-grade: **1 Gbit/s**  
  - High-end / cloud: **10‚Äì100 Gbit/s**
- **Overhead:**  
  - Ethernet requires protocols like **TCP/IP or UDP**, which add extra latency.  
  - Typically connects two devices (computer ‚Üî switch).

##### 4. Switches

- **Function:** Allow multiple devices to communicate concurrently, enabling **many-to-many connections**.  
- Example:  
  - An Ethernet switch might connect **40+ servers** at high bandwidth.  
- **In deep learning clusters:**  
  - Switches can also link multiple GPUs or nodes together (e.g., distributed training setups).  
- **PCIe Switching:**  
  - PCIe itself can also be switched to connect multiple GPUs to a single CPU host.

##### 5. NVLink

- **Purpose:** NVIDIA‚Äôs high-speed alternative to PCIe for **GPU-to-GPU** and **GPU-to-CPU** communication.  
- **Performance:**  
  - Up to **300 GB/s per link** (significantly higher than PCIe).  
  - **Server GPUs** (e.g., Volta V100) have 6 NVLink connections.  
  - **Consumer GPUs** (e.g., RTX 2080 Ti) have only one NVLink (100 GB/s).  
- **Use Case:**  
  - Used with **NCCL (NVIDIA Collective Communication Library)** for fast inter-GPU communication in distributed deep learning.

##### ‚úÖ Key Takeaways

| Technology | Typical Bandwidth | Latency | Purpose | Use Case |
|-------------|------------------|----------|----------|-----------|
| **PCIe 4.0** | Up to 32 GB/s/lane | ~5 ¬µs | CPU‚ÄìGPU / storage interconnect | Internal device communication |
| **Ethernet** | 1‚Äì100 Gbit/s | Higher | Long-distance connectivity | Data centers, cloud |
| **Switches** | Varies | Depends on setup | Multi-device connectivity | Distributed GPU/CPU clusters |
| **NVLink** | Up to 300 GB/s | Very low | GPU‚ÄìGPU / GPU‚ÄìCPU | High-speed deep learning clusters |


=====================================Training on Multiple GPUs===========================================

#### Training on Multiple GPUs

##### 1. Overview

When deep learning models grow large ‚Äî with millions or even billions of parameters ‚Äî a **single GPU** often becomes insufficient due to **memory limitations** or **computation bottlenecks**.  
To overcome this, we distribute the model or data across **multiple GPUs**.  

This section explains how training is split across GPUs using **model parallelism** and **data parallelism**.

##### 2. Splitting the Problem

##### 2.1 Motivation
Large models (e.g., ResNet, BERT, GPT) cannot fit entirely into one GPU‚Äôs memory.  
By dividing the model across multiple GPUs, we can:
- Reduce memory load per GPU.
- Train larger networks efficiently.
- Parallelize computation to shorten training time.

##### 3. Model Parallelism

##### Concept
In **model parallelism**, different parts (layers or operations) of the model are placed on different GPUs.  
Each GPU computes only a portion of the forward and backward passes.

**Example:**
- GPU 1 handles the first half of the network layers.
- GPU 2 handles the later layers.
- Activations are passed between GPUs during the forward pass, and gradients are passed back during the backward pass.

##### Advantages
- Enables training of **very large models** that exceed single-GPU memory.
- Efficient for architectures with large layers (e.g., Transformers).

##### Disadvantages
- Requires frequent **synchronization** between GPUs.
- Introduces **communication overhead** during gradient transfers.
- Imbalanced workloads can cause **idle GPU time** (known as ‚Äúpipeline bubbles‚Äù).

##### Types of Model Parallelism
| Type | Description |
|------|--------------|
| **Layer Partitioning** | Different layers assigned to different GPUs. |
| **Tensor Partitioning** | Splits large matrix/tensor operations across GPUs. |
| **Pipeline Parallelism** | Stages of the model processed in a streaming fashion ‚Äî while GPU 1 handles batch *n*, GPU 2 processes batch *n-1*. |

##### 4. Data Parallelism

##### Concept
In **data parallelism**, each GPU holds a **full copy of the model** but trains on **different mini-batches** of data simultaneously.

Steps:
1. Each GPU performs a **forward and backward pass** on its batch.  
2. The **gradients** are averaged (synchronized) across GPUs.  
3. Model parameters are updated consistently across all GPUs.

**Example:**  
If we use 4 GPUs, each processes ¬º of the training data per step, then synchronizes gradients with others.

##### Advantages
- Scales efficiently to **many GPUs**.
- Requires **minimal model changes**.
- Works well for large datasets with many examples.

##### Disadvantages
- Synchronization cost grows with GPU count.
- May lead to communication bottlenecks if network bandwidth is low.

##### 5. Combining Techniques

In modern deep learning:
- **Hybrid strategies** (e.g., model + data parallelism) are used for massive models.  
- Frameworks like **PyTorch Distributed**, **DeepSpeed**, and **Megatron-LM** support multi-level parallelism for large-scale training (billions of parameters).

##### ‚úÖ Key Takeaways

| Approach | Model Copy | GPU Role | Best For | Main Limitation |
|-----------|-------------|----------|-----------|------------------|
| **Model Parallelism** | Split across GPUs | Each GPU handles part of the model | Very large models | High communication overhead |
| **Data Parallelism** | Full model per GPU | Each GPU handles different data | Large datasets | Gradient synchronization cost |
| **Pipeline Parallelism** | Split sequentially | GPUs work in stages | Deep sequential models | Idle time (pipeline bubbles) |

##### Summary:
- **Model parallelism** breaks up the **model**; **data parallelism** splits the **data**.  
- Synchronization and communication efficiency determine scalability.  
- Combining both enables training of models far beyond single-GPU limits ‚Äî the foundation of **modern distributed deep learning**.

In [8]:
%matplotlib inline
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

In [9]:
# Initialize model parameters
scale = 0.01
W1 = torch.randn(size=(20, 1, 3, 3)) * scale
b1 = torch.zeros(20)
W2 = torch.randn(size=(50, 20, 5, 5)) * scale
b2 = torch.zeros(50)
W3 = torch.randn(size=(800, 128)) * scale
b3 = torch.zeros(128)
W4 = torch.randn(size=(128, 10)) * scale
b4 = torch.zeros(10)
params = [W1, b1, W2, b2, W3, b3, W4, b4]

# Define the model
def lenet(X, params):
    h1_conv = F.conv2d(input=X, weight=params[0], bias=params[1])
    h1_activation = F.relu(h1_conv)
    h1 = F.avg_pool2d(input=h1_activation, kernel_size=(2, 2), stride=(2, 2))
    h2_conv = F.conv2d(input=h1, weight=params[2], bias=params[3])
    h2_activation = F.relu(h2_conv)
    h2 = F.avg_pool2d(input=h2_activation, kernel_size=(2, 2), stride=(2, 2))
    h2 = h2.reshape(h2.shape[0], -1)
    h3_linear = torch.mm(h2, params[4]) + params[5]
    h3 = F.relu(h3_linear)
    y_hat = torch.mm(h3, params[6]) + params[7]
    return y_hat

# Cross-entropy loss function
loss = nn.CrossEntropyLoss(reduction='none')

In [10]:
# Data Synchronization:
# 1. we need to have the ability to distribute a list of parameters to multiple devices and to attach gradients (get_params).
# 2. we need the ability to sum parameters across multiple devices, i.e., we need an allreduce function.

def get_params(params, device):
    new_params = [p.to(device) for p in params]
    for p in new_params:
        p.requires_grad_()
    return new_params

def allreduce(data):
    # Sum all parameters to the first device
    for i in range(1, len(data)):
        data[0][:] += data[i].to(data[0].device)
        # broadcast the summed parameters back to all devices
    for i in range(1, len(data)):
        data[i][:] = data[0].to(data[i].device)

In [11]:
new_params = get_params(params, d2l.try_gpu(0))
print('b1 weight:', new_params[1])
print('b1 grad:', new_params[1].grad)

b1 weight: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       requires_grad=True)
b1 grad: None


In [12]:
data = [torch.ones((1, 2), device=d2l.try_gpu(i)) * (i + 1) for i in range(2)]
print('before allreduce:\n', data[0], '\n', data[1])
allreduce(data)
print('after allreduce:\n', data[0], '\n', data[1])

before allreduce:
 tensor([[1., 1.]]) 
 tensor([[2., 2.]])
after allreduce:
 tensor([[3., 3.]]) 
 tensor([[3., 3.]])


In [None]:
#  distribute a minibatch evenly across multiple ( for example there are 100 training examples in the minibatch, send 50 training examples to each GPU)

data = torch.arange(20).reshape(4, 5)
devices = [torch.device('cuda:0'), torch.device('cuda:1')]
split = nn.parallel.scatter(data, devices)
print('input :', data)
print('load into', devices)
print('output:', split)

# For later reuse we define a split_batch function that splits both data and labels.
def split_batch(X, y, devices):
    """Split `X` and `y` into multiple devices."""
    assert X.shape[0] == y.shape[0]
    return (nn.parallel.scatter(X, devices), # data
            nn.parallel.scatter(y, devices)) # label

In [None]:
def train_batch(X, y, device_params, devices, lr):
    X_shards, y_shards = split_batch(X, y, devices)
    # Loss is calculated separately on each GPU
    ls = [loss(lenet(X_shard, device_W), y_shard).sum()
          for X_shard, y_shard, device_W in zip(X_shards, y_shards, device_params)]
    for l in ls:  # Backpropagation is performed separately on each GPU
        l.backward()
    # Sum all gradients from each GPU and broadcast them to all GPUs
    with torch.no_grad():
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad for c in range(len(devices))])
    # The model parameters are updated separately on each GPU
    for param in device_params:
        d2l.sgd(param, lr, X.shape[0]) # Here, we use a full-size batch

def train(num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    devices = [d2l.try_gpu(i) for i in range(num_gpus)]
    # Copy model parameters to `num_gpus` GPUs
    device_params = [get_params(params, d) for d in devices]
    num_epochs = 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    timer = d2l.Timer()
    for epoch in range(num_epochs):
        timer.start()
        for X, y in train_iter:
            # Perform multi-GPU training for a single minibatch
            train_batch(X, y, device_params, devices, lr)
            torch.cuda.synchronize()
        timer.stop()
        # Evaluate the model on GPU 0
        animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
    print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
          f'on {str(devices)}')

train(num_gpus=1, batch_size=256, lr=0.2)

train(num_gpus=2, batch_size=256, lr=0.2)

====================================Concise Implementation for Multiple GPUs=================================

In [13]:
import torch
from torch import nn
from d2l import torch as d2l

In [14]:
#@save
def resnet18(num_classes, in_channels=1):
    """A slightly modified ResNet-18 model."""
    def resnet_block(in_channels, out_channels, num_residuals,
                     first_block=False):
        blk = []
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.append(d2l.Residual(out_channels, use_1x1conv=True,
                                        strides=2))
            else:
                blk.append(d2l.Residual(out_channels))
        return nn.Sequential(*blk)

    # This model uses a smaller convolution kernel, stride, and padding and
    # removes the max-pooling layer
    net = nn.Sequential(
        nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
        nn.BatchNorm2d(64),
        nn.ReLU())
    net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
    net.add_module("resnet_block2", resnet_block(64, 128, 2))
    net.add_module("resnet_block3", resnet_block(128, 256, 2))
    net.add_module("resnet_block4", resnet_block(256, 512, 2))
    net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
    net.add_module("fc", nn.Sequential(nn.Flatten(),
                                       nn.Linear(512, num_classes)))
    return net

In [16]:
net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# We will initialize the network inside the training loop



In [None]:
# the training code needs to perform several basic functions for efficient parallelism:
# 1. Network parameters need to be initialized across all devices.
# 2. While iterating over the dataset minibatches are to be divided across all devices.
# 3. We compute the loss and its gradient in parallel across devices.
# 4. Gradients are aggregated and parameters are updated accordingly.

def train(net, num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    devices = [d2l.try_gpu(i) for i in range(num_gpus)]
    def init_weights(module):
        if type(module) in [nn.Linear, nn.Conv2d]:
            nn.init.normal_(module.weight, std=0.01)
    net.apply(init_weights)
    # Set the model on multiple GPUs
    net = nn.DataParallel(net, device_ids=devices)
    trainer = torch.optim.SGD(net.parameters(), lr)
    loss = nn.CrossEntropyLoss()
    timer, num_epochs = d2l.Timer(), 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    for epoch in range(num_epochs):
        net.train()
        timer.start()
        for X, y in train_iter:
            trainer.zero_grad()
            X, y = X.to(devices[0]), y.to(devices[0]) # ????
            l = loss(net(X), y)
            l.backward()
            trainer.step()
        timer.stop()
        animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(net, test_iter),))
    print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
          f'on {str(devices)}')

In [None]:
train(net, num_gpus=1, batch_size=256, lr=0.1)

In [None]:
train(net, num_gpus=2, batch_size=512, lr=0.2)

===================================Parameter Servers==========================================

![Data Parallel training](./Images//DataParallelTraining.png)
![Parameter sync strategies](./Images/ParameterSyncStragegies.png)