### üßæ Code Cell 1: Move data to the tensor

In [None]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create tensor on CPU
x = torch.tensor([1.0, 2.0, 3.0])
print(x.device)   # cpu

# Move tensor to GPU
x_gpu = x.to(device)
print(x_gpu.device)  # cuda:0


### üßæ Code Cell 2: Demonstrating a Device Mismatch Error
This cell intentionally produces a device mismatch error to show what happens when tensors and models are placed on different devices (e.g., CPU vs GPU).

In [None]:
# ============================================================
# ‚ö†Ô∏è Part 1: Trigger a Device Mismatch Error
# ============================================================

import torch
import torch.nn as nn

print("‚úÖ CUDA available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple model and move it to GPU
model = nn.Linear(10, 2).to(device)
print("Model device:", next(model.parameters()).device)

# Create data still on CPU
x = torch.randn(4, 10)
print("Data device:", x.device)

# Try forward pass (üí• this will crash!)
out = model(x)


### üßæ Code Cell 3: What this cell does
- **Device Placement**: Moves tensors/model to CPU/GPU as needed.
- **Tips**: Inspect printed shapes/metrics to confirm expectations.
- **Position**: This explanation corresponds to code cell #3 in the original flow.

In [None]:
# ============================================================
# ‚úÖ Part 2: Fix the Device Mismatch Error
# ============================================================


### üßæ Code Cell 4: Demonstrating GPU Memory Explosion
This cell shows how GPU memory usage can explode when tensors that require gradients are not properly detached or reused inside a loop.
It intentionally creates a situation where new computation graphs are built at every iteration, causing memory to grow rapidly until the GPU runs out of space.

In [None]:
# üöÄ Demonstration: CUDA Out of Memory (OOM) in PyTorch
import torch
import torch.nn as nn

print("‚úÖ CUDA available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Step 1Ô∏è‚É£: Print GPU info
!nvidia-smi

# Step 2Ô∏è‚É£: Create a large random tensor
try:
    print("\n--- Trying to allocate a large tensor on GPU ---")
    x = torch.randn(40000, 40000, device=device)  # This may trigger OOM
    y = x @ x
except RuntimeError as e:
    print("\n‚ùå RuntimeError caught:")
    print(e)

# Step 3Ô∏è‚É£: Check how much memory was used
print("\n--- Memory info after error ---")
allocated = torch.cuda.memory_allocated() / 1e6
reserved = torch.cuda.memory_reserved() / 1e6
print(f"Allocated: {allocated:.2f} MB, Reserved: {reserved:.2f} MB")

# Step 4Ô∏è‚É£: Fix by reducing the tensor size
torch.cuda.empty_cache()
print("\n--- Retrying with smaller tensor ---")
x = torch.randn(4000, 4000, device=device)
y = x @ x
print("‚úÖ Success! Tensor shape:", y.shape)


‚úÖ CUDA available: True
Sat Oct 18 07:16:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   47C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                       

### üßæ Code Cell 5: Vanishing Gradient Demonstration

This cell illustrates the vanishing gradient problem, where gradients become very small in the earlier layers of a deep network.
The code builds a multi-layer model with activations like sigmoid or tanh, performs backpropagation, and measures gradient magnitudes across layers.
You‚Äôll see that the first (input) layer has the smallest gradient, showing that signals fade as they travel backward through many layers.

In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# ============================================
# Example 1: Vanishing Gradient
# ============================================

print("VANISHING GRADIENT DEMO")
print("-" * 40)

# Deep network with sigmoid activation
class VanishingNet(nn.Module):
    def __init__(self):
        super().__init__()
        # 10 layers with sigmoid activation
        self.fc1 = nn.Linear(1, 5)
        self.fc2 = nn.Linear(5, 5)
        self.fc3 = nn.Linear(5, 5)
        self.fc4 = nn.Linear(5, 5)
        self.fc5 = nn.Linear(5, 5)
        self.fc6 = nn.Linear(5, 5)
        self.fc7 = nn.Linear(5, 5)
        self.fc8 = nn.Linear(5, 5)
        self.fc9 = nn.Linear(5, 5)
        self.fc10 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        x = torch.sigmoid(self.fc4(x))
        x = torch.sigmoid(self.fc5(x))
        x = torch.sigmoid(self.fc6(x))
        x = torch.sigmoid(self.fc7(x))
        x = torch.sigmoid(self.fc8(x))
        x = torch.sigmoid(self.fc9(x))
        x = self.fc10(x)
        return x

# Create model and compute gradients
model = VanishingNet()
x = torch.tensor([[1.0]])
y = torch.tensor([[0.5]])

# Forward and backward pass
output = model(x)
loss = (output - y) ** 2
loss.backward()

# Check gradients
print("Gradient in last layer (fc10):", model.fc10.weight.grad.abs().mean().item())
print("Gradient in middle layer (fc5):", model.fc5.weight.grad.abs().mean().item())
print("Gradient in first layer (fc1):", model.fc1.weight.grad.abs().mean().item())
print("\n‚ö†Ô∏è Notice: Gradients get smaller in earlier layers!")
print("This is the vanishing gradient problem with sigmoid.")


VANISHING GRADIENT DEMO
----------------------------------------
Gradient in last layer (fc10): 0.8905013799667358
Gradient in middle layer (fc5): 2.7121091989101842e-05
Gradient in first layer (fc1): 1.3565953693728261e-08

‚ö†Ô∏è Notice: Gradients get smaller in earlier layers!
This is the vanishing gradient problem with sigmoid.
