In [5]:
import torch

print("torch.backends availability and settings:")

print("\nCPU backend:")
print(torch.backends.cpu)

print("\nCUDA backend:")
print(torch.backends.cuda.is_built())  # returns True if CUDA is available

print("\ncuDNN backend:")
print(f"Enabled: {torch.backends.cudnn.enabled}")
print(f"Available: {torch.backends.cudnn.is_available()}")
print(f"Version: {torch.backends.cudnn.version()}")

print("\ncuSparseLt backend:")
print(torch.backends.cusparselt)

print("\nMHA (Memory Efficient Attention):")
print(torch.backends.mha)

print("\nMPS (Metal Performance Shaders for Apple):")
print(f"Is built: {torch.backends.mps.is_built()}") if hasattr(torch.backends, 'mps') else print("MPS not available")

print("\nMKL backend:")
print(torch.backends.mkl)

print("\nMKLDNN backend:")
print(f"Enabled: {torch.backends.mkldnn.enabled}")

print("\nNNPACK backend:")
print(f"Enabled: {torch.backends.nnpack}")

print("\nOpenMP backend:")
print(torch.backends.openmp)



torch.backends availability and settings:

CPU backend:
<module 'torch.backends.cpu' from 'C:\\Users\\fadzw\\Downloads\\Trash\\test-torch\\venv\\Lib\\site-packages\\torch\\backends\\cpu\\__init__.py'>

CUDA backend:
False

cuDNN backend:
Enabled: True
Available: False
Version: None

cuSparseLt backend:
<module 'torch.backends.cusparselt' from 'C:\\Users\\fadzw\\Downloads\\Trash\\test-torch\\venv\\Lib\\site-packages\\torch\\backends\\cusparselt\\__init__.py'>

MHA (Memory Efficient Attention):
<module 'torch.backends.mha' from 'C:\\Users\\fadzw\\Downloads\\Trash\\test-torch\\venv\\Lib\\site-packages\\torch\\backends\\mha\\__init__.py'>

MPS (Metal Performance Shaders for Apple):
Is built: False

MKL backend:
<module 'torch.backends.mkl' from 'C:\\Users\\fadzw\\Downloads\\Trash\\test-torch\\venv\\Lib\\site-packages\\torch\\backends\\mkl\\__init__.py'>

MKLDNN backend:
Enabled: True

NNPACK backend:
Enabled: <module 'torch.backends.nnpack' from 'C:\\Users\\fadzw\\Downloads\\Trash\\test-to

In [4]:
import torch 

cpu_capability= torch.backends.cpu.get_cpu_capability() 

print(f'CPU capability: {cpu_capability}')
# tells you what CPU instruction set PyTorch is optimized to use on your machine .

cuda_built = torch.backends.cuda.is_built()

print(f"Is PyTorch built with CUDA support? {cuda_built}")

if torch.backends.cuda.is_built() and torch.cuda.is_available():
    print('before changing:')
    print(f'TF32 matmul allowed ? {torch.backends.cuda.matmul.allow_tf32}')

    #disable tf32 matmul (for precision reasons)
    torch.backends.cuda.matmul.allow_tf32 = False
    print(f'tf32 matmul allowed after disabling {torch.backends.cuda.matmul.allow_tf32}')

    torch.backends.cuda.matmul.allow_tf32 = True
    print(f'TF32 matmul allowed after re-enabling {torch.backends.cuda.matmul.allow_tf32}')
else:
    print('CUDA is not built or not available')

CPU capability: AVX2
Is PyTorch built with CUDA support? False
CUDA is not built or not available


You're asking about a set of low-level PyTorch functions related to **Scaled Dot Product Attention (SDPA)**, especially different backend implementations for performance. These functions don't require CUDA to understand conceptually — so let's walk through them clearly.

---

## ✅ **What is Scaled Dot Product Attention (SDPA)?**

This is the **core mechanism** in transformers (e.g., BERT, GPT). It works like this:

```python
Attention(Q, K, V) = softmax(QKᵀ / √d_k) * V
```

Where:

* `Q`: Query matrix
* `K`: Key matrix
* `V`: Value matrix
* `d_k`: Dimension of the key vectors (used to scale the dot products)
* `softmax` gives attention weights

This operation tells the model how much focus (attention) to give to each token when processing sequences.

---

## 🔥 What is **Flash Scaled Dot Product Attention**?

FlashAttention is a **highly optimized GPU kernel** implementation of SDPA developed by *HazyResearch*. PyTorch integrated it for huge speed and memory improvements, especially on large models.

Function:

```python
torch.backends.cuda.flash_sdp_enabled()
```

* ✅ **Returns `True` if FlashAttention is available and used**
* ❌ **Returns `False` if FlashAttention is not enabled or not available (e.g., no CUDA, old GPU)**

---

## 🧠 What is **Memory Efficient SDPA**?

A different kernel that uses **less memory** than the standard attention algorithm. It's especially useful for long sequences or large batches.

Function:

```python
torch.backends.cuda.mem_efficient_sdp(enabled: bool)
```

* **Enables/disables memory-efficient attention** in PyTorch.

---

## ➗ What is **Math SDPA**?

This is the **default math-based implementation** (not optimized via GPU kernels like Flash or memory-efficient ones). It's **accurate**, but slower and uses more memory.

Function:

```python
torch.backends.cuda.math_sdp_enabled()
```

* Checks if the math-based version is currently being used.

Function:

```python
torch.backends.cuda.enable_math_sdp(enable: bool)
```

* Manually turn it on or off.

---

## 🎯 What is **FP16/BF16 reduction in math SDPA**?

These are **reduced-precision floating point formats**:

* **FP16**: 16-bit float
* **BF16**: 16-bit, but with a different exponent format (better for training stability)

Reduction refers to **summing over elements** (as in softmax or matmul). Using FP16/BF16 helps reduce memory and speed up computation — but can cause **numerical instability** if not handled properly.

Function:

```python
torch.backends.cuda.fp16_reduction_math_sdp_allowed()
```

* Returns whether this fast-reduction path is allowed in the math version of SDPA.

Function:

```python
torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(enabled: bool)
```

* Enables/disables this behavior.

---

### 🔁 Summary

| Function                                      | Meaning                                                 |
| --------------------------------------------- | ------------------------------------------------------- |
| `flash_sdp_enabled()`                         | FlashAttention enabled or not (fastest, most efficient) |
| `mem_efficient_sdp(enabled)`                  | Enables memory-efficient SDPA backend                   |
| `math_sdp_enabled()`                          | Is the basic math implementation in use                 |
| `enable_math_sdp(enable)`                     | Turn math-based SDPA on/off                             |
| `fp16_reduction_math_sdp_allowed()`           | Can math SDPA use reduced-precision sums                |
| `allow_fp16_bf16_reduction_math_sdp(enabled)` | Enable/disable fp16/bf16 reduction                      |

---

If you're running without CUDA, these will all return `False` or raise errors if the features aren’t available — but they give you insight into the **backend strategy PyTorch is using for attention layers**.




---

## 🔁 **What is cuDNN Scaled Dot Product Attention?**

### ✅ `torch.backends.cuda.cudnn_sdp_enabled()`

* Returns `True` if PyTorch is using **cuDNN’s implementation** of **Scaled Dot Product Attention (SDPA)**.

### ✅ `torch.backends.cuda.enable_cudnn_sdp(enabled: bool)`

* Enables or disables the use of **cuDNN's SDPA** kernel.

### 🧠 What is cuDNN SDPA?

* **cuDNN (CUDA Deep Neural Network Library)** is NVIDIA's GPU-accelerated library for deep learning primitives.
* Starting with cuDNN v8.9+, it includes a **fused attention kernel** (scaled dot-product attention as a single optimized GPU kernel).
* This offers a balance of **performance, memory usage, and precision**.

### 🚀 Use Cases

* cuDNN SDPA is used **automatically** by PyTorch when:

  * A compatible GPU is present (usually recent Ampere/Hopper architecture)
  * Input dimensions and settings allow it
* Works well for:

  * **Short-to-medium sequence lengths**
  * Use in models like **BERT**, **T5**, etc.
  * **Inference or training** with decent batch sizes

---

## 📜 Are there any research papers on cuDNN SDPA?

* **No standalone research paper** on cuDNN SDPA — it's **proprietary** and part of NVIDIA's **cuDNN documentation** and release notes:

  * You can check [NVIDIA cuDNN Release Notes](https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html)
  * Or [cuDNN Developer Guide](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html)

---

## 🔥 What is FlashAttention?

FlashAttention is a **high-performance attention kernel** developed by **HazyResearch** (Stanford).

### 📘 Paper:

> **FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness**
> Tri Dao et al., 2022
> [📄 Read it here](https://arxiv.org/abs/2205.14135)

### 🧠 Key Ideas:

* Traditional attention is **memory bound** (slow because of memory reads/writes)
* FlashAttention uses **tiling and fused kernels** to:

  * Avoid storing intermediate results
  * Reduce memory usage
  * Increase speed by keeping everything on GPU registers/shared memory

### 🧪 Use Cases:

* Ideal for **long sequences** and **large models** (like LLMs)
* Particularly useful during **training**

---

## ✅ Check if PyTorch is Built with FlashAttention

### `torch.backends.cuda.is_flash_attention_available()`

* Returns `True` if FlashAttention backend is compiled and CUDA is available.

---

## ✅ Can FlashAttention Be Used for This Attention Call?

### `torch.backends.cuda.can_flash_attention(params: SDPAParams, debug=False)`

* This checks if the current inputs (shape, dtype, etc.) are **compatible** with FlashAttention.
* If not, PyTorch will **fallback to other backends** (like cuDNN or math).

---

## 🔧 What is `SDPAParams`?

This is a PyTorch internal struct that packages all inputs needed for **Scaled Dot Product Attention**, such as:

```python
SDPAParams(
  query: Tensor,
  key: Tensor,
  value: Tensor,
  attn_mask: Optional[Tensor] = None,
  dropout_p: float = 0.0,
  is_causal: bool = False
)
```

### Why it exists:

* PyTorch provides **multiple SDPA backends** (Flash, cuDNN, memory-efficient, math).
* `SDPAParams` is used to let these backends inspect the inputs and decide if they can run the operation efficiently.

---

## 🔁 Summary Table

| Function                         | Purpose                                                |
| -------------------------------- | ------------------------------------------------------ |
| `cudnn_sdp_enabled()`            | Is cuDNN's SDPA being used?                            |
| `enable_cudnn_sdp(enabled)`      | Turn cuDNN SDPA on/off                                 |
| `is_flash_attention_available()` | Is FlashAttention compiled & usable?                   |
| `can_flash_attention(params)`    | Can FlashAttention run with given tensors?             |
| `SDPAParams`                     | Struct for query, key, value, etc., passed to backends |

---


In [None]:

---

## ✅ `torch.backends.cuda.can_use_efficient_attention(params, debug=False)`

* This function checks if the **efficient attention backend** (also called **"memory-efficient attention"**) **can be used** for your specific input tensors and settings (passed via `SDPAParams`).

---

### 🧠 What is Efficient Attention?

**Efficient Attention** (aka **Memory-Efficient Attention**) is an optimized implementation that:

* Reduces memory usage by **avoiding explicit attention matrices**.
* Fuses operations to minimize intermediate memory.
* Is suitable for **large batch sizes** or **long sequences**.

It was inspired by:

> **"Memory-Efficient Attention" by Tri Dao**
> (not the same as FlashAttention but related idea)

Used heavily in:

* Training **transformers**
* Large models where **GPU memory** is a limiting factor

---

## ✅ `torch.backends.cuda.can_use_cudnn_attention(params, debug=False)`

* Checks if the **cuDNN backend** can handle the current attention configuration.
* If not, PyTorch will **fallback** to another backend like Flash, math, or memory-efficient.

---

## ✅ `torch.backends.cuda.sdp_kernel(...)`

This **context manager** is used to **enable/disable specific backends** for `scaled_dot_product_attention()`.

```python
torch.backends.cuda.sdp_kernel(
    enable_flash=True,
    enable_math=True,
    enable_mem_efficient=True,
    enable_cudnn=True
)
```

You can use it like this:

```python
with torch.backends.cuda.sdp_kernel(enable_flash=False):
    # This block will not use FlashAttention
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v)
```

---

### 🔁 What Are the "Three" (actually **Four**) Backends?

PyTorch's SDPA system dynamically chooses between the following:

| Backend                                       | Description                                                              |
| --------------------------------------------- | ------------------------------------------------------------------------ |
| ✅ **FlashAttention**                          | Fastest, GPU-efficient, best for large sequence & batch sizes.           |
| ✅ **Memory-Efficient** (efficient\_attention) | Uses less memory, suitable when GPU RAM is tight.                        |
| ✅ **cuDNN Attention**                         | Uses NVIDIA's cuDNN implementation; often a good general-purpose choice. |
| ✅ **Math-based (unfused)**                    | Pure PyTorch fallback, slower, used when others can’t.                   |

---

## 🔁 How PyTorch Chooses a Backend (Simplified):

1. If FlashAttention is available and inputs are compatible → use it
2. Else, if Memory-Efficient works → use it
3. Else, if cuDNN works → use it
4. Else → fall back to math implementation

---

## ✅ What is `SDPAParams`?

It's an internal struct passed to backend decision functions. Contains:

```python
SDPAParams(
  query: Tensor,
  key: Tensor,
  value: Tensor,
  attn_mask: Optional[Tensor] = None,
  dropout_p: float = 0.0,
  is_causal: bool = False
)
```

Used in:

* `can_flash_attention(...)`
* `can_use_efficient_attention(...)`
* `can_use_cudnn_attention(...)`

---

## 💡 Summary

* **FlashAttention** = Fastest but GPU- and shape-sensitive.
* **EfficientAttention** = Uses less memory, slower than Flash, great fallback.
* **cuDNNAttention** = Good general purpose, fused, stable.
* **Math** = Slow fallback.

Use `sdp_kernel()` to test or force which backend is used.

