<a href="https://colab.research.google.com/github/lovnishverma/Python-Getting-Started/blob/main/Comparative_Analysis_Acceleration_Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **High-Performance Computing Benchmark: CPU vs. GPU**

**Objective:** Demonstrate the computational bottleneck of standard Python loops compared to hardware-accelerated libraries.

**The Task:** Perform a compute-bound trigonometric operation (`sin(x) * cos(x)`) on **10,000,000** data points.

**The Methods:**
1.  **Pure Python:** Standard Interpreter Loop (Baseline)
2.  **NumPy:** CPU Vectorization (Standard Optimization)
3.  **Numba:** JIT Compilation (Advanced CPU Optimization)
4.  **CuPy:** GPU Acceleration (Parallel Processing on NVIDIA T4)

In [1]:
import time
import math
import numpy as np
import cupy as cp
from numba import jit, prange
from tqdm import tqdm  # Progress bar for the slow Python loop
from rich import print

# Hardware Check
gpu_count = cp.cuda.runtime.getDeviceCount()
print(f"Environment Ready.")
print(f"GPU Detected: {gpu_count}x NVIDIA T4")

### 1. **Configuration**
We define the dataset size at **10 Million**.

*Note: Trigonometric functions (sine/cosine) are computationally expensive for the CPU, making this an ideal stress test.*

In [2]:
N_SIZE = 10_000_000
print(f"Dataset Size: {N_SIZE:,} elements")

### 2. **Baseline: Pure Python Loop**
We iterate through the 10 million items using a standard `for` loop.

**Warning:** This operation is extremely slow due to the Global Interpreter Lock (GIL) and lack of vectorization. A progress bar is included to visualize the processing time.

In [3]:
def python_heavy_math(n):
    result = 0.0
    # TQDM adds a progress bar so we can see the slow execution speed
    for x in tqdm(range(n), desc="Processing Python Loop"):
        result += math.sin(x) * math.cos(x)
    return result

print(f"Running Pure Python implementation on {N_SIZE:,} items...")

t0 = time.perf_counter()
python_heavy_math(N_SIZE)
t1 = time.perf_counter()

time_python = t1 - t0
print(f"\nPython Execution Time: {time_python:.4f} seconds")

Processing Python Loop: 100%|██████████| 10000000/10000000 [00:03<00:00, 3165416.42it/s]


### 3. **NumPy: Vectorization**
We switch to NumPy, which pushes the loop execution to optimized C-code. While significantly faster, it is still bound by CPU clock speeds.

In [4]:
print(f"Running NumPy implementation...")

# Create Data
data_np = np.arange(N_SIZE, dtype=np.float32)

t0 = time.perf_counter()
# Vectorized Operation
np.sum(np.sin(data_np) * np.cos(data_np))
t1 = time.perf_counter()

time_numpy = t1 - t0
print(f"NumPy Execution Time: {time_numpy:.4f} seconds")

### 4. **Numba: JIT Compilation**
Using Just-In-Time (JIT) compilation to convert the Python function into machine code. This allows it to run at near-native C speeds on the CPU.

*Note: A warmup run is performed first to compile the code, ensuring the benchmark measures only execution time.*

In [5]:
@jit(nopython=True, parallel=True)
def numba_heavy_math(arr):
    res = 0.0
    for i in prange(len(arr)):
        res += np.sin(arr[i]) * np.cos(arr[i])
    return res

print(f"Running Numba implementation...")
# Warmup (Compilation)
numba_heavy_math(data_np)

# Benchmark
t0 = time.perf_counter()
numba_heavy_math(data_np)
t1 = time.perf_counter()

time_numba = t1 - t0
print(f"Numba Execution Time: {time_numba:.4f} seconds")

### 5. **CuPy: GPU Acceleration**
We transfer the data to the GPU (VRAM) and execute the operation in parallel across thousands of CUDA cores.

*Note: `cp.cuda.Stream.null.synchronize()` is used to ensure the timer waits for the GPU to finish calculations.*

In [6]:
print(f"Running CuPy implementation...")

# Transfer data to GPU
data_cp = cp.arange(N_SIZE, dtype=cp.float32)

# Warmup
cp.sum(cp.sin(data_cp) * cp.cos(data_cp))
cp.cuda.Stream.null.synchronize()

# Benchmark
t0 = time.perf_counter()
cp.sum(cp.sin(data_cp) * cp.cos(data_cp))
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()

time_cupy = t1 - t0
print(f"CuPy Execution Time: {time_cupy:.4f} seconds")

### 6. **Final Results Summary**

In [7]:
print(f"{'-'*40}")
print("BENCHMARK RESULTS (Lower is Better)")
print(f"{'-'*40}")
print(f"1. Pure Python: {time_python:.4f} s")
print(f"2. NumPy (CPU): {time_numpy:.4f} s")
print(f"3. Numba (CPU): {time_numba:.4f} s")
print(f"4. CuPy (GPU):  {time_cupy:.4f} s")
print(f"{'-'*40}")
print(f"SPEEDUP FACTOR:")
print(f"CuPy is {time_numpy / time_cupy:.1f}x faster than NumPy")
print(f"CuPy is {time_python / time_cupy:.1f}x faster than Python")

# Subscribe to **Lovnish Verma** for more....

*https://www.youtube.com/@lovnishverma*