# Notebook 02: Device Query and GPU Properties
## Phase 1: Foundations - CUDA Architecture Basics

**Learning Objectives:**
- Query and display GPU device properties
- Understand GPU architecture components (SMs, cores, warps)
- Learn about memory hierarchy and sizes
- Make informed decisions about kernel configuration
- Understand compute capability and its significance

## Concept: GPU Architecture

**NVIDIA GPU Architecture Components:**

1. **Streaming Multiprocessors (SMs)**: The core processing units
2. **CUDA Cores**: Individual processing units within an SM
3. **Warp**: Group of 32 threads that execute together
4. **Memory Hierarchy**:
   - Registers (fastest, per-thread)
   - Shared Memory (fast, per-block)
   - L1/L2 Cache
   - Global Memory (slowest, all threads)

**Compute Capability**: Version number indicating GPU features
- Format: Major.Minor (e.g., 7.5, 8.0, 8.6)
- Higher numbers = newer architecture with more features

## Example 1: Count Available Devices

In [None]:
%%cu
#include <stdio.h>

int main() {
    int deviceCount = 0;
    cudaError_t error = cudaGetDeviceCount(&deviceCount);
    
    if (error != cudaSuccess) {
        printf("cudaGetDeviceCount failed: %s\n", cudaGetErrorString(error));
        return 1;
    }
    
    printf("Number of CUDA devices: %d\n\n", deviceCount);
    
    if (deviceCount == 0) {
        printf("No CUDA-capable devices found!\n");
        return 1;
    }
    
    return 0;
}

## Example 2: Basic Device Properties

In [None]:
%%cu
#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    
    for (int dev = 0; dev < deviceCount; dev++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        
        printf("Device %d: %s\n", dev, prop.name);
        printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("  Total Global Memory: %.2f GB\n", 
               prop.totalGlobalMem / 1024.0 / 1024.0 / 1024.0);
        printf("  Multiprocessors: %d\n", prop.multiProcessorCount);
        printf("  Max Threads per Block: %d\n", prop.maxThreadsPerBlock);
        printf("\n");
    }
    
    return 0;
}

## Example 3: Comprehensive Device Information

In [None]:
%%cu
#include <stdio.h>

int main() {
    int dev = 0;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, dev);
    
    printf("=== Device %d: %s ===\n\n", dev, prop.name);
    
    // Compute capability
    printf("Compute Capability: %d.%d\n", prop.major, prop.minor);
    
    // Memory information
    printf("\n--- Memory Information ---\n");
    printf("Total Global Memory: %.2f GB\n", 
           prop.totalGlobalMem / 1024.0 / 1024.0 / 1024.0);
    printf("Shared Memory per Block: %zu bytes\n", prop.sharedMemPerBlock);
    printf("Constant Memory: %zu bytes\n", prop.totalConstMem);
    printf("Registers per Block: %d\n", prop.regsPerBlock);
    printf("L2 Cache Size: %d bytes\n", prop.l2CacheSize);
    
    // Execution configuration
    printf("\n--- Execution Configuration ---\n");
    printf("Multiprocessors: %d\n", prop.multiProcessorCount);
    printf("CUDA Cores per SM: ~%d (approx)\n", 
           prop.major >= 7 ? 64 : 128);
    printf("Warp Size: %d threads\n", prop.warpSize);
    printf("Max Threads per Block: %d\n", prop.maxThreadsPerBlock);
    printf("Max Threads per SM: %d\n", prop.maxThreadsPerMultiProcessor);
    printf("Max Blocks per SM: %d\n", prop.maxBlocksPerMultiProcessor);
    
    // Grid and block dimensions
    printf("\n--- Grid and Block Limits ---\n");
    printf("Max Grid Size: (%d, %d, %d)\n", 
           prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
    printf("Max Block Dimensions: (%d, %d, %d)\n", 
           prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
    
    // Performance features
    printf("\n--- Performance Features ---\n");
    printf("Clock Rate: %.2f GHz\n", prop.clockRate / 1000000.0);
    printf("Memory Clock Rate: %.2f GHz\n", prop.memoryClockRate / 1000000.0);
    printf("Memory Bus Width: %d-bit\n", prop.memoryBusWidth);
    printf("Peak Memory Bandwidth: %.2f GB/s\n", 
           2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
    
    // Capabilities
    printf("\n--- Capabilities ---\n");
    printf("Concurrent Kernels: %s\n", prop.concurrentKernels ? "Yes" : "No");
    printf("ECC Enabled: %s\n", prop.ECCEnabled ? "Yes" : "No");
    printf("Unified Addressing: %s\n", prop.unifiedAddressing ? "Yes" : "No");
    printf("Managed Memory: %s\n", prop.managedMemory ? "Yes" : "No");
    
    return 0;
}

## Example 4: Selecting and Setting Active Device

In [None]:
%%cu
#include <stdio.h>

__global__ void identifyDevice() {
    printf("Hello from the active GPU device!\n");
}

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Total devices: %d\n\n", deviceCount);
    
    // Get current device
    int currentDevice;
    cudaGetDevice(&currentDevice);
    printf("Current active device: %d\n", currentDevice);
    
    // Get its properties
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, currentDevice);
    printf("Device name: %s\n\n", prop.name);
    
    // Set device explicitly (useful in multi-GPU systems)
    cudaSetDevice(0);
    printf("Set active device to 0\n");
    
    // Launch kernel on active device
    identifyDevice<<<1, 1>>>();
    cudaDeviceSynchronize();
    
    return 0;
}

## Example 5: Calculate Optimal Thread Configuration

In [None]:
%%cu
#include <stdio.h>

void printOptimalConfig(int dataSize) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    
    printf("\nOptimal Configuration for %d elements:\n", dataSize);
    printf("----------------------------------------\n");
    
    // Common thread block sizes
    int blockSizes[] = {128, 256, 512, 1024};
    
    for (int i = 0; i < 4; i++) {
        int threadsPerBlock = blockSizes[i];
        
        if (threadsPerBlock > prop.maxThreadsPerBlock) {
            continue;
        }
        
        int numBlocks = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
        int totalThreads = numBlocks * threadsPerBlock;
        int wastedThreads = totalThreads - dataSize;
        float efficiency = 100.0 * dataSize / totalThreads;
        
        printf("Threads/Block: %4d | Blocks: %6d | "
               "Total Threads: %8d | Wasted: %6d (%.1f%% efficient)\n",
               threadsPerBlock, numBlocks, totalThreads, wastedThreads, efficiency);
    }
}

int main() {
    printOptimalConfig(10000);
    printOptimalConfig(1000000);
    printOptimalConfig(1048576);  // Power of 2
    
    return 0;
}

## Practical Exercise

**Exercise 1:** Write a program that displays memory information for your GPU in a human-readable format (GB, MB, KB).

**Exercise 2:** Calculate the theoretical peak FLOPS (floating-point operations per second) of your GPU based on clock rate and core count.

**Exercise 3:** Create a function that recommends the best block size for a given problem size.

**Exercise 4:** Compare the specifications of your GPU with NVIDIA's published specs online.

In [None]:
%%cu
// Your solution here
#include <stdio.h>

int main() {
    // TODO: Implement your solution
    
    return 0;
}

## Key Takeaways

1. **cudaGetDeviceCount()** returns the number of CUDA-capable GPUs
2. **cudaGetDeviceProperties()** retrieves detailed GPU information
3. **cudaSetDevice()** selects which GPU to use (important for multi-GPU systems)
4. **Compute capability** indicates the GPU architecture and available features
5. Understanding GPU limits (max threads, blocks, memory) is crucial for optimal performance
6. Different GPUs have different capabilities - always query before making assumptions
7. **Warp size** is always 32 threads on current NVIDIA GPUs

## Next Steps

In the next notebook, we'll learn how to:
- Perform vector addition on the GPU
- Allocate and manage GPU memory
- Transfer data between host and device
- Calculate proper thread indices

Continue to: **03_vector_add.ipynb**

## Notes

*Use this space to write your own notes and observations:*

---

My GPU specifications:
- Device Name: 
- Compute Capability: 
- Global Memory: 
- SM Count: 

---