# Notebook 01: Hello World from GPU
## Phase 1: Foundations - CUDA Architecture Basics

**Learning Objectives:**
- Understand the basic CUDA programming model
- Learn how to write and launch a simple CUDA kernel
- Understand the difference between host (CPU) and device (GPU) code
- Learn how to compile and execute CUDA programs
- Understand kernel launch syntax `<<<grid, block>>>`

## Concept: CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. It allows developers to use GPUs for general-purpose processing.

**Key Concepts:**
- **Host**: The CPU and its memory (RAM)
- **Device**: The GPU and its memory (VRAM)
- **Kernel**: A function that runs on the GPU
- **Thread**: A single execution unit on the GPU

**CUDA Program Structure:**
1. Allocate memory on GPU
2. Copy data from host to device
3. Launch kernel on GPU
4. Copy results back from device to host
5. Free GPU memory

**Kernel Launch Syntax:**
```cuda
kernelFunction<<<numBlocks, threadsPerBlock>>>(arguments);
```

## Example 1: Simple Hello World Kernel

In [None]:
%%cu
#include <stdio.h>

// Kernel definition - runs on GPU
__global__ void helloFromGPU() {
    printf("Hello World from GPU!\n");
}

int main() {
    // Host code
    printf("Hello World from CPU!\n");
    
    // Launch kernel with 1 block and 1 thread
    helloFromGPU<<<1, 1>>>();
    
    // Wait for GPU to finish
    cudaDeviceSynchronize();
    
    return 0;
}

## Example 2: Multiple Threads Saying Hello

In [None]:
%%cu
#include <stdio.h>

__global__ void helloFromMultipleThreads() {
    printf("Hello from thread %d!\n", threadIdx.x);
}

int main() {
    printf("Launching kernel with 10 threads...\n");
    
    // Launch kernel with 1 block and 10 threads
    helloFromMultipleThreads<<<1, 10>>>();
    
    cudaDeviceSynchronize();
    
    return 0;
}

## Example 3: Multiple Blocks and Threads

In [None]:
%%cu
#include <stdio.h>

__global__ void helloFromBlocksAndThreads() {
    printf("Hello from block %d, thread %d!\n", blockIdx.x, threadIdx.x);
}

int main() {
    printf("Launching kernel with 3 blocks and 4 threads per block...\n");
    
    // Launch kernel with 3 blocks and 4 threads per block
    helloFromBlocksAndThreads<<<3, 4>>>();
    
    cudaDeviceSynchronize();
    
    return 0;
}

## Example 4: Basic Error Checking

In [None]:
%%cu
#include <stdio.h>

// Macro for error checking
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA error in %s:%d: %s\n", __FILE__, __LINE__, \
                   cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

__global__ void simpleKernel() {
    printf("Kernel executed successfully!\n");
}

int main() {
    printf("Launching kernel with error checking...\n");
    
    simpleKernel<<<1, 1>>>();
    
    // Check for kernel launch errors
    CUDA_CHECK(cudaGetLastError());
    
    // Synchronize and check for execution errors
    CUDA_CHECK(cudaDeviceSynchronize());
    
    printf("Kernel completed without errors!\n");
    
    return 0;
}

## Practical Exercise

**Exercise 1:** Modify the hello world kernel to print a custom message with both block and thread IDs.

**Exercise 2:** Launch a kernel with 5 blocks and 8 threads per block. How many total threads are executing?

**Exercise 3:** Add error checking to ensure the kernel launches successfully.

**Exercise 4:** Experiment with different block and thread configurations. What happens with very large numbers?

In [None]:
%%cu
// Your solution here
#include <stdio.h>

__global__ void myCustomKernel() {
    // TODO: Implement your custom kernel
}

int main() {
    // TODO: Launch your kernel
    
    return 0;
}

## Key Takeaways

1. **CUDA kernels** are functions that run on the GPU, defined with `__global__` keyword
2. **Kernel launch syntax**: `kernelName<<<numBlocks, threadsPerBlock>>>(args)`
3. **threadIdx.x** gives the thread index within a block
4. **blockIdx.x** gives the block index within the grid
5. **cudaDeviceSynchronize()** waits for all GPU operations to complete
6. Always check for CUDA errors in production code
7. GPU threads execute in parallel, so output order may vary

## Next Steps

In the next notebook, we'll learn how to:
- Query GPU device properties
- Understand GPU architecture (SMs, cores, warps)
- Check GPU capabilities programmatically
- Make informed decisions about kernel configuration

Continue to: **02_device_query.ipynb**

## Notes

*Use this space to write your own notes and observations:*

---



---