# Notebook 06: Memory Basics and Data Transfer## Phase 2: Memory Management**Learning Objectives:**- Master CUDA memory allocation and deallocation- Understand different memory transfer patterns- Learn about pinned (page-locked) memory- Measure memory transfer bandwidth- Optimize host-device data movement

## Concept: CUDA Memory Model**Memory Types:**- **Global Memory**: Large, slow, accessible by all threads- **Pinned Memory**: Non-pageable host memory for faster transfers- **Device Memory**: GPU VRAM**Memory Functions:**```cudacudaMalloc(&d_ptr, size);           // Allocate device memorycudaFree(d_ptr);                    // Free device memorycudaMemcpy(dst, src, size, kind);   // Copy memorycudaMallocHost(&h_ptr, size);       // Allocate pinned memorycudaFreeHost(h_ptr);                // Free pinned memory```**Transfer Bandwidth:**- Pinned memory: ~12 GB/s (PCIe 3.0 x16)- Pageable memory: ~6 GB/s- Async transfers possible with pinned memory

## Example 1: Basic Memory Basics and Data Transfer

In [None]:
%%cu
#include <stdio.h>
#include <stdlib.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA error at %s:%d: %s\n", __FILE__, __LINE__, \
                   cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    int n = 1000;
    size_t size = n * sizeof(float);
    
    printf("=== Basic Memory Allocation Example ===\n");
    printf("Array size: %d elements (%zu bytes)\n\n", n, size);
    
    // Step 1: Allocate host (CPU) memory
    float *h_data = (float*)malloc(size);
    if (h_data == NULL) {
        printf("Failed to allocate host memory\n");
        return 1;
    }
    printf("✓ Host memory allocated: %p\n", h_data);
    
    // Initialize host data
    for (int i = 0; i < n; i++) {
        h_data[i] = i * 1.0f;
    }
    printf("✓ Host data initialized (first 5: %.1f, %.1f, %.1f, %.1f, %.1f)\n",
           h_data[0], h_data[1], h_data[2], h_data[3], h_data[4]);
    
    // Step 2: Allocate device (GPU) memory
    float *d_data;
    CUDA_CHECK(cudaMalloc(&d_data, size));
    printf("✓ Device memory allocated: %p\n", d_data);
    
    // Step 3: Copy host to device
    CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
    printf("✓ Data copied from host to device\n");
    
    // Step 4: Verify by copying back
    float *h_verify = (float*)malloc(size);
    CUDA_CHECK(cudaMemcpy(h_verify, d_data, size, cudaMemcpyDeviceToHost));
    printf("✓ Data copied back for verification\n");
    
    // Verify data integrity
    bool correct = true;
    for (int i = 0; i < n; i++) {
        if (h_verify[i] != h_data[i]) {
            printf("✗ Mismatch at index %d: %.1f != %.1f\n", i, h_verify[i], h_data[i]);
            correct = false;
            break;
        }
    }
    
    if (correct) {
        printf("✓ Data transfer successful! All values match.\n");
    }
    
    // Step 5: Clean up
    CUDA_CHECK(cudaFree(d_data));
    free(h_data);
    free(h_verify);
    printf("\n✓ Memory freed successfully\n");
    
    return 0;
}

## Example 2: Pinned Memory for Faster Transfers

In [None]:
%%cu
// Exercise: Implement a function that allocates memory, transfers data, 
// processes it on GPU, and returns the result

#include <stdio.h>
#include <stdlib.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA error: %s\n", cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

__global__ void scaleArray(float *data, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= scale;
    }
}

int main() {
    // TODO: 
    // 1. Allocate host and device memory
    // 2. Initialize host data
    // 3. Transfer to device
    // 4. Launch kernel to scale array by 2.0
    // 5. Transfer back and verify results
    // 6. Clean up memory
    
    int n = 1000;
    size_t size = n * sizeof(float);
    
    // Your solution here
    printf("Exercise: Implement complete memory management workflow\n");
    
    return 0;
}

## Key Takeaways

1. **cudaMalloc** allocates memory on the GPU, **cudaFree** releases it
2. **cudaMemcpy** transfers data; specify direction with cudaMemcpyKind
3. **Pinned memory** (cudaMallocHost) provides 1.5-2x faster transfer speeds
4. **Pageable memory** (malloc) is default but slower for GPU transfers
5. Always check return values and use CUDA_CHECK macro for error handling
6. **cudaEvent** API provides accurate timing for GPU operations
7. **Bandwidth = Data Size / Transfer Time** measures transfer efficiency
8. Minimize host-device transfers - they are expensive operations

In [None]:
## Next Steps

In the next notebook, we'll learn about:
- Memory bandwidth benchmarking techniques
- Measuring effective bandwidth
- Understanding PCIe transfer limits
- Optimizing memory transfer patterns

Continue to: **07_memory_bandwidth_benchmarking.ipynb**

## Key Takeaways

1. Memory Basics And Data Transfer is essential for CUDA programming
2. Understanding memory patterns improves performance significantly
3. Always benchmark and verify results
4. Use CUDA events for accurate timing
5. Error checking is critical for production code

## Next StepsContinue to: **07_next_topic.ipynb**

## Notes*Use this space to write your own notes and observations:*------