Parallel Computing: Environment Setup and Optimization

Welcome to the Parallel Computing assignment! This guide is designed to serve as a beginner-friendly tutorial for 3rd-year IT students. It will walk you through setting up your environment, writing basic parallel programs using OpenMP, MPI, and CUDA, and comparing their performance.

1. Environment Setup (Ubuntu)

To get started, you will need to install the necessary compilers and libraries for OpenMP, MPI, and CUDA. Run the following commands in your Ubuntu terminal.

1.1. GCC and OpenMP

OpenMP is typically bundled with the GNU Compiler Collection (GCC). Install GCC using:

sudo apt update
sudo apt install build-essential

To verify the installation:

gcc --version

1.2. OpenMPI

OpenMPI is an open-source implementation of the Message Passing Interface (MPI) standard.

sudo apt install openmpi-bin libopenmpi-dev

To verify the installation:

mpicc --version
mpirun --version

1.3. CUDA Toolkit

If your system has an NVIDIA GPU, you can install the CUDA toolkit.

sudo apt install nvidia-cuda-toolkit

To verify the installation:

nvcc --version

2. Compilation and Execution

The repository contains a vector_add example for each framework. Here is how to compile and run each one manually.

Alternatively, you can use the provided benchmark.sh script to run all of them at once!

2.1. OpenMP

Compile:

gcc -fopenmp openmp/vector_add_openmp.c -o openmp/vector_add_openmp

The -fopenmp flag tells the compiler to recognize OpenMP directives (like #pragma omp).

Execute:

./openmp/vector_add_openmp

2.2. MPI

Compile:

mpicc mpi/vector_add_mpi.c -o mpi/vector_add_mpi

Execute:

mpirun -np 4 ./mpi/vector_add_mpi

The -np 4 argument specifies that the program should be run using 4 separate processes.

2.3. CUDA

Compile:

nvcc cuda/vector_add_cuda.cu -o cuda/vector_add_cuda

Execute:

./cuda/vector_add_cuda

3. Hardware Optimization (Placeholder)

Note: This section will be expanded in your final report based on your specific hardware and benchmark results.

When evaluating parallel performance, simply writing parallel code isn't enough; you must tune it to your hardware's specifications.

OpenMP (Thread Count): Performance generally improves as you increase the number of threads up to the number of physical CPU cores. Exceeding this can lead to context-switching overhead. You can control this via the OMP_NUM_THREADS environment variable.
MPI (Process Distribution): Distributing processes across multiple physical nodes over a network introduces communication overhead. Optimization involves minimizing data transfer (e.g., in MPI_Scatter and MPI_Gather) and maximizing computation.
CUDA (Memory Coalescing & Cores): GPUs possess thousands of CUDA cores designed for massive throughput. To fully utilize them, memory accesses must be "coalesced" (threads in a warp accessing contiguous memory blocks). Additionally, the Grid and Block dimensions (e.g., threads per block) should be tuned according to the GPU's streaming multiprocessor (SM) architecture.
L3 Cache: The size of your CPU's L3 cache dictates how much data can be kept close to the processing cores. If your array size far exceeds the L3 cache, you will incur significant main memory latency penalties, often bottlenecking performance before CPU compute limits are reached.

4. Troubleshooting

If you run into issues, check the following common pitfalls:

"command not found: nvcc":
- Cause: The CUDA toolkit is either not installed or its bin directory is not in your system's PATH.
- Fix: Add export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} to your ~/.bashrc file and run source ~/.bashrc.
"MPI_ERR_OTHER: known error not in list" or Network Issues with mpirun:
- Cause: OpenMPI can sometimes get confused by loopback interfaces or multiple network adapters.
- Fix: Try running with mpirun --mca btl_tcp_if_include lo -np 4 ... to restrict it to the local loopback interface.
NVIDIA Driver Conflicts:
- Cause: You may have upgraded your kernel, causing a mismatch with the loaded NVIDIA driver (e.g., nvidia-smi fails).
- Fix: Reboot your machine or reinstall the specific driver version: sudo apt install --reinstall nvidia-driver-<version>.
Array Size Not Divisible by Processes (MPI):
- Cause: Our basic MPI implementation requires the total array size to be evenly divisible by the number of processes (the -np argument).
- Fix: Ensure N % num_processes == 0 or modify the MPI code to handle remainders using MPI_Scatterv.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cuda		cuda
docs		docs
mpi		mpi
openmp		openmp
optimized programs		optimized programs
README.md		README.md
benchmark.sh		benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Computing: Environment Setup and Optimization

1. Environment Setup (Ubuntu)

1.1. GCC and OpenMP

1.2. OpenMPI

1.3. CUDA Toolkit

2. Compilation and Execution

2.1. OpenMP

2.2. MPI

2.3. CUDA

3. Hardware Optimization (Placeholder)

4. Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallel Computing: Environment Setup and Optimization

1. Environment Setup (Ubuntu)

1.1. GCC and OpenMP

1.2. OpenMPI

1.3. CUDA Toolkit

2. Compilation and Execution

2.1. OpenMP

2.2. MPI

2.3. CUDA

3. Hardware Optimization (Placeholder)

4. Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages