Skip to content
Hrachya Tandilyan edited this page Sep 5, 2017 · 3 revisions

Parallel Programming Course

Project Specifications

Abstract

Project assumes implementation of distances calculator tool, designed to calculate distances between big numerical vectors, using different calculation engines, each one using different parallel programming techniques like C++11 standard library multithreading, MPI, CUDA.

Input

  • 2 Sets of numeric vectors.

    • First set is called “Query Vectors”.
    • Second set is called “Dataset Vectors”.
  • Vectors’ coordinates default type is float

  • All the vectors have same length, default is 512

  • Default count of query vectors is 1024

  • Default count of dataset vectors is 1024

Query and dataset vectors should be initialized following this 2 approaches:

  • By input .csv files for query vectors / dataset vectors separately (csv is the simplest format for data-table representation, each row corresponds to a single vector).
    • if query vectors csv file is provided, the content should precisely match with provided query vectors count/length, otherwise exit with corresponding error message.
    • if dataset vectors csv file is provided, the content should precisely match with provided dataset vectors count/length, otherwise exit with corresponding error message.
    • NOTE: this input method will be helpful for writing fixed test-cases with specific input vectors, which can serve as regression test-cases throughout the development process.
  • If there is no query vectors and/or dataset vectors csv file(s) provided, corresponding vectors should be randomly generated.

Output

  • Runtime details
    • Each calculation phase runtime
    • Total runtime
  • Distances matrix, containing distances of each query vector from every dataset vector.
  • Distance metrics: L1, L2, Hamming.

Components

Components are described in terms of concepts, providing the choice of concept implementation details to student (functions, single class, hierarchy of classes). The architecture and connections between concepts also must be defined by students (usage of classic design patterns is a big advantage).

Distance Calculator

Main component which encapsulates all other concepts and provides common interface for all type of distances calculation (also providing support for specific cases additional arguments providing, like MPI initialization). This concept is also responsible for input and output datasets storage, initialization, and transfer to corresponding calculation engine (avoid copying datasets, keep all the vectors in one place, and provide references to them).

Distance metric specific Calculation engines

Those engines must provide actual distances calculation implementation, using specific technique, like c++11 threads, MPI, CUDA etc. Also there must be simplest, sequential engine implementation doing all the calculations by nested cycles. Each engine can have different modes which can be specified by clients of engine (there can be more than one version of calculation for specific technique).

Stopwatch/Precise Timing mechanism

Since we have different implementation of same calculations, we need some well tuned stopwatch to measure the calculation runtimes and find out the winner! The Stopwatch must be easy to use, thread safe, and as precise as possible: some specific third party libraries can be used for this component only, but write a wrapper over that libs, to have your own interface (there are some good timing classes in c++11 standard library which can be used here).

Input/Output/Logging mechanism

The final executable must have fully functional command line interface with proper inputs processing, validation, detailed help message, etc. There must be several ways of reporting the results (writing separate files, write in output stream, both ways etc.). Additionally, have several levels of logging, and please have level of debug logs, where you can dump every detail of specific hard part execution and so on.

Results validation mechanism

There should be support for results validation process, which assumes the following: After running the distances calculation with some specific engine (lets say CUDA engine), store the result distances matrix Rerun the distances calculation with simplest engine (nested c++ for-s) which is assumed to be a 100% correct implementation. Compare the distances matrices for both runs and validate the calculation correctness for first engine. Results validation can be enabled/disabled (default is disabled) by specific command line flag.

Implementation Requirements

Try to use “Top down” approach. Design, then implement!

  • Do a step by step implementation, and commit every step you do!
  • ** Every committed version of code must pass the compilation!**
  • Write complete commit messages, describe current changes!
  • There can be empty functions, classes committed, with TODO marks.
  • Do not commit temp files, and build results.

Keep code clean, readable and use same indentation for all the code.

Write doxygen comments for all the header files and their components! Write comments for hard parts of implementation as well. If not familiar with doxygen, provide an hour to study it (believe me, you will thank me later).