Skip to content

mathisdelsart/VectorBlitz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorBlitz - High-Performance Matrix Server

A high-performance HTTP server for matrix multiplication and pattern matching

C NGINX CUDA AVX-512 License: Academic

Features | Architecture | Quick Start | Optimizations | Benchmarking | Documentation


About

This project implements a high-performance HTTP server designed to process matrix operations and pattern matching for a content detection system. Built on top of NGINX with a custom C module using nginx-link-function, the server handles POST requests containing two matrices and multiple patterns, computing the minimum Euclidean distance between each pattern and the matrix multiplication result.

The project was developed across four progressive phases, each focusing on different aspects of system performance:

Phase Focus Description
Phase 1 Implementation Basic server functionality with correct algorithm
Phase 2 Benchmarking Performance evaluation using Design of Experiments (DoE)
Phase 3 Scalar Optimization Cache-aware algorithms and loop unrolling
Phase 4 Hardware Acceleration SIMD (SSE/AVX/AVX-512) and SIMT (CUDA) implementations

Features

Core Functionality

  • Matrix Multiplication: Efficient K×K matrix multiplication with UINT32 precision
  • Pattern Matching: Custom distance computation between patterns and matrix results
  • HTTP Interface: RESTful POST endpoint for processing requests
  • Binary Protocol: Optimized binary request/response format

Performance Optimizations

  • Cache-Aware Algorithms: Line-by-line matrix multiplication for improved cache utilization
  • Loop Unrolling: Manual unrolling with configurable widths (2, 4, 8, 16, 32)
  • SIMD Vectorization: SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) implementations
  • CUDA Acceleration: GPU-based parallel processing for large workloads
  • Sparse Optimization: Skip zero elements during computation

Benchmarking Framework

  • Automated Testing: Full experiment automation via make_plots.sh
  • Load Generation: Integration with wrk2 for precise request rate control
  • Statistical Analysis: Multi-run aggregation with variance reporting
  • Visualization: Automated plot generation with Python/Matplotlib

Architecture

VectorBlitz/
├── project/
│   ├── server-implementation/     # Core server code
│   │   ├── main.c                 # NGINX module entry point
│   │   └── Makefile               # Multi-target build system
│   ├── utils/                     # Algorithm implementations
│   │   ├── utils.c/h              # Scalar algorithms
│   │   ├── simd.c/h               # SSE/AVX/AVX-512 implementations
│   │   └── simt.cu/h              # CUDA GPU implementation
│   ├── tests-server/              # Unit test suite
│   │   └── src/                   # Test sources (CMake-based)
│   ├── perfs-eval-phase2/         # Phase 2: DoE experiments
│   ├── perfs-eval-phase3/         # Phase 3: Optimization analysis
│   ├── perfs-eval-phase4/         # Phase 4: SIMD/SIMT benchmarks
│   ├── nginx-conf/                # Server configuration
│   └── wrk2/                      # Load testing tool (submodule)
└── extra-project/
    ├── statements/                # Project specifications
    ├── reports/                   # Phase reports (PDF)
    └── videos/                    # Demo videos

Request Processing Pipeline

HTTP POST Request
       │
       ▼
┌──────────────────┐
│  parse_request() │  ─── Decode binary header + matrices + patterns
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ multiply_matrix()│  ─── K×K matrix multiplication (scalar/SIMD/SIMT)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ test_patterns()  │  ─── Compute min distance for each pattern
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ res_to_string()  │  ─── Format results as CSV response
└────────┬─────────┘
         │
         ▼
   HTTP Response

Quick Start

Prerequisites

  • GCC 9+ with C11 support
  • NGINX with ngx-link-function module
  • CMake 3.10+ (for tests)
  • Python 3.8+ with matplotlib, numpy, pandas
  • wrk2 (included as submodule)

For SIMD: CPU with AVX-512 support For SIMT: NVIDIA GPU with CUDA toolkit 11+

Installation

# Clone the repository
git clone https://github.com/mathisdelsart/VectorBlitz.git
cd VectorBlitz

# Initialize submodules
git submodule update --init --recursive

# Build wrk2 load generator
cd ./wrk2 && make && cd ../..

Building the Server

cd project/server-implementation

# Build naive version (Phase 2 baseline)
make build

# Build with scalar optimizations (Phase 3)
make build "CFLAGS=-DBEST"

# Build with SIMD acceleration (Phase 4)
make build_simd CFLAGS+="-DSIMDBEST"

# Build with CUDA acceleration (Phase 4)
make build_simt

Running the Server

# Start server (default: port 8888)
make run_release

# Start with specific optimizations
make run_release "CFLAGS=-DBEST"

# Start with multiple workers
make run_release NB_WORKER=4

# Debug mode with GDB
make run_gdb

# Memory profiling with Valgrind
make run_valgrind

Sending a Test Request

# Using the provided script
./send-request.sh

# Manual request with curl
curl -X POST http://localhost:8888/ \
  --data-binary @test_request.bin

Optimizations

Phase 3: Scalar Optimizations

Flag Description Technique
CACHE_AWARE Cache-friendly matrix multiplication Line-by-line traversal
UNROLL Loop unrolling Manual unrolling with configurable width
BEST Combined optimizations All techniques + custom improvements
# Compile with specific optimization
make build "CFLAGS=-DCACHE_AWARE -DUNROLL"

# Compile best version
make -B run_release "CFLAGS=-DBEST"

Phase 4: Hardware Acceleration

SIMD (CPU Vectorization)

Flag Vector Width Instructions
SIMD128 128-bit SSE2/SSE4
SIMD256 256-bit AVX2
SIMD512 512-bit AVX-512
SIMDBEST Best available Auto-selected
# Run with AVX-512
make -B run_release_simd CFLAGS+="-DSIMD512"

# Run best SIMD version
make -B run_release_simd CFLAGS+="-DSIMDBEST"

SIMT (GPU Acceleration)

# Run with default block size
make -B run_release_simt

# Run with custom CUDA block size (max 32×32)
make -B run_release_simt NVCCFLAGS+="-DCUDA_BLOCK_SIZE=16"

Benchmarking

Running Performance Experiments

# Generate all plots for a specific phase
bash make-plots-phase2.sh
bash make-plots-phase3.sh
bash make-plots-phase4.sh

Key Metrics

  • Throughput: Requests per second (RPS)
  • Latency: p50, p95, p99, p99.9 percentiles
  • Resource Utilization: CPU, memory usage

Testing

cd project/tests-server

# Build and run tests
cmake -B build && cmake --build build
ctest --test-dir build

# Or use the script
./run-tests-server.sh

Test suites include:

  • test-parsing.c - Request parsing validation
  • test-matrix.c - Matrix multiplication correctness
  • test-patterns.c - Pattern matching accuracy
  • test-res_to_string.c - Output formatting

Documentation


Author

Mathis DELSART

License

This project is developed for academic purposes as part of university coursework.


Built for LINFO2241 - Architecture and Performance of Computer Systems @ UCLouvain (Université catholique de Louvain).

About

High-performance HTTP matrix multiplication server on NGINX with progressive optimizations: cache-aware algorithms, SIMD (AVX-512).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors