VectorBlitz - High-Performance Matrix Server

A high-performance HTTP server for matrix multiplication and pattern matching

About

This project implements a high-performance HTTP server designed to process matrix operations and pattern matching for a content detection system. Built on top of NGINX with a custom C module using nginx-link-function, the server handles POST requests containing two matrices and multiple patterns, computing the minimum Euclidean distance between each pattern and the matrix multiplication result.

The project was developed across four progressive phases, each focusing on different aspects of system performance:

Phase	Focus	Description
Phase 1	Implementation	Basic server functionality with correct algorithm
Phase 2	Benchmarking	Performance evaluation using Design of Experiments (DoE)
Phase 3	Scalar Optimization	Cache-aware algorithms and loop unrolling
Phase 4	Hardware Acceleration	SIMD (SSE/AVX/AVX-512) and SIMT (CUDA) implementations

Features

Core Functionality

Matrix Multiplication: Efficient K×K matrix multiplication with UINT32 precision
Pattern Matching: Custom distance computation between patterns and matrix results
HTTP Interface: RESTful POST endpoint for processing requests
Binary Protocol: Optimized binary request/response format

Performance Optimizations

Cache-Aware Algorithms: Line-by-line matrix multiplication for improved cache utilization
Loop Unrolling: Manual unrolling with configurable widths (2, 4, 8, 16, 32)
SIMD Vectorization: SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) implementations
CUDA Acceleration: GPU-based parallel processing for large workloads
Sparse Optimization: Skip zero elements during computation

Benchmarking Framework

Automated Testing: Full experiment automation via make_plots.sh
Load Generation: Integration with wrk2 for precise request rate control
Statistical Analysis: Multi-run aggregation with variance reporting
Visualization: Automated plot generation with Python/Matplotlib

Architecture

VectorBlitz/
├── project/
│   ├── server-implementation/     # Core server code
│   │   ├── main.c                 # NGINX module entry point
│   │   └── Makefile               # Multi-target build system
│   ├── utils/                     # Algorithm implementations
│   │   ├── utils.c/h              # Scalar algorithms
│   │   ├── simd.c/h               # SSE/AVX/AVX-512 implementations
│   │   └── simt.cu/h              # CUDA GPU implementation
│   ├── tests-server/              # Unit test suite
│   │   └── src/                   # Test sources (CMake-based)
│   ├── perfs-eval-phase2/         # Phase 2: DoE experiments
│   ├── perfs-eval-phase3/         # Phase 3: Optimization analysis
│   ├── perfs-eval-phase4/         # Phase 4: SIMD/SIMT benchmarks
│   ├── nginx-conf/                # Server configuration
│   └── wrk2/                      # Load testing tool (submodule)
└── extra-project/
    ├── statements/                # Project specifications
    ├── reports/                   # Phase reports (PDF)
    └── videos/                    # Demo videos

Request Processing Pipeline

HTTP POST Request
       │
       ▼
┌──────────────────┐
│  parse_request() │  ─── Decode binary header + matrices + patterns
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ multiply_matrix()│  ─── K×K matrix multiplication (scalar/SIMD/SIMT)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ test_patterns()  │  ─── Compute min distance for each pattern
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ res_to_string()  │  ─── Format results as CSV response
└────────┬─────────┘
         │
         ▼
   HTTP Response

Quick Start

Prerequisites

GCC 9+ with C11 support
NGINX with ngx-link-function module
CMake 3.10+ (for tests)
Python 3.8+ with matplotlib, numpy, pandas
wrk2 (included as submodule)

For SIMD: CPU with AVX-512 support For SIMT: NVIDIA GPU with CUDA toolkit 11+

Installation

# Clone the repository
git clone https://github.com/mathisdelsart/VectorBlitz.git
cd VectorBlitz

# Initialize submodules
git submodule update --init --recursive

# Build wrk2 load generator
cd ./wrk2 && make && cd ../..

Building the Server

cd project/server-implementation

# Build naive version (Phase 2 baseline)
make build

# Build with scalar optimizations (Phase 3)
make build "CFLAGS=-DBEST"

# Build with SIMD acceleration (Phase 4)
make build_simd CFLAGS+="-DSIMDBEST"

# Build with CUDA acceleration (Phase 4)
make build_simt

Running the Server

# Start server (default: port 8888)
make run_release

# Start with specific optimizations
make run_release "CFLAGS=-DBEST"

# Start with multiple workers
make run_release NB_WORKER=4

# Debug mode with GDB
make run_gdb

# Memory profiling with Valgrind
make run_valgrind

Sending a Test Request

# Using the provided script
./send-request.sh

# Manual request with curl
curl -X POST http://localhost:8888/ \
  --data-binary @test_request.bin

Optimizations

Phase 3: Scalar Optimizations

Flag	Description	Technique
`CACHE_AWARE`	Cache-friendly matrix multiplication	Line-by-line traversal
`UNROLL`	Loop unrolling	Manual unrolling with configurable width
`BEST`	Combined optimizations	All techniques + custom improvements

# Compile with specific optimization
make build "CFLAGS=-DCACHE_AWARE -DUNROLL"

# Compile best version
make -B run_release "CFLAGS=-DBEST"

Phase 4: Hardware Acceleration

SIMD (CPU Vectorization)

Flag	Vector Width	Instructions
`SIMD128`	128-bit	SSE2/SSE4
`SIMD256`	256-bit	AVX2
`SIMD512`	512-bit	AVX-512
`SIMDBEST`	Best available	Auto-selected

# Run with AVX-512
make -B run_release_simd CFLAGS+="-DSIMD512"

# Run best SIMD version
make -B run_release_simd CFLAGS+="-DSIMDBEST"

SIMT (GPU Acceleration)

# Run with default block size
make -B run_release_simt

# Run with custom CUDA block size (max 32×32)
make -B run_release_simt NVCCFLAGS+="-DCUDA_BLOCK_SIZE=16"

Benchmarking

Running Performance Experiments

# Generate all plots for a specific phase
bash make-plots-phase2.sh
bash make-plots-phase3.sh
bash make-plots-phase4.sh

Key Metrics

Throughput: Requests per second (RPS)
Latency: p50, p95, p99, p99.9 percentiles
Resource Utilization: CPU, memory usage

Testing

cd project/tests-server

# Build and run tests
cmake -B build && cmake --build build
ctest --test-dir build

# Or use the script
./run-tests-server.sh

Test suites include:

test-parsing.c - Request parsing validation
test-matrix.c - Matrix multiplication correctness
test-patterns.c - Pattern matching accuracy
test-res_to_string.c - Output formatting

Documentation

Phase 1 Statement - Server implementation
Phase 2 Statement - Performance evaluation
Phase 3 Statement - Scalar optimizations
Phase 4 Statement - Hardware acceleration
Phase 2 Report - DoE analysis results
Phase 3 Report - Optimization analysis

Author

Mathis DELSART

License

This project is developed for academic purposes as part of university coursework.

Built for LINFO2241 - Architecture and Performance of Computer Systems @ UCLouvain (Université catholique de Louvain).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
extra-project		extra-project
project		project
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md
init_project.sh		init_project.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VectorBlitz - High-Performance Matrix Server

About

Features

Core Functionality

Performance Optimizations

Benchmarking Framework

Architecture

Request Processing Pipeline

Quick Start

Prerequisites

Installation

Building the Server

Running the Server

Sending a Test Request

Optimizations

Phase 3: Scalar Optimizations

Phase 4: Hardware Acceleration

SIMD (CPU Vectorization)

SIMT (GPU Acceleration)

Benchmarking

Running Performance Experiments

Key Metrics

Testing

Documentation

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VectorBlitz - High-Performance Matrix Server

About

Features

Core Functionality

Performance Optimizations

Benchmarking Framework

Architecture

Request Processing Pipeline

Quick Start

Prerequisites

Installation

Building the Server

Running the Server

Sending a Test Request

Optimizations

Phase 3: Scalar Optimizations

Phase 4: Hardware Acceleration

SIMD (CPU Vectorization)

SIMT (GPU Acceleration)

Benchmarking

Running Performance Experiments

Key Metrics

Testing

Documentation

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages