A high-performance HTTP server for matrix multiplication and pattern matching
Features | Architecture | Quick Start | Optimizations | Benchmarking | Documentation
This project implements a high-performance HTTP server designed to process matrix operations and pattern matching for a content detection system. Built on top of NGINX with a custom C module using nginx-link-function, the server handles POST requests containing two matrices and multiple patterns, computing the minimum Euclidean distance between each pattern and the matrix multiplication result.
The project was developed across four progressive phases, each focusing on different aspects of system performance:
| Phase | Focus | Description |
|---|---|---|
| Phase 1 | Implementation | Basic server functionality with correct algorithm |
| Phase 2 | Benchmarking | Performance evaluation using Design of Experiments (DoE) |
| Phase 3 | Scalar Optimization | Cache-aware algorithms and loop unrolling |
| Phase 4 | Hardware Acceleration | SIMD (SSE/AVX/AVX-512) and SIMT (CUDA) implementations |
- Matrix Multiplication: Efficient K×K matrix multiplication with UINT32 precision
- Pattern Matching: Custom distance computation between patterns and matrix results
- HTTP Interface: RESTful POST endpoint for processing requests
- Binary Protocol: Optimized binary request/response format
- Cache-Aware Algorithms: Line-by-line matrix multiplication for improved cache utilization
- Loop Unrolling: Manual unrolling with configurable widths (2, 4, 8, 16, 32)
- SIMD Vectorization: SSE (128-bit), AVX2 (256-bit), and AVX-512 (512-bit) implementations
- CUDA Acceleration: GPU-based parallel processing for large workloads
- Sparse Optimization: Skip zero elements during computation
- Automated Testing: Full experiment automation via
make_plots.sh - Load Generation: Integration with
wrk2for precise request rate control - Statistical Analysis: Multi-run aggregation with variance reporting
- Visualization: Automated plot generation with Python/Matplotlib
VectorBlitz/
├── project/
│ ├── server-implementation/ # Core server code
│ │ ├── main.c # NGINX module entry point
│ │ └── Makefile # Multi-target build system
│ ├── utils/ # Algorithm implementations
│ │ ├── utils.c/h # Scalar algorithms
│ │ ├── simd.c/h # SSE/AVX/AVX-512 implementations
│ │ └── simt.cu/h # CUDA GPU implementation
│ ├── tests-server/ # Unit test suite
│ │ └── src/ # Test sources (CMake-based)
│ ├── perfs-eval-phase2/ # Phase 2: DoE experiments
│ ├── perfs-eval-phase3/ # Phase 3: Optimization analysis
│ ├── perfs-eval-phase4/ # Phase 4: SIMD/SIMT benchmarks
│ ├── nginx-conf/ # Server configuration
│ └── wrk2/ # Load testing tool (submodule)
└── extra-project/
├── statements/ # Project specifications
├── reports/ # Phase reports (PDF)
└── videos/ # Demo videos
HTTP POST Request
│
▼
┌──────────────────┐
│ parse_request() │ ─── Decode binary header + matrices + patterns
└────────┬─────────┘
│
▼
┌──────────────────┐
│ multiply_matrix()│ ─── K×K matrix multiplication (scalar/SIMD/SIMT)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ test_patterns() │ ─── Compute min distance for each pattern
└────────┬─────────┘
│
▼
┌──────────────────┐
│ res_to_string() │ ─── Format results as CSV response
└────────┬─────────┘
│
▼
HTTP Response
- GCC 9+ with C11 support
- NGINX with
ngx-link-functionmodule - CMake 3.10+ (for tests)
- Python 3.8+ with
matplotlib,numpy,pandas - wrk2 (included as submodule)
For SIMD: CPU with AVX-512 support For SIMT: NVIDIA GPU with CUDA toolkit 11+
# Clone the repository
git clone https://github.com/mathisdelsart/VectorBlitz.git
cd VectorBlitz
# Initialize submodules
git submodule update --init --recursive
# Build wrk2 load generator
cd ./wrk2 && make && cd ../..cd project/server-implementation
# Build naive version (Phase 2 baseline)
make build
# Build with scalar optimizations (Phase 3)
make build "CFLAGS=-DBEST"
# Build with SIMD acceleration (Phase 4)
make build_simd CFLAGS+="-DSIMDBEST"
# Build with CUDA acceleration (Phase 4)
make build_simt# Start server (default: port 8888)
make run_release
# Start with specific optimizations
make run_release "CFLAGS=-DBEST"
# Start with multiple workers
make run_release NB_WORKER=4
# Debug mode with GDB
make run_gdb
# Memory profiling with Valgrind
make run_valgrind# Using the provided script
./send-request.sh
# Manual request with curl
curl -X POST http://localhost:8888/ \
--data-binary @test_request.bin| Flag | Description | Technique |
|---|---|---|
CACHE_AWARE |
Cache-friendly matrix multiplication | Line-by-line traversal |
UNROLL |
Loop unrolling | Manual unrolling with configurable width |
BEST |
Combined optimizations | All techniques + custom improvements |
# Compile with specific optimization
make build "CFLAGS=-DCACHE_AWARE -DUNROLL"
# Compile best version
make -B run_release "CFLAGS=-DBEST"| Flag | Vector Width | Instructions |
|---|---|---|
SIMD128 |
128-bit | SSE2/SSE4 |
SIMD256 |
256-bit | AVX2 |
SIMD512 |
512-bit | AVX-512 |
SIMDBEST |
Best available | Auto-selected |
# Run with AVX-512
make -B run_release_simd CFLAGS+="-DSIMD512"
# Run best SIMD version
make -B run_release_simd CFLAGS+="-DSIMDBEST"# Run with default block size
make -B run_release_simt
# Run with custom CUDA block size (max 32×32)
make -B run_release_simt NVCCFLAGS+="-DCUDA_BLOCK_SIZE=16"# Generate all plots for a specific phase
bash make-plots-phase2.sh
bash make-plots-phase3.sh
bash make-plots-phase4.sh- Throughput: Requests per second (RPS)
- Latency: p50, p95, p99, p99.9 percentiles
- Resource Utilization: CPU, memory usage
cd project/tests-server
# Build and run tests
cmake -B build && cmake --build build
ctest --test-dir build
# Or use the script
./run-tests-server.shTest suites include:
test-parsing.c- Request parsing validationtest-matrix.c- Matrix multiplication correctnesstest-patterns.c- Pattern matching accuracytest-res_to_string.c- Output formatting
- Phase 1 Statement - Server implementation
- Phase 2 Statement - Performance evaluation
- Phase 3 Statement - Scalar optimizations
- Phase 4 Statement - Hardware acceleration
- Phase 2 Report - DoE analysis results
- Phase 3 Report - Optimization analysis
This project is developed for academic purposes as part of university coursework.
Built for LINFO2241 - Architecture and Performance of Computer Systems @ UCLouvain (Université catholique de Louvain).