Quantized matmul in CUDA, with a PyTorch interface Original code from FasterTransformer / TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/kernels Adapted to support a different quantization scheme.