Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

trholding · 2023-07-23T18:41:27Z

@karpathy I was thinking if you'd consider using a BLAS lib to speed up compute so that larger models may work.
If that is the case, please also have a option to compile with CLBlast (compatible drop in blas) so that compute can get offloaded to GPU via OpenCL.

https://www.netlib.org/blas/#_reference_blas_version_3_11_0
https://www.openblas.net/
https://github.com/CNugteren/CLBlast

karpathy · 2023-07-23T18:59:35Z

Yeah... I definitely want to keep the simplicity of the repo. I'll take a look.

hchenphd · 2023-07-24T10:04:49Z

If CPU(BLAS)&GPU(CLBlast) speedups were applied, I am very interested in benchmarking the program with different quantization model by different edge devices, such as RK3588，Nvidia Jetson Orin Nano, even Android mobile(like Qualcomm Snapdragon 7/8).
As karpathy said , " in this repo we focus on more narrow applications ", I believe deploying the program on edge devices with CPU&GPU speedup can have significant value for many interesting and specific scenarios.

MostWrong · 2023-07-24T15:26:03Z

So I tried using cblas

#include <cblas.h>

void accum(float *a, float *b, int size) {
    cblas_saxpy(size, 1.0f, b, 1.0f, a, 1);
}


void rmsnorm(float* o, float* x, float* weight, int size) {
    float ss = cblas_sdot(size, x, 1.0f, x, 1.0f);
    ss /= size;
    ss += 1e-5f;
    ss = 1.0f / sqrt(ss);

    for (int j = 0; j < size; j++) {
        o[j] = weight[j] * (ss * x[j]);
    }
}

void matmul(float* xout, float* x, float* w, int n, int d) {
    cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0f, w, n, x, 1, 0.0f, xout, 1);
}

Gives a decent speedup

Added BLAS support: + Openblas + CLBlast (GPU) CLBlast is considerable slower. Needs investigation. Added APE binary prompt support Usage: Ape run: $ run.com Baremetal Boot: $ qemu-system-x86_64 -serial stdio -hda run.com (input is broken on baremetal) Updated Makefile Usage: make runopenblas make runclblast

trholding · 2023-08-05T08:35:22Z

Available in a separate fork. Thanks for the discussion. Closing.

trholding changed the title ~~Use a BLAS lib and & CLBlast for CPU and GPU speedups. [Enhancement]~~ Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] Jul 23, 2023

hchenphd mentioned this issue Jul 24, 2023

A simple benchmark for an Android device #34

Open

MostWrong mentioned this issue Jul 24, 2023

Fixed OpenMP Configuration Issue for Optimal Performance #45

Closed

trholding closed this as completed Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

trholding commented Jul 23, 2023

karpathy commented Jul 23, 2023

hchenphd commented Jul 24, 2023

MostWrong commented Jul 24, 2023 •

edited

trholding commented Aug 5, 2023

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

Comments

trholding commented Jul 23, 2023

karpathy commented Jul 23, 2023

hchenphd commented Jul 24, 2023

MostWrong commented Jul 24, 2023 • edited

trholding commented Aug 5, 2023

MostWrong commented Jul 24, 2023 •

edited